Technical Papers
Apr 22, 2022

Deep Learning Image Captioning in Construction Management: A Feasibility Study

Publication: Journal of Construction Engineering and Management
Volume 148, Issue 7

Abstract

Deep learning image captioning methods are able to generate one or several natural sentences to describe the contents of construction images. By deconstructing these sentences, the construction object and activity information can be retrieved integrally for automated scene analysis. However, the feasibility of deep learning image captioning in construction remains unclear. To fill this gap, this research investigates the feasibility of deep learning image captioning methods in construction management. First, a linguistic schema for annotating construction machine images was established, and a captioning data set was developed. Then, six deep learning image captioning methods from the computer vision community were selected and tested on the construction captioning data set. In the sentence-level evaluation, the transformer-self-critical sequence training (Tsfm-SCST) method has obtained the best performance among six methods with the bilingual evaluation (BLEU)-1 score of 0.606, BLEU-2 of 0.506, BLEU-3 of 0.427, BLEU-4 of 0.349, metric for evaluation of translation with explicit ordering (METEOR) of 0.287, recall-oriented understudy for gisting evaluation (ROUGE) of 0.585, consensus-based image description evaluation (CIDEr) of 1.715, and semantic propositional image caption evaluation (SPICE) score of 0.422. In the element-level evaluation, the Tsfm-SCST method achieved an average precision of 91.1%, recall of 83.3%, and an F1 score of 86.6% for recognition of construction machine objects by deconstructing the generated sentences. This research indicates that deep learning image captioning is feasible as a method of generating accurate and precise text descriptions from construction images, with potential applications in construction scene analysis and image documentation.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request (e.g., captioning data sets and models).

Acknowledgments

The authors would like to thank Mr. Zicong Huang and Ms. Dilyara Tulegenova for assisting with the development of the image captioning data set.

References

Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. “SPICE: Semantic propositional image caption evaluation.” In Proc., European Conf. on Computer Vision, 382–398. New York: Springer.
Balali, V., and M. Golparvar-Fard. 2015. “Segmentation and recognition of roadway assets from car-mounted camera video streams using a scalable non-parametric image parsing method.” Autom. Constr. 49 (Jan): 27–39. https://doi.org/10.1016/j.autcon.2014.09.007.
Bang, S., and H. Kim. 2020. “Context-based information generation for managing UAV-acquired data using image captioning.” Autom. Constr. 112 (Apr): 103116. https://doi.org/10.1016/j.autcon.2020.103116.
Chen, C., Z. Zhu, and A. Hammad. 2020. “Automated excavators activity recognition and productivity analysis from construction site surveillance videos.” Autom. Constr. 110 (Feb): 103045. https://doi.org/10.1016/j.autcon.2019.103045.
Chin-Yew, L. 2004. “ROUGE: A package for automatic evaluation of summaries.” In Text summarization branches out, 74–81. Cambridge, MA: Association for Computational Linguistics.
Chun, P., T. Yamane, and Y. Maemura. 2021. “A deep learning-based image captioning method to automatically generate comprehensive explanations of bridge damage.” Comput.-Aided Civ. Infrastruct. Eng. 1–15. https://doi.org/10.1111/mice.12793.
Fang, W., L. Ma, P. E. D. Love, H. Luo, L. Ding, and A. Zhou. 2020. “Knowledge graph for identifying hazards on construction sites: Integrating computer vision with ontology.” Autom. Constr. 119 (Nov): 103310. https://doi.org/10.1016/j.autcon.2020.103310.
Gao, L., X. Li, J. Song, and H. T. Shen. 2019. “Hierarchical LSTMs with adaptive attention for visual captioning.” IEEE Trans. Pattern Anal. Mach. Intell. 42 (5): 1112–1131. https://doi.org/10.1109/TPAMI.2019.2894139.
Golparvar-Fard, M., A. Heydarian, and J. C. Niebles. 2013. “Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers.” Adv. Eng. Inf. 27 (4): 652–663. https://doi.org/10.1016/j.aei.2013.09.001.
Hara, K., H. Kataoka, and Y. Satoh. 2018. “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet?” In Proc., IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 6546–6555. New York: IEEE.
He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep residual learning for image recognition.” In Proc., IEEE Conf. on Computer Vision and Pattern Recognition, 770–778. New York: IEEE.
Hossain, M. Z., F. Sohel, M. F. Shiratuddin, and H. Laga. 2019. “A comprehensive survey of deep learning for image captioning.” ACM Comput. Surv. 51 (6): 1–36. https://doi.org/10.1145/3295748.
Huang, L., W. Wang, J. Chen, and X.-Y. Wei. 2019. “Attention on attention for image captioning.” In Proc., IEEE/CVF Int. Conf. on Computer Vision, 4633–4642. New York: IEEE.
Johnson, J., A. Karpathy, and L. Fei-Fei. 2015. “DenseCap: Fully convolutional localization networks for dense captioning.” In Proc., IEEE Conf. on Computer Vision and Pattern Recognition, 4565–4574. New York: IEEE.
Kim, H., H. Kim, Y. W. Hong, and H. Byun. 2018. “Detecting construction equipment using a region-based fully convolutional network and transfer learning.” J. Comput. Civ. Eng. 32 (2): 04017082. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000731.
Kim, H., K. Kim, and H. Kim. 2016. “Data-driven scene parsing method for recognizing construction site objects in the whole image.” Autom. Constr. 71 (Nov): 271–282. https://doi.org/10.1016/j.autcon.2016.08.018.
Kolar, Z., H. Chen, and X. Luo. 2018. “Transfer learning and deep convolutional neural networks for safety guardrail detection in 2D images.” Autom. Constr. 89 (May): 58–70. https://doi.org/10.1016/j.autcon.2018.01.003.
Konstantinou, E., J. Lasenby, and I. Brilakis. 2019. “Adaptive computer vision-based 2D tracking of workers in complex environments.” Autom. Constr. 103 (Jul): 168–184. https://doi.org/10.1016/j.autcon.2019.01.018.
Kulchandani, J. S., and K. J. Dangarwala. 2015. “Moving object detection: Review of recent research trends.” In Proc., Int. Conf. on Pervasive Computing, 1–5. New York: IEEE.
Lavie, A., and A. Agarwal. 2007. “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments.” In Proc., 2nd Workshop on Statistical Machine Translation, 228–231. Cambridge, MA: MIT Press.
Lecun, Y., Y. Bengio, and G. Hinton. 2015. “Deep learning.” Nature 521 (7553): 436–444. https://doi.org/10.1038/nature14539.
Lin, T.-Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. “Microsoft COCO: Common objects in context.” In Proc., European Conf. on Computer Vision, 740–755. New York: Springer.
Liu, H., G. Wang, T. Huang, P. He, M. Skitmore, and X. Luo. 2020. “Manifesting construction activity scenes via image captioning.” Autom. Constr. 119 (Nov): 103334. https://doi.org/10.1016/j.autcon.2020.103334.
Lu, J., J. Yang, D. Batra, and D. Parikh. 2018. “Neural baby talk.” Preprint, submitted March 27, 2018. http://arxiv.org/abs/1803.09845.
Luo, X., H. Li, D. Cao, F. Dai, J. Seo, and S. Lee. 2018a. “Recognizing diverse construction activities in site images via relevance networks of construction-related objects detected by convolutional neural networks.” J. Comput. Civ. Eng. 32 (3): 04018012. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000756.
Luo, X., H. Li, D. Cao, Y. Yu, X. Yang, and T. Huang. 2018b. “Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks.” Autom. Constr. 94 (Oct): 360–370. https://doi.org/10.1016/j.autcon.2018.07.011.
Mao, J., W. Xu, Y. Yang, J. Wang, and A. L. Yuille. 2014. “Explain images with multimodal recurrent neural networks.” Preprint, submitted October 4, 2014. http://arxiv.org/abs/1410.1090.
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2001. “BLEU: A method for automatic evaluation of machine translation.” In Proc., 40th Annual Meeting on Association for Computational Linguistics, 311. Cambridge, MA: MIT Press.
Pour Rahimian, F., S. Seyedzadeh, S. Oliver, S. Rodriguez, and N. Dawood. 2020. “On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning.” Autom. Constr. 110 (Feb): 103012. https://doi.org/10.1016/j.autcon.2019.103012.
Redmon, J., and A. Farhadi. 2018. “YOLOv3: An incremental improvement.” Preprint, submitted April 8, 2018. http://arxiv.org/abs/1804.02767.
Rennie, S. J., E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. 2017. “Self-critical sequence training for image captioning.” In Proc., IEEE Conf. on Computer Vision and Pattern Recognition, 1179–1195. New York: IEEE.
Roberts, D., and M. Golparvar-Fard. 2019. “End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level.” Autom. Constr. 105 (Sep): 102811. https://doi.org/10.1016/j.autcon.2019.04.006.
Son, H., H. Seong, H. Choi, and C. Kim. 2019. “Real-time vision-based warning system for prevention of collisions between workers and heavy equipment.” J. Comput. Civ. Eng. 33 (5): 1–14. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000845.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. “Attention is all you need.” Preprint, submitted June 12, 2017. http://arxiv.org/arXiv,ID:1706.03762.
Vedantam, R., C. L. Zitnick, and D. Parikh. 2014. “CIDEr: Consensus-based image description evaluation.” Accessed October 3, 2021. http://arxiv.org/1411.5726.
Vig, J. 2019. “A multiscale visualization of attention in the transformer model.” Accessed October 3, 2021. http://arxiv.org/1906.05714.
Vinyals, O., A. Toshev, S. Bengio, and D. Erhan. 2017. “Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge.” IEEE Trans. Pattern Anal. Mach. Intell. 39 (4): 652–663. https://doi.org/10.1109/TPAMI.2016.2587640.
Wang, X., and Z. Zhu. 2021. “Vision-based hand signal recognition in construction: A feasibility study.” Autom. Constr. 125 (Feb): 103625. https://doi.org/10.1016/j.autcon.2021.103625.
Wang, Y., P.-C. Liao, C. Zhang, Y. Ren, X. Sun, and P. Tang. 2019. “Crowdsourced reliable labeling of safety-rule violations on images of complex construction scenes for advanced vision-based workplace safety.” Adv. Eng. Inf. 42 (Oct): 101001. https://doi.org/10.1016/j.aei.2019.101001.
Xiao, B., and S. Kang. 2021a. “Development of an image data set of construction machines for deep learning object detection.” J. Comput. Civ. Eng. 35 (2): 05020005. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000945.
Xiao, B., and S. Kang. 2021b. “Vision-based method integrating deep learning detection for tracking multiple construction machines.” J. Comput. Civ. Eng. 35 (2): 04020071. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000957.
Xiao, B., Q. Lin, and Y. Chen. 2021a. “A vision-based method for automatic tracking of construction machines at nighttime based on deep learning illumination enhancement.” Autom. Constr. 127 (Jul): 103721. https://doi.org/10.1016/j.autcon.2021.103721.
Xiao, B., Y. Zhang, Y. Chen, and X. Yin. 2021b. “A semi-supervised learning detection method for vision-based monitoring of construction sites by integrating teacher-student networks and data augmentation.” Adv. Eng. Inf. 50 (Oct): 101372. https://doi.org/10.1016/j.aei.2021.101372.
Xiao, B., and Z. Zhu. 2018. “Two-dimensional visual tracking in construction scenarios: A comparative study.” J. Comput. Civ. Eng. 32 (3): 04018006. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000738.
Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015. “Show, attend and tell: Neural image caption generation with visual attention.” Accessed October 3, 2021. http://arxiv.org/1502.03004.
Yang, J., M.-W. Park, P. A. Vela, and M. Golparvar-Fard. 2015. “Construction performance monitoring via still images, time-lapse photos, and video streams: Now, tomorrow, and the future.” Adv. Eng. Inf. 29 (2): 211–224. https://doi.org/10.1016/j.aei.2015.01.011.
Yang, J., Z. Shi, and Z. Wu. 2016. “Vision-based action recognition of construction workers using dense trajectories.” Adv. Eng. Inf. 30 (3): 327–336. https://doi.org/10.1016/j.aei.2016.04.009.
Yin, X., Y. Chen, A. Bouferguene, H. Zaman, M. Al-Hussein, and L. Kurach. 2020. “A deep learning-based framework for an automated defect detection system for sewer pipes.” Autom. Constr. 109 (Aug): 102967. https://doi.org/10.1016/j.autcon.2019.102967.
Zhang, B., I. Titov, and R. Sennrich. 2019. “Improving deep transformer with depth-scaled initialization and merged attention.” Accessed October 3, 2021. http://arxiv.org/1908.11365.
Zhu, Z., X. Ren, and Z. Chen. 2017. “Integrated detection and tracking of workforce and equipment from construction jobsite videos.” Autom. Constr. 81 (Sep): 161–171. https://doi.org/10.1016/j.autcon.2017.05.005.

Information & Authors

Information

Published In

Go to Journal of Construction Engineering and Management
Journal of Construction Engineering and Management
Volume 148Issue 7July 2022

History

Received: Oct 3, 2021
Accepted: Feb 22, 2022
Published online: Apr 22, 2022
Published in print: Jul 1, 2022
Discussion open until: Sep 22, 2022

Permissions

Request permissions for this article.

Authors

Affiliations

Research Assistant Professor, Dept. of Building and Real Estate, Hong Kong Polytechnic Univ., Hung Hom, Kowloon, Hong Kong. ORCID: https://orcid.org/0000-0003-0798-8018. Email: [email protected]
Yiheng Wang [email protected]
Ph.D. Student, Dept. of Civil and Environmental Engineering, Univ. of Alberta, Edmonton, AB, Canada T6G 2R3. Email: [email protected]
Shih-Chung Kang [email protected]
Professor, Dept. of Civil and Environmental Engineering, Univ. of Alberta, Edmonton, AB, Canada T6G 2R3 (corresponding author). Email: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

  • Bi-Directional Image-to-Text Mapping for NLP-Based Schedule Generation and Computer Vision Progress Monitoring, Construction Research Congress 2024, 10.1061/9780784485262.084, (826-835), (2024).

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share