Deep Learning Image Captioning in Construction Management: A Feasibility Study

Xiao, Bo; Wang, Yiheng; Kang, Shih-Chung

doi:10.1061/(ASCE)CO.1943-7862.0002297

Technical Papers

Apr 22, 2022

Deep Learning Image Captioning in Construction Management: A Feasibility Study

Authors: Bo Xiao, Aff.M.ASCE https://orcid.org/0000-0003-0798-8018 [email protected], Yiheng Wang [email protected], and Shih-Chung Kang [email protected]Author Affiliations

Publication: Journal of Construction Engineering and Management

Volume 148, Issue 7

https://doi.org/10.1061/(ASCE)CO.1943-7862.0002297

Get Access

Abstract

Deep learning image captioning methods are able to generate one or several natural sentences to describe the contents of construction images. By deconstructing these sentences, the construction object and activity information can be retrieved integrally for automated scene analysis. However, the feasibility of deep learning image captioning in construction remains unclear. To fill this gap, this research investigates the feasibility of deep learning image captioning methods in construction management. First, a linguistic schema for annotating construction machine images was established, and a captioning data set was developed. Then, six deep learning image captioning methods from the computer vision community were selected and tested on the construction captioning data set. In the sentence-level evaluation, the transformer-self-critical sequence training (Tsfm-SCST) method has obtained the best performance among six methods with the bilingual evaluation (BLEU)-1 score of 0.606, BLEU-2 of 0.506, BLEU-3 of 0.427, BLEU-4 of 0.349, metric for evaluation of translation with explicit ordering (METEOR) of 0.287, recall-oriented understudy for gisting evaluation (ROUGE) of 0.585, consensus-based image description evaluation (CIDEr) of 1.715, and semantic propositional image caption evaluation (SPICE) score of 0.422. In the element-level evaluation, the Tsfm-SCST method achieved an average precision of 91.1%, recall of 83.3%, and an F1 score of 86.6% for recognition of construction machine objects by deconstructing the generated sentences. This research indicates that deep learning image captioning is feasible as a method of generating accurate and precise text descriptions from construction images, with potential applications in construction scene analysis and image documentation.

Get full access to this article

View all available purchase options and get full access to this article.

Get Access

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request (e.g., captioning data sets and models).

Acknowledgments

The authors would like to thank Mr. Zicong Huang and Ms. Dilyara Tulegenova for assisting with the development of the image captioning data set.

References

Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. “SPICE: Semantic propositional image caption evaluation.” In Proc., European Conf. on Computer Vision, 382–398. New York: Springer.

Abstract

Get full access to this article

Data Availability Statement

Acknowledgments

References

Information

Published In

Copyright

History

Permissions

Authors

Affiliations

Metrics

Citations

Download citation

Cited by

Get Access

Access content

Purchase

ASCE Library Card (5 downloads)

ASCE Library Card (5 downloads)

ASCE Library Card (20 downloads)

ASCE Library Card (20 downloads)

Buy Single Article

Buy Single Article

Get Access

Access content

Purchase

ASCE Library Card (5 downloads)

ASCE Library Card (5 downloads)

ASCE Library Card (20 downloads)

ASCE Library Card (20 downloads)

Buy Single Article

Buy Single Article

Figures

Other

Share

Copy the content Link

Share with email

Share

Request Username

Create a new account

Change Password

Password Changed Successfully

Verify Phone

Congrats!