Technical Papers
May 14, 2024

Explainable Image Captioning to Identify Ergonomic Problems and Solutions for Construction Workers

Publication: Journal of Computing in Civil Engineering
Volume 38, Issue 4

Abstract

The high occurrence of work-related musculoskeletal disorders (WMSDs) in construction remains a pressing concern, causing numerous nonfatal injuries. Preventing WMSDs necessitates the implementation of an ergonomic process, encompassing the identification of ergonomic problems and corresponding solutions. Finding ergonomic problems and solutions within active construction sites requires significant efforts from personnel possessing ergonomics expertise. However, ergonomic experts and training programs are often lacking in construction. To address this issue, the authors applied deep learning (DL)–based explainable image captioning to identify ergonomic problems and their corresponding solutions from images that are prevalent in construction sites. To this end, the authors proposed a vision-language model (VLM) capable of identifying ergonomic problems and their solutions, aided by data augmentation. The bilingual evaluation understudy (BLEU) score was used to measure the similarity between ergonomic problems and solutions identified by the proposed VLM and those specified in an ergonomic guideline. Testing with 222 real-site images, the proposed VLM achieved the highest BLEU-4 score, 0.796, compared with the traditional convolutional neural network-long short-term memory and a state-of-the-art VLM, the bootstrapping language-image pretraining. In addition, the authors developed an explainability module, visualizing which specific areas of the images the proposed VLM focuses on when identifying ergonomic problems and the important words for identifying ergonomic solutions. The highest BLEU score and the visual explanations demonstrate the potential and credibility of the proposed VLM in identifying ergonomic problems and their solutions. The proposed VLM and explainability module greatly contribute to implementing the ergonomic process in construction, identifying ergonomic problems and their solutions only with site images.

Practical Applications

To prevent WMSDs, the National Institute of Occupational Safety and Health (NIOSH) recommends implementing an ergonomic process, which encompasses ergonomic problem identification, ergonomic risk assessment, and ergonomic solution identification. The current practice on sites relies on the intermittent implementation of manual ergonomic processes, and thus often falls short in protecting workers against WMSDs due to rapidly changing site conditions and the lack of on-site ergonomic expertise. Addressing this, many automated tools have been developed for ergonomic risk assessment, but none for ergonomic problem and solution identification. Therefore, with these assessment tools, we aim to streamline the recommended ergonomic process in an automated manner. To this end, we propose a deep learning–based explainable image captioning model for automated ergonomic problem and solution identification. Utilizing an ordinary camera (e.g., smartphones and site surveillance cameras), safety managers can easily identify ergonomic problems, assess risk levels, and identify corresponding solutions. Additionally, our model provides justification for its identification by visualizing the reason behind the identified ergonomic problems and solutions. With such an easily accessible and trustworthy model, the on-site ergonomic process can be streamlined, potentially reducing workers’ WMSDs.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.
All models (i.e., X-BLIPBART, BLIP, and CNN-LSTM).

Acknowledgments

This research was financially supported by VelocityEHS. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of VelocityEHS. SangHyun Lee and Meiyin Liu also work for VelocityEHS as machine learning scientists. The intellectual property stemming from this paper is licensed to VelocityEHS.

Disclaimer

The appearance of US Department of Defense (DoD) visual information does not imply or constitute DoD endorsement.

References

AlAfnan, M. A., S. Dishari, M. Jovic, and K. Lomidze. 2023. “ChatGPT as an educational tool: Opportunities, challenges, and recommendations for communication, business writing, and composition courses.” J. Artif. Intell. Technol. 3 (2): 60–68. https://doi.org/10.37965/jait.2023.0184.
Albers, J. T., and C. F. Estill. 2007. Simple solutions: Ergonomics for construction workers. Washington, DC: National Institute for Occupational Safety and Health.
Arrieta, A. B., et al. 2019. “Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.” Inf. Fusion 58 (Jun): 82–115. https://doi.org/10.1016/j.inffus.2019.12.012.
Bang, S., F. Baek, S. Park, W. Kim, and H. Kim. 2020. “Image augmentation to improve construction resource detection using generative adversarial networks, cut-and-paste, and image transformation techniques.” Autom. Constr. 115 (Apr): 103198. https://doi.org/10.1016/j.autcon.2020.103198.
Beddiar, D., M. Oussalah, and S. Tapio. 2022. “Explainability for medical image captioning.” In Proc., 2022 Eleventh Int. Conf. on Image Processing Theory, Tools and Applications (IPTA), 1–6. New York: IEEE.
Beddiar, D.-R., M. Oussalah, and T. Seppänen. 2023. “Automatic captioning for medical imaging (MIC): A rapid review of literature.” Artif. Intell. Rev. 56 (5): 4019–4076. https://doi.org/10.1007/s10462-022-10270-w.
Chaddad, A., J. Peng, J. Xu, and A. Bouridane. 2023. “Survey of explainable AI techniques in healthcare.” Sensors 23 (2): 634. https://doi.org/10.3390/s23020634.
Chai, Y. T., and T.-K. Wang. 2022. “Evaluation and decision-making framework for concrete surface quality based on computer vision and ontology.” Eng. Constr. Archit. Manage. 30 (10): 4881–4913. https://doi.org/10.1108/ECAM-01-2022-0064.
Chen, X., H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. 2015. “Microsoft COCO captions: Data collection and evaluation server.” Preprint, submitted October 16, 2019. http://arxiv.org/abs/1504.00325.
CPWR (Center for Construction Research and Training). 2018. The construction chart book. Silver Spring, MD: CPWR.
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. 2021. “An image is worth 16x16 words: Transformers for image recognition at scale.” Preprint, submitted March 11, 2021. http://arxiv.org/abs/2010.11929.
Elguendouze, S., M. C. P. de Souto, A. Hafiane, and A. Halftermeyer. 2022. “Towards explainable deep learning for image captioning through representation space perturbation.” In Proc., Int. Joint Conf. on Neural Networks (IJCNN), 1–8. New York: IEEE. https://doi.org/10.1109/IJCNN55064.2022.9892275.
EU-OSHA (European Union-Occupational Safety and Health Administration). 2020. “Ergonomic risk.” Accessed September 13, 2023. https://osha.europa.eu/en/tools-and-resources/eu-osha-thesaurus/term/70060i.
Fernández, M. M., J. Á. Fernández, J. M. Bajo, and C. A. Delrieux. 2020. “Ergonomic risk assessment based on computer vision and machine learning.” Comput. Ind. Eng. 149 (Feb): 106816. https://doi.org/10.1016/j.cie.2020.106816.
He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep residual learning for image recognition.” In Proc., IEEE Conf. on Computer Vision and Pattern Recognition. New York: IEEE.
Hussain, F., R. Hussain, and E. Hossain. 2021. “Explainable Artificial Intelligence (XAI): An Engineering Perspective.” Preprint, submitted January 10, 2021. http://arxiv.org/abs/2101.03613.
Joshi, G., R. Walambe, and K. Kotecha. 2021. “A review on explainability in multimodal deep neural nets.” IEEE Access 9 (Apr): 59800–59821. https://doi.org/10.1109/ACCESS.2021.3070212.
Joslin, K. 2021. “US Navy Seabees with NMCB-5 build a schoolhouse in Timor-Leste to support the local Ministry of Education [Image 4 of 6].” Accessed November 3, 2024. https://www.dvidshub.net/image/6810547/us-navy-seabees-with-nmcb-5-build-schoolhouse-timor-leste-support-local-ministry-education.
Karim, M. R., T. Dohmen, M. Cochez, O. Beyan, D. Rebholz-Schuhmann, and S. Decker. 2020. “DeepCOVIDExplainer: Explainable COVID-19 diagnosis from chest X-ray images.” In Proc., IEEE Int. Conf. on Bioinformatics and Biomedicine (BIBM), 1034–1037. New York: IEEE.
Karras, T., T. Aila, S. Laine, and J. Lehtinen. 2018. “Progressive growing of GANs for improved quality, stability, and variation.” Preprint, submitted February 17, 2019. http://arxiv.org/abs/1710.10196.
Lewis, M., Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. 2019. “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” Preprint, submitted April 19, 2022. http://arxiv.org/abs/1910.13461.
Li, C., and S. Lee. 2011. “Computer vision techniques for worker motion analysis to reduce musculoskeletal disorders in construction.” In Proc., Computing in Civil Engineering (2011), 380–387. Reston, VA: ASCE.
Li, F., H. Zhang, Y.-F. Zhang, S. Liu, J. Guo, L. M. Ni, P. Zhang, and L. Zhang. 2022a. “Vision-language intelligence: Tasks, representation learning, and large models.” Preprint, submitted March, 2022. http://arxiv.org/abs/2203.01922.
Li, J., D. Li, C. Xiong, and S. Hoi. 2022b. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation.” In Proc., Int. Conf. on Machine Learning, 12888–12900. Cambridge, MA: Proceedings of Machine Learning Research.
Li, J., R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. 2021. “Align before fuse: Vision and language representation learning with momentum distillation.” In Advances in neural information processing systems, 9694–9705. New York: Curran Associates.
Liu, H., G. Wang, T. Huang, P. He, M. Skitmore, and X. Luo. 2020. “Manifesting construction activity scenes via image captioning.” Autom. Constr. 119 (Jun): 103334. https://doi.org/10.1016/j.autcon.2020.103334.
Love, P. E. D., W. Fang, J. Matthews, S. Porter, H. Luo, and L. Ding. 2023. “Explainable artificial intelligence (XAI): Precepts, models, and opportunities for research in construction.” Adv. Eng. Inf. 57 (Apr): 102024. https://doi.org/10.1016/j.aei.2023.102024.
Mei, X., C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang. 2023. “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.” Preprint, submitted April 12, 2023. http://arxiv.org/abs/2303.17395.
Ming, Y., N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu. 2022. “Visuals to text: A comprehensive review on automatic image captioning.” IEEE/CAA J. Autom. Sin. 9 (8): 1339–1365. https://doi.org/10.1109/JAS.2022.105734.
Murdoch, W. J., C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu. 2019. “Definitions, methods, and applications in interpretable machine learning.” Proc. Natl. Acad. Sci. USA 116 (44): 22071–22080. https://doi.org/10.1073/pnas.1900654116.
Mutasa, S., S. Sun, and R. Ha. 2020. “Understanding artificial intelligence based radiology studies: What is overfitting?” Clin. Imaging 65 (Sep): 96–99. https://doi.org/10.1016/j.clinimag.2020.04.025.
OSHA (Occupational Safety and Health Administration). 2015. “Ergonomics—Overview.” Accessed March 27, 2023. https://www.osha.gov/ergonomics.
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. “Bleu: A method for automatic evaluation of machine translation.” In Proc., 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia: Association for Computational Linguistics.
Pransky, G., T. Snyder, A. Dembe, and J. Himmelstein. 1999. “Under-reporting of work-related disorders in the workplace: A case study and review of the literature.” Ergonomics 42 (1): 171–182. https://doi.org/10.1080/001401399185874.
Schuhmann, C., et al. 2022. “LAION-5B: An open large-scale dataset for training next generation image-text models.” Adv. Neural Inf. Process. Syst. 35 (Apr): 25278–25294.
Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. “Grad-CAM: Visual explanations from deep networks via gradient-based localization.” In Proc., IEEE Int. Conf. on Computer Vision, 618–626. New York: IEEE.
Seo, J., and S. Lee. 2021. “Automated postural ergonomic risk assessment using vision-based posture classification.” Autom. Constr. 128 (Apr): 103725. https://doi.org/10.1016/j.autcon.2021.103725.
Seo, J., R. Starbuck, S. Han, S. Lee, and T. J. Armstrong. 2015. “Motion data-driven biomechanical analysis during construction tasks on sites.” J. Comput. Civ. Eng. 29 (4): B4014005. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000400.
Sneller, T. N., S. D. Choi, and K. Ahn. 2018. “Awareness and perceptions of ergonomic programs between workers and managers surveyed in the construction industry.” Work 61 (1): 41–54. https://doi.org/10.3233/WOR-182778.
Song, B., R. Zhou, and F. Ahmed. 2023. “Multi-modal machine learning in engineering design: A review and future directions.” J. Comput. Inf. Sci. Eng. 24 (1): 010801. https://doi.org/10.1115/1.4063954.
Stefanini, M., M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara. 2022. “From show to tell: A survey on deep learning-based image captioning.” IEEE Trans. Pattern Anal. Mach. Intell. 45 (1): 539–559. https://doi.org/10.1109/TPAMI.2022.3148210.
Sun, J., S. Lapuschkin, W. Samek, and A. Binder. 2022. “Explain and improve: LRP-inference fine-tuning for image captioning models.” Inf. Fusion 77 (Feb): 233–246. https://doi.org/10.1016/j.inffus.2021.07.008.
Tao, Y., H. Hu, F. Xu, and Z. Zhang. 2023. “Ergonomic risk assessment of construction workers and projects based on fuzzy Bayesian network and D-S evidence theory.” J. Constr. Eng. Manage. 149 (6): 04023034. https://doi.org/10.1061/JCEMD4.COENG-12821.
Torma-Krajewski, J., L. J. Steiner, and R. Burgess-Limerick. 2009. Ergonomics processes: Implementation guide and tools for the mining industry. Washington, DC: National Institute for Occupational Safety and Health.
Tsai, W. L., J. J. Lin, and S.-H. Hsieh. 2023. “Generating construction safety observations via CLIP-based image-language embedding.” In Computer vision–ECCV 2022 workshops: Tel Aviv, Israel, Lecture notes in computer science, 366–381. Cham, Switzerland: Springer.
US BLS (United States Bureau of Labor Statistics). 2020. “Number, incidence rate, and median days away from work of injuries and illnesses involving musculoskeletal disorders by selected industries, US, private sector, 2018.” Accessed March 27, 2023. https://www.bls.gov/iif/factsheets/msds-chart2-data.htm.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention is all you need.” In Advances in neural information processing systems. New York: Curran Associates.
Vig, J. 2019. “BERTVIZ: A tool for visualizing multi-head self-attention in the BERT model.” In Proc., ICLR Workshop: Debugging Machine Learning Models. San Francisco: OpenAI.
Wang, D., F. Dai, and X. Ning. 2015. “Risk assessment of work-related musculoskeletal disorders in construction: State-of-the-art review.” J. Constr. Eng. Manage. 141 (6): 04015008. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000979.
Wang, X., R. Huang, Z. Jin, T. Fang, and H. Qu. 2023. “CommonsenseVIS: Visualizing and understanding commonsense reasoning capabilities of natural language models.” IEEE Trans. Vis. Comput. Graphics 30 (1): 273–283. https://doi.org/10.1109/TVCG.2023.3327153.
Wang, Y., B. Xiao, A. Bouferguene, M. Al-Hussein, and H. Li. 2022. “Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning.” Adv. Eng. Inf. 53 (Feb): 101699. https://doi.org/10.1016/j.aei.2022.101699.
Wu, H., B. Zhong, H. Li, P. Love, X. Pan, and N. Zhao. 2021. “Combining computer vision with semantic reasoning for on-site safety management in construction.” J. Build. Eng. 42 (Mar): 103036. https://doi.org/10.1016/j.jobe.2021.103036.
Yang, Z., K. Kafle, F. Dernoncourt, and V. Ordonez. 2023. “Improving visual grounding by encouraging consistent gradient-based explanations.” In Proc., IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 19165–19174. New York: IEEE.
Yong, G., M. Liu, and S. Lee. 2024. “Automated captioning for ergonomic problem and solution identification in construction using a vision-language model and caption augmentation.” In Proc., Construction Research Congress 2024, 709–718. Reston, VA: ASCE.
Zhai, P., J. Wang, and L. Zhang. 2023. “Extracting worker unsafe behaviors from construction images using image captioning with deep learning–based attention mechanism.” J. Constr. Eng. Manage. 149 (2): 04022164. https://doi.org/10.1061/JCEMD4.COENG-12096.

Information & Authors

Information

Published In

Go to Journal of Computing in Civil Engineering
Journal of Computing in Civil Engineering
Volume 38Issue 4July 2024

History

Received: Sep 26, 2023
Accepted: Jan 23, 2024
Published online: May 14, 2024
Published in print: Jul 1, 2024
Discussion open until: Oct 14, 2024

Permissions

Request permissions for this article.

Authors

Affiliations

Gunwoo Yong, S.M.ASCE [email protected]
Ph.D. Candidate, Dept. of Civil and Environmental Engineering, Univ. of Michigan, 2350 Hayward St., Ann Arbor, MI 48109. Email: [email protected]
Meiyin Liu, A.M.ASCE [email protected]
Assistant Professor, Dept. of Civil and Environmental Engineering, Rutgers Univ., 500 Bartholomew Rd., Piscataway, NJ 08854. Email: [email protected]
SangHyun Lee, M.ASCE [email protected]
Professor, Dept. of Civil and Environmental Engineering, Univ. of Michigan, 2350 Hayward St., Ann Arbor, MI 48109 (corresponding author). Email: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share