Technical Papers
Mar 25, 2024

Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models

Publication: Journal of Construction Engineering and Management
Volume 150, Issue 6

Abstract

The growth of the global construction market has attracted international companies to participate in overseas projects. Overseas projects are extremely dynamic with numerous uncertainties, raising the need to collect information about construction in host countries. Due to the vast amounts of text data in the construction industry, an automated method, specifically information retrieval, is required to find the necessary information. Previous studies have suggested automated methods to review various construction documents. However, these studies required substantial manual effort and mainly focused on only one language, resulting in loss of vital information because it is buried in documents written in the host country’s language. To address these limitations, this study proposes a cross-lingual information retrieval (CLIR) framework using pretrained Bidirectional Encoder Representations from Transformers (BERT) models to retrieve information from multilingual construction documents. The proposed framework employs language models (i.e., monolingual, multilingual, and cross-lingual) and trains these models on a construction data set to enhance their ability in construction-specific text. The framework achieved reliable performance of retrieval, even with minimal additional training using domain-specific data. The results indicate that training on the domain data set raises the level of retrieval, increasing the mean reciprocal rank of a specific task by up to 0.2128. With the employment of a monolingual model with machine translation, CLIR in a specific domain could be performed effectively without the need for a labeled data set. The suggested CLIR framework offers a practical alternative for dealing with construction documents in overseas projects, reducing time and cost while improving risk identification and mitigation.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request (fine-tuned monolingual, multilingual, and cross-lingual models, and Python scripts).

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00241758). This research was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008475, the Competency Development Program for Industry Specialist).

References

Akanbi, T., and J. Zhang. 2021. “Design information extraction from construction specifications to support cost estimation.” Autom. Constr. 131 (Nov): 103835. https://doi.org/10.1016/j.autcon.2021.103835.
Beamish, P. W., and A. C. Inkpen. 1995. “Keeping international joint ventures stable and profitable.” Long Range Plann. 28 (3): 2–36. https://doi.org/10.1016/0024-6301(95)00018-E.
Boytsov, L., and E. Nyberg. 2020. “Flexible retrieval with NMSLIB and FlexNeuART.” In Proc., NLP Open Source Software, 32–43. Stroudsburg, PA: Association for Computational Linguistics.
Brouthers, K. D., L. E. Brouthers, and T. J. Wilkinson. 1995. “Strategic alliances: Choose your partners.” Long Range Plann. 28 (3): 2–25. https://doi.org/10.1016/0024-6301(95)00008-7.
Business Research Company. 2022. “Construction global market opportunities and strategies to 2031: COVID-19 impact and recovery.” Research and markets. Accessed May 4, 2023. https://www.researchandmarkets.com/report/construction.
Chandra, G., and S. K. Dwivedi. 2020. “Query expansion based on term selection for Hindi–English cross lingual IR.” J. King Saud Univ.–Comput. Inf. Sci. 32 (3): 310–319. https://doi.org/10.1016/j.jksuci.2017.09.002.
Chang, C. M., C. H. Chang, and S. Y. Hwang. 2020. “Employing word mover’s distance for cross-lingual plagiarized text detection.” Proc. Assoc. Inf. Sci. Tech. 57 (1): e229. https://doi.org/10.1002/pra2.229.
Chi, N. W., Y. H. Jin, and S. H. Hsieh. 2019. “Developing base domain ontology from a reference collection to aid information retrieval.” Autom. Constr. 100 (Mar): 180–189. https://doi.org/10.1016/j.autcon.2019.01.001.
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. 2018. “BERT: Pre-training of deep bidirectional transformers for language understanding.” In Proc., 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186. Stroudsburg, PA: Association for Computational Linguistics.
di Buono, M. P., M. Monoteleone, F. Marano, and J. Monti. 2013. “Knowledge management and cultural heritage repositories: Cross-lingual information retrieval strategies.” In Vol. 2 of Proc., 2013 Digital Heritage Int. Congress (DigitalHeritage), 295–302. New York: IEEE.
Dini, L., W. Peters, D. Liebwald, E. Schweighofer, L. Mommers, and W. Voermans. 2005. “Cross-lingual legal information retrieval using a WordNet architecture.” In Proc., 10th Int. Conf. on Artificial Intelligence and Law, 163–167. New York: ACM Digital Library. https://doi.org/10.1145/1165485.1165510.
Feng, D., and H. Chen. 2021. “A small samples training framework for deep learning-based automatic information extraction: Case study of construction accident news reports analysis.” Adv. Eng. Inf. 47 (Mar): 101256. https://doi.org/10.1016/j.aei.2021.101256.
HajiAminShirazi, S., and S. Momtaz. 2020. “Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answering.” Mach. Transl. 34 (Mar): 287–303. https://doi.org/10.1007/s10590-020-09257-7.
Han, S. H., S. H. Park, D. Y. Kim, and H. Kim. 2007. “Causes of bad profit in overseas construction projects.” J. Constr. Eng. Manage. 133 (2): 932–943. https://doi.org/10.1061/(ASCE)0733-9364(2007)133:12(932).
Hassan, F., and T. Le. 2020. “Automated requirements identification from construction contract documents using natural language processing.” J. Leg. Aff. Dispute Resolut. Eng. Constr. 12 (2): 04520009. https://doi.org/10.1061/(ASCE)LA.1943-4170.0000379.
Huang, H., Y. Liang, N. Duan, M. Gong, L. Shou, D. Jiang, and M. Zhou. 2019. “Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks.” In Proc., 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), 2485–2494. Stroudsburg, PA: Association for Computational Linguistics.
Jain, S., and B. C. Wallace. 2019. “Attention is not explanation.” In Proc., 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3543–3556. Stroudsburg, PA: Association for Computational Linguistics.
Javemick-Will, A. N., and W. R. Scott. 2010. “Who needs to know what? Institutional knowledge and global projects.” J. Constr. Eng. Manage. 136 (5): 546–557. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000035.
Jiang, Z., A. El-Jaroudi, W. Hartmann, D. Karakos, and L. Zhao. 2020. “Cross-lingual information retrieval with BERT.” In Proc., Workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), 26–31. Paris: European Language Resources Association.
Karthikeyan, K., Z. Wang, S. Mayhew, and D. Roth. 2020. “Cross-lingual ability of multilingual BERT: An empirical study.” Preprint, submitted December 17, 2019. https://doi.org/10.48550/arXiv.1912.07840.
Kim, T., and S. Chi. 2019. “Accident case retrieval and analyses: Using natural language processing in the construction industry.” J. Constr. Eng. Manage. 145 (3): 04019004. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001625.
Kim, Y., Y. Bang, J. Sohn, and H. Kim. 2022. “Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers.” Autom. Constr. 134 (Mar): 104061. https://doi.org/10.1016/j.autcon.2021.104061.
Lample, G., and A. Conneau. 2019. “Cross-lingual language model pretraining.” Preprint, submitted January 22, 2019. https://doi.org/10.48550/arXiv.1901.07291.
Lee, J. H., J. S. Yi, and J. W. Son. 2019. “Development of automatic-extraction model of poisonous clauses in international construction contracts using rule-based NLP.” J. Comput. Civ. Eng. 33 (3): 04019003. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000807.
Li, J., C. Liu, J. Wang, L. Bing, H. Li, X. Liu, D. Zhao, and R. Yan. 2020a. “Cross-lingual low-resource set-to-description retrieval for global e-commerce.” In Vol. 33 of Proc., AAAI Conf. on Artificial Intelligence, 8212–8219. Reston, VA: American Institute of Aeronautics and Astronautics.
Li, Y., T. Sun, Y. Shou, and H. Sun. 2020b. “What makes a competent international project manager in emerging and developing countries?” Project Manage. J. 51 (2): 181–198. https://doi.org/10.1177/8756972820901387.
Lin, K. Y., K. W. Chou, H. T. Lin, and S. H. Hsieh. 2009. “Exploring the effectiveness of Chinese-to-English machine translation for CLIR applications in earthquake engineering.” J. Comput. Civ. Eng. 23 (3): 140–147. https://doi.org/10.1061/(ASCE)0887-3801(2009)23:3(140).
Lin, K. Y., S. H. Hsieh, H. P. Tserng, and K. W. Chou. 2007. “Establishing domain testing resources to support advance text-based information retrieval applications for architecture, engineering, construction and facility management (AEC/FM).” Comput. Civ. Eng. 2007 (May): 383–390. https://doi.org/10.1061/40937(261)47.
Lord, M. D., and A. L. Ranft. 2000. “Organizational learning about new international markets: Exploring the internal transfer of local market knowledge.” J. Int. Bus. Stud. 31 (Dec): 573–589. https://doi.org/10.1057/palgrave.jibs.8490923.
Ma, X., J. Guo, R. Zhang, Y. Fan, X. Ji, and X. Cheng. 2021. “PROP: Pre-training with representative words prediction for Ad-hoc retrieval.” In Proc., 14th ACM Int. Conf. on Web Search and Data Mining, 283–291. New York: Association for Computing Machinery.
Microsoft Open Source. 2016. “MS MARCO.” Accessed April 3, 2023. https://microsoft.github.io/msmarco.
Moon, S., G. Lee, S. Chi, and H. Oh. 2021. “Automated construction specification review with named entity recognition using natural language processing.” J. Constr. Eng. Manage. 147 (1): 04020147. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001953.
Novak, E., L. Bizjak, D. Mladenic, and M. Grobelnik. 2022. “Why is a document relevant? Understanding the relevance scores in cross-lingual document retrieval.” Knowl. Syst. 244 (May): 108545. https://doi.org/10.1016/j.knosys.2022.108545.
Ochieng, E., A. Price, and D. Moore. 2013. Management of global construction projects. London: Bloomsbury.
Ogundepo, O., X. Zhang, S. Sun, K. Duh, and J. Lin. 2022. “AfriCLIRMatrix: Enabling cross-lingual information retrieval for African languages.” In Proc., 2022 Conf. on Empirical Methods in Natural Language Processing, 8721–8728. Stroudsburg, PA: Association for Computational Linguistics.
OpenAI. 2023. “ChatGPT.” Accessed April 3, 2023. https://chat.openai.com.
Park S., et al. 2021. “KLUE: Korean language understanding evaluation.” Preprint, submitted May 20, 2021. https://arxiv.org/abs/2105.09680.
Pires, T., E. Schlinger, and D. Garrette. 2019. “How multilingual is Multilingual BERT?” In Proc., 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. Stroudsburg, PA: Association for Computational Linguistics.
Reimers, N., and I. Gurevych. 2019. “Sentence-BERT: Sentence embeddings using Siamese BERT-Networks.” Preprint, submitted August 27, 2019. https://arxiv.org/abs/1908.10084.
Robinson, G. 2015. Global construction 2030: A global forecast for the construction industry to 2030. London: Global Construction Perspectives and Oxford Economics.
Saleh, S., and P. Pecina. 2020. “Document translation vs. query translation for cross-lingual information retrieval in the medical domain.” In Proc., 58th Annual Meeting of the Association for Computational Linguistics, 6489–6860. Stroudsburg, PA: Association for Computational Linguistics.
Sharma, V. K., N. Mittal, and A. Vidyarthi. 2020. “Context-based translation for the out of vocabulary words applied to Hindi–English cross-lingual information retrieval.” IETE Tech. Rev. 39 (2): 276–285. https://doi.org/10.1080/02564602.2020.1843553.
Siraj, N. B., and A. R. Fayek. 2019. “Risk identification and common risks in construction: Literature review and content analysis.” J. Constr. Eng. Manage. 145 (9): 03119004. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001685.
Sun, S., and K. Duh. 2020. “CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval.” In Proc., 2014 Conf. on Empirical Methods in Natural Language Processing, 4160–4170. Stroudsburg, PA: Association for Computational Linguistics.
Tang, L. X., S. Geva, A. Trotman, Y. Xu, and K. Y. Itakura. 2014. “An evaluation framework for cross-lingual link discovery.” Inf. Process. Manage. 50 (1): 1–23. https://doi.org/10.1016/j.ipm.2013.07.003.
Wang, R., Z. Zhang, F. Zhuang, D. Gao, Y. Wei, and Q. He. 2021. “Adversarial domain adaptation for cross-lingual information retrieval with multilingual BERT.” In Proc., 30th ACM Int. Conf. on Information & Knowledge Management, 3498–3502. New York: Association for Computing Machinery.
Wang, S. Q., M. F. Dulaimi, and M. Y. Aguria. 2004. “Risk management framework for construction projects in developing countries.” Construct. Manage. Econ. 22 (3): 237–252. https://doi.org/10.1080/0144619032000124689.
Wiegreffe, S., and Y. Pinter. 2019. “Attention is not explanation.” Preprint, submitted August 13, 2019. https://doi.org/10.48550/arXiv.1908.0462611–20.
Yu, P., H. Fei, and P. Li. 2021. “Cross-lingual language model pretraining for retrieval.” In Proc., Web Conf. 2021, 1029–1039. New York: Association of Computing Machinery.
Zhang, J., and N. M. El-Gohary. 2017. “Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking.” Autom. Constr. 73 (May): 45–57. https://doi.org/10.1016/j.autcon.2016.08.027.
Zhang, J., Y. Sun, and A. J. Jara. 2015. “Towards semantically linked multilingual corpus.” Int. J. Inf. Manage. 35 (3): 387–395. https://doi.org/10.1016/j.ijinfomgt.2015.01.004.
Zhong, B., X. Xing, H. Luo, Q. Zhou, H. Li, T. Rose, and W. Fang. 2020. “Deep learning-based extraction of construction procedural constraints from construction regulations.” Adv. Eng. Inf. 43 (May): 101003. https://doi.org/10.1016/j.aei.2019.101003.
Zhou, D., W. Qu, L. Li, M. Tang, and A. Yang. 2022. “Neural topic-enhanced cross-lingual word embeddings for CLIR.” Inf. Sci. 608 (Aug): 809–824. https://doi.org/10.1016/j.ins.2022.06.081.
Zou, P. X. W., G. Zhang, and J. Wang. 2007. “Understanding the key risks in construction projects in China.” Int. J. Project Manage. 25 (6): 601–614. https://doi.org/10.1016/j.ijproman.2007.03.001.

Information & Authors

Information

Published In

Go to Journal of Construction Engineering and Management
Journal of Construction Engineering and Management
Volume 150Issue 6June 2024

History

Received: Jul 17, 2023
Accepted: Jan 2, 2024
Published online: Mar 25, 2024
Published in print: Jun 1, 2024
Discussion open until: Aug 25, 2024

Permissions

Request permissions for this article.

ASCE Technical Topics:

Authors

Affiliations

Jungyeon Kim [email protected]
Integrated M.S. and Ph.D. Program Student, Dept. of Civil and Environmental Engineering, Seoul National Univ., Seoul 08826, Republic of Korea. Email: [email protected]
Ph.D. Candidate, Dept. of Civil and Environmental Engineering, Seoul National Univ., Seoul 08826, Republic of Korea. ORCID: https://orcid.org/0000-0002-9256-5548. Email: [email protected]
Professor, Dept. of Civil and Environmental Engineering, Seoul National Univ., Seoul 08826, Republic of Korea; Adjunct Professor, Institute of Construction and Environmental Engineering, Seoul National Univ., Seoul 08826, Republic of Korea (corresponding author). ORCID: https://orcid.org/0000-0002-0409-5268. Email: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share