Technical Papers
Aug 14, 2021

Comparing Natural Language Processing Methods to Cluster Construction Schedules

Publication: Journal of Construction Engineering and Management
Volume 147, Issue 10

Abstract

The names of construction activities are the only unstructured data attribute in construction schedules, and they often guide construction execution. Activity names are devised to communicate between stakeholders, and therefore often are written using inconsistent terminologies across repetitive activities with omitted contextual information. This presents a challenge for machine learning systems when learning patterns from construction schedules. This paper compared the performance of state-of-the-art text-related clustering methods in identifying repetitive activities. This was achieved by creating a ground truth data set on the basis of the standard construction work classification, and then comparing the precision, recall, and F1 score of latent semantic analysis (LSA), latent Dirichlet allocation (LDA), word2vec, and fastText algorithms to group activity names in 27 construction schedules. Results indicated that the F1 score of LSA outperformed LDA (0.84% versus 0.88%), whereas the results of language models–based clustering depended on the quality of word embedding and the paired clustering method. This study provides insight into how to preprocess activity names of construction schedules for further artificial intelligence (AI)-based quantitative analysis. Methodologies described in this study will help researchers who work on natural language–related research in construction (e.g., safety and contract management) to better capture the feature of words, rather than only counting the word frequencies.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code generated or used during the study are available in a repository (online at https://doi.org/10.17863/CAM.63898) in accordance with funder data retention policies.

Acknowledgments

We thank Kier for sharing experiences and knowledge in scheduling practice and using scheduling software. We thank nPlan for sharing the data parsing prototype and valuable discussion about machine learning techniques. The presented work was based on research funded by InnovateUK (Project reference: 104795).

References

Amer, F., and M. Golparvar-Fard. 2019. “Automatic understanding of construction schedules: Part-of-activity tagging.” In Proc., 2019 European Conf. on Computing in Construction, 190–197. Chania, Crete, Greece: European Council on Computing in Construction.
Ankerst, M., M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. “OPTICS: Ordering points to identify the clustering structure.” SIGMOD Rec. 28 (2): 49–60. https://doi.org/10.1145/304181.304187.
Association for Project Management. 2017. “What is scheduling in project management?” Accessed August 1, 2019. https://www.apm.org.uk/body-of-knowledge/delivery/schedule-management/.
Benjamin, C. O., D. L. Babcock, N. B. Yunus, and J. Kincaid. 1990. “Knowledge-based prototype for improving scheduling productivity.” J. Comput. Civ. Eng. 4 (2): 124–134. https://doi.org/10.1061/(ASCE)0887-3801(1990)4:2(124).
Blei, D., L. Carin, and D. Dunson. 2010. “Probabilistic topic models.” IEEE Signal Process Mag. 27 (6): 55–65.
Blei, D., A. Y. Ng, and M. I. Jordan. 2003. “Latent Dirichlet allocation.” J. Mach. Learn. Res. 3: 993–1022.
Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. “Enriching word vectors with subword information.” Trans. Assoc. Comput. Ling. 5: 135–146. https://doi.org/10.1162/tacl_a_00051.
Bouman, C. A., M. Shapiro, G. W. Cook, C. B. Atkins, and H. Cheng. 1997. CLUSTER: An unsupervised algorithm for modeling Gaussian mixtures. West Lafayette, IN: Purdue Univ.
Brockmann, C. 2012. “Construction project scheduling and control (CourseSmart), 2nd edn.” Constr. Manage. Econ. 30 (11): 1012–1013. https://doi.org/10.1080/01446193.2012.694456.
Buckland, M., and F. Gey. 1994. “The relationship between recall and precision.” J. Am. Soc. Inf. Sci. 45 (1): 12–19. https://doi.org/10.1002/(SICI)1097-4571(199401)45:1%3C12::AID-ASI2%3E3.0.CO;2-L.
Caldas, C. H., and L. Soibelman. 2003. “Automating hierarchical document classification for construction management information systems.” Autom. Constr. 12 (4): 395–406. https://doi.org/10.1016/S0926-5805(03)00004-9.
Cer, D., et al. 2018. “Universal sentence encoder.” In Proc., EMNLP 2018–Conf. on Empirical Methods in Natural Language Processing: System Demonstrations. Ithaca, NY: Cornell Univ.
Changali, S., A. Mohammad, and M. Van Nieuwland. 2015. “The construction productivity imperative.” McKinsey Q. (Jun): 1–10.
Cheng, M.-Y., Y.-H. Chang, and D. Korir. 2019. “Novel approach to estimating schedule to completion in construction projects using sequence and nonsequence learning.” J. Constr. Eng. Manage. 145 (11): 04019072. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001697.
Chi, N.-W., K.-Y. Lin, and S.-H. Hsieh. 2014. “Using ontology-based text classification to assist job hazard analysis.” Adv. Eng. Inf. 28 (4): 381–394. https://doi.org/10.1016/j.aei.2014.05.001.
Construction Specification Institute. 2012. “Table 13: Spaces by function.” OmniClass—A strategy for classifying the built environment. Alexandria, VA: Construction Specification Institute.
Cvitanic, T., B. Lee, H. I. Song, K. Fu, and D. Rosen. 2016. “LDA v. LSA: A comparison of two computational text analysis tools for the functional categorization of patents.” In Proc., CEUR Workshop, 41–50. Alexandria, VA: National Science Foundation.
De Snoo, C., W. Van Wezel, and R. J. Jorna. 2011. “An empirical investigation of scheduling performance criteria.” J. Oper. Manage. 29 (3): 181–193. https://doi.org/10.1016/j.jom.2010.12.006.
Elzomor, M., R. Burke, K. Parrish, and G. E. Gibson Jr. 2018. “Front-end planning for large and small infrastructure projects: Comparison of project definition rating index tools.” J. Manage. Eng. 34 (4): 04018022. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000611.
Esmaeili, B., and M. R. Hallowell. 2012. Attribute-based risk model for measuring safety risk of struck-by accidents. Reston, VA: ASCE.
Esmaeili, B., M. R. Hallowell, and B. Rajagopalan. 2015. “Attribute-based safety risk assessment. I: Analysis at the fundamental level.” J. Constr. Eng. Manage. 141 (8): 04015021. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000980.
Ester, M., H.-P. Kriegel, J. Sander, and X. Xu. 1996. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In Proc., 2nd Int. Conf. on Knowledge Discovery and Data Mining, 226–231. Menlo Park, CA: Association for the Advancement of Artificial Intelligence.
Flyvbjerg, B., M. K. Skamris Holm, and S. L. Buhl. 2003. “How common and how large are cost overruns in transport infrastructure projects?” Trans. Rev. 23 (1): 71–88. https://doi.org/10.1080/01441640309904.
Fung, B. C. M., K. Wang, and M. Ester. 2011. “Hierarchical document clustering.” In Encyclopedia of data warehousing and mining, 555–559. Hershey, PA: IGI Global.
Glenigan, Constructing Excellence, CITB, and Department for Business, Innovation and Skills. 2015. UK industry performance report. Bournemouth, UK: Glenigan.
Gondia, A., A. Siam, W. El-Dakhakhni, and A. H. Nassar. 2020. “Machine learning algorithms for construction projects delay risk prediction.” J. Constr. Eng. Manage. 146 (1): 04019085. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001736.
Goodfellow, I., Y. Bengio, A. Courville, and Y. Bengio. 2016. Deep learning. Cambridge, MA: MIT Press.
Google News. 2013. “Word2vec—Tools for computing continuous distributed representations of words.” Accessed July 30, 2013. https://code.google.com/archive/p/word2vec/.
Habash, N., O. Rambow, and R. Roth. 2009. “MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization.” In Proc., 2nd Int. Conf. on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 62. Paris: European Language Resources Association.
Harris, R. B., and P. G. Ioannou. 1998. “Scheduling projects with repeating activities.” J. Constr. Eng. Manage. 124 (4): 269–278. https://doi.org/10.1061/(ASCE)0733-9364(1998)124:4(269).
ISO. 2018. Organization and digitization of information about buildings and civil engineering works, including building information modelling (BIM). Information management using building information modelling. Part 1: Concepts and principles. ISO 19650-1. Geneva: ISO.
Jain, A. K. 2010. “Data clustering: 50 years beyond K-means.” Pattern Recognit. Lett. 31 (8): 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
Joulin, A., E. Grave, P. Bojanowski, and T. Mikolov. 2017. “Bag of tricks for efficient text classification.” In Proc., 15th Conf. of the European Chapter of the Association for Computational Linguistics, EACL 2017. Stroudsburg, PA: Association for Computational Linguistics.
Kaufman, L., and R. J. Peter. 2008. “Partitioning around medoids (Program PAM).” In Finding groups in data: An introduction to cluster analysis, 68–125. New York: Wiley.
Kim, J. I., M. Fischer, and C. Kam. 2018. “Generation and evaluation of excavation schedules for hard rock tunnels in preconstruction and construction.” Autom. Constr. 96 (Dec): 378–397. https://doi.org/10.1016/j.autcon.2018.09.022.
Landauer, T. K., and S. T. Dumais. 1997. “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.” Psychol. Rev. 104 (2): 211. https://doi.org/10.1037/0033-295X.104.2.211.
Lee, K.-P., H.-S. Lee, M. Park, D. Y. Kim, and M. Jung. 2017. “Management-reserve estimation for international construction projects based on risk-informed k-NN.” J. Manage. Eng. 33 (4): 04017002. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000510.
Lin, Z., M. Feng, C. N. Dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. 2017. “A structured self-attentive sentence embedding.” In Proc., 5th Int. Conf. on Learning Representations, ICLR 2017: Conf. Track Proc. Ithaca, NY: Cornell Univ.
Liu, P., D. Zhou, and N. Wu. 2007. “VDBSCAN: Varied density based spatial clustering of applications with noise.” In Proc., 2007 Int. Conf. on Service Systems and Service Management, 1–4. New York: IEEE.
Lund, K., and C. Burgess. 1996. “Producing high-dimensional semantic spaces from lexical co-occurrence.” Behav. Res. Methods Instrum. Comput. 28 (2): 203–208. https://doi.org/10.3758/BF03204766.
Luu, V. T., S. Y. Kim, N. V. Tuan, and S. O. Ogunlana. 2009. “Quantifying schedule risk in construction projects using Bayesian belief networks.” Int. J. Project Manage. 27 (1): 39–50. https://doi.org/10.1016/j.ijproman.2008.03.003.
Manning, C., P. Raghavan, and H. Schütze. 2008. “Hierarchical clustering.” Chap. 17 in Introduction to information retrieval. Cambridge, UK: Cambridge University Press.
Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. 2015. “The Stanford CoreNLP natural language processing toolkit.” In Proc., 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60. Stroudsburg, PA: Association for Computational Linguistics.
McPartland, R. 2017. “What is IFC?” Accessed May 24, 2020. https://www.thenbs.com/knowledge/what-is-ifc.
Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. “Efficient estimation of word representations in vector space.” In Proc., 1st Int. Conf. on Learning Representations, ICLR 2013: Workshop Track. Stroudsburg PA: Association for Computational Linguistics.
Mimno, D., H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. 2011. “Optimizing semantic coherence in topic models.” In Proc., EMNLP 2011: Conf. on Empirical Methods in Natural Language Processing, 262–272. Stroudsburg PA: Association for Computational Linguistics.
Mosca, A. 2019. nPlan—Methods of constructing a construction, and related systems and computer program products. Alexandria, VA: United States Patent and Trademark Office.
Moselhi, O., D. Gong, and K. El-Rayes. 2011. “Estimating weather impact on the duration of construction activities.” Can. J. Civ. Eng. 24 (3): 359–366. https://doi.org/10.1139/l96-122.
Mulholland, B., and J. Christian. 1999. “Risk assessment in construction schedules.” J. Constr. Eng. Manage. 125 (1): 8–15. https://doi.org/10.1061/(ASCE)0733-9364(1999)125:1(8).
Muñoz-Avila, H., D. W. Aha, D. S. Nau, R. Weber, L. Breslow, and F. Yaman. 2001. SiN: Integrating case-based reasoning with task decomposition. Fort Belvoir, VA: Defense Technical Information Centre.
Muñoz-Avila, H., K. Gupta, D. W. Aha, and D. Nau. 2002. “Knowledge-based project planning.” In Knowledge management and organizational memories, edited by R. Dieng-Kuntz and N. Matta, 125–134. Boston: Springer.
Nasir, D., B. McCabe, and L. Hartono. 2003. “Evaluating risk in construction–schedule model (ERIC–S): Construction schedule risk model.” J. Constr. Eng. Manage. 129 (5): 518–527. https://doi.org/10.1061/(ASCE)0733-9364(2003)129:5(518).
Park, J., and H. Cai. 2017. “WBS-based dynamic multi-dimensional BIM database for total construction as-built documentation.” Autom. Constr. 77 (May): 15–23. https://doi.org/10.1016/j.autcon.2017.01.021.
Pennington, J., R. Socher, and C. D. Manning. 2014. “GloVe: Global vectors for word representation.” In Proc., EMNLP 2014-2014 Conf. on Empirical Methods in Natural Language Processing, 1532–1543. Stroudsburg, PA: Association for Computational Linguistics.
Ramos, J. 2003. “Using TF-IDF to determine word relevance in document queries.” In Proc., 1st Instructional Conf. on Machine Learning, 133–142. University Park, PA: Citeseer.
Rehurek, R., and P. Sojka. 2010. “Software framework for topic modelling with large corpora.” In Proc., LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: Citeseer.
Rohde, D. L., L. M. Gonnerman, and D. C. Plaut. 2006. “An improved model of semantic similarity based on lexical co-occurrence.” Commun. ACM 8: 627–633.
Royal Institution of Chartered Surveyors. 1988. SMM7: Standard method of measurement of building works. London: Construction Federation.
Salesky, M. E. 2017. The project managers guide to IDIQ task order service contracts. Cham, Switzerland: Springer Nature.
Shen, L. Y. 1997. “Project risk management in Hong Kong.” Int. J. Project Manage. 15 (2): 101–105. https://doi.org/10.1016/S0263-7863(96)00045-2.
Soibelman, L., and H. Kim. 2002. “Data preparation process for construction knowledge generation through knowledge discovery in databases.” J. Comput. Civ. Eng. 16 (1): 39–48. https://doi.org/10.1061/(ASCE)0887-3801(2002)16:1(39).
Srinivasan, S. 2017. Guide to big data applications. Cham, Switzerland: Springer.
Steinley, D., and M. J. Brusco. 2011. “Choosing the number of clusters in K-means clustering.” Psychol. Methods 16 (3): 285. https://doi.org/10.1037/a0023346.
Stevens, K., P. Kegelmeyer, D. Andrzejewski, and D. Buttler. 2012. “Exploring topic coherence over many models and many topics.” In Proc., EMNLP-CoNLL 2012-2012 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 952–961. Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.
Sun, C., S. Jiang, M. J. Skibniewski, Q. Man, and L. Shen. 2017. “A literature review of the factors limiting the application of BIM in the construction industry.” Technol. Econ. Dev. Econ. 23 (5): 764–779. https://doi.org/10.3846/20294913.2015.1087071.
Tixier, A. J.-P., M. R. Hallowell, B. Rajagopalan, and D. Bowman. 2016a. “Application of machine learning to construction injury prediction.” Autom. Constr. 69 (Sep): 102–114. https://doi.org/10.1016/j.autcon.2016.05.016.
Tixier, A. J.-P., M. R. Hallowell, B. Rajagopalan, and D. Bowman. 2016b. “Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports.” Autom. Constr. 62 (Feb): 45–56. https://doi.org/10.1016/j.autcon.2015.11.001.
Wang, B., and C.-C. J. Kuo. 2020. “SBERT-WK: A sentence embedding method by dissecting BERT-based word models.” Preprint, submitted February 16, 2020. http://arxiv.org/abs/2002.06652.
Westney, R. E. 1997. The engineer’s cost handbook: Tools for managing project costs. New York: CRC Press.
Williams, T. P., and J. Gong. 2014. “Predicting construction cost overruns using text mining, numerical data and ensemble classifiers.” Autom. Constr. 43 (Jul): 23–29. https://doi.org/10.1016/j.autcon.2014.02.014.
Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal. 2017. “Data transformations.” Chap. 8 in Data mining. 4th ed., edited by I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, 285–334. Burlington, MA: Morgan Kaufmann.
Xie, P., and E. P. Xing. 2013. “Integrating document clustering and topic modeling.” In Proc., 29th Conf., UAI 2013: Uncertainty in Artificial Intelligence, 694–703. Ithaca, NY: Cornell Univ.
Yalcinkaya, M., and V. Singh. 2015. “Patterns and trends in Building Information Modeling (BIM) research: A Latent Semantic Analysis.” Autom. Constr. 59 (Nov): 68–80. https://doi.org/10.1016/j.autcon.2015.07.012.
Yan, Y., R. Rosales, G. Fung, R. Subramanian, and J. Dy. 2014. “Learning from multiple annotators with varying expertise.” Mach. Learn. 95 (3): 291–327. https://doi.org/10.1007/s10994-013-5412-1.
Yu, W. D., and J. Y. Hsu. 2013. “Content-based text mining technique for retrieval of CAD documents.” Autom. Constr. 31 (May): 65–74. https://doi.org/10.1016/j.autcon.2012.11.037.
Yurchyshyna, A., and A. Zarli. 2009. “An ontology-based approach for formalisation and semantic organisation of conformance requirements in construction.” Autom. Constr. 18 (8): 1084–1098. https://doi.org/10.1016/j.autcon.2009.07.008.
Zhang, F., H. Fleyeh, X. Wang, and M. Lu. 2019. “Construction site accident analysis using text mining and natural language processing techniques.” Automm. Constr. 99 (Mar): 238–248. https://doi.org/10.1016/j.autcon.2018.12.016.

Information & Authors

Information

Published In

Go to Journal of Construction Engineering and Management
Journal of Construction Engineering and Management
Volume 147Issue 10October 2021

History

Received: Jan 28, 2021
Accepted: Jun 22, 2021
Published online: Aug 14, 2021
Published in print: Oct 1, 2021
Discussion open until: Jan 14, 2022

Permissions

Request permissions for this article.

Authors

Affiliations

Ying Hong, M.ASCE [email protected]
Research Associate, Dept. of Engineering, Univ. of Cambridge, Cambridge CB3 0FA, UK (corresponding author). Email: [email protected]; [email protected]
Haiyan Xie, M.ASCE [email protected]
Professor, Dept. of Technology, Illinois State Univ., Normal, IL 61761. Email: [email protected]
Gary Bhumbra [email protected]
Research Scientist, Dept. of Neuroscience, Physiology, and Pharmacology, Univ. College London, London WC1E 6BT, UK. Email: [email protected]
Ioannis Brilakis, M.ASCE [email protected]
Laing O’Rourke Reader, Dept. of Engineering, Univ. of Cambridge, Cambridge CB3 0FA, UK. Email: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

  • Uncovering Critical Causes of Highway Work Zone Accidents Using Unsupervised Machine Learning and Social Network Analysis, Journal of Construction Engineering and Management, 10.1061/JCEMD4.COENG-13952, 150, 3, (2024).
  • Graph-Based Automated Construction Scheduling without the Use of BIM, Journal of Construction Engineering and Management, 10.1061/JCEMD4.COENG-12687, 149, 2, (2023).
  • Organisational Factors of Artificial Intelligence Adoption in the South African Construction Industry, Frontiers in Built Environment, 10.3389/fbuil.2022.823998, 8, (2022).
  • A Novel and Intelligent Safety-Hazard Classification Method with Syntactic and Semantic Features for Large-Scale Construction Projects, Journal of Construction Engineering and Management, 10.1061/(ASCE)CO.1943-7862.0002382, 148, 10, (2022).
  • Natural language processing for smart construction: Current status and future directions, Automation in Construction, 10.1016/j.autcon.2021.104059, 134, (104059), (2022).
  • Application of NLP-based topic modeling to analyse unstructured text data in annual reports of construction contracting companies, CSI Transactions on ICT, 10.1007/s40012-022-00355-w, 10, 2, (97-106), (2022).
  • Towards the Development of a Budget Categorisation Machine Learning Tool: A Review, Trends on Construction in the Digital Era, 10.1007/978-3-031-20241-4_8, (101-110), (2022).

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share