Technical Papers
Nov 16, 2022

Investigating the Role of Clustering in Construction-Accident Severity Prediction Using a Heterogeneous and Imbalanced Data Set

Publication: Journal of Construction Engineering and Management
Volume 149, Issue 2

Abstract

Despite remarkable advances in the construction industry, it is still among the most hazardous industries; accidents occur in the construction industry with different severity levels. Construction accident data sets are available for analysis, but they face heterogeneity and class imbalance issues. Multitudinous complexities and uncertainties in construction projects result in heterogeneity; this leads to poor predictive performance of machine learning algorithms. Class imbalance issues arise because accidents occur at different severities with unequal distribution, producing biased prediction results. This study aimed to assess the impact of clustering on construction accident analysis when a data set is heterogeneous and imbalanced and to take a step toward making incidents more predictable. Accidents were predicted following four data preparation approaches: unmodified, balanced, clustered and clustered + balanced. The k-means clustering algorithm was adopted to split the data into homogenous clusters. Synthetic minority oversampling technique (SMOTE) and k-means SMOTE (KMSMOTE) were used to overcome the class imbalance issue. Five different supervised machine learning algorithms—classification and regression tree (CART), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB) and artificial neural network (ANN)—were employed for the prediction process. The results indicated that clustering significantly improved the predictive performance of the algorithms. The use of clustering along with oversampling was also the most appropriate approach to analyze accidents, providing more accurate and reliable predictions. The improvements resulting from applying the approach were about 33%, 23%, and 33% in terms of average precision, recall, and F1-score, respectively. Moreover, the ensemble learning classifiers used, RF and XGB, outperformed the other models. Ultimately, this research assisted safety professionals in predicting outcomes more accurately and in undertaking more appropriate safety measures.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

References

Abdelhamid, T. S., and J. G. Everett. 2000. “Identifying root causes of construction accidents.” J. Constr. Eng. Manage. 126 (1): 52–60. https://doi.org/10.1061/(ASCE)0733-9364(2000)126:1(52).
Adam, A., P.-E. Josephson, and G. Lindahl. 2014. “Implications of cost overruns and time delays on major public construction projects.” In Proc., 19th Int. Symp. on the Advancement of Construction Management and Real Estate. Berlin: Springer.
Ayhan, B. U., and O. B. Tokdemir. 2019. “Safety assessment in megaprojects using artificial intelligence.” Saf. Sci. 118 (Aug): 273–287. https://doi.org/10.1016/j.ssci.2019.05.027.
Ayhan, B. U., and O. B. Tokdemir. 2020. “Accident analysis for construction safety using latent class clustering and artificial neural networks.” J. Constr. Eng. Manage. 146 (3): 04019114. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001762.
Boateng, P., Z. Chen, and S. O. Ogunlana. 2015. “An analytical network process model for risks prioritisation in megaprojects.” Int. J. Proj. Manage. 33 (8): 1795–1811. https://doi.org/10.1016/j.ijproman.2015.08.007.
Breiman, L. 2001. “Random forests.” Mach. Learn. 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and regression trees. London: CRC Press.
Callegari, C., A. Szklo, and R. Schaeffer. 2018. “Cost overruns and delays in energy megaprojects: How big is big enough?” Energy Policy 114 (1): 211–220. https://doi.org/10.1016/j.enpol.2017.11.059.
Celebi, M. E., H. A. Kingravi, and P. A. Vela. 2013. “A comparative study of efficient initialization methods for the k-means clustering algorithm.” Expert Syst. Appl. 40 (1): 200–210. https://doi.org/10.1016/j.eswa.2012.07.021.
Chandrasekhar, A. M., and K. Raghuveer. 2013. “Intrusion detection technique by using k-means, fuzzy neural network and SVM classifiers.” In Proc., 2013 Int. Conf. on Computer Communication and Informatics. New York: IEEE.
Chawla, N., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. “SMOTE: Synthetic minority over-sampling technique.” J. Artif. Intell. Res. 16 (8): 321–357. https://doi.org/10.1613/jair.953.
Chawla, N. 2006. “Data mining for imbalanced datasets: An overview.” In Data Mining and knowledge discovery handbook, 853–867. Berlin: Springer.
Chen, T., and C. Guestrin. 2016 “XGBoost: A scalable tree boosting system.” In Proc., 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. New York: ACM.
Choi, J., B. Gu, S. Chin, and J. S. Lee. 2020. “Machine learning predictive model based on national data for fatal accidents of construction workers.” Autom. Constr. 110 (9): 102974. https://doi.org/10.1016/j.autcon.2019.102974.
Douzas, G., F. Bacao, and F. Last. 2018. “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE.” Inf. Sci. 465 (55): 1–20. https://doi.org/10.1016/j.ins.2018.06.056.
ESAW (European Statistics on Accidents at Work). 2019. “Accidents at work statistics.” Accessed June 12, 2022. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Accidents_at_work_statistics#Analysis_by_activity.
Fernández, A., V. López, M. Galar, M. J. del Jesus, and F. Herrera. 2013. “Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches.” Knowl. Based Syst. 42 (Jan): 97–110. https://doi.org/10.1016/j.knosys.2013.01.018.
Flyvbjerg, B. 2014. “What you should know about megaprojects and why: An overview.” Project Manage. J. 45 (2): 6–19. https://doi.org/10.1002/pmj.21409.
Flyvbjerg, B., M. K. S. Holm, and S. L. Buhl. 2003. “How common and how large are cost overruns in transport infrastructure projects?” Transp. Rev. 23 (1): 71–88. https://doi.org/10.1080/01441640309904.
Gao, W., W. Wang, D. Dimitrov, and Y. Wang. 2018. “Nano properties analysis via fourth multiplicative ABC indicator calculating.” Arabian J. Chem. 11 (6): 793–801. https://doi.org/10.1016/j.arabjc.2017.12.024.
García, V., J. S. Sánchez, R. Martín-Félez, and R. A. Mollineda. 2012. “Surrounding neighborhood-based SMOTE for learning from imbalanced data sets.” Progress Artificial Intell. 1 (4): 347–362. https://doi.org/10.1007/s13748-012-0027-5.
Hallowell, M. R., D. Alexander, and J. A. Gambatese. 2017. “Energy-based safety risk assessment: Does magnitude and intensity of energy predict injury severity?” Constr. Manage. Econ. 35 (1–2): 64–77. https://doi.org/10.1080/01446193.2016.1274418.
Hallowell, M. R., and J. A. Gambatese. 2009. “Construction safety risk mitigation.” J. Constr. Eng. Manage. 135 (12): 1316–1323. https://doi.org/10.1061/(ASCE)CO.1943-7862.0000107.
He, H., and E. A. Garcia. 2009. “Learning from imbalanced data.” IEEE Trans. Knowl. Data Eng. 21 (9): 1263–1284. https://doi.org/10.1109/TKDE.2008.239.
Heaton, J. 2008. Introduction to neural networks with Java. Chesterfield, UK: Heaton Research, Inc.
Hinze, J., C. Pedersen, and J. Fredley. 1998. “Identifying root causes of construction injuries.” J. Constr. Eng. Manage. 124 (1): 67–71. https://doi.org/10.1061/(ASCE)0733-9364(1998)124:1(67).
Holzinger, A. 2019. “Big data calls for machine learning.” Encyclopedia Biomed. Eng. 1–3 (3): 258–264. https://doi.org/10.1016/B978-0-12-801238-3.10877-3.
HSE (Health and Safety Executive). 2021. “Work-related fatal injuries in Great Britain.” Accessed April 16, 2022. https://www.hse.gov.uk/statistics/fatals.htm.
ILO (International Labor Organization). 2015. “Global trends on occupational accidents and diseases.” Accessed December 22, 2021. https://www.ilo.org/legacy/english/osh/en/story_content/external_files/fs_st_1-ILO_5_en.pdf.
JISHA (Japan Industrial Safety and Health Administration). 2020. “OSH statistics in Japan.” Accessed April 16, 2022. https://www.jisha.or.jp/english/statistics/.
Kang, K., and H. Ryu. 2019. “Predicting types of occupational accidents at construction sites in Korea using random forest model.” Saf. Sci. 120 (6): 226–236. https://doi.org/10.1016/j.ssci.2019.06.034.
Kursa, M. B., and W. R. Rudnicki. 2010. “Feature selection with the boruta package.” J. Statis. Software 36 (11): 1–13. https://doi.org/10.18637/jss.v036.i11.
Li, Y., Y. Hu, B. Xia, M. Skitmore, and H. Li. 2018. “Proactive behavior-based system for controlling safety risks in urban highway construction megaprojects.” Autom. Constr. 95 (7): 118–128. https://doi.org/10.1016/j.autcon.2018.07.021.
Likas, A., N. Vlassis, and J. J. Verbeek. 2003. “The global k-means clustering algorithm.” Pattern Recognit. 36 (2): 451–461. https://doi.org/10.1016/S0031-3203(02)00060-2.
MacQueen, J. 1967. “Some methods for classification and analysis of multivariate observations.” In Proc., 5th Berkeley Symp. on Mathematical Statistics and Probability. Berkeley, CA: University of California Press.
Maiti, S., and J. Choi. 2019. “An evidence-based approach to health and safety management in megaprojects.” Int. J. Constr. Manage. 21 (10): 997–1010. https://doi.org/10.1080/15623599.2019.1602580.
Misra, S., and H. Li. 2020. “Noninvasive fracture characterization based on the classification of sonic wave travel times.” In Machine learning for subsurface characterization, 243–287. Amsterdam, Netherlands: Elsevier.
Misra, S., and Y. Wu. 2020. “Machine learning assisted segmentation of scanning electron microscopy images of organic-rich shales with feature extraction and feature ranking.” In Machine learning for subsurface characterization, 289–314. Amsterdam, Netherlands: Elsevier.
Moayed, F. A., and R. L. Shell. 2010. “Application of artificial neural network models in occupational safety and health utilizing ordinal variables.” Ann Occup Hyg 55 (2): 132–142. https://doi.org/10.1093/annhyg/meq079.
MOEL (Ministry of Employment and Labor). 2017. “Analysis of industrial accidents in 2011–2017.” Accessed April 16, 2022. https://www.kosha.or.kr/english/index.do.
Nguyen, G. H., A. Bouzerdoum, and S. L. Phung. 2009. “Learning pattern classification tasks with imbalanced data sets.” In Pattern recognition. Vukovar, Croatia: InTech.
Nguyen, L. D., D. Q. Tran, and M. P. Chandrawinata. 2016. “Predicting safety risk of working at heights using Bayesian networks.” J. Constr. Eng. Manage. 142 (9): 04016041. https://doi.org/10.1061/(ASCE)CO.1943-7862.0001154.
Oztekin, A., L. Al-Ebbini, Z. Sevkli, and D. Delen. 2018. “A decision analytic approach to predicting quality of life for lung transplant recipients: A hybrid genetic algorithms-based methodology.” Eur. J. Oper. Res. 266 (2): 639–651. https://doi.org/10.1016/j.ejor.2017.09.034.
Patel, D. A., and K. N. Jha. 2015. “Neural network approach for safety climate prediction.” J. Manage. Eng. 31 (6): 05014027. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000348.
Pham, D. T., and A. A. Afify. 2007. “Clustering techniques and their applications in engineering.” SAGE 221 (11): 1445–1459. https://doi.org/10.1243/09544062JMES508.
Poh, C. Q. X., C. U. Ubeynarayana, and Y. M. Goh. 2018. “Safety leading indicators for construction sites: A machine learning approach.” Autom. Constr. 93 (22): 375–386. https://doi.org/10.1016/j.autcon.2018.03.022.
Raghuwanshi, B. S., and S. Shukla. 2020. “SMOTE based class-specific extreme learning machine for imbalanced learning.” Knowl.-Based Syst. 187 (6): 104814. https://doi.org/10.1016/j.knosys.2019.06.022.
Santos, M. S., J. P. Soares, P. H. Abreu, H. Araujo, and J. Santos. 2018. “Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [Research Frontier].” IEEE Comput. Intell. Mag. 13 (4): 59–76. https://doi.org/10.1109/MCI.2018.2866730.
Sarkar, S., A. Pramanik, J. Maiti, and G. Reniers. 2020. “Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data.” In Safety science, 104616. Amsterdam, Netherlands: Elsevier.
Sarkar, S., R. Raj, S. Vinay, J. Maiti, and D. K. Pratihar. 2019. “An optimization-based decision tree approach for predicting slip-trip-fall accidents at work.” Saf. Sci. 118 (Jan): 57–69. https://doi.org/10.1016/j.ssci.2019.05.009.
Singh, D., and B. Singh. 2020. “Investigating the impact of data normalization on classification performance.” Appl. Soft Comput. 97 (Sep): 105524. https://doi.org/10.1016/j.asoc.2019.105524.
Suárez Sánchez, A., P. Riesgo Fernández, F. Sánchez Lasheras, F. J. de Cos Juez, and P. J. García Nieto. 2011. “Prediction of work-related accidents according to working conditions using support vector machines.” Appl. Math. Comput. 218 (7): 3539–3552. https://doi.org/10.1016/j.amc.2011.08.100.
Tixier, A. J. P., M. R. Hallowell, B. Rajagopalan, and D. Bowman. 2016. “Application of machine learning to construction injury prediction.” Autom. Constr. 69 (May): 102–114. https://doi.org/10.1016/j.autcon.2016.05.016.
Vapnik, V. N. 1995. The nature of statistical learning theory. Berlin: Springer.
Wang, K. J., B. Makond, K. H. Chen, and K. M. Wang. 2014. “A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients.” Appl. Soft Comput. 20 (Jul): 15–24. https://doi.org/10.1016/j.asoc.2013.09.014.
Wu, J. 2012. Cluster analysis and K-means clustering: An introduction. Berlin: Springer.
Wu, W., A. G. F. Gibb, and Q. Li. 2010. “Accident precursors and near misses on construction sites: An investigative tool to derive information from accident databases.” Saf. Sci. 48 (7): 845–858. https://doi.org/10.1016/j.ssci.2010.04.009.
Xie, L., G. Lin, C. Hon, B. Xia, and M. Skitmore. 2020. “Comparing the psychosocial safety climate between megaprojects and non-megaprojects: Evidence from China.” Appl. Sci. 10 (24): 8809. https://doi.org/10.3390/APP10248809.
XLSTAT. 2019. “Essential data analysis tool for excel.” Accessed December 18, 2021. https://www.xlstat.com/en/.
Xue, P., Y. Jiang, Z. Zhou, X. Chen, X. Fang, and J. Liu. 2019. “Multi-step ahead forecasting of heat load in district heating systems using machine learning algorithms.” Energy 188 (Jan): 116085. https://doi.org/10.1016/j.energy.2019.116085.
Yang, C., M. Chen, and Q. Yuan. 2021. “The application of XGBoost and SHAP to examining the factors in freight truck-related crashes: An exploratory analysis.” Accid. Anal. Preven. 158 (Apr): 106153. https://doi.org/10.1016/j.aap.2021.106153.
Yuan, C., and H. Yang. 2019. Research on K-Value selection method of K-Means clustering algorithm. 226–235. Basel, Switzerland: Multidisciplinary Digital Publishing Institute.
Zhang, S., J. Teizer, J. K. Lee, C. M. Eastman, and M. Venugopal. 2013. “Building information modeling (BIM) and safety: Automatic safety checking of construction models and schedules.” Autom. Constr. 29 (May): 183–195. https://doi.org/10.1016/j.autcon.2012.05.006.
Zhu, R., X. Hu, J. Hou, and X. Li. 2021. “Application of machine learning techniques for predicting the consequences of construction accidents in China.” Process Saf. Environ. Protect. 145 (Aug): 293–302. https://doi.org/10.1016/j.psep.2020.08.006.

Information & Authors

Information

Published In

Go to Journal of Construction Engineering and Management
Journal of Construction Engineering and Management
Volume 149Issue 2February 2023

History

Received: Jan 7, 2022
Accepted: Jul 18, 2022
Published online: Nov 16, 2022
Published in print: Feb 1, 2023
Discussion open until: Apr 16, 2023

Permissions

Request permissions for this article.

Authors

Affiliations

Dept. of Civil Engineering, Ferdowsi Univ. of Mashhad, Mashhad 93, Iran. ORCID: https://orcid.org/0000-0003-3073-7671. Email: [email protected]
Hossein Etemadfard [email protected]
Assistant Professor, Dept. of Civil Engineering, Ferdowsi Univ. of Mashhad, Mashhad 93, Iran (corresponding author). Email: [email protected]
Ali Rahimzadegan [email protected]
Dept. of Civil Engineering, Ferdowsi Univ. of Mashhad, Mashhad 93, Iran. Email: [email protected]
Professor, Dept. of Civil Engineering, Ferdowsi Univ. of Mashhad, Mashhad 93, Iran. ORCID: https://orcid.org/0000-0002-1667-7812. Email: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

  • Predicting Safety Accident Costs in Construction Projects Using Ensemble Data-Driven Models, Journal of Construction Engineering and Management, 10.1061/JCEMD4.COENG-14397, 150, 7, (2024).
  • A Data-Driven Recommendation System for Construction Safety Risk Assessment, Journal of Construction Engineering and Management, 10.1061/JCEMD4.COENG-13437, 149, 12, (2023).

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share