Technical Papers
Nov 22, 2023

Semisupervised Clustering Approach for Pipe Failure Prediction with Imbalanced Data Set

Publication: Journal of Water Resources Planning and Management
Volume 150, Issue 2

Abstract

In recent years, machine learning (ML) approaches have been used widely for water pipe condition assessment and failure prediction. These methods require a considerable amount of data from water distribution networks (WDNs). Imbalanced and missing data, whether asset or failure data, compromise a model’s prediction performance. In this research, using only 2 years of failure data in a real WDN, three ML methods—XGBoost, random forest and logistic regression—were used to prioritize asset rehabilitation. To address the issue of imbalanced data, a novel method of semisupervised clustering is proposed to leverage the domain knowledge in combination with unsupervised learning to divide the data set into homogeneous categories and enhance the classification accuracy. The introduced approach performed better than well-known data science class imbalance treatment techniques. Furthermore, analysis of the results indicated that classification evaluation metrics struggled to assess practically the effectiveness of various methods. To address this, an economic indicator is proposed to rank the pipes for rehabilitation based on their cost and likelihood of failure (LoF). Preventive maintenance using the results of an economic indicator reduces the number of failures with a small fraction of the total replacement cost. Moreover, another indicator was developed to consider the consequence of the failures and LoF simultaneously. This indicator mitigates in a cost-effective manner the flow capacity reductions in WDNs caused by failures. The results of this study provide asset managers with a powerful tool to prioritize assets for rehabilitation.

Practical Applications

In recent years, machine learning algorithms have gained popularity for assessing water pipe conditions and predicting failures. However, their effectiveness relies on substantial data from water distribution networks (WDNs). Challenges arise with limited (imbalanced) data, affecting prediction accuracy. This study focused on a specific WDN with only 2 years of failure data, aiming to identify priority assets for rehabilitation. Three ML methods (XGBoost, random forest, and logistic regression) and a novel semisupervised clustering approach were employed. This method combines expert knowledge with traditional techniques, significantly improving predictive accuracy. By applying ML algorithms within these homogenous clusters, predictive accuracy was enhanced notably. Two novel metrics were introduced for prioritizing pipe rehabilitation: one combining failure likelihood and replacement costs, and the other evaluating pipes based on their significance within the WDN and associated rehabilitation expenses. These models empower asset managers to optimize pipe replacement budget allocation and enhance the network performance.

Get full access to this article

View all available purchase options and get full access to this article.

Data Availability Statement

Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions. All case study data are owned by the utility company and are subject to a nondisclosure agreement (NDA), thereby limiting their availability for public dissemination. Requests for noncommercial use of the scripts will be evaluated on a case-by-case basis.

Acknowledgments

This research has been funded by Datatecnics Corporation Limited and UK Research and Innovation (UKRI), under Knowledge Transfer Partnership (KTP) InnovateUK with Grant number 12418.

References

Abokifa, A. A., and L. Sela. 2023. “Integrating spatial clustering with predictive modelling of pipe failures in water distribution systems.” Urban Water J. 20 (4): 465–476. https://doi.org/10.1080/1573062X.2023.2180393.
Akinsomi, O., S. E. Ong, and M. Ibrahim. 2013. “Corporate real estate holdings and firm returns of Shariah compliant firms.” In Proc., 20th Annual European Real Estate Society Conf. Amsterdam, Netherlands: European Real Estate Society.
Akintola, A. G., A. O. Balogun, H. A. Mojeed, F. Usman-Hamza, S. A. Salihu, K. S. Adewole, G. B. Balogun, and P. O. Sadiku. 2022. “Performance analysis of machine learning methods with class imbalance problem in Android malware detection.” Int. J. Interact. Mobile Technol. 16 (10): 140–162. https://doi.org/10.3991/ijim.v16i10.29687.
Barton, N. A., T. S. Farewell, and S. H. Hallett. 2020. “Using generalized additive models to investigate the environmental effects on pipe failure in clean water networks.” npj Clean Water 3 (1): 31. https://doi.org/10.1038/s41545-020-0077-3.
Barton, N. A., T. S. Farewell, S. H. Hallett, and T. F. Acland. 2019. “Improving pipe failure predictions: Factors affecting pipe failure in drinking water networks.” Water Res. 164 (Nov): 114926. https://doi.org/10.1016/j.watres.2019.114926.
Barton, N. A., S. H. Hallett, S. R. Jude, and T. H. Tran. 2022. “An evolution of statistical pipe failure models for drinking water networks: A targeted review.” Water Supply 22 (4): 3784–3813. https://doi.org/10.2166/ws.2022.019.
Bozdogan, H. 1987. “Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions.” Psychometrika 52 (3): 345–370. https://doi.org/10.1007/BF02294361.
Burez, J., and D. Van den Poel. 2009. “Handling class imbalance in customer churn prediction.” Expert Syst. Appl. 36 (3): 4626–4636. https://doi.org/10.1016/j.eswa.2008.05.027.
Ceriani, L., and P. Verme. 2012. “The origins of the Gini index: Extracts from Variabilità e Mutabilità (1912) by Corrado Gini.” J. Econ. Inequality 10 (Sep): 421–443. https://doi.org/10.1007/s10888-011-9188-x.
Chen, M., Q. Liu, S. Chen, Y. Liu, C.-H. Zhang, and R. Liu. 2019. “XGBoost-based algorithm interpretation and application on post-fault transient stability status prediction of power system.” IEEE Access 7 (Jan): 13149–13158. https://doi.org/10.1109/ACCESS.2019.2893448.
Chen, T., and C. Guestrin. 2016. “XGBoost: A scalable tree boosting system.” Preprint, submitted March 9, 2016. https://arxiv.org/abs/1603.02754.
Chen, T. Y.-J., and S. D. Guikema. 2020. “Prediction of water main failures with the spatial clustering of breaks.” Reliab. Eng. Syst. Saf. 203 (Nov): 107108. https://doi.org/10.1016/j.ress.2020.107108.
Cox, D., and E. J. Snell. 2008. Analysis of binary data, 4–5. New York: Springer. https://doi.org/10.1007/978-0-387-32833-1_5.
Davis, J., and M. Goadrich. 2006. “The relationship between Precision-Recall and ROC curves.” In Proc., 23rd Int. Conf. on Machine Learning (ICML ’06), 233–240. New York: Association for Computing Machinery.
Dawood, T., E. Elwakil, H. Mayol Novoa, and J. F. Gárate Delgado. 2020. “Water pipe failure prediction and risk models: State-of-the-art review.” Can. J. Civ. Eng. 47 (10): 1117–1127. https://doi.org/10.1139/cjce-2019-0481.
Debón, A., A. Carrión, E. Cabrera, and H. Solano. 2010. “Comparing risk of failure models in water supply networks using ROC curves.” Reliab. Eng. Syst. Saf. 95 (1): 43–48. https://doi.org/10.1016/j.ress.2009.07.004.
Fan, X., X. Wang, X. Zhang, and X. Yu. 2022. “Machine learning based water pipe failure prediction: The effects of engineering, geology, climate and socio-economic factors.” Reliab. Eng. Syst. Saf. 219 (Mar): 108185. https://doi.org/10.1016/j.ress.2021.108185.
Giraldo-González, M. M., and J. P. Rodríguez. 2020. “Comparison of statistical and machine learning models for pipe failure modeling in water distribution networks.” Water 12 (4): 1153. https://doi.org/10.3390/W12041153.
Hekmati, N., M. M. Rahman, N. Gorjian, R. Rameezdeen, and C. W. K. Chow. 2020. “Relationship between environmental factors and water pipe failure: An open access data study.” SN Appl. Sci. 2 (Nov): 1806. https://doi.org/10.1007/s42452-020-03581-6.
Jara-Arriagada, C., and I. Stoianov. 2021. “Pipe breaks and estimating the impact of pressure control in water supply networks.” Reliab. Eng. Syst. Saf. 210 (Jun): 107525. https://doi.org/10.1016/j.ress.2021.107525.
Kabaasha, A., J. E. van Zyl, and G. Mahinthakumar. 2020. “Correcting power leakage equation for improved leakage modeling and detection.” J. Water Resour. Plann. Manage. 146 (3): 06020001. https://doi.org/10.1061/(ASCE)WR.1943-5452.0001172.
Kakoudakis, K., K. Behzadian, R. Farmani, and D. Butler. 2016. “Pipeline failure prediction in water distribution networks using evolutionary polynomial regression combined with K-means clustering.” Urban Water J. 14 (7): 737–742. https://doi.org/10.1080/1573062x.2016.1253755.
Kakoudakis, K., R. Farmani, and D. Butler. 2018. “Pipeline failure prediction in water distribution networks using weather conditions as explanatory factors.” J. Hydroinf. 20 (5): 1191–1200. https://doi.org/10.2166/hydro.2018.152.
Kulkarni, A., D. Chong, and F. A. Batarseh. 2020. “Foundations of data imbalance and solutions for a data democracy.” In Data democracy, 83–106. Amsterdam, Netherlands: Elsevier.
Lazar, A., A. Ballow, L. Jin, C. A. Spurlock, A. Sim, and K. Wu. 2019. “Machine learning for prediction of mid to long term habitual transportation mode use.” In Proc., 2019 IEEE Int. Conf. on Big Data (Big Data). New York: IEEE. https://doi.org/10.1109/bigdata47090.2019.9006411.
Liu, W., Z. Chen, and Y. Hu. 2022. “XGBoost algorithm-based prediction of safety assessment for pipelines.” Int. J. Press. Vessels Pip. 197 (Jun): 104655. https://doi.org/10.1016/j.ijpvp.2022.104655.
Marsili, V., S. Meniconi, S. Alvisi, B. Brunone, and M. Franchini. 2020. “Experimental analysis of the water consumption effect on the dynamic behaviour of a real pipe network.” J. Hydraul. Res. 59 (3): 477–487. https://doi.org/10.1080/00221686.2020.1780506.
Martínez-Codina, Á., L. Cueto-Felgueroso, M. Castillo, and L. Garrote. 2015. “Use of pressure management to reduce the probability of pipe breaks: A Bayesian approach.” J. Water Resour. Plann. Manage. 141 (9): 04015010. https://doi.org/10.1061/(asce)wr.1943-5452.0000519.
Menze, B. H., B. M. Kelm, R. Masuch, U. Himmelreich, P. Bachert, W. Petrich, and F. A. Hamprecht. 2009. “A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data.” BMC Bioinf. 10 (Jul): 213. https://doi.org/10.1186/1471-2105-10-213.
Monfared, Z., M. Molavi Nojumi, and A. Bayat. 2022. “A review of water quality factors in water main failure prediction models.” Water Pract. Technol. 17 (1): 60. https://doi.org/10.2166/wpt.2021.094.
Philip, B. E., and H. Aljassmi. 2020. “The relevance of water pipe deterioration prediction models: A review.” Int. J. Sci. Technol. Res. 9 (2): 503–510.
Piryonesi, S. M., and T. E. El-Diraby. 2021. “Using machine learning to examine impact of type of performance indicator on flexible pavement deterioration modeling.” J. Infrastruct. Syst. 27 (2): 04021005. https://doi.org/10.1061/(ASCE)IS.1943-555X.0000602.
Rajani, B., and Y. Kleiner. 2001. “Comprehensive review of structural deterioration of water mains: Physically based models.” Urban Water 3 (3): 151–164. https://doi.org/10.1016/S1462-0758(01)00032-2.
Rajani, B., C. Zhan, and S. Kuraoka. 1996. “Pipe–soil interaction analysis of jointed water mains.” Can. Geotech. J. 33 (3): 393–404. https://doi.org/10.1139/t96-061.
Rifaai, T. M., A. A. Abokifa, and L. Sela. 2022. “Integrated approach for pipe failure prediction and condition scoring in water infrastructure systems.” Reliab. Eng. Syst. Saf. 220 (Apr): 108271. https://doi.org/10.1016/j.ress.2021.108271.
Robles-Velasco, A., P. Cortés, J. Muñuzuri, and B. De Baets. 2023. “Prediction of pipe failures in water supply networks for longer time periods through multi-label classification.” Expert Syst. Appl. 213 (Mar): 119050. https://doi.org/10.1016/j.eswa.2022.119050.
Robles-Velasco, A., P. Cortés, J. Muñuzuri, and L. Onieva. 2020. “Prediction of pipe failures in water supply networks using logistic regression and support vector classification.” Reliab. Eng. Syst. Saf. 196 (Apr): 106754. https://doi.org/10.1016/j.ress.2019.106754.
Saito, T., and M. Rehmsmeier. 2015. “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced data sets.” PLoS One 10 (3): e0118432. https://doi.org/10.1371/journal.pone.0118432.
Scheidegger, A., J. P. Leitão, and L. Scholten. 2015. “Statistical failure models for water distribution pipes—A review from a unified perspective.” Water Res. 83 (Oct): 237–247. https://doi.org/10.1016/j.watres.2015.06.027.
Scheidegger, A., L. Scholten, M. Maurer, and P. Reichert. 2013. “Extension of pipe failure models to consider the absence of data from replaced pipes.” Water Res. 47 (11): 3696–3705. https://doi.org/10.1016/j.watres.2013.04.017.
Snider, B., G. Lewis, A. Chen, L. Vamvakeridou, and D. Savić. 2023. “A flexible, leak crew focused localization model using a maximum coverage search area algorithm.” IOP Conf. Ser.: Earth Environ. Sci. 1136 (1): 012042. https://doi.org/10.1088/1755-1315/1136/1/012042.
Strobl, C., A. L. Boulesteix, and T. Augustin. 2007. “Unbiased split selection for classification trees based on the Gini index.” Comput. Stat. Data Anal. 52 (1): 483–501. https://doi.org/10.1016/j.csda.2006.12.030.
Vu, H. L., K. T. W. Ng, A. Richter, and C. An. 2022. “Analysis of input set characteristics and variances on k-fold cross validation for a recurrent neural network model on waste disposal rate estimation.” J. Environ. Manage. 311 (Jun): 114869. https://doi.org/10.1016/j.jenvman.2022.114869.
Wasim, M., S. Shoaib, S. M. Mubarak, I. Inamuddin, and A. M. Asiri. 2018. “Factors influencing corrosion of metal pipes in soils.” Environ. Chem. Lett. 16 (Sep): 861–879. https://doi.org/10.1007/s10311-018-0731-x.
Wols, B. A., and P. van Thienen. 2014. “Modelling the effect of climate change induced soil settling on drinking water distribution pipes.” Comput. Geotech. 55 (Jan): 240–247. https://doi.org/10.1016/j.compgeo.2013.09.003.
Wols, B. A., A. Vogelaar, A. Moerman, and B. Raterman. 2019. “Effects of weather conditions on drinking water distribution pipe failures in the Netherlands.” Water Supply 19 (2): 404–416. https://doi.org/10.2166/ws.2018.085.
Zhu, M., J. Xia, X. Jin, M. Yan, G. Cai, J. Yan, and G. Ning. 2018. “Class weights random forest algorithm for processing class imbalanced medical data.” IEEE Access 6 (Jan): 4641–4652. https://doi.org/10.1109/ACCESS.2018.2789428.

Information & Authors

Information

Published In

Go to Journal of Water Resources Planning and Management
Journal of Water Resources Planning and Management
Volume 150Issue 2February 2024

History

Received: May 26, 2023
Accepted: Sep 21, 2023
Published online: Nov 22, 2023
Published in print: Feb 1, 2024
Discussion open until: Apr 22, 2024

Permissions

Request permissions for this article.

ASCE Technical Topics:

Authors

Affiliations

Ph.D. Candidate, Centre for Water Systems, Univ. of Exeter, North Park Rd., Exeter EX4 4PY, UK (corresponding author). ORCID: https://orcid.org/0000-0003-2814-9838. Email: [email protected]
Postdoctoral Researcher, Centre for Water Systems, Univ. of Exeter, North Park Rd., Exeter EX4 4PY, UK. ORCID: https://orcid.org/0000-0002-5275-3587
Akbar A. Javadi
Professor, Centre for Water Systems, Univ. of Exeter, North Park Rd., Exeter EX4 4PY, UK.
Raziyeh Farmani
Professor, Centre for Water Systems, Univ. of Exeter, North Park Rd., Exeter EX4 4PY, UK.

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

  • Efficacy of Tree-Based Models for Pipe Failure Prediction and Condition Assessment: A Comprehensive Review, Journal of Water Resources Planning and Management, 10.1061/JWRMD5.WRENG-6334, 150, 7, (2024).

View Options

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Get Access

Access content

Please select your options to get access

Log in/Register Log in via your institution (Shibboleth)
ASCE Members: Please log in to see member pricing

Purchase

Save for later Information on ASCE Library Cards
ASCE Library Cards let you download journal articles, proceedings papers, and available book chapters across the entire ASCE Library platform. ASCE Library Cards remain active for 24 months or until all downloads are used. Note: This content will be debited as one download at time of checkout.

Terms of Use: ASCE Library Cards are for individual, personal use only. Reselling, republishing, or forwarding the materials to libraries or reading rooms is prohibited.
ASCE Library Card (5 downloads)
$105.00
Add to cart
ASCE Library Card (20 downloads)
$280.00
Add to cart
Buy Single Article
$35.00
Add to cart

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share