Analysis of the Influences of Sampling Bias and Class Imbalance on Performances of Probabilistic Liquefaction Models
Publication: International Journal of Geomechanics
Volume 17, Issue 6
Abstract
Sampling bias and class imbalance are important parts of model uncertainty that have a significant impact on the predictive probability of classification models. This study analyzed the influences of sampling bias and class imbalance on the performance of four common methods used in 10 models for seismic liquefaction—Bayesian network (BN), artificial neural network (ANN), logistic regression (LR), and support vector machine (SVM)—using controlled experiments based on penetration test (SPT) data from 350 standard case histories. The data are divided into two data sets with class distributions of 150:150 and 200:100, which are separately stratified and sampled to obtain 11 different cases of distributions (10:90, 20:80, 25:75, 33:67, 40:60, 50:50, 60:40, 67:33, 75:25, 80:20, and 90:10) to quantify the predictive performance of the four models using statistical model validation metrics, such as overall accuracy, area under the receiver operating characteristic curve, precision, recall, and F-score. The experiments show that the best distribution of liquefaction samples for training is not a fixed point but, rather, a range. The authors suggest that the best range of sample distribution is from 1 to 1.5 (liquefaction/nonliquefaction) for the BN method, from 0.67 to 1 for the ANN method, approximately 0.5 for the LR model, and from 0.5 to 1 for the SVM method. Furthermore, oversampling technology was used to try to improve the predictive capability of the four models for two samples (10:90 and 90:10) with bad class imbalance and sampling bias. The predictive performance of the oversampled sample considerably improved over the original samples with bad class imbalance and sampling bias for the LR model and the SVM polynomial (SVM-Pol) model rather than for the BN maximum likelihood estimation (BN-MLE) model and the ANN radial basis function (ANN-RBF) model. In addition, in the fields with unknown real distribution of classes in the population, when a training sample contains severe class imbalance or sampling bias, the authors recommend that researchers choose an oversampled sample that has the same class distribution as the population of the collected data to ensure optimal performance.
Get full access to this article
View all available purchase options and get full access to this article.
Acknowledgments
The work presented in this paper is part of research sponsored by the National Science Council of People’s Republic of China under Grant 2011CB013605-2.
References
Bayraktarli, Y. Y. (2006). “Application of Bayesian probabilistic networks for liquefaction of soil.” 6th Int. Ph.D. Symp. in Civil Engineering, Institute of Structural Engineering ETH Zurich, Zurich, Switzerland, 8, 23–26.
Bensi, M., Kiureghian, A. D., and Straub, D. (2011). “Bayesian network modeling of correlated random variables drawn from a Gaussian random field.” Struct. Saf., 33(6), 317–332.
Cetin, K. O., Kiureghian, A. D., and Seed, R. B. (2002). “Probabilistic models for the initiation of seismic soil liquefaction.” Struct. Saf., 24(1), 67–82.
Chen, Y. R., Hsieh, S. C., Chen, J. W., and Shih, C. C. (2005). “Energy-based probabilistic evaluation of soil liquefaction.” Soil Dyn. Earthquake Eng., 25(1), 55–68.
Goh, A. T. C. (1996). “Neural-network modeling of CPT seismic liquefaction data.” J. Geotech. Eng., 70–73.
Goh, A. T. C., and Goh, S. H. (2007). “Support vector machines: Their use in geotechnical engineering as illustrated using seismic liquefaction data.” Comput. Geotech., 34(5), 410–421.
Hu, J., Tang, X. W., and Qiu, J. (2015). “A Bayesian network approach for predicting seismic liquefaction based on interpretive structural modeling.” Georisk, 9(3), 200–217.
Huang, H. W., Zhang, J., and Zhang, L. M. (2012). “Bayesian network for characterizing model uncertainty of liquefactionpotential evaluation models.” KSCE J. Civ. Eng., 16(5), 714–722.
Idriss, I. M., and Boulanger R. W. (2010). “SPT-based liquefaction triggering procedures.” Rep. UCD/CGM-10/02, Center for Geotechnical Modeling, Dept. of Civil and Environmental Engineering, Univ. of California, Davis, CA.
Jain, A. (2012). Sampling bias in evaluating the probability of seismically induced soil liquefaction with SPT & CPT case histories, Masters dissertation, Michigan Technological Univ., Houghton, MI.
Juang, C. H., and Chen, C. J. (1999). “CPT-based liquefaction evaluation using artificial neural networks.” Comput.-Aided Civ. Infrastruct. Eng., 14(3), 221–229.
Juang, C. H., Ching, J., Luo, Z., and Ku, C. S. (2012). “New models for probability of liquefaction using standard penetration tests based on updated database of case histories.” Eng. Geol., 133–134, 85–93.
Lai, S., Chang, W., and Lin, P. (2006). “Logistic regression model for evaluating soil liquefaction probability using CPT data.” J. Geotech. Geoenviron. Eng., 694–704.
Liao Samson, S. C., Veneziano, D., and Whitman, R. V. (1988). “Regression models for evaluating liquefaction probability.” J. Geotech. Engrg., 389–411.
Moss, R., Seed, R., Kayen, R., Stewart, J., Der Kiureghian, A., and Cetin, K. (2006). “CPT-based probabilistic and deterministic assessment of in situ seismic soil liquefaction potential.” J. Geotech. Geoenviron. Eng., 1032–1051.
Olson, D. L., and Delen, D. (2008). Advanced data mining techniques, 1st Ed., Springer, Berlin, 111–123.
Oommen, T., Baise, L. G., and Vogel, R. (2010). “Validation and application of empirical liquefaction models.” J. Geotech. Geoenviron. Eng., 1618–1633.
Oommen, T., Baise, L. G., and Vogel, R. (2011). “Sampling bias and class imbalance in maximum-likelihood logistic regression.” Math. Geosci., 43(1), 99–120.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems, Morgan Kaufmann, San Mateo, CA.
Thammasiri, D., Delen, D., Meesad, P., and Kasap, N. (2013). “A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition.” Expert Syst. Appl., 41(2), 321–330.
Vapnik, V. (1995). The nature of statistical learning theory, Springer, New York.
Yazdi, J. S., Kalantary, F., and Yazdi, H. S. (2013). “Investigation on the effect of data imbalance on prediction of liquefaction.” Int. J. Geomech., 463–466.
Yen, S. J., and Lee, Y. S. (2009). “Cluster-based under-sampling approaches for imbalanced data distributions.” Expert Syst. Appl., 36(3), 5718–5727.
Information & Authors
Information
Published In
Copyright
© 2016 American Society of Civil Engineers.
History
Received: Dec 3, 2015
Accepted: Aug 22, 2016
Published online: Nov 8, 2016
Discussion open until: Apr 8, 2017
Published in print: Jun 1, 2017
Authors
Metrics & Citations
Metrics
Citations
Download citation
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.