Automated Procedure to Assess Civil Infrastructure Data Quality: Method and Validation
Publication: Journal of Infrastructure Systems
Volume 11, Issue 3
Abstract
Monitoring data are collected to measure the condition, environment, usage, and performance of civil infrastructure. High quality monitoring data are necessary for decision-support systems, design analysis, and research. However, little work has been done in the area of generic, automated data quality assessment and cleansing procedures. We have developed an automated, two-level data quality assessment procedure to address this deficiency. In the first level of our procedure, several different data quality assessment methods are used in a voting scheme to identify concentrations of anomalies in aggregate data. In the second level, differences between anomalies and normal data at the individual data level are identified; combined with domain knowledge, these differences can be used to identify different types of errors, such as missing data and calibration errors. In our case studies, we have been able to effectively cleanse the data using the results from our data quality assessment procedure. We have also developed a test bench to explore the sensitivity of the data quality assessment algorithms used in our approach. The test bench introduces a known error into a clean, artificial data set and then evaluates how well each assessment method identifies the error. The test bench results show that our approach is able to effectively identify anomalies, even those with small magnitudes of error.
Get full access to this article
View all available purchase options and get full access to this article.
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. NSFCMS-9987871 and partially supported by Illinois Department of Transportation through the Metropolitan Transportation Support Initiative (METSI) at University of Illinois, Chicago. The writers would also like to thank Margaret H. Chalkline and the Minnesota Department of Transportation for giving us the opportunity to study their weigh-in-motion data.
References
Agrawal, R., Imielinski, T., and Swami, A. (1993). “Mining association rules between sets of items in large databases.” Proc., ACM SIGMOD Int. Conf., Association of Computing Machinery, Washington, D.C., 207–216.
American Association of State Highway and Transportation Officials (AASHTO). (1986). AASHTO Guide for Design of Pavement Structures, Washington, D.C.
Buchheit, R. B. (2002). “Vacuum: Automated procedures for assessing and cleansing civil infrastructure data.” PhD thesis, Carnegie Mellon Univ., Pittsburgh, Penn.
Chapman, P., Clinton, J., Khabaza, T., Reinartz, T., and Wirth, R. (2000). “The CRISP-DM process model, The CRISP-DM Consortium, www.crisp-dm.org.”
Chen, M.-S., Han, J., and Yu, P. S. (1996). “Data mining: An overview from a database perspective.” IEEE Trans. Knowl. Data Eng., 8(6), 866–883.
Cortes, C., Jackel, L. D., and Chiang, W.-P. (1995). “Limits on learning machine accuracy imposed by data quality.” Proc., Int. Conf. on Knowledge Discovery and Data Mining (KDD95), Association for Computing Machinery, Washington, D.C., 57–62.
D’Agostino, R. B., and Stephens, M. A. (1986). Goodness-of-fit techniques, Dekker, New York.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., eds. (1996). Advances in knowledge discovery and data mining, MIT Press, Cambridge, Mass.
Federal Highway Administration (FHWA) (2001). “Traffic monitoring guide.” FHWA-PL-01-021, Washington, D.C.
Federal Highway Administration (FHWA). (1998). “Understanding traffic variations by vehicle classifications.” FHWA-RD-98-117, Washington, D.C.
Feller, W. (1968). An introduction to probability theory and its applications, Wiley, New York.
Giles, D. E. A. (2000). “A saddlepoint approximation to the distribution function of the Anderson-Darling test statistics.” Econometrics Working Paper, Dept. of Economics, Univ. of Victoria, British Columbia, Canada.
Hand, D. J. (2000). “New challenges for statisticians.” Soc. Sci. Comput. Rev., 18(4), 442–449.
Hudson, W. R., Haas, R., and Uddin, W. (1997). Infrastructure management, McGraw-Hill, New York.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). “Data clustering: A review.” ACM Comput. Surv., 31(3), 264–323.
Maletic, J. I., and Marcus, A. (2000). “Data cleansing: Beyond integrity checking.” Proc. Conf. on Information Quality (IQ2000), Massachusetts Institute of Technology, Cambridge, Mass., 200–209.
Minnesota Department of Transportation (MDOT). (2000). Minnesota Trucking Regulations, Office of Motor Carrier Services, St. Paul, Minn.
Scheaffer, R. L., and McClave, J. T. (1995). Probability and statistics for engineers, Duxbury Press, Belmont, Calif.
Information & Authors
Information
Published In
Copyright
© 2005 ASCE.
History
Received: Jul 10, 2002
Accepted: Nov 29, 2004
Published online: Sep 1, 2005
Published in print: Sep 2005
Authors
Metrics & Citations
Metrics
Citations
Download citation
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.