Introduction
Globally, cities are facing pressing challenges, including rapid urbanization and intensifying climate change. These challenges will likely undermine urban water management around the world. Traditionally, urban water management has been supply-focused, meaning that cities have focused on increasing their water supply to meet the demand (
Gleick 2003). Recently, however, many cities have started to integrate demand management with their existing policies (
Mitchell 2006;
Luo et al. 2015;
Wang et al. 2020). As droughts become more frequent and intense (
Dai 2011;
AghaKouchak et al. 2015;
Bruss et al. 2019;
Ault 2020), it is likely that these integrated management strategies will become increasingly important in many regions. To ensure that integrated strategies can be implemented successfully, there is a need to better understand the nuances of residential water consumption, particularly within cities.
Many water consumption models aimed at characterizing intra-city water consumption focus on demographics and housing characteristics, which are often correlated with water consumption. For example, in a case study conducted in Reno, Nevada, Viñoles et al. (
2015) found that length of residency was associated with higher water consumption. The authors reasoned that this was due to the growth in physical and social capital that participants experienced the longer they lived in the area (
Viñoles et al. 2015). A similar study, conducted in Phoenix, Arizona, found that longer residence times increased water consumption and that longer-term residents were more likely to believe in the idea that Phoenix is an oasis (
Harlan et al. 2009). Interestingly, a recent study conducted in Southern California found that water consumption decreased as residents occupied a house for longer periods of time (
Bolorinos et al. 2020), which suggests that there may be significant differences between areas, possibly based on localized norms. In addition to length of residency, a number of studies have demonstrated an increase in water consumption as household income increases (
Harlan et al. 2009;
Shandas and Parandvash 2010;
Ghavidelfar et al. 2017). Many of these studies attributed this to the larger homes and lot sizes that are often associated with higher incomes. Recently, Cominola et al. (
2018) adopted a segmentation approach to classify the water and electricity demand profiles in Los Angeles, California. The authors found that increased water consumption was driven by large house sizes, more occupants, and intensive outdoor uses (
Cominola et al. 2018), echoing previous work that leveraged linear models. Going beyond income and housing characteristics, a recent study focused on the impact of the COVID-19 pandemic on water consumption (
Li et al. 2021). The authors found that while California’s urban water consumption decreased while residents were working from home, the residential sector experienced an increase in consumption (
Li et al. 2021). This suggests that remote work (one of the demographics included in the present study) may impact water consumption, although it is not as common of a predictor as income or housing characteristics. These studies highlight the need to consider a wide array of demographic variables when modeling water consumption.
In addition to the work on demographics-based models, there is a significant body of literature dedicated to understanding the role that social norms play in determining water consumption. For example, several studies have demonstrated that environmental awareness often leads to increased water conservation, especially if there are social norms that encourage pro-environmental behaviors (
Pinto et al. 2011;
Willis et al. 2011;
Beal et al. 2016;
Ramsey et al. 2017). However, the norms that influence water conservation may vary within a given city or lead to different reactions from different segments of a population. For example, Bhanot (
2017) found that competitive messaging about water consumption among neighbors led to lower consumption for those that were already among the more efficient users, while the households that consumed more water (and thus had a lower rank) were likely to
increase their water consumption when presented with competitive messaging. The author attributed this to the “last place effect” and subsequent demotivation to work towards reducing water use (
Bhanot 2017). In a similar study, Cominola et al. (
2021) used smart meter data to evaluate the impact of getting feedback on water consumption via a mobile application, including comparisons with peer households. The authors found that households that had access to the application reduced their water consumption in both the short- and long-term (
Cominola et al. 2021). Social influences can also arise through media consumption. Quesnel and Ajami (
2017) found that increased news media coverage and subsequent internet searches led to a reduction in water consumption during an extreme drought. Overall, these studies demonstrate the importance of social norms in explaining trends in water consumption in the context of groups (i.e., segments of populations), as well as the benefit of integrating these norms within local water management policies.
Going beyond demographics and social norms, some studies consider the climate impacts on water consumption. For example, Ashoori et al. (
2016) found that residential water consumption in Los Angeles was sensitive to the climate, namely, temperature and precipitation. Specifically, Ashoori et al. (
2016) found that precipitation was especially significant in predicting water consumption single family homes, with more precipitation leading to lower water consumption. This was tied to the increased outdoor water use common to single family homes in the area. Using higher resolution data, Balling et al. (
2008) further demonstrated that climate played a role in determining water consumption at the census tract level. The authors found that water consumption increased across the city during periods of higher-than-normal temperatures, as well as lower-than-normal precipitation (
Balling et al. 2008). In a study conducted in San Francisco, California, the authors found that higher temperatures were correlated with higher water consumption across income levels, while precipitation was found to be insignificant (
Quesnel and Ajami 2017). Finally, a recent study used future climate scenarios to predict water consumption (
Rasifaghihi et al. 2020). Rasifaghihi et al. (
2020) found that future increases in temperature and changes in precipitation patterns are likely to lead to increased seasonal water consumption, while base water consumption is likely to remain constant. Many of these studies considered only temperature and precipitation as climatic variables and often used linear regression as the main modeling technique (
Balling et al. 2008;
Ashoori et al. 2016;
Rasifaghihi et al. 2020). In fact, in a seminal review by House-Peters and Chang (
2011), the authors found that studies focused on the climate impact on water consumption primarily focused on precipitation and temperature, with only a few studies considering wind speed and evapotranspiration. Recent work, however, has provided evidence that additional climatic variables, such as relative humidity and dew point temperature, are needed to accurately model water consumption (
Obringer et al. 2020a). Moreover, it has been shown that the relationship between climate and water consumption is nonlinear (
Obringer et al. 2019,
2020a;
Wongso et al. 2020), which calls for more complex modeling techniques than linear regression.
Much of the previously discussed literature has focused on evaluating the impact of a single data type (e.g., demographics
or climate variables). That being said, there have been a number of models aimed at integrating multiple data types into a single analysis of urban water consumption. For example, Ashoori et al. (
2016) included price and population in addition to the climate variables to model water consumption in Los Angeles. However, this study focused on sector-level water use (e.g., single-family residential or commercial), and did not account for the intra-city differences beyond housing type. In this sense, the study may have overlooked potential cultural or demographic indicators of water consumption. Several studies have integrated climate variables and demographics to characterize the climate sensitivity of intra-city water consumption. House-Peters et al. (
2010), for example, found that the census blocks with newer homes and more educated residents tended to be more sensitive to climatic conditions (e.g., drought). Similar results were found in a study by Balling et al. (
2008), which demonstrated that the water consumption in census tracts with large lots and higher income was more sensitive to the climate. These studies primarily relied on linear models and did not account for the cultural aspects that may also have been driving water consumption. Finally, Ramsey et al. (
2017) evaluated the impact of demographics and social norms on water consumption in India, but did not account for the various climate influences that may shape water consumption. Given that water consumption is multi-faceted, with a number of different influences, including demographics, climate, and social norms, it is important to build predictive models that consider a variety of input variables, including the socio-demographic characteristics of the population and other social variables when available. Moreover, models that account for the intra-city water consumption patterns will enable practitioners to develop community-specific plans to curb water consumption.
Here, we present a data-driven model to predict intra-city residential water consumption, accounting for the variability in demographics and climate, while leveraging social norms to aid the interpretation of the model results. In particular, we combine demographic variables measured by the census, such as education level and household income, with high resolution climate data, including precipitation, temperature, and relative humidity. These quantitative variables are used to create a data-driven model of water consumption, the results of which are then interpreted through the novel incorporation of qualitative social norms data. This work advances the growing body of work surrounding the use of data-driven models within water resources research by (1) focusing on a non-linear modeling technique, which has been shown to be effective in other scenarios (
Obringer and Nateghi 2018;
Wongso et al. 2020), (2) expanding the included climate variables beyond precipitation and temperature, which better captures consumption trends (
Obringer et al. 2019), and (3) emphasizing intra-city patterns, which are underexplored in comparison to larger, sector-level studies. Additionally, the use of social norms as a tool for improving interpretation of model results aids in bridging the gap between quantitative and qualitative work. The integration of quantitative and qualitative data is a key aspect of socio-environmental systems research, particularly as qualitative data can be used to provide insight to quantitative measurements (
Elsawah et al. 2020). The data-driven model is based on observational data, but can be used to predict water consumption at the census tract level, assuming the demographics and climate input data are available. We then expand this quantitative model by leveraging qualitative data on social norms to provide insights to the model results that cannot be inferred from demographics or climate data alone. This study aims to increase the scientific understanding of the driving factors behind urban water consumption, allowing water utilities to implement community-specific demand management techniques. In the following sections, we first discuss the data and methods used within this study. Then, we delve into the results and discussion. Finally, we conclude with a summary of the implications for practitioners.
Results
In this section, we discuss the results from the moderate-intensity analysis (described in the section “Methods”). In particular, we first show the model performance, including measures of error and the difference between the actual and predicted water consumption values. Then, we discuss the variable importance in terms of predictive accuracy. Finally, we discuss the results of the qualitative interviews, which are then used to interpret the predicted residential water consumption. Results of the high-intensity analysis are presented in Figs. S5–S10.
Model Performance
The statistical performance of a model is often measured in terms of out-of-sample prediction error, as well as the ability of the model to explain the variance of the actual data. Table
2 outlines the model performance across each season. In particular, the table contains values for out-of-sample (i.e., the test sample)
, root mean squared error (RMSE), and the normalized root mean squared error (NRMSE). The
values can be used to evaluate the ability of the model to fit the data (i.e., the amount of variance explained by the model). These performance measures indicate that the model can adequately capture the variance in the water consumption data across the seasons, with
values from 0.77 to 0.83 (Table
2). The other two measures of model performance (RMSE and NRMSE) represent the prediction error. The NRMSE is the normalized form of the RMSE, providing a unitless measure of prediction error. Here, lower values indicate lower prediction error. The results therefore demonstrate that the model has high predictive accuracy (low error) across the seasons (Table
2). In particular, the summer season performs the best, which is a critical time for demand management, as it generally represents tpeak usage.
Additionally, the NRMSE can be used to gauge uncertainty. Our model, for example, results in an NRMSE ranging from 0.096 in the summer to 0.114 in the spring (see Table
2). This means that the average error in our model is 9.6% to 11.4%, depending on the season. In other words, the prediction results may be about 10% more or less than the actual values. That being said, there remains uncertainty in the model, which can present a challenge when applying the framework. For example, a 9.6% error in the model is relatively low, but translates to about 2.57 million liters of water (see Table
2)—this could cause issues for utility companies that plan to allocate a certain portion of water to the city but end up with an unexpected deficit or surplus. Nonetheless, the model performs well and improves upon previous work. In terms of the variance, our results indicate that the model explains 77% to 83% of the variance in the actual data, depending on the season (see Table
2). This is a significant improvement over previous work conducted in Phoenix, Arizona, in which the average
value was 0.25 (
Balling et al. 2008). The improvement is likely due to the use of a nonlinear model that does not require any strict parameterizations of the relationship between the dependent variable (water consumption) and the independent variables (demographics and climate).
Fig.
2 shows the differences between predicted and actual water consumption in each census tract over the course of 2018, which can be used to visualize the spatial variations in predictive accuracy. One can, for example, evaluate where the model over-predicts residential water consumption (blue shades) and which areas the model under-predicts water consumption (red shades), as well as the magnitude of those over-/under-predictions. The figure shows relatively small differences across the study area with some seasonal differences. In particular, the summer and fall months include more extreme differences (
liters) than the winter and spring seasons. Around 4% of the census tracts in the summer model have extreme differences between the actual and predicted data, compared to 2% in the winter model. Likewise, the fall model shows 3% of tracts with extreme differences (
liters), compared to 2% in the spring model. These differences could be due to the increase in outdoor water consumption during the summer and fall seasons, compared to the winter and spring seasons. In most of these extreme cases, the model predicted less water consumption than the observations (red tracts in Fig.
2). This is likely due to housing characteristics (lot size, house age, etc.), which were not included in this study. It is possible that adding these variables would increase the predictive accuracy in certain tracts, but that is beyond the scope of this study.
It is notable that the model performs better in the central areas of the city, where water consumption tends to be less than the outer, more suburban areas (see Fig. S11). The suburban areas likely have similar demographics to some of the more central areas (house value, income, etc.), but different end-uses (e.g., outdoor landscaping). Considering the generalized data-driven model architecture for the entire study domain, if two different tracts have similar demographic data, the model will predict similar water consumption. Overall, the majority of the tracts are being predicted accurately, with minimal differences between the predicted and actual values—a benefit for water utilities and policymakers interested in better understanding the intra-city water consumption patterns.
Variable Importance and Partial Dependence
Following the variable selection process outlined in the section "Data and Methods," five to six variables per season were selected for the final model. The final variables are shown in Fig.
3. Among the important variables, home ownership was found to be the most influential across all seasons. This means that the percentage of houses that are owned within a census tract is crucial for predicting the water consumption within that tract. Furthermore, house value and household income are repeatedly among the most important variables, indicating a close relationship between socioeconomic status and water consumption. Additionally, the percentage of families with kids in a given census tract was found to be important for predicting water consumption. Education was also shown to play an important role in predicting total water consumption within a census tract. In particular, in the spring, fall, and winter months, the percent of residents with some college education was an important predictor, while in the summer months, the percent of residents with an associate’s degree was found to be important. Finally, the percentage of people that walk to work was important in the spring months. Notably, none of the climate variables remained following the final threshold-based variable selection. This suggests that while changes in climate may be important at the larger inter-city scale (
Obringer et al. 2019), they are less important for predicting intra-city water consumption within the city of Indianapolis. This is may be due to the variability in the data. In other words, the climate does not vary as much (within the city) as compared to the demographics. It is possible, then, that an analysis conducted in a more climatically-variable area may be more sensitive to the changes in intra-city climate, as previously shown in Phoenix (
Balling et al. 2008).
The variable importance plots (Fig.
3) indicate the most crucial independent variables for predicting intra-city water consumption, but they do not indicate the direction of the relationship (i.e., positive or negative correlation). To understand
how the important variables impact water consumption, we can use partial dependence plots. Fig.
4 shows the results of the partial dependence analysis for the summer months (see Figs.
S2–
S4 for the other seasons included in the analysis). In particular, the figure shows the partial dependence for the six variables shown in Fig.
3. Fig.
4 shows that as the percentage of home owners increase, the water consumption increases as well. Similarly, as the percentage of families increase, so does total water consumption within the census tract. In terms of socioeconomic indicators, as the percentage of houses valued between $50,000 and $100,000 increases, there is an initial reduction in water consumption. However, this drop is followed by a steady increase in water consumption as the percentage of lower-valued homes increases. Finally, the percentage of people with associate’s degrees was an important variable in the summer months (Fig.
3). The partial dependence plot indicates that at first, water consumption increases with the percentage of associate’s degrees, but then decreases around 5%. The partial dependence plots allow us to attach a direction to the sensitivity of important variables and begin to understand the relationship between the predictor and response variables.
Qualitative Interviews
The interviews were coded and analyzed thematically to discern perceptions of water conservation within various neighborhoods. We paid particular attention to the presence (or lack thereof) of social norms related to water conservation. In general, most interviewees indicated that there were no strong social norms regarding water conservation within their neighborhoods. Rather than expressing an expectation on their neighbors in terms of water conservation, the interviewees discussed having personal values that were not shared by the neighborhood as a whole, as evident in the following quote: “I think it’s very much on a house to house basis, there are people doing individual things, but I don’t know in our neighborhood that there is necessarily like a grassroots effort.” That being said, there were a few respondents who indicated their neighborhoods did in fact have social norms surrounding water conservation. These interviewees indicated that environmentally-friendly activities, such as recycling, driving electric vehicles, and using rain barrels, were popular among the residents. Moreover, they stated that those behaviors were expected and that conservation would often be brought up at neighborhood gatherings. One particular interviewee discussed pro-environmental behavior and mindsets in their neighborhood by saying: “I think that from a community perspective, at a micro-level, [the neighborhood] is going to be much more socially conscious of environmental conservation efforts, if you will, than Indiana as a whole or even Indianapolis. I think we’re somewhat unique in that regard and I certainly think that is the reality, especially from the interactions that I have had, not just in person, but on social media.” By geographically connecting these neighborhoods with the census tract(s) within the same designated area we use the interview results to interpret the quantitative model findings discussed above.
Discussion
Within the body of literature on water demand modeling, there have been a number of studies aimed at determining the various factors that ultimately impact water consumption. For example, Sankarasubramanian et al. (
2017) found that higher education and income often led to higher adoption of efficiency techniques. House-Peters et al. (
2010) found that water consumption depended on several housing characteristics (e.g., outdoor space and house size), as well as education levels. Finally, Ashoori et al. (
2016) found that temperature and precipitation were important factors for predicting water consumption, particularly in single family homes. The work presented here shares some similarities with previous findings, as well as some differences, which are discussed below.
For example, our study did not include housing characteristics, but did include home ownership, which was found to be important across all four seasons. Fig.
3 shows that the percentage of home ownership was the most important variable across the entire year—indicating a potential demographic variable for the water utility to use for targeted conservation initiatives. Fig.
4 demonstrates that as the number of owned houses increases, the total water consumption does as well, which is expected, as home ownership usually means larger lot sizes and houses, leading to higher water consumption, especially in the summer when landscaping is popular (
House-Peters et al. 2010;
Sankarasubramanian et al. 2017). Likewise, as the number of families with kids increases, so does water consumption (Fig.
4). This is likely due to the increased number of people within a household, which leads to more water consumption. This finding is aligned with previous studies on the subject (
Worland et al. 2018). The percentage of families with kids was also an important predictor in the high-intensity user group (see Fig.
S5), indicating that this may be an optimal demographic predictor across the city. These findings suggest that efforts to limit water consumption ought to be targeted at areas in which home ownership is significant, rather than areas with a majority of renters, as well as areas that are primarily made up of families.
One of the particularly novel results of our analysis was the importance of walking to work in predicting the water consumption, which has not, to our knowledge, been reported prior to this work. The percentage of people that walk to work is not a variable that many would intuitively connect to water consumption, thus it is possible that previous analyses have simply not included variables related to commutes. This is an advantage of starting with a large dataset (e.g., 72 demographic variables) and doing an automated variable selection procedure—the algorithm was able to use a variable to make a prediction that otherwise might not have even been included. Given that the percentage of people who walk to work is positively correlated with the percentage of single people and people in their twenties while negatively correlated with the percentage of families and home ownership (see Fig.
S1), it is likely that walking to work is an indirect proxy for location within the city. In particular, the population that walks to work is primarily located in the city center (see Fig.
S11), which is also an area that contains a lot of apartments and smaller houses with little to no outdoor space. In this sense, it may not be walking to work which is the predictor of water consumption, rather walking to work is representative of one’s location within the city and thus, the predominant style of living, which may be the true predictor of water consumption. This echoes previous work, which demonstrated that suburban households (which are unlikely to be associated with walking to work) tend to consume more water due to outdoor landscaping, as well as larger house sizes (
Balling et al. 2008;
House-Peters et al. 2010). Conversely, in the high-intensity user group, which contains a number of rural census tracts, the percentage of people that worked from home and the percentage of people that drive to work were found to be important predictors in the summer and winter season, respectively (see Fig.
S5). The partial dependence analysis suggests that as the number of people that work from home increases in these high-intensity tracts, the water consumption increases (see Fig.
S8). This could be due to increased use of indoor water while people are home most of the day, as shown by a recent study focused on the impact of remote work (
Li et al. 2021). Another possibility is that the increased use of water is being used primarily outdoors for work-related purposes, such as farming, since the majority of the high-intensity census tracts are rural. Additional research is needed to better understand these impacts on the high-intensity user base and how they might be leveraged to encourage water conservation.
A common predictor of water consumption is household income. A number of studies have found positive relationships between these two variables, with higher income often leading to higher water consumption (
Harlan et al. 2009;
House-Peters et al. 2010;
Worland et al. 2018). Similarly, in our model, during the summer months, the percentage of houses valued between $50,000 and $100,000 (the median house value for the city is $130,000) was found to be positively related to water consumption. In particular, there is a notable inflection point when 30% of the homes within the tract are valued between $50,000 and $100,000 in which the water consumption begins to steadily increase. In fact, census tracts with percentages of lower-valued houses above this threshold tend to also have higher percentages of low income households (less than $50k a year), as shown in Fig.
S11. Previous work has shown that lower-income households are less likely to have efficient appliances (
Sankarasubramanian et al. 2017), which may contribute to the increased water consumption shown in Fig.
4. In the high-intensity group, income was also shown to be important, with higher percentages of affluence leading to higher consumption of water (see Figs.
S7–
S10). This aligns with previous research into the connection between income and water consumption (
Harlan et al. 2009;
House-Peters et al. 2010;
Worland et al. 2018). Finally, education levels were found to be important across the seasons. In particular, the percentage of people with associate’s degrees was found to be important in the summer months. Previous work has suggested that education levels were a significant predictor of water use efficiency (
Sankarasubramanian et al. 2017), which may explain the decrease once the census tract reaches 5% of the population having associate’s degrees. Overall, the important variables presented here not only echo previous studies, but also provide valuable information on how to target conservation within cities.
The majority of previous literature has focused on quantitative modeling of water consumption, which has its own limitations in terms of interpretation. That is, one cannot interpret the results beyond the correlation or the predictive accuracy. One way to move past this limitation is to using qualitative data as a source of insight (
Elsawah et al. 2020). Here, we used qualitative interviews to assess the presence of social norms in several census tracts, shown in Fig.
5.
Visually, these census tracts cluster into two groups—the center-top and the central group, with the center-top group containing more census tracts where the water consumption was overpredicted than the central group. The central group represents the city center, while the center-top group is still relatively close to the city center, but larger properties (often with large yards) and detached houses are common. Normally, these neighborhoods would have higher than usual water consumption—given the percentage of home ownership and families. However, the model over-predicted the consumption, indicating that in reality these neighborhoods consume less water than their demographics alone would suggest, based on the rest of the study area. This finding was confirmed by the interview results, discussed above, in which interviewees from the center-top group indicated that there were social norms within their neighborhoods that encouraged water conservation practices. In particular, the interviewees discussed the prevalence of using rain barrels for outdoor landscaping, rather than relying on water from the tap. This information regarding the local water conservation practices cannot be obtained through the usual data sources (i.e., census data, housing records, etc.), but it is, nonetheless, critical to understanding intra-city water consumption patterns. Based on the demographics of the area (e.g., presence of families, detached homes, higher income, etc.), water utilities may assume that they need to run a conservation program in these neighborhoods, when in fact, conservation-based norms already exist and are leading to less consumption than neighborhoods with similar demographics. By integrating an understanding of local social norms into the interpretation of quantitative modeling results, utility companies may be able to focus their efforts on areas of the city in which demand management interventions will be more effective.
Limitations of the Study
There are a few limitations of the study. First, the present study only uses 15 interviews to conduct the interpretation analysis. While the sample size was guided by the data saturation point in our semi-structure interview process (
Guest et al. 2006), we recognize that it represents a small fraction of the population in Indianapolis. A broader data coverage over the city, rather than focusing on a few central neighborhoods, could contribute to more inclusive and successful conservation interventions. It would be especially important to extend this analysis to the high-intensity users, as they tend to have a disproportionately high level of water consumption and also represent an opportunity to greatly reduce water consumption (
Rosenberg 2007;
Suero et al. 2012;
Abdallah and Rosenberg 2014). However, conducting interviews on such large scales is labor intensive and may be infeasible for research teams, policymakers, and water and other resource management practitioners due to time and monetary constraints. In place of interviews, a survey method may be more beneficial in future studies for determining large-scale attitudes towards water conservation, as well as any social norms that are present within different areas of the city. Surveys have their own set of challenges, however, and must be carefully designed to ensure unbiased results. Although a survey was outside the scope of the current study, the results from the interviews can be used to develop future survey questionnaires and large-scale datasets. Similar work has been done in other areas of the country, although not specifically focused on social norms (
White et al. 2019). In addition, in our current study, social norm data was only collected and used
a posteriori to improve interpretation of model results, but not used to build the mathematical models. There is great potential for including social norm variables in future models and use relevant large-scale data as model inputs.
Another limitation of the study is the lack of landscaping variables in the study, such as lot size or irrigation requirements. These variables play a major in role in determine total water consumption, particularly in the summer months (
House-Peters and Chang 2011). Moreover, landscaping variables, or more broadly outdoor water use, is likely to be more influential among the high-intensity group, which could improve the interpretation of the model results from these tracts. Often this data is collected from a variety of sources, such as real estate websites or remote sensing images. However, the lack of unified public database for the city of Indianapolis led us to not include the data in the quantitative analysis—although some interviewees discussed irrigation habits within their neighborhoods. This exclusion of landscaping variables likely impacted the predictive accuracy of the model, especially for suburban census tracts, which are likely to have larger yards that require irrigation. Future work should seek to include these variables, particularly if the end goal is to improve predictive accuracy.
Finally, this study leveraged a two-stage approach for variable selection that relied on a pre-detremined correlation threshold. While this process has been leveraged in several previous studies (
Genuer et al. 2010;
Mukherjee and Nateghi 2017;
Obringer et al. 2020b), the correlation pre-screening may add bias, particularly if it involves expert opinions in the decision-making process. To minimize this opportunity for added bias, we maintained a strict computational criterion for the pre-screening filter, which did not rely on expert opinions. Additionally, the two-stage process implemented in this analysis was found to be more computationally efficient, as well as more interpretable, when compared to a heuristic-based algorithm. That being said, in the future, there is likely to be a shift towards automated variable selection, particularly as algorithms become more efficient. There are a growing number of algorithms, such as the variable selection using random forest (VSURF), which is similar to the threshold-based approach implemented here (
Genuer et al. 2015). These algorithms can be used with a variety of algorithms, from simple linear regression (
Li et al. 2013) to more complex tree-based models (
Galelli and Castelletti 2013). In the future, researchers may opt to implement these procedures to further limit any potential for added bias.