This section will present and discuss the results of the four types of analyses (A–D) mentioned above.
Analysis of Model Performance per Streamflow Gauge and Model
First, the model performance at each of the 46 streamflow gauging stations and for each of the 17 models regarding NSE of the simulated streamflow was analyzed (Fig.
2). The summary statistics of median NSE over all gauges for the calibration and validation stations for each objective can be found in Table
S2. The gauging stations in Fig.
2 are sorted using a hierarchical clustering such that stations of similar performance patterns across all models appear closer to each other than gauges that show a different pattern. Each of the four panels in the figure is divided by horizontal lines into three blocks separating the two Machine Learning models in the first block from the seven locally calibrated models in the second block and the eight globally calibrated models in the third block.
As expected, the locally calibrated models (second block) in general yield better results than globally calibrated models (third block) in the calibration model [Figs.
2(a and b)]. Most locally calibrated models, however, perform surprisingly well in validation [Figs.
2(c and d)]. It is also surprising that no significant difference between objectives 1 and 2 can be seen. Most models perform equally well in both objectives (see median NSE for each model added as labels right of each panel). The reason for this is that the stations for objective 1 (low human impact watersheds) are chosen purely based on the regulation type as “natural” and “reference” provided by WSC and USGS. The classification, however, is constant in time, and a watershed might have been of no human impact in the past but is no longer (or vice versa). In hindsight, the stations initially selected as low human impact should have been manually double-checked to ensure that they are indeed not affected by catchment management.
Following the definition of an at least satisfactory performance regarding streamflow of an NSE of 0.5 or higher (
Moriasi et al. 2015), the only models with a (median) satisfactory performance in all four setups (panel A-D) are the GR4J-Raven-lp, HYMOD2-DS, mHM-Waterloo, and mHM-UFZ. Satisfactory in both calibration but not in both validation setups (panel A-B) are ML-LSTM, GR4J-Raven-sd, and SWAT-Guelph. LBRM is satisfactory in both calibration setups as well; the validation results are not considered here because some information of the validation stations might have informed calibration already (due to the operational setup of LBRM). The likely reason for the superior performance of these models in validation is that the validation stations were selected after a visual inspection of hydrographs and drainage area to make sure they were not heavily managed and not located in highly urbanized areas. Satisfactory in both validation scenarios but not in both calibration scenarios (panel C-D) are VIC, VIC-GRU, GEM-Hydro, and MESH-SVS. LBRM and MESH-CLASS are not considered for validation because they both used validation stations for calibration.
The machine learning model ML-XGBoost is the only model that performed substantially better for objective 1 (calibration) than for objective 2 (
). All other models have very similar median NSEs (in both calibration and validation). However, the second machine learning model, ML-LSTM, performed much better than ML-XGBoost. It has been shown before that ML-XGBoost does not perform well when insufficient training data are available (
Gauch et al. 2019). The data-driven ML-LSTM model serves here as a baseline for all hydrologic models, as it can show how much information content is provided in the data and the performance that can be achieved without hydrologic knowledge. Notably, both data-driven models perform equally well (or poorly) in both objectives in calibration (NSE of 0.79 versus 0.79 for ML-LSTM, and NSE of 0.36 versus 0.26 for ML-XGBoost). Even the stations that did not perform well for the globally calibrated models—last six in Fig.
2(a) and first five in Fig.
2(b)—and were deemed likely to be managed did not show decreased performance compared with other stations for the ML models, even though no information on water management policies were used. It may be possible to augment hydrologic models with machine learning estimates to improve the simulations in basins like 04166100 and 02GC007, where almost all globally calibrated models fail to achieve NSEs above 0.
mHM-Waterloo is practically as good as the ML-LSTM, showing that most information contained in the forcing and geophysical data can be extracted by the model. HYMOD2-DS and LBRM are ranked second and third in the locally calibrated group. mHM-Waterloo and mHM-UFZ significantly outperform other models in both groups (local and global calibration, respectively). Reasons for mHM’s performance might include its unique multiscale parameter regionalization (
Samaniego et al. 2010) approach using geophysical data at their native resolution and relating them directly to hydrologic variables without losing the subgrid variability. Another reason is the internal model structure of mHM, which maintains a prescribed nonlinear relationship among internal model variables (
Rakovec et al. 2019) being achieved by implementations of the transfer functions of each mHM parameter for most of the processes and elimination of unnecessary parameter interactions. mHM-UFZ significantly outperforms other globally calibrated models in calibration. It is hard to assign further ranks as three other globally calibrated models—HYPE, GEM-Hydro, and MESH-SVS—perform very similarly in calibration. The results in validation are closer together, but mHM-UFZ is still the best and MESH-SVS is ranked second.
mHM-UFZ, like other globally calibrated models, had problems in performed well for a few stations (i.e., 04166100, 02GA047, 02GB007, 04161820, 0416400, 04165500, 02GC007, 04174500, and 04166500). These stations have been identified to be located in highly urbanized areas such as metropolitan Detroit, which corresponds to fact that models were not a priori informed about any human influence and regulation. Several models, like the land-surface scheme SVS used by both GEM-Hydro and MESH-SVS, as well as WATFLOOD, are known to have problems with flashy responses because tile drainage is not modeled in these models. GEM-Hydro and MESH-SVS perform almost equally in both objectives in calibration and validation, certainly because they both use exactly the same underlying land-surface scheme, SVS. Furthermore, the calibrated SVS parameters of MESH-SVS were directly used for GEM-Hydro without further calibration. MESH-CLASS, which uses the CLASS land-surface scheme, does not perform as well as GEM-Hydro and MESH-SVS; this might be due to a different calibration strategy or differences in the land-surface schemes.
For the GR4J model, no major improvements can be observed when switching from a lumped setup (GR4J-Raven-lp) to a semidistributed model setup (GR4J-Raven-sd), which might be due to either the fact that the discretization of the semidistributed version was not detailed enough or the maximum performance of GR4J is already reached when using the lumped version, and an improvement could only be achieved when using additional data such as watershed management rules.
The traditional VIC model, based on regular grid cells with defined tiles, and VIC-GRU show similar performance. The main differences between the tiles and GRUs is that the latter take into account soil classes in their native resolution while the tiles only use the major soil class per grid cell. This, however, does not make much difference in the Lake Erie watershed because the soil classes do not exhibit high spatial heterogeneity. Hence, the VIC model uses about 1,390 tiles (381 grid cells each with 3–4 vegetation tiles) and VIC-GRU uses about 2,400 grouped response units. Some stations are better for one or the other model, but the overall performance is similar. The fact that VIC/VIC-GRU reaches an upper limit of streamflow performance at a certain discretization level was previously described by Gharari et al. (
2020). The differences in performance between the models are likely due to different input datasets used, different routing models, and/or different calibration strategies.
SWAT-EPA provided only results for the calibration of objective 1 stations, but yielded significantly worse performance than any other model, and particularly worse than SWAT-Guelph, which was set up by an independent group. The weak performance of SWAT-EPA is either due to calibration strategy or the datasets used. This shows that model intercomparisons need to take into account modelers’ expertise and perseverance in addition to the traditionally considered influence of datasets used, calibration strategies, model discretization, etc.
The performance for all models at the seven validation streamflow gauging stations is displayed in Figs.
2(c and d). In validation, the Machine Learning models show significantly degraded performance relative to calibration, and both hydrologic model groups (locally and globally calibrated) were found to perform better than ML models. This is most likely because there were not enough data (e.g., streamflow gauge data, forcing data, and basin attributes) to train ML models properly, as previously discussed by Gauch et al. (
2019). The best model in validation is LBRM, which cannot be considered because it had already used all stations for calibration. Hence, the two versions of mHM (mHM-Waterloo and mHM-UFZ) are ranked first in the local and global model groups, respectively. HYMOD2-DS and MESH-SVS are ranked second in the two groups, respectively.
It should be noted that the locally calibrated models (second group) use either the area-ratio method (
Fry et al. 2014) (LBRM and SWAT-Guelph) or the nearest-neighbor method to determine streamflow at the validation stations. The donor basins used to derive those estimates might differ between the models. The list of donor basins used for each validation station can be found in the README files online (e.g., “data/objective_1/lake-erie/validation/model/HYMOD2-DS/README.md” on the on the GitHub associated with this publication).
Analysis of Model Performance across Gauging Stations
Fig.
3 shows the median NSE per model across all gauges in objective 1 and objective 2 in calibration [Fig.
3(a)] and validation [Fig.
3(b)], mostly to ease comparison between calibration and validation results (panel A versus panel B) and differences between the two objectives for each model (light versus dark-colored bar per model). A few models report different values for objective 1 and objective 2 in validation (i.e., SWAT-Guelph, mHM-UFZ, and VIC), because they provided an optimal model setup for each of the two objectives. All other models provided only one final model setup (independent of the objective), and hence only provide one modeled streamflow time series for each gauge.
The Machine Learning models show a significant drop in performance between calibration and validation (
for ML-LSTM) caused by the limited training data available. This could be resolved by pretraining the ML models with a larger dataset, for example, CAMELS (
Newman et al. 2015;
Addor et al. 2017) and then fine-tune with the datasets available for the study domain of Lake Erie. This would partly resolve the imbalance between hydrologic models developed based on expert knowledge over many years or even decades and Machine Learning models only being trained on the current data without benefiting from any form of expert knowledge.
As for the locally calibrated models (red bars), SWAT-Guelph performance also significantly drops (
for objective 1 and
for objective 2) due to known weak transferability of parameters in SWAT (
Heuvelmans et al. 2004). Another reason for the drop in performance might be the usage of the area-ratio method to obtain the estimates at the validation stations, while all other models (except LBRM) used a nearest-neighbor method. The lumped and semidistributed GR4J setups, GR4J-Raven-lp and GR4J-Raven-sd, as well as HYMOD2-DS, decrease in performance but not as much as the ML models (
,
, and
, respectively); mHM-UFZ maintains almost the same performance in validation as it has in calibration (
). Almost all globally calibrated models maintain or, surprisingly, even improve performance in validation. This probably a side effect of the much smaller sample size (7 stations versus 28 and 31 in calibration); but, more importantly, it is certainly caused by the large set of urban watersheds in the calibration set for which all global models had problems to obtain good results. Such gauges were avoided in the set of validation stations chosen after we observed that the classification of WSC/USGS needs manual verification. mHM-UFZ maintains the same performance (
), HYPE decreases slightly (
); VIC and VIC-GRU show slightly improved performance (
and 0.08) as do GEM-Hydro, MESH-SVS, and MESH-CLASS (
, 0.11, and 0.14, respectively). MESH-CLASS, however, is slightly biased, as it already used two of the validation stations for calibration, and thus validation results might appear better than they actually are. WATFLOOD is the only global model that showed drastically decreased performance in validation (
), which may indicate a flawed validation setup given that WATFLOOD was developed using catchments in the Lake Erie basin (
Kouwen et al. 1993), and thus has shown much more robust performance in this region in the past.
In addition, the results in Fig.
3 again highlight that only minor differences exist between the performance of the models for objective 1 and objective 2 stations originally defined as low human impact and most downstream stations, respectively. This is most likely due to the fact that the assignment of low human impact was mainly based on the tags indicated by WSC and USGS in their gauge information. We see, however, that there are several stations in objective 1 (i.e., 04166100, 02GA047, 02GB007, 04161820, 0416400, and 04165500) where all models had problems in achieving good results, even in calibration, due to their location within highly urbanized areas. These stations, therefore, should not have been classified as objective 1 (although labeled as “natural” by WSC and “reference” by USGS). These stations lowered the performance of objective 1 and led to similar results for both objectives. The validation stations were selected later in the project with more care, making sure they are not located in urban areas. We also visually inspected the observed streamflow to make sure there was no obvious human influence/management in the selected gauges. The void of human-impacted watersheds in validation explains why performance improved in so many models.
Analysis of Performance per Gauging Station across All Models
Fig.
4 shows the variability of the streamflow simulation performance for each gauging station over sets of models. Panel A considers all models, while the other three panels separate the models into the categories of machine learning (ML) models (panel B), locally calibrated models (panel C), and globally calibrated models (panel D). The gauges (
-axis) are sorted according to decreasing median NSEs over all models (horizontal line in each box of panel A).
The calibration results show that the gauges with the worst performance have the highest variability regarding the streamflow simulation performance, with only a few exceptions (see panel A). The high variability in performance for calibration stations is due to the high variability of performance of globally calibrated models (panel D), while ML and locally calibrated models (panels B and C, respectively) result in much less variability and, in general, better results for those stations. The stations with greater variability are the stations mentioned above (see the subsections above: Analysis of Model Performance per Streamflow Gauge and Model; and Analysis of Model Performance across Gauging Stations) as being the ones in highly urbanized areas where all the global models, except mHM-UFZ, HYPE, and VIC-GRU, have difficulties in simulat\ing streamflow. All of the above holds for both objectives in calibration mode.
The picture changes when it comes to the performance of the validation stations. Most of the overall variability (panel A) is caused by ML models (panel B) and the locally calibrated models (panel C), which all used either the area-ratio or nearest-neighbor method to transfer parameters of various donor basins to the stations used for validation. The variability of the ML models is a less reliable predictor compared wih the results for the locally calibrated models, as it is determined based on only two ML models, while the boxplots for the locally calibrated models represent six models (mostly five, because LBRM should not really be counted as it used all validation stations for calibration). This again shows that the globally calibrated models are superior in validation due to their less subjective approach to model ungauged basins; however, they need overall improvement when it comes to representing urban areas.
The best-performing station across all models in calibration mode of objective 1 is 02GC010 (median ); the station with lowest performance is 04166100 (median ). The best-performing station across all models in validation for objective 1 is 04208000 (median ), while the worst-performing one is 04167000 (median ). The simulated streamflow time series of these four stations is analyzed in the next section.
Analysis of Simulated Streamflow Ensembles
In Fig.
5, streamflow ensembles for three groups of models (ML, locally calibrated models, and globally calibrated models) are shown for selected gauging stations in calibration and validation mode only for objective 1 (objective 2 results are similar but not shown). The best and worst objective 1 stations in calibration and validation are used here. The performance is regarding the overall highest and lowest median NSE across all models [Fig.
4(a)].
The ML ensemble consists of only two models (grey hydrograph plots). The two simulated hydrographs in calibration mode [Figs.
5(a and b)] look very realistic—low flows are well represented and some peak flows are missed but most are captured. For the worst station in the calibration set for objective 1 [04166100; Fig.
5(b)], one ML model seems to overestimate the flows consistently, possibly because this is a station with human influence. No data are provided to the ML models to train them regarding human impacts and management, and it is hence expected that these stations are not very well represented with data-driven models (or any other globally calibrated model).
The ensemble of streamflow time series for locally and globally calibrated models (red and blue hydrographs, respectively) look very similar for well-performing stations in both calibration and validation [Figs.
5(a and c)]. This, however, changes for the worst-performing station: The variance in the simulated hydrographs in calibration mode [blue panel in Fig.
5(b)] is much larger for globally calibrated models, while the ensemble of the locally calibrated models looks, expectedly, very consistent. The low flows are well represented, but some peak flows are missed by locally calibrated models [red panel Fig.
5(b)]. This gets surprisingly better for the validation of the overall worst-performing station [blue panel Fig.
5(d)], possibly because these stations are not in urban areas, which are the stations where almost all globally calibrated models had problems already in calibration (namely GEM-Hydro, MESH-SVS, MESH-CLASS, and WATFLOOD, but also VIC and VIC-GRU).
Rather than evaluating the performance of each individual model using the median NSE [added as label
for each group
to panels in Fig.
5], the model intercomparison now also allows to obtain the performance of the model ensemble. Therefore, the mean ensemble streamflow timeseries
for each model group
is derived and compared to the observations [added as label
to panels in Fig.
5]. For all model groups and both stations, this leads to large improvements in performance, highlighting the strength of using model ensembles rather than individual models, as reported by others (
Duan et al. 2007;
Muhammad et al. 2018;
Darbandsari and Coulibaly 2019).