Concepts of Information Content and Likelihood in Parameter Calibration for Hydrological Simulation Models

Beven, Keith; Smith, Paul

doi:10.1061/(ASCE)HE.1943-5584.0000991

Open access

Technical Papers

Feb 26, 2014

Concepts of Information Content and Likelihood in Parameter Calibration for Hydrological Simulation Models

Authors: Keith Beven [email protected] and Paul SmithAuthor Affiliations

Publication: Journal of Hydrologic Engineering

Volume 20, Issue 1

https://doi.org/10.1061/(ASCE)HE.1943-5584.0000991

PDF

Abstract

There remains a great deal of uncertainty about uncertainty estimation in hydrological modeling. Given that hydrology is still a subject limited by the available measurement techniques, it does not appear that the issue of epistemic error in hydrological data will go away for the foreseeable future, and it may be necessary to find a way to allow for robust model conditioning and more subjective treatments of potential epistemic errors in prediction. In this paper an attempt is made to analyze how this is the result of the epistemic uncertainties inherent in the hydrological modeling process and their impact on model conditioning and hypothesis testing. Some ideas are proposed about how to deal with assessing the information in hydrological data and how it might influence model conditioning based on hydrological reasoning, with an application to rainfall-runoff modeling of a catchment in northern England, where inconsistent data for some events can introduce disinformation into the model conditioning process. A methodology is presented to make an assessment of the relative information content of calibration data before running a model that can then inform the evaluation of model runs and resulting prediction uncertainties.

Once upon a Time

This paper is intended as a polemic. Although it has been invited as a review paper, it will not be balanced in its presentation. Neither will it be particularly biased toward one or another approach to modeling parameter calibration, though clearly the authors have a history in having developed the Generalised Likelihood Uncertainty Estimation (GLUE) methodology. In fact, the thoughts in this paper are the result of a process of thinking about the so-called GLUE controversy (Beven 2006, 2008, 2012a; Beven et al. 2008, 2012a; Clark et al. 2012) over a long period of time. But the story starts a long time ago.

Once upon a time, in the days not so long ago when computers were big and slow and had little memory, there was a belief that it might be possible to find a single computer model of a catchment that was in some sense optimal. This was, in part, a response to the limitations of the computers of the time, but it was also in part a deeper form of modernist philosophy of science expressed in terms of gradually evolving toward the true model of a catchment system. This was despite the fact that many of the conceptual model structures used were very simple [indeed, this was held to be a virtue by Dawdy and O’Donnell (1965)]; despite the fact that a lot of model calibration was carried out by manual trial and error [though a number of automatic calibration techniques were being used; see, for example, Blackie and Eeles (1985)], despite the knowledge that the optimum model would depend on the objective function used, and despite the fact that the objective function response surface might be complex [again, see Blackie and Eeles (1985)], especially when it was necessary to estimate many different parameters (the original Stanford Watershed Model had on the order of 35 parameters if snowmelt was included, including multipliers for both rainfall and evapotranspiration estimates—there was a suggestion at the time that the only person who could really calibrate it successfully was Norman Crawford, who wrote the model as part of his Ph.D. thesis, supervised by Ray Linsley). In fact, some very early work on model structures was driven by a desire to produce simple response surfaces to facilitate automatic calibration (e.g., Ibbitt and O’Donnell 1974), something that has resurfaced recently in terms of using poor numerical algorithms in more complex models, resulting in greater complexity of likelihood surfaces (e.g., Kavetski and Clark 2010).

One response to these difficulties (and also the sheer number of conceptual hydrological models available) was to take an alternative view of model calibration. In this view, model parameter values would have a basis in the physics of the flow processes and would be estimated by measurement or by knowledge of the characteristics of the soils, vegetation, and channels in the catchment. This was the basis for the Freeze and Harlan (1969) blueprint for a physically based digitally simulated model of basin hydrology. Such distributed models could, at least in principle (a much used phrase in the 1970s and 1980s), have parameter values that varied from element to element in the distributed solution, with the result that thousands, rather than tens, of parameters were now required. Such models have persisted, of course, despite both the physics of their conceptualization and the possibility of estimating effective parameter values being questioned (e.g., Beven 1989, 2001a, 2002a; Grayson et al. 1992). Some distributed models, e.g., Soil and Water Assessment Tool (SWAT), even provide databases of default parameter values for different conditions and have been used worldwide on this basis. This makes it easier for users to simulate even ungauged basins—but should really not be expected to guarantee good results. The extrapolation of model parameter values from one catchment to another is fraught with difficulty, though in some cases it can work even when land use is changing (e.g., Buytaert and Beven 2009). This was one of the drivers behind the International Association of Hydrological Sciences IAHS Prediction of Ungauged Basins (PUB) initiative (e.g., Sivapalan 2003), to see whether better hydrological understanding could lead to more success in transferring parameter values to new catchments (Blöschl et al. 2013).

Another strain of research concerned itself more directly with the problem of parameter calibration on complex, high-dimensional response surfaces. Various techniques have been developed since the 1980s, including Shuffled Complex Evolution (SCE), simulated annealing, particle swarm optimization, Pareto optimization, and other algorithms. All were attempts to find a global optimum (or Pareto set of optima in the case of multiple competing objective functions) while being robust to the distractions of local optima and local discontinuities in a surface. The SCE-University of Arizona (UA) algorithm in particular, developed at the University of Arizona, was made freely available and has been very widely used in hydrological modeling, including being the basis for other more computationally intensive algorithms [e.g., Vrugt et al. (2009), Laloy and Vrugt (2012)]. It is not totally immune to the problem of local optima, however. Early on, the study of Duan et al. (1992) demonstrated how starting the algorithm from different randomly chosen points in the model space had a high probability of finding parameter sets close to the global optimum but might occasionally fail even for only a six-parameter model.

Over the same period of time, the question of uncertainty associated with the predictions of hydrological models has also assumed greater importance. This was partly a result of the general increase in computer power available to hydrological modelers (in the 1970s and 1980s, estimates of uncertainty were generally limited to analytical statistical calculations around some so-called best fit or maximum likelihood model). More computer power means that either much more complex model structures or much finer spatial discretizations can be implemented or, alternatively, that many more runs of a model can be made. One way of using those many runs is in uncertainty assessment [Monte Carlo experiments have been carried out even with the SHE model; see Vazquez et al. (2009)]. It was also partly a result of the more general recognition of the uncertainties associated with environmental models that can arise from many different sources and be manifest in complex ways (e.g., Beven 2009, 2013). Uncertainty estimation can be important in the modeling process if it makes a difference in decision making, affecting either the decision that is made (e.g., Todini 2004) or the way that a decision is made (e.g., Beven 2011). But the different sources of uncertainty are complex and involve a lack of knowledge as well as random natural variability (epistemic as well as aleatory errors). Thus there is an important issue about how far formal statistical theory can be used to represent different types and sources of uncertainty [e.g., the recent exchange of Beven et al. (2012a) and Clark et al. (2011, 2012), as well as Montanari and Koutsoyiannis, this issue).

As pointed out by the referees on this paper, all methods of uncertainty estimation can be considered to be statistical in the sense that the only evidence that is available about potential distributions of future errors are those errors that have been seen in conditioning a model or models during calibration. The present authors agree but wish to distinguish between formal statistical methods of uncertainty estimation that assume that errors are fundamentally aleatory and methods that try to take more explicit account of epistemic errors, at least in trying to reflect the expectation of reduced information content relative to purely aleatory assumptions when estimating prediction uncertainties.

Formal statistical methods involve the evaluation of a likelihood function for different model runs, which depends on assumptions made about the modeling errors (including assumptions about distributional form; structured bias, correlation, and heteroscedasticity; stationarity; and the information content of a single residual). Where the formal assumptions hold, the likelihoods and uncertainties associated with the predictions have a formal probabilistic interpretation. The question is, however, whether the relatively simple assumptions that are generally made are adequate in the face of uncertainties arising from lack of knowledge. By definition, of course, not enough is known about epistemic errors, so any uncertainty estimation method that tries to allow for epistemic error cannot ensure a formal probabilistic interpretation. It might also fail in prediction, but this will also be the case for formal methods that treat epistemic errors as if they were aleatory in calibration when the epistemic errors in prediction are (at least for some events) rather different. This argument hinges on how far error distributions can be considered stationary in going from calibration to prediction periods. The question of stationarity is complex and considered further in what follows.

In fact, the main questions that arise in calibrating hydrological simulation models have not changed in the last 30 years or more (e.g., Nash and Sutcliffe 1970; Kirkby 1975; Gupta and Sorooshian 1985; Gupta et al. 1998, 2009; Kuczera and Mroczkowski 1998; Beran 1999; Vrugt et al. 2002; Liu and Gupta 2007). They can be summarized as follows:

1.

How much information content is there in a period of hydrological data that might be used to calibrate a simulation model?

2.

What is an appropriate likelihood measure for assessing the performance of different models, given the many sources of uncertainty in the modeling process?

To these questions might be added a third:

3.

How does one deal with elements of nonstationarity and surprise in prediction when epistemic errors might appear that have not been seen in calibration/conditioning?

It is these general questions that this paper aims to address from a hydrological (rather than information theory or statistical) perspective. Reference will be made only to simulation models. Models used for forecasting the N-step-ahead output variables given current values of those variables require a somewhat different perspective [this includes the increasingly common activity of using off-line remote sensing data assimilation to correct the predictions of distributed simulation models; see for example Pauwels et al. (2001), Rodell et al. (2004), and Reichle et al. (2008)]. The use of data assimilation in forecasting can be useful for practical applications, but it is not simulation in the sense used here.

Simulation models are needed in practical applications of predicting the impacts of future change and in predicting the response of ungauged areas, situations where data assimilation is not possible. Simulation models are also an important means of demonstrating improvements in hydrological understanding and science (whereas data assimilation can compensate for deficiencies in either observations or science, often in very useful ways when forecasts are required in real time). The question, then, is how far scientific progress can be demonstrated given the limitations and uncertainties in the modeling process that affect the calibration of model parameters (Beven 2000, 2002b; Beven and Freer 2001; Beven and Westerberg 2011; Westerberg et al. 2011a; Beven et al. 2011; McMillan et al. 2010, 2012).

Perceptual Model of Uncertainty in Hydrological Modeling Process

In previous work (Beven 1989, 2002b, 2006, 2012b) the concept of a perceptual model of hydrological processes and the way in which any mathematical or conceptual model of those processes necessarily involves a gross simplification of the perceptual model were discussed. A perceptual model can include all the complexities, heterogeneities, and nonstationarities in the functioning of a catchment. It can serve as a useful test for the assumptions that are used in developing a mathematical model but is not in itself a useful predictor since it is only qualitative in nature. Some processes, such as preferential flows, can be perceived as important in catchment response, but there are no agreed or satisfactory mathematical descriptions of such processes that can be used to quantify their effect (Beven and Germann 1982, 2013; Bachmair and Weiler 2011). They are thus often ignored (which in itself might be a cause of epistemic error in prediction). And, of course, there may well be important omissions from even the qualitative perceptual process model that might be important in the actual catchment response.

One could similarly outline a perceptual model of uncertainty in the hydrological modeling process from a hydrological perspective. There are many different sources of uncertainty, all of which will affect the calibration of model parameters. Since parameter identification will interact with all the sources of uncertainty (including scale and commensurability effects), one would expect that values of parameters resulting from calibration would be effective values that might be different from prior estimations of values based on direct observations or other sources.

The main sources of uncertainty that will affect model calibration (e.g., Beven 2005, 2012a; Liu and Gupta 2007) are as follows:

1.

Uncertainty introduced by errors in the model structure,

2.

Uncertainty introduced by errors in boundary and initial conditions,

3.

Uncertainty introduced by errors in the observations used in model calibration (including the commensurability errors arising from differences in meaning between model variables and parameters and observables), and

4.

Uncertainty introduced by computational constraints and likelihood specification in the calibration process.

Note that no parameter uncertainty is included in this list because it cannot be considered independently of the listed sources in the case of calibration or conditioning (e.g., Beven 2005, 2006, 2009). A separate source of uncertainty is an absence of data for calibration, but this is a less interesting case as the uncertainty estimated then depends only on the prior assumptions about each source of uncertainty. Prior parameter estimates will also have an effect in calibration/conditioning, but it is rare in hydrological modeling when there is confidence in defining prior distributions (and covariation) for parameter values.

All of these sources of uncertainty include both variation that can be represented as aleatory natural random variability and the epistemic errors that result from a lack of knowledge (e.g., Hall 2003; Rougier 2012; Rougier and Beven 2012; Beven 2013; Sun et al. 2012; Beven and Young 2013). A much more detailed classification, including linguistic and cultural uncertainties, can also be developed (Mayo 1996; Allchin 2004), but this simple differentiation will suffice here. This difference has a much older history (e.g., Knight 1921; Keynes 1936; Popper 1957; Helton and Burmaster 1996; Howson 2003) but has been generally neglected in hydrological simulation. It is suggested here, however, that its importance should be recognized more explicitly.

The reason it is important is that in general it is not possible to represent epistemic uncertainties by a formal statistical model with identifiable parameters because epistemic uncertainties will generally result in arbitrarily nonstationary error characteristics. Comments received on the first draft of this paper demonstrate that this difference requires some further explanation because purely aleatory errors can be considered arbitrary and because many relatively simple statistical error models might produce nonstationary characteristics for short periods of data, even if their parameters are constant and the generating process stationary in the long term. In addition, the only evidence for the nature of either aleatory or epistemic error will be from the available calibration data. If there are nonstationary characteristics that lead to quite different errors in prediction, then there can be no question of a priori evidence for them. A good illustration of this is the postaudit analysis of the predictions of groundwater models in Konikow and Bredehoeft (1992), where success depended largely on how well future boundary conditions had been estimated [also a form of epistemic error problem; see Popper (1957) for a similar statement].

This is one reason why many modelers do not see why epistemic errors should not be treated in the same way as aleatory errors within a formal statistical framework (see the arguments of Montanari and Koutsoyiannis, this issue). This also has the claimed advantage of objectivity in the sense that the assumptions of an error model can be checked against the actual model residuals (it is, indeed, good practice to do so). One recent referee even went so far as to say that the distinction between aleatory and epistemic error is not important because in principle statistical error models can be “infinitely complex.” That just begs the question of how such a complex error model might be identified for arbitrary epistemic errors [and how that error model might interact with the identification of the underlying hydrological model; see, for example, the discussion of Beven (2009)].

There is that word arbitrary again. So how is arbitrary different from aleatory, and why is the difference important? This is really a question of time scale of variability relative to the time scales in calibration and prediction, and particularly in calibration, where all the information content lies that informs the model predictions. From the law of large numbers under the assumptions of an aleatory statistical error model (after allowing for bias or autocorrelaton, for example), it is expected that the low-order moments of a distribution of residuals will converge quite rapidly relative to having a reasonably long length of available calibration data. Indeed, it is hoped that the calibration period will be long enough to ensure convergence to the moments of the underlying population so as to increase confidence in the error model parameters when they are used to estimate uncertainty in prediction. The Gaussian model of likelihood, under the aleatory assumption, also reflects this. The dominant term is normally proportional to

{(σ_{ε}^{2})}^{- N / 2}

, where

σ_{ε}^{2}

is the variance of the model residuals and

N

is the number of residuals. Therefore, when

N

is very large (as is quite common in the calibration of hydrological models), two models with very similar error variances will have orders-of-magnitude differences in likelihood. In effect, the likelihood space is stretched hugely by the aleatory error assumption.

That might, of course, be absolutely the right thing to do if the errors really are aleatory (though it still seems a little odd given the possibility of model structural error that might itself lead to nonstationarity of residual characteristics). But what if the errors are arbitrary in the sense of having odd periods of rather different characteristics during the calibration period in any one of the different sources of uncertainty in the modeling processes. Most modelers will recognize this situation. Even models that fit most of the data well will have a poor fit to some odd events [e.g., comment of Beven (2009) on the Leaf River data used in Vrugt et al. (2009)]. In this case there may be no asymptotic distribution for any of the sources of error for the time scale of an application, implying also that the arbitrary errors in prediction might be different from those in calibration. The point is that treating such errors as if they were aleatory will overestimate the information content of the calibration period residuals.

It is not difficult to identify potential errors of this (epistemic) type. Consider the very practical example of a perceptual model of the uncertainties associated with rainfall inputs to a hydrological simulation model. The uncertainty will be different for different numbers of rain gauges. It will be different depending on rain gauge design, placement, and exposure (particularly in catchments with a significant elevation range and wind effects on catch). It will be different depending on whether rain gauge information can be combined with radar data (with all of the epistemic uncertainties involved in converting the radar return signal into rainfall intensities). One might have the perception that for a given number of rain gauges the epistemic error might be different for small patchy rainstorms than for synoptic events, but even in synoptic events there might be cells moving over a catchment that are poorly represented by a sparse measurement network. Most of these sources of uncertainty are epistemic, not aleatory. Their characteristics might be quite different from event to event and in some cases might lead to observations that are hydrologically inconsistent, such as those events where apparent runoff coefficients for an event are greater than 1. Such events should not be used in model calibration (Beven and Westerberg 2011; Beven et al. 2011, see subsequent discussion), but this has rarely been considered in calibration exercises (partly because of the difficulty of estimating runoff coefficients or identifying inconsistent data for individual events). In fact, hydrological reasoning also suggests that epistemic errors of this type can have an impact beyond the single event because of the way input errors are processed through the dynamics of a model to produce an effect on the antecedent state of the (modeled) catchment in predicting the next event.

Similar perceptual arguments can be made about other boundary conditions required to run a model and about the observed discharges used to evaluate model predictions in calibration [since discharge is generally a constructed variable dependent itself on a model; see Beven et al. (2012b) and Westerberg et al. (2011a)]. This is also not something new: in their seminal paper Stephenson and Freeze (1974) pointed out that the validation of a distributed hillslope model would be impossible because it was not possible to know both the initial and boundary conditions for the simulation of any real hillslope with sufficient accuracy and precision. One would also expect the nature of errors to be nonstationary for good hydrological reasons. The way in which boundary condition uncertainties interact with model structural uncertainties might be quite different in predictions of the rising limb of a hydrograph compared to the falling limb of a hydrograph. One would also expect that the magnitude of prediction uncertainties might increase with predicted discharge and that any time step correlation in model errors might be quite different for the rising and falling limbs. Such effects should be expected—and they should be expected not to have purely random characteristics.

Uncertainty and Complexity

All perceptual models of catchment processes suggest that catchments are complex, nonlinear, nonstationary, dynamic systems within which the response to an event is highly dependent on the antecedent state of the system, as well as the sequence and pattern of inputs for that event (e.g., Graham et al. 2010; Bachmair and Weiler 2011; Beven 2012b, Ch. 1). This has led to arguments that uncertainty in predicting outputs should be an expectation as a result of the nonlinear dynamics (e.g., Sivakumar 2000, 2007, 2009; Islam and Sivakumar 2002; Koutsoyiannis 2010, 2011; Montanari and Koutsoyiannis 2012; Weijs 2009). Certainly, simplified model structures might not properly represent the complex dynamics [a point that has equally been made in other areas, e.g., Smith (2001)], but the response of a hydrological system is by its very nature constrained by the inputs and mass and energy balances. In general, the response to an event is limited by mass balance, and the difference between output and input cannot be greater than the volume stored in the system at the start of the event. Actual evapotranspiration is limited by the energy balance (and by the available water). Thus, nonlinear complexity will add to model structural error but will be constrained. However, it seems reasonable to expect that model structural errors should be dependent on the complex trajectories of system states. A further issue in model structural uncertainty will be the nonlinear responses to perturbations in numerical implementation of the model equations (Clark and Kavetski 2010; Kavetski and Clark 2010).

Uncertainty, Equifinality, and GLUE

The complexity of issues that arise in model calibration means that it is impossible to define a so-called optimal model of a catchment. Whether a model appears optimal will depend on the likelihood measures used and the period of calibration data used (with all the associated, but generally unknown a priori, aleatory and epistemic errors). Thus, there may be many different model structures and parameter sets within model structures that produce simulations considered acceptable to the user in the sense that they might be useful in prediction. This is what Beven (1993, 2006) refers to as equifinality (see note at end of paper). This term was intended to indicate that there may be many different model structures and parameter sets within model structures that produce simulations considered to be acceptable to the user in the sense that they are accepted as being potentially useful in prediction. It was chosen over the more common terms (at the time) of nonuniqueness and nonidentifiability to indicate this was a generic issue in model calibration rather than simply a difficulty in finding the optimum model. The term ambiguity has also been used in a similar sense (e.g., Silbergeld 1987; in hydrology, Zin and Saulnier 2000).

The recognition of equifinality as a generic issue led to the development of the Generalised Likelihood Uncertainty Estimation (GLUE) methodology of Beven and Binley (1992) as a way of defining a set of acceptable models and using the likelihood weighted ensemble in prediction. The likelihood weights used were originally based on informal measures to avoid the need to define a model of the errors, including measures such as the Nash-Sutcliffe efficiency (Nash and Sutcliffe 1970) that had been widely used in model optimization. Beven and Binley (1992) gave examples of a number of such measures and how they could be shaped by a user-defined parameter to reflect user preferences. The researchers also showed how different likelihoods could be combined in different ways (including Bayesian multiplication, weighted addition, fuzzy intersection, and fuzzy union) and gave an example of how a more efficient search algorithm could be used to find acceptable models, based on a nearest-neighbor surface interpolation technique with a randomized choice as to whether to run a model. More recently, GLUE has been extended by evaluating models relative to the limits of acceptability applied to individual measurements (e.g., Blazkova and Beven 2009; Liu et al. 2009; Westerberg et al. 2011b).

In this respect, GLUE was ahead of its time in preceding formal Bayesian methods of model calibration introduced into hydrological modeling by Kuczera and Parent (1998). GLUE has, however, been strongly criticized by adherents of formal Bayesian methods because of the subjective choices involved in any application [e.g., Mantovan and Todini 2006; Stedinger et al. 2008; Clark et al. 2011; but see the responses of Beven et al. (2008, 2012a)].

Reducing Epistemic Uncertainty by Improving Knowledge

It is often suggested that one way of differentiating between aleatory and epistemic uncertainties is that aleatory uncertainties represent irreducible natural variability while epistemic uncertainties can be reduced by additional study. The implication is that epistemic uncertainties might (eventually) be reduced to aleatory uncertainties, given enough additional observation and research. In hydrology, however, this does not seem to be a useful argument, let alone a justification for treating epistemic uncertainties as if they were aleatory, pending those further observations and understanding. As noted earlier, hydrology as a science is limited by the techniques available for estimating the variables in the water balance equation. Despite advances in geophysical methodologies, that is even more the case for what happens to water in the subsurface. Thus, there is much about the characteristics of catchments that will remain unknowable (except by uncertain model inference) for the foreseeable future, particularly with respect to past periods of data that might be used from model calibration when in general it will be impossible to reduce uncertainties further. Thus, other ways of dealing with nonstationary epistemic uncertainties must be found.

Perceptions of the Meaning of Uncertainty

An interesting issue that then arises in any assessment of uncertainty in model calibration and prediction is the communication of the meaning of the uncertainty estimates (e.g., Faulkner et al. 2007). The word uncertainty itself evokes quite different perceptions from different people and in different circumstances. This is where linguistic and cultural uncertainties can also be important. Uncertainty and likelihood have formal scientific meanings in ideal cases of random variability, but the importance of epistemic uncertainties in environmental modeling makes the interpretation and communication of meaning more problematic. One perception of uncertainty is of a lack of predictability as a result of a lack of understanding of how given systems function (an argument used, for example, by climate skeptics). This is an interpretation of uncertainty as ignorance or lack of predictability.

Very often, however, such an interpretation is not justified. There are many types of prediction where uncertainty estimation might be a useful part of the decision-making process as an expression of confidence in predictions even when uncertainties are epistemic rather than aleatory (e.g., Pappenberger and Beven 2006; Juston et al. 2013). Confidence limits, of course, also have a formal meaning within a statistical framework, as the expression of the uncertainty in a quantity of interest resulting from sampling variability. In this framework, the expectation is that as the sample size increases, the confidence limits should narrow. If the uncertainties due to sampling really are aleatory, every new set of samples should be useful in conditioning the confidence [this is the coherence argument of Mantovan and Todini (2006)].

There is no analogous expectation if the uncertainties are epistemic (or, indeed, if the scale of aleatory variability is large relative to the calibration set; it may be difficult to distinguish the two, but rather than a reason for treating epistemic uncertainties as aleatory, this would seem to be an argument for treating aleatory uncertainty as epistemic). As the sample size increases, one might expect new issues to arise, limiting the degree to which the confidence in a predicted quantity might be reduced. It is difficult to allow for such future contingencies since one can still only express the confidence in future predictions according to past performance. But one should not expect every new event or set of samples to be coherent in the same way as for aleatory uncertainties, and one should be wary of real future surprises [see the discussion of Beven et al. (2008), Beven (2013), and the subsequent section on assessing information content].

Rethinking Concepts of Information Content of Hydrological Data

Thus, given the potential for both aleatory and epistemic sources of uncertainty in hydrological simulation, how should one go about assessing the information content of a set of data that might be used in parameter calibration? Traditionally, information theory has assessed information content in terms of entropy measures [e.g., going back at least as far as Amorocho and Espildora (1973); see also Singh (1998) and Weijs et al. (2010)], while systems analysis and statistics have made use of Akaike, Bayes, Deviance, Young, and other information criteria that focus on the assessment of model fit and parameter uncertainties. Both approaches have been applied widely in hydrological modeling but seem to represent the wrong place to start from a hydrological perspective. Entropy measures [and the equivalent U-uncertainty measures in a possibilistic framework; see Beven and Binley (1992)] are based only on the shapes of distributions, so that any information about error sequences is largely lost (except when looking at lagged cross-entropy measures). Statistical information measures, on the other hand, assume that all errors can be treated as if they were aleatory, which is also clearly not the case. A new approach seems necessary.

Starting from a hydrological perspective, such an approach should be event based. Immediately the practical import of the nature of uncertainties in the modeling process becomes a little clearer. Let us begin with an assumption that all events, a priori, should contribute equally to the information content of a calibration data set. It is known, however, that such an assumption is certainly not appropriate: the real contribution of information for different events will depend on multiple factors as follows:

1.

What will increase the relative information content of an event?

a.

The relative accuracy of the estimation of the inputs driving the model;

b.

The relative accuracy of observations with which model outputs will be compared (including commensurability issues); and

c.

The unusualness of an event (e.g., extremes, rarity of initial conditions).

2.

What will decrease the relative information content of an event?

a.

Repetition (multiple examples of similar conditions);

b.

The inconsistency of input and output data;

c.

The relative uncertainty of observations (e.g., highly uncertain overbank flood discharges would reduce information content of an extreme event, discharges for catchments with ill-defined rating curves might be less informative than in catchments with well-defined curves); and

d.

A prior disinformative/less informative event over the dynamic response time scale of the catchment.

Ideally, of course, one would wish to quantify the information contributed by each event, but nearly all of these factors are subject to epistemic uncertainties. For example, unusualness would be an advantage if the data are consistent. If a model can then reproduce the catchment response of an unusual event, then one’s confidence in that model as a simulator would increase [point c above; see also Singh and Bárdossy (2012)]. But an event might be unusual only because of epistemic uncertainty in the available observations (a local convective cell of rainfall happens to be centered on a rain gauge such that the estimate of input rainfall over the catchment is unusually high, but the discharge hydrograph shows little or no response). In some cases, such inconsistencies might be easily recognizable [e.g., when runoff coefficients are greater than 1 or exceptionally low; see Beven (2009), Beven et al. (2011), and the example in what follows]. Any hydrologist with modeling experience will be able to recognize a variety of other similar cases. Such events might be disinformative in the sense that the data might be misleading in the specific context of the evaluation of sets of model parameter values to determine whether they will be useful in prediction. Clearly such data might be informative in other contexts (e.g., in providing information about epistemic data errors). In many cases, distinguishing between usefully unusual and inconsistent events might be rather difficult, even more so when the inconsistency is generated by a prior event (point g above). What is clear, however, is that the effects are certainly not aleatory in nature in the sense of being drawn from a consistent and stationary statistical distribution of potential errors.

These arguments suggest that an attempt should be made to identify the relative information content of a series of events independent of any model run. Otherwise, there is the possibility of a reductio ad absurdum of declaring as disinformative all events that cannot be fitted by a particular model. Making information content conditional on the assumption that the model is a correct representation of the system seems to be an unsuitable strategy in a hydrological context (albeit traditional in both frequentist and Bayesian statistical inference, possibly after adding a model discrepancy function).

So what is the alternative? How can an evaluation of the information content (or potential disinformation) of an event be assessed for the purposes of parameter estimation? One suggestion, given enough calibration events, is to take advantage of the tension between the consistency and information redundancy of event data. There is a (perceptual) expectation that events with similar antecedent conditions and similar rainfall amounts and intensities will have similar observed responses. Thus, classifying events according to antecedent conditions and rainfall characteristics (based on the data alone) will result in a division of events into different clusters, within which the events should be more or less consistent with each other. Unusual events in each cluster would be expected to stand out and could be assessed as to whether or not they might be informative for parameter calibration purposes. Multiple events that are similarly unusual might indicate the need for a new class of events.

In this way, it is possible to assess a likelihood weight associated with each storm period that takes into account the characteristics listed earlier that should either increase or decrease the potential information content of that period of data. This approach is explored in the example application in what follows, where clustering based on storm characteristics is used both in the determination of disinformative periods and in the formulation of a measure of storm information content.

Rethinking Concepts of Likelihood of Hydrological Models

Why One Should Not Get Hung up on Formal Bayes

There has been an intense and continuing debate about uncertainty estimation in hydrological simulation (and indeed elsewhere in different modeling communities). There is a fundamental reason why this has been the case. As noted earlier, it is not actually known how to characterize the many different sources of epistemic uncertainty, and their interactions, that affect the hydrological modeling process. Thus, there can be no correct answer, a situation that is therefore ripe for different (strongly held) opinions, methodologies, and even philosophies [see most recently Beven et al. (2012) and Clark et al. (2012)].

This is actually a situation that would have been recognized by Thomas Bayes in the eighteenth century (Howson and Urbach 1993). Bayes actually perished before he was published, but a paper found among his effects after his death showed that he tried to find a general way of expressing the odds that should be accepted for a certain hypothesis given some prior beliefs about that hypothesis and some evidence supporting that hypothesis. Bayes’s equation allows this combination of (possibly subjective) prior belief and (possibly subjective) evidence in a very simple (multiplicative) way. Bayes’s equation works nicely when both prior belief and evidence can be expressed in terms of odds or probabilities. There is nothing to object to in this original formulation of Bayes. A similar discrete form of the equation was derived independently by Laplace (1986).

It is, however, rather different from the use of Bayes’s equation in recent hydrological applications, as taken directly from the development of Bayesian statistics in the twentieth century. The critical aspect of this development has been the expression of the evidence in terms of a likelihood function. The likelihood function is based on expressing how well a hypothesis (a hydrological model in this case) can reproduce some observations. It is generally based on a sample of residual differences between the model predictions and the observations. A formal likelihood function can include some parametric structure in representing the residuals (bias or simple trend, simple autocorrelation structures, or simple structured heteroscedasticity) but fundamentally assumes that the residual series is aleatory with stationary properties. This allows the likelihood to have a formal probabilistic interpretation. There is no reason why this should also apply in the case of epistemic uncertainties (actually, even if a posterior analysis of model residuals shows apparently aleatory characteristics, the perceptual expectation should still be for nonstationarity and surprise in prediction). Thus, for real data, it is difficult to justify the strong formal Bayesian position taken in hydrology, for example, by Mantovan and Todini (2006), Stedinger et al. (2008), and Clark et al. (2011).

There is, however, a defense that can be used in support of the formal statistical methods. This suggests that the residual model structure used in the likelihood function can be tested against the actual characteristics of the residual series. This, it is suggested, can provide objective support for a choice of likelihood. In addition, the use of Bayes’s equation in a way that allows the estimation of the probability of predicting a future observation conditional on a model allows the verification of the consequent uncertainty estimates in future predictions. This is the basis for the claims of objectivity for the formal Bayesian statistical approach. There is a certain irony in this claim; one of the drivers for the work of R. A. Fisher in developing what is now called frequentist statistics in the early twentieth century was the aim of avoiding the subjectivity inherent in Bayes, while one’s prior knowledge about the effective parameter values that might be required for a model to provide good predictions of the observations is generally poor (clearly not an operational problem in Bayes if a noninformative prior is used, but then all the posterior inference depends on the definition of the likelihood function).

It is worth making clear that a Bayesian approach does not depend on the use of such a formal statistical likelihood or on the Gaussian definition of information (based on squared residual errors) that underlies most uses of formal likelihoods. In Bayes’s original formulation, a posterior belief in a hypothesis was developed based on a prior belief and some estimates of general odds that the hypothesis was correct. The question then is how best to estimate such odds, or an equivalent likelihood, in the difficult case of hydrological simulation subject to epistemic error. The important issue here is that the odds should reflect in some way the expectation that errors will have both epistemic and aleatory components, not treated as if the information content were purely aleatory, which results in overconfidence in inference about models and parameters. The Gaussian function of information is also not a necessary assumption. In the derivation by Laplace (1986), he assumed a function to assess the information associated with a new observation that depends on the absolute value of a residual. Other choices are also possible (Tarantola 2006).

One Should Expect a Model to Perform Better Than When Using Estimates Based Directly on the Available Data

So how does one determine the appropriate odds that a model might be useful in prediction when, from a perceptual model of uncertainties, one expects that not all error should be treated as aleatory? One possibility is to revisit the spread of observed hydrographs in the classification analysis used in the assessment of information content. That spread of hydrographs provides a scale against which the errors in model predictions can be assessed that allows some implicit accounting for epistemic and aleatory uncertainty in the event rainfall estimates. Observation error on each hydrograph in the cluster can be added to the spread, where this can be explicitly represented (e.g., Liu et al. 2009; Krueger et al. 2009; Westerberg et al. 2011a, b). The event hydrographs for each model run can then be compared against the variability in the hydrograph clusters. For a model to be useful in prediction, one would expect it to take more account of the particularities of each event in each cluster such that a good model would be expected to perform better than the averaged response over all storms in the cluster.

Reinforcement, Redundancy, Out-of-Range Prediction, and Surprise

Within the formal Bayesian paradigm the coherency argument suggests that every new observation and associated model residual should be informative in shaping the posterior distribution of model parameters (e.g., Mantovan and Todini 2006). It is, however, this characteristic of the formulation of the likelihood function that leads to the very strong conditioning (stretching of the likelihood surface) that is common in formal Bayesian inference. There is a question as to whether this strong conditioning is appropriate when the errors are epistemic rather than aleatory (or, actually, even when the errors are aleatory if some information is effectively redundant). If a model does well at predicting two very similar events, is the information provided by the second event a strong reason to reinforce the belief in that model or (to some extent) redundant (in that it is already known that the model could perform well on such an event)? The information provided by the second event is a useful reinforcement, but it should essentially provide less real information than that provided by a quite different event. The concepts for defining the information content of a period of data outlined previously are intended to reflect the range of events within a period (see also Singh and Bárdossy 2012).

It is of course possible that in prediction, new types of events will appear so that some out-of-range prediction will be required. It is also possible that events that might have been defined as disinformative in calibration will occur in prediction, but without the prior knowledge that the event will prove to be inconsistent. This will only be revealed in a post hoc analysis. The potential for different epistemic error characteristics in prediction is one reason why the performance of hydrological models in so-called validation periods is commonly found to be worse than in calibration periods.

What Are the Required Characteristics of a Likelihood Measure Given Epistemic Uncertainty?

Accept, at least as a working hypothesis, that a formal probabilistic likelihood function will have limited value given the expectation of nonstationary epistemic uncertainties that cannot be adequately represented by a statistical model. The question that then arises is how to define a likelihood measure given a series of information. The name likelihood function will be reserved for the type of formal statistical likelihoods in extant use. A likelihood function will have a formal probabilistic interpretation (if and only if the assumptions of the error model are correct and hold in prediction). A likelihood measure might conform to axioms of probability or possibility (see, for example, Smith et al. 2008) but by necessity cannot express the probabilities of potential epistemic errors because those probabilities cannot be known (otherwise they would not be epistemic). Likelihood is here intended to express the belief in terms of odds that a model as hypothesis might be useful in prediction. Note that it can be entirely consistent with Bayes’s equation in this sense, although other ways of combining likelihood measures could also be chosen (Beven and Binley 1992, 2013). It is also suggested that such a measure should take into account the estimate of the relative information content of different periods of data as outlined earlier, as well as performance as expressed in a series of model residuals.

It might also be required that a model hypothesis in which one has no confidence that it will prove useful in prediction should be given a likelihood of zero. Note that this does not happen in formal Bayes inference unless a bounded distribution is used in deriving the formal likelihood. Models might be given a very low likelihood, but never zero. All hypothesis testing is then relative (for example using Bayes ratios), but the final choice of model or models to be used for prediction is necessarily subjective, especially where one model structure does not appear to be particularly better than another.

Each model is associated with a complex series of residuals that reflect all sources of aleatory and epistemic uncertainty. One measure of belief might then be to evaluate the relative performance of a model compared to the variability in a data-based model based on classification. The simplest, one-parameter, data-based model is the mean flow for a cluster. Then, following Laplace rather than Gauss, a likelihood contribution could be expressed in terms of absolute model residuals within a cluster as

L \propto f (\frac{ε_{s}}{ε_{d}})

where

ε_{s}

= mean absolute error for a simulation model; and

ε_{d}

= mean absolute error from mean flow based on the data. If

f ()

were

\exp {- | |}

, then this would be analogous to the Laplace discrete form of Bayes, and results over all clusters could be readily combined multiplicatively.

Such a formulation allows for the consideration of what the limit of an acceptable simulation would be. On the basis that the hydrological model should outperform the simplest data-based model, the ratio

ε_{s} / ε_{d}

should be at most one. Limits tighter than this reflect (subjective) statements as to how much better the modeler wishes the model to be relative to the data based model.

Example Application

To illustrate the practical application of the preceding discussion, consider the calibration of a rainfall-runoff model to observed data. The observed data used are discharges and hourly rainfall totals for the

322 {km}^{2}

headwater catchment of the South Tyne (U.K.) above Featherstone. This upland catchment is predominantly covered by moorland that is used for rough grazing and overlies a Carboniferous sequence of coal measures, sandstones, mudstones, and shales. Calibration was performed using data from 1990 to 1997, with the data from 1998 to 2003 held back for evaluation.

The discharge data

y_{t}

(indexed by time) are observed using a compound Crump weir. The weir contains the flow at all recorded stages and remains modular throughout its range. The discharge series is generally considered to be of good quality and is considered to represent the so-called natural flow to within 10% at the 95th percentile flow.

There are five recording rain gauges within the catchment. These generally recorded higher rainfall totals when compared with storage gauges within the catchment. Analysis of the rainfall data and the spatial distribution of the gauges indicate there is scope for both the under- and overestimation of the inputs to the catchment under different rainfall patterns. In the modeling exercise that follows, catchment average rainfall estimates,

r_{t}

, are computed from the recording rain gauges using a Thiessen polygon method.

The catchment averaged rainfall, along with a synthetic potential evapotranspiration series,

p_{t}

, based on annual and daily sinusoids, is used to force the hydrological model that is to be calibrated. The model is a lumped storage model consisting of a single tank. The volume of water stored in the tank, which has a maximum capacity of

s_{\max}

, is denoted by

s_{t}

and evolves according to Eqs. (1)–(3):

f_{t + 1} = r_{t + 1} - γ s_{t} - p_{t} s_{t} s_{\max}^{- 1}

(1)

s^{'} = \max (s_{t} + f_{t + 1}, 0)

(2)

s_{t + 1} = \min (s^{'}, s_{\max})

(3)

The effective rainfall is given by

u_{t + 1} = γ s_{t} + \max (s^{'} - s_{t + 1}, 0)

. This is routed through two parallel tanks expressed as a linear transfer function [Eq. (4)] using the backward shift operator

z^{- 1}

(i.e.,

z^{- i} w_{t} = w_{t - i}

) [see, for example, Young and Beven (1994), Young et al. (2004)]. The output of these tanks is then summed to give the model output

x_{t}

. Limiting the parameters such that

0 < α_{1} < α_{2} < 1

and

0 < ρ < 1

ensures that the linear routing is mass conservative, with

ρ

being the fraction of the effective input going to the tank with the faster response

x_{t} = \frac{(1 - α_{1}) ρ}{1 - α_{1} z^{- 1}} u_{t - 2} + \frac{(1 - α_{2}) (1 - ρ)}{1 - α_{2} z^{- 1}} u_{t - 2}

(4)

The model has five parameters

θ = (s_{\max}, γ, α_{1}, α_{2}, ρ)

that require calibration.

Identifying Disinformative Periods

The first stage of the calibration process is to identify disinformative periods of data. An event-based approach is taken where both the properties of the event rainfall and runoff coefficient are reconsidered. Event runoff coefficients may be expected to vary due to inadequacies in their calculation, shortcomings in the observation of both the precipitation input and the output discharge, and differing internal states of the system. Despite these limitations, events whose runoff coefficients differ substantially from similar events (e.g., those with similar rainfall characteristics falling in the same season) or the average over the whole period (in this catchment, 0.8) should be further investigated. These events may be examples of catchment dynamics not often observed and so of use in informing the calibration or disinformative due to the level of observational error present.

Calculating the total runoff needed for the computation of the event runoff coefficient requires the extrapolation of the falling limb of the event hydrograph to estimate the volume of discharge that might have been observed if subsequent rainfall had not occurred. Extensive reviews of the analysis of recession limbs have been given by Hall (1968) and Tallaksen (1995). While there are well-known difficulties with recession variability, an analysis of recession curves can often give some indication of the characteristics of the subsurface discharges from a catchment and can be used to develop catchment storage models (Lamb and Beven 1997).

The approach taken for this study was to develop a master recession curve (MRC) by piecing together individual shorter recession curves [see also Beven et al. (2011)]. Ideally, only true flow recessions should be selected where there is no rainfall and minimal evapotranspiration during the flow recession period. In practice this ideal is difficult to achieve, and in this study the MRC was constructed by piecing together individual recession curves of greater than 12 h duration during which less than 0.2 mm of rain fell. While such an approach may be taken on catchments such as the one studied, where the response to rainfall is not dominated by groundwater, it will not be universally applicable. The time series of observed discharge and rainfall were divided into events. A new event started with the first rainfall after 12 dry hours (hours with less than 2 mm of rainfall) when the discharge at that time was less than

0.8 mm / h

hence within the range of a MRC.

Fig. 1 shows the results of an agglomerative hierarchical clustering analysis (Jain and Dubes 1988) of the 418 calibration events. The clustering is performed using the runoff coefficient, total precipitation, maximum precipitation over a time window similar to the concentration time of the catchment (in this case 5 h), event duration, and runoff volume assigned to the preceding event (the variable initial tail in the plots), which represents the antecedent conditions. All volumes are expressed in millimeters and the variable standardized (i.e., scaled to zero mean and unit standard deviation) prior to the application of the clustering procedure. The clustering procedure starts with the most complex case, each event being in its own cluster, and agglomerates clusters based on minimizing the average Euclidean distance between points in the clusters being combined until a single cluster is achieved. This results in a hierarchical cluster tree whose trimming determines the complexity of the clustering. In this study trimming is performed using the inconsistency coefficient (Jain and Dubes 1988), which measures the similarity between combined clusters at each step. The value of the inconsistency coefficient at which links are kept was determined by graphical analysis of its relationship to the cophenetic correlation of the trimmed tree. The value of the inconsistency coefficient chosen was that at a cusp in the plot where further simplification of the tree results in a marked deterioration in the cophenetic correlation while increased complexity led to small improvements in the fit.

Fig. 1. Box-and-whisker plots summarizing properties of events in each cluster of an initial clustering of calibration events based on antecedent conditions, rainfall characteristics, and runoff coefficient; the clustering identifies five groups of disinformative events

Five of the clusters in the resulting classification contain only events with runoff coefficients greater than one. These were discounted as being disinformative. Another five of the remaining clusters contain events whose runoff coefficients are less than zero or greater than one. These events were investigated further, resulting in the discarding of all events in the calibration data whose runoff coefficient is less than 0.05 or greater than 0.95. It is hypothesized that epistemic error adversely affects the computation of accurate event characteristics for most of the events discarded. In some cases, particularly the short, low-volume rainfall events in Class 7, where the runoff coefficient is heavily influenced by the tail of the preceding event, the uncertainty associated with the MRC may also affect the computation of accurate event characteristics. The clustering analysis shows that the remaining 375 events used in the following section for calibration form distinct classes with few precipitation or other outliers.

Information Content and Formulating a Likelihood

In the previous subsection the quality of the data was considered without reference to the model being calibrated. The calibration of the model proceeds from this by specifying a likelihood relating the model output to the observed discharge. In this analysis an informal likelihood with the properties outlined in Smith et al. (2008) is constructed, so that in a Bayesian analysis the resulting posterior parameter distribution of the parameters satisfies the axioms of probability. In constructing this informal likelihood, both the information content of each event and its interaction with the model due to the sequencing of events must be considered. In particular, the model output at time periods subsequent to a disinformative event is expected to be in error if the observational error occurs in the precipitation data. If this is the case, the model output will be affected by both the incorrect reading of the effective rainfall during the disinformative event and the resulting misspecification of the initial condition of the tank for subsequent events.

The persistence of the effect of an incorrect effective input in the model output can be considered by an analysis of the transfer function used in the model to route the effective rainfall. Though this analysis could be performed for every parameter combination for consistency, the analysis is based on the optimal mean absolute error fit to the data. Fig. 2 shows the fitted transfer function. After 50 time steps the contribution to the model output of a unit input will be approximately

5 \times 10^{- 4} mm / h

. The maximum effective input for the mean absolute error simulation is of the order of

10^{1}

, indicating a maximum effect on the model output after 50 time steps of the order of

10^{- 3}

. For comparison, the maximum observed discharge during calibration is

4.25 mm / h

and the baseflow is around

10^{- 2} mm / h

. Assuming a maximum error in effective input is similar in magnitude to the maximum of the effective input for this simulation, it is considered that modeled values within 50 time steps of the introduction of disinformative data may be severely compromized, whereas those up to approximately 200 may be altered by a magnitude that will be significant only in periods of low flow. Thus, for this catchment, events that start within 50 time steps of a disinformative period are considered not to reliably inform the calibration given the potential for errors in the antecedent states and so are discarded in calculating the informal likelihood. Recall that data assimilation to correct for such errors in the model states is not being considered in this simulation case study. A sensitivity analysis suggested that the effects on a third or subsequent event were small, at least in this catchment.

Fig. 2. Unit hydrograph given by linear routing for model parameters that minimize the sum of absolute errors

The basis of the informal likelihood assigned to the

i

th model, with parameter set

θ_{i}

, is the performance of the model on each event. The use of a grouped event-based measure integrates over and penalizes against nonstationary bias and autocorrelation structures without a need to assume specific structures. For the

k

th event the performance of the

i

th model is assessed in terms of the ratio

ε_{s, i, k} / ε_{d, k}

, the mean absolute error over the event divided by the mean absolute deviation away from the mean discharge over the event.

In combining these quantifications of the model performance on each event to produce a likelihood, additional weight is given to unique or rare events that are considered informative. The rarity of an event is based on a clustering of the informative events based on their rainfall characteristics and antecedent conditions, as outlined in the previous section. This classification (Fig. 3) reflects the assignment into groups of events with similar antecedent conditions and rainfall properties. A prediction event can therefore be assigned to a cluster prior to observing the response to the observed precipitation. This classification is relatively homogeneous, and the few outliers do not appear to be indicative of disinformation or unique input scenarios.

Fig. 3. Box-and-whisker plots summarizing properties of events in each cluster of clustering used in likelihood formulation based on antecedent conditions and rainfall characteristics; runoff coefficient and number of events in each cluster are shown for comparative purposes

Events within each cluster are then pooled to give for the

j

th cluster a performance measure for the

i

th model of

M_{j, i} = \exp (- \frac{1}{n_{j}} \sum_{k \in K_{j}} \frac{ε_{s, i, k}}{ε_{d, k}})

(5)

where

n_{j}

= number of events in the

j

th cluster and

K_{j}

= set of events in the jth cluster of those events.

Pooling the events by cluster and discarding the events affected by previous disinformative events reduces the temporal dependency between the events. Though it is clear that successive events will not be entirely independent because of antecedent effects on state variables, they will be more independent than within event errors.

In the belief that each cluster provides an independent equal increment of information to the conditioning process, the likelihood of the observed data given the

i

th model can be computed as the product of the

M_{j, i}

. This approach is followed in defining the likelihood with the incorporation of an additional feature, an indicator function

I

. The indicator function takes the value zero for the

i

th model if

ε_{s, i, k}

is greater than

10^{- 0.3} \approx 0.5

for any event. This applies a limit of acceptability by ensuring that zero weight is given to all sets of parameter values for which it is felt the model may not be of value in prediction.

The likelihood of the

i

th model is then given as

L_{i} \propto I_{i} \prod_{j = 1, \dots, 12} M_{j, i}

(6)

which can be written as

L_{i} \propto I_{i} \exp (- \sum_{j} \sum_{k \in K_{j}} \frac{ε_{s, i, k}}{n_{j} ε_{d, k}})

(7)

The term within the exponential is a weighted sum of the

ε_{s, i, k}

, which are summaries of the model fit across events and clusters. Note that the weights in this sum are derived independently of any model using hydrological reasoning about relative information content and the clustering process.

If further events in a cluster are observed, then, following the analysis in Smith et al (2008),

M_{j, i}

converges stochastically with order

n_{j}^{- 1 / 2}

since the additional events add information only through revision of the estimate of the sample mean of the ratio

ε_{s, i, k} / ε_{d, k}

for that cluster. This suggests that the informal likelihood will converge and cease to learn further as more events are observed (if the error characteristics for the informative events stay largely stationary).

Fig. 4 shows the marginal cumulative distributions of the parameters for the resulting Bayesian analysis, where the prior parameter distributions are taken to be uniform on the ranges shown in the figure. For comparison, the summaries of two further posterior parameter distributions are shown. The first is what would result if the disinformative data had not been removed and the initial clustering used. The second is based on a formal statistical analysis using all the calibration data and the likelihood of Schoups and Vrugt (2010) with the autocorrelation terms excluded (the autocorrelation is high here, but including it leads to wide asymptotic uncertainty bounds for the simulation case; however, excluding autocorrelation is expected to bias parameter inference).

Fig. 4. Marginal cumulative distributions of model parameters using informal likelihood with disinformative data removed (solid line) and with disinformative data included (dotted line) and with that resulting from Schoups and Vrugt (2010) likelihood (dashed line)

Comparison of the posterior distributions reveals that the inclusion of the disinformative data produces a stronger conditioning of the parameter space, which is indicative of the potential for making Type II errors of rejecting a plausible model because of epistemic errors (Beven 2010, 2012a). This is even clearer in the case of the formal likelihood, where the posterior distribution of all the parameters is much more heavily conditioned.

Evaluating the Ensemble Model Predictions

The informal likelihood used to condition the model does not include a formal error model that can be used in a Bayesian framework for predicting future observed values. In many applications of GLUE the residual errors associated with a model are treated implicitly under the assumption that there will be residual errors in prediction similar to those in calibration. This can provide acceptable prediction limits where both model and data are reliable (e.g., Beven et al. 2008). Here, it is not certain that that will be the case, so a nonparametric approach to providing prediction bounds is used within the GLUE framework.

The approach taken constructs the predictive distribution from the calibration data and the expected value of the model simulation at the prediction time step. The predictive uncertainty around the expected value of the model simulation can be characterized through the differences

d_{t} = y_{t} - {\hat{y}}_{t}

found during calibration. The characterization used takes the form of the locally constant quantile regression, as outlined in Yu and Jones (1998, Section 2.1). For a time step

t^{'}

in prediction where the expected value of the model simulation is

{\hat{y}}_{t^{'}}

, the difference

d_{t}

associated with the

t

th calibration time step is assigned a weight

w_{t, t^{'}}

. This weight is zero if the class of the event

t^{'}

falls in is not the same as that of the

t

th time step; otherwise,

w_{t, t^{'}} \propto \exp [- λ_{t^{'}} {({\hat{y}}_{t} - {\hat{y}}_{t^{'}})}^{2}]

. Quantiles of the empirical distribution of the weighted

d_{t}

provide a summary of the uncertainty around the prediction

{\hat{y}}_{t^{'}}

. The parameter

λ_{t^{'}}

is dependent upon

{\hat{y}}_{t^{'}}

and is selected at each time step to ensure that the effective sample size

{(\sum_{t} w_{t, t^{'}})}^{2} / \sum_{t} w_{t, t^{'}}^{2}

meets some predetermined value. This provides a form of state-dependent smoothing of the conditional distribution, with

λ_{t^{'}}

taking smaller values where calibration data are sparse. The values taken using an effective sample size are chosen on an event class basis using leave-one-out cross validation of the calibration data and matching the 95% coverage probability. Prior to a prediction event it is impossible to determine whether the event is informative or disinformative. To allow for this, two sets of prediction bounds are derived, one using data from the informative calibration events, the other using the data from the disinformative events, which are assigned to an appropriate cluster based on their rainfall characteristics.

Fig. 5 summarizes the coverage of the 95% prediction confidence intervals on an event-by-event basis. Although some evidence of poorer performance is present in validation on the informative events, the similarity of the coverage probabilities for the informative events in calibration and validation suggests that the means of conditioning the parameter space, along with the formulation of the predictive confidence intervals, is, for this catchment, transferable in time. The coverage of the disinformative validation events by the prediction bounds derived from informative data is typically lower than for the informative validation events since these uncertainty bounds do not attempt to characterize the disinformation in the data during these periods. The disinformative prediction bounds, however, perform adequately on the disinformative validation data. The lower number of disinformative calibration events, however, limits the characterization of the associated error distributions.

Fig. 5. Box-and-whisker plot summarizing coverage (as a fraction) on an event basis of 95% prediction confidence intervals for (a) informative bounds on informative calibration events; (b) informative bounds on validation events that would be considered informative; (c) informative bounds on disinformative validation events; (d) disinformative bounds on disinformative validation events; (e) formal likelihood on validation events considered informative; (f) formal likelihood on validation events considered disinformative

Coverage of the prediction intervals for the formal likelihood appears to be conservative for both the informative and disinformative periods. In this case, this result is indicative of the error model’s inability to characterize the distribution of the residuals adequately. The inclusion of autoregressive (AR) or autoregressive moving average (ARMA) components in the likelihood (Schoups and Vrugt 2010) would help address this. Inclusion of such terms in this case results in a nonstationary autoregressive integrated moving average (ARIMA) model. Though such a nonstationary time series model may be adequate for shorter lead time forecasting (over a low number of time steps), it is not suitable for the simulation of prediction error bounds over longer time periods such as those used in this study. For the high autocorrelation in the model residuals in this study the resulting asymptotic prediction bounds of the ARIMA model are so wide as to not be useful.

Fig. 6 shows both sets of prediction bounds for a period of validation data. It is apparent that the prediction bounds based on the assumption that the event is disinformative are less precise than those for the informative assumption. Nonetheless, it is still possible for a surprise in the presence of a disinformative event that neither prediction interval captures correctly. The excursion of the observed discharge during this event could still be consistent with 95% prediction bounds, but there is, of course, no guarantee that the type of epistemic errors detected in calibration will cover all surprises in prediction. Fig. 6 supports the suggestion that the calibration of the formal likelihood results in conservative prediction bounds for informative data periods, especially during high flows.

Fig. 6. Plot of informative (shaded) and disinformative (dashed) 95% prediction intervals, observed discharge (solid line), and formal likelihood 95% prediction interval (dotted line) for a prediction period; lighter shaded area represents prediction for event later deemed disinformative

Summary

As Beven (2008) observed, there remains a great deal of uncertainty about uncertainty estimation in hydrological modeling. In this paper an attempt has been made to analyze how this is the result of the epistemic uncertainties inherent in the hydrological modeling process and to propose some ideas about how to deal with assessing the relative information content of hydrological data (independently of any model) and how it might influence model conditioning based on hydrological reasoning. As noted earlier, by the very nature of epistemic errors, there can be no right answer without the additional information that would be necessary to define their characteristics. The proposed method consists of the following steps within a GLUE framework that have been demonstrated in the simple application presented:

1.

Classification of events and identification of events that might be disinformative to the calibration process;

2.

Definition of a relative measure of information for each informative event;

3.

Evaluation of each model run relative to a simple data model in each class of events, including exclusion of nonacceptable/nonbehavioral runs;

4.

Definition of a relative likelihood for each retained model (here based on the median residual for each event);

5.

Prediction using the likelihood weighted ensemble of retained models, including a nonparametric estimation of residual error; and

6.

An evaluation of the additional error that might arise in prediction because of data inconsistencies.

Software supporting these steps for the rainfall-runoff modeling case will be available from the authors.

Hydrology remains a subject limited by its measurement techniques. As was pointed out previously (e.g., Beven 2001b), even the water balance equation for a catchment cannot be verified by measurement without allowing for significant uncertainties in each of the input, output, and storage terms. It has been suggested here that the fact that these uncertainties involve epistemic and aleatory characteristics has important consequences for model calibration, hypothesis testing, and prediction. For individual events input and output data can be shown to be hydrologically inconsistent and may be disinformative to the model calibration and testing processes. It does not appear that the issue of epistemic error in the hydrological data will go away for the foreseeable future, but a proper evaluation of informative data is probably a prerequisite for any real improvements in testing hydrological model structures as hypotheses about catchment response (Beven 2010, 2012a; Beven et al. 2012a; Clark et al. 2011, 2012).

There has been long experience in hydrology with the performance of hydrological models in prediction or validation being worse than in calibration (though this is not always the case, a particular realization of errors can give better performance in prediction). Clearly one explanation for this is the possibility of epistemic errors that are quite different in calibration from those in prediction. Removing inconsistent or disinformative events from the calibration process might lead to more robust predictions of hydrologically consistent events in validation but, as Fig. 6 shows, does not guard against the failure to predict inconsistent events. The possibility of defining additional prediction bounds for these types of events needs to be explored in future work. This may be one way of separating robust model calibration and testing from more subjective decisions about projections of potential epistemic errors.

A Note on Equifinality

KB recently rediscovered that equifinality was first used in the context of hydrological model calibration in his Ph.D. thesis in 1975, but the 1993 prophecy paper is the first reference in print. To his knowledge, the first use of equifinality in this sense is in the work on general systems theory by Ludwig von Bertalanffy (1950, 1968). Culling (1957) and later Chorley (1962) first used it in relation to the development of similar landforms from different initial conditions by different histories, though Culling (1987) later preferred an interpretation in terms of nonlinear dynamics. A fuller discussion in this context is given in Beven (1996).

Acknowledgments

KB started doing work on Monte Carlo simulations of hydrological models while working at the University of Virginia in 1980 on a CDC6600 mainframe computer. Discussions with George Hornberger at that time helped shaped the ideas that led to the GLUE methodology, published with Andy Binley in 1992. Since that time, discussions with many others have helped to clarify the issues of aleatory and epistemic uncertainties and their impact on likelihoods. This work is a contribution to the CREDIBLE consortium funded by the U.K. Natural Environment Research Council (Grant NE/J017299/1). Comments from Bellie Sivakumar, Alberto Montanari, and an anonymous referee helped to improve the paper.

References

Allchin, D. (2004). “Error types.” Perspect. Sci., 9(1), 38–59.

Abstract

Once upon a Time

Perceptual Model of Uncertainty in Hydrological Modeling Process

Uncertainty and Complexity

Uncertainty, Equifinality, and GLUE

Reducing Epistemic Uncertainty by Improving Knowledge

Perceptions of the Meaning of Uncertainty

Rethinking Concepts of Information Content of Hydrological Data

Rethinking Concepts of Likelihood of Hydrological Models

Why One Should Not Get Hung up on Formal Bayes

One Should Expect a Model to Perform Better Than When Using Estimates Based Directly on the Available Data

Reinforcement, Redundancy, Out-of-Range Prediction, and Surprise

What Are the Required Characteristics of a Likelihood Measure Given Epistemic Uncertainty?

Example Application

Identifying Disinformative Periods

Information Content and Formulating a Likelihood

Evaluating the Ensemble Model Predictions

Summary

A Note on Equifinality

Acknowledgments

References

Information

Published In

Copyright

History

Authors

Affiliations

Metrics

Citations

Download citation

Cited by

Figures

Other

Share

Copy the content Link

Share with email

Share

Request Username

Create a new account

Change Password

Password Changed Successfully

Verify Phone

Congrats!