Free access
TECHNICAL PAPERS
May 17, 2010

Appraisal of Statistical Predictability under Uncertain Inputs: SST to Rainfall

Publication: Journal of Hydrologic Engineering
Volume 16, Issue 12

Abstract

Climatic variables that are used as inputs in hydrologic models often have large measurement uncertainties that are mostly ignored in hydrologic applications because of lack of appropriate tools. This study develops a set of algorithms to engage uncertainty information in three of the most common statistical procedures applied on climatic data, namely correlation (BaNCorr), principal component analysis (VBaNPCA), and regression (VNRVM). These new algorithms are developed within a common framework of Bayesian learning, and together they provide a comprehensive tool to account for uncertainty in various stages of model development. The developed algorithms are first tested and compared with traditional methods and state-of-the-art algorithms on synthetic data. Practical application of the proposed algorithms is demonstrated by developing a seasonal prediction model for all India summer monsoon rainfall by using sea surface temperature (SST) data and associated measurement errors as inputs. The results suggest that incorporating measurement errors in hydrologic models improves their prediction performance and provides better assessment of their predictive capabilities.

Introduction

A recent trend in hydrologic prediction models is to derive predictability from variables governing atmospheric and oceanic circulations that are measured by using in situ and remote-sensing observations and have various sources of uncertainties. The magnitude of measurement uncertainties, which are now often quantified, differs from variable to variable and for a given variable varies substantially in space and time. It is well established that ignoring uncertainty in data can lead to misleading results (Pappenberger and Beven 2006). Nevertheless, this information is mostly ignored in traditional models (both physical models such as general circulation models and statistical models) probably because until recently such information was not readily available. Hardly any research has been done to develop practical tools to engage this information, let alone to demonstrate the advantages of doing so. This study is an attempt to span this gap.
Uncertainties in input data propagate through various stages of model development leading to the final prediction, and therefore quantification of prediction uncertainties requires accounting for uncertainties in all the intermediate steps. Most of the existing studies address the issue of uncertainty in only one aspect of model development (Chowdhury and Sharma 2007) and as such, cannot reliably estimate the prediction uncertainty. In this study, a methodology to engage uncertainty in various stages of developing a statistical hydrologic prediction model, which involves selecting predictors and learning the relationship between selected predictors and predictand, is developed within a Bayesian framework.
Identifying relevant predictors from a pool of potential predictors to circumvent the curse of dimensionality is accomplished by screening or filtering procedures (Wilks 2006) that can be broadly categorized into feature-selection and feature-extraction methods (Duda et al. 2001). Feature-selection methods select a subset of features from the pool of potential predictors by using predictand information (supervised learning), whereas feature-extraction methods derive or extract predictors by combining features in the original pool of predictors typically without using predictand information (unsupervised learning).
Among feature-selection methods, the Pearson correlation coefficient is the most commonly used method. Features are selected by assigning a threshold on either the absolute value of the correlation or the significance level. Presence of white noise in the data attenuates the absolute value of correlation and widens the confidence interval around it (Kulathinal et al. 2002; Fan 2003). Hence, ignoring measurement errors in the data can result in selection of undesirable features. Estimating correlation when measurement errors are homogeneous (do not vary from sample to sample) has been extensively studied in the literature (Fuller 1987; Schisterman et al. 2003), but studies on heterogeneous errors are limited (Kulathinal et al. 2002) and few, if any, employ a Bayesian framework. Bayesian methods for correlation estimation offer a distinctive advantage over traditional maximum likelihood and method of moments in that they are more robust when the sample size is small and allow incorporation of domain knowledge into the estimation procedure (Liechty et al. 2004; Schisterman et al. 2003; Barnard et al. 2000). In this study, a Bayesian algorithm to estimate correlation is developed when data are corrupted by heterogeneous noise. The algorithm is termed as the Bayesian Noisy Correlation (BaNCorr).
Principal component analysis (PCA) is arguably the most popular feature-extraction method in hydrologic prediction models (Kumar et al. 2006; Tootle and Piechota 2006). Tripathi and Govindaraju (2008) demonstrated that engaging data uncertainty information improves the performance of PCA by extracting principal components (PCs) that have low data reconstruction error. Additionally, their method, Bayesian noisy PCA (BaNPCA), can identify the appropriate number of PCs and impute missing values in the data. However, the BaNPCA algorithm is based on type-II maximum likelihood by using local Gaussian approximation for the posterior distribution of the model parameters, an assumption that is likely to be violated if the sample size is small. This study develops an algorithm for BaNPCA using variational inference or variational Bayes method (MacKay 2005; Jordan et al. 1999) that avoids local approximations and provides a computationally efficient alternative. The new algorithm is termed as variational Bayes noisy PCA (VBaNPCA).
The predictors identified in the screening step are mapped to the predictand by a functional relationship that takes the form of a regression model if the predictand is a continuous variable or classification model if the predictand is a categorical variable. This study considers the former case and develops a new algorithm to incorporate data uncertainty in learning regression relationships. The proposed algorithm is a modification of the relevance vector machine (RVM; Tipping 2001) to engage uncertainty in predictors. The RVM was selected as the base model for the new algorithm because it offers the following advantages within the present context:
1.
RVMs belong to a class of embedded methods in which screening and mapping steps are embedded into a single algorithm, making them less prone to over fitting (Guyon and Elisseeff 2003).
2.
The RVM algorithm yields model and prediction error, thus quantifying the inherent uncertainties in making hydrologic predictions.
3.
The RVM is based on Bayesian framework facilitating incorporation of measurement errors into its algorithm.
Quiñonero Candela et al. (2003) modified the RVM algorithm to engage uncertainties in predictors during the prediction phase and demonstrated its advantages for multistep forecast. However, their method ignores uncertainty during the RVM training phase, a limitation that, to our knowledge, has not been addressed in the literature. In this study, a methodology is proposed to utilize information of measurement errors in predictors during both training and prediction phases of the RVM algorithm. The algorithm is based on variational principles and is termed as variational noisy RVM (VNRVM).

Bayesian Inference

Let D denote the observed data, and Θ=[θ1,θ2,,θp]T denote unknown variables (model parameters and latent variables) in the model. In Bayesian inference, the unknown variables are estimated by first assigning a prior distribution, p(Θ), over them and then calculating the posterior distribution by using Bayes rule as
p(Θ|D)=p(D|Θ)p(Θ)p(D|Θ)p(Θ)dΘ
(1)
For most nontrivial models, the posterior distributions do not have a closed form because the integration in the denominator of Eq. (1) is not tractable and is therefore approximated. The two broad categories of approximations that are used in the study, namely deterministic and stochastic approximation, are briefly presented next.

Stochastic Approximation

In stochastic approximation, p(Θ|D) is numerically approximated by using sampling algorithms. Markov Chain Monte Carlo (MCMC) methods, developed in the physics literature, are the most popular sampling method used in Bayesian inference because they are efficient in sampling high dimensional spaces (large p), easy to design, and applicable to large class of distributions (Bishop 2006). These methods sample by constructing a Markov chain whose stationary distribution (equilibrium distribution) is the required posterior distribution. Two MCMC methods used in this paper, namely the Metropolis-Hasting algorithm and Gibbs sampler, are briefly outlined.
The Metropolis-Hasting (MH) algorithm was developed by Metropolis et al. (1953) and later extended by Hastings (1970). The algorithm draws samples from p(Θ|D) by generating a first order Markov chain in which each state Θt depends only on its previous state Θt-1 through a jumping distribution J(Θt|Θt-1), also known as proposal or candidate generating distribution, as follows:
1.
Set time step t=0 and initialize Θt.
2.
Generate candidate sample Θ* from J(Θ*|Θt) and estimate
s=p(Θ*|D)/J(Θ*|Θt)p(Θt|D)/J(Θt|Θ*)=p(D|Θ*)p(Θ*)/J(Θ*|Θt)p(D|Θt)p(Θt)/J(Θt|Θ*)
(2)
3.
Select the candidate sample with probability r, in which r=min(s,1). If the candidate is selected, set Θt+1=Θ*, otherwise set Θt+1=Θt; increment t=t+1 and go to Step 2.
After sufficient burn-in period (say τ steps), the Markov chain approaches its stationary distribution and the samples generated (Θτ+1,,Θτ+Ns) are from p(Θ|D).
Gibbs sampling (Geman and Geman 1984) is a special case of MH algorithm in which r=1, i.e., candidate samples are always selected. The method is used when it is easy to draw samples from the univariate conditional distribution p(θi|θi), where θi is the ith unknown variable and θi denotes θ1,,θp but with θi omitted. The following steps are involved:
1.
Set time step t=0 and initialize θi(t); i=1,,p.
2.
Generate the following samples:
θ1(t+1)p(θ1|θ2(t),θ3(t),,θp(t));θ2(t+1)p(θ2|θ1(t+1),θ3(t),,θp(t));θp(t+1)p(θp|θ1(t+1),θ2(t+1),,θp-1(t+1))
3.
Increment t=t+1 and go to Step 2.
After sufficient burn-in-period, the samples generated will be from the required posterior distribution p(Θ|D).
MCMC methods, owing to increased computational power of modern computers, have revolutionized the field of Bayesian inference. These methods are guaranteed to converge given infinite processing time. However, because processing time is finite, various diagnostics have been developed to ascertain convergence. The most commonly used tools are trace-plots of the generated samples, Geweke test (Geweke 1992), Raftery-Lewis test (Raftery and Lewis 1992), Gelman and Rubin test (Gelman and Rubin 1992), and diagnostics on the basis of autocorrelation of the generated samples.

Deterministic Approximation

The aim in this study is to approximate p(Θ|D) by Q(Θ|D) such that Q(Θ|D) is similar to p(Θ|D) and at the same time has a closed form. The methods are based on the principles of calculus of variations and are therefore referred to as variational inference or variational Bayes (Bishop 2006) methods. In the machine learning literature, they are also known as ensemble methods (MacKay 2005). The variational method used in this study has its origin in mean-field theory (Parisi 1988) where it was developed to approximate the lowest-energy state in quantum mechanical calculations. In this method, the approximating distribution Q(Θ|D) is assumed to factorize as
Q(Θ|D)=i=1pqi(θi|D)
(3)
In the remainder of the paper, to make the notation uncluttered, explicit dependence of approximating distribution on data will be dropped and the ith component of the distribution will be denoted by q(θi).
To estimate Q(Θ), the Kullback-Leibler (KL) divergence (Kullback and Leibler 1951) between Q(Θ) and p(Θ|D) [Eq. (8)], which is a measure of distance between the two distributions, is minimized
KL(Qp)=-Q(Θ)ln[p(Θ|D)Q(Θ)]dΘ
(4)
The KL-divergence can take only positive values with minimum value of zero only when Q(Θ)=p(Θ|D). Substituting Q(Θ) from Eq. (3) in Eq. (4) and minimizing KL-divergence with respect to distributions q(θi) yields the following estimates:
lnq(θi)=Eji[lnp(D,Θ)]+const;i=1,,p
(5)
where q(θi) = optimal value of the approximating distribution; p(D,Θ) = joint distribution over data and the unknown variables; and Eji(·) denotes expectation with respect to variables θ1,,θp, except θi. Eq. (5) provides the optimal solution for Q(Θ) subject to the factorization constraint [Eq. (3)]. However, the solution is not explicit because q(θi) depends on the expected value with respect to variables θi and should be estimated iteratively. Nevertheless, convergence is guaranteed because KL-divergence is convex with respect to each factor q(θi).
An additional benefit of variational methods is that they can be used for model selection, i.e., selecting the best model from a set of candidate models. The candidate that corresponds to the minimum value of KL-divergence is the best model. Because KL-divergence cannot be estimated directly as it is a function of p(Θ|D), the following decomposition of the model evidence is utilized:
lnp(D)=L(Q)+KL(Qp)
(6)
where L(Q)=Q(Θ)ln[p(D,Θ)/Q(Θ)]dΘ. Since p(D) is constant, minimizing KL-divergence is equivalent to maximizing L(Q). The quantity L(Q) is known as lower bound and can be easily estimated for given Q(Θ). The best model is the one that maximizes the lower bound. A secondary advantage of calculating L(Q) is realized in monitoring the convergence of the iterative estimation procedure and checking the software implementation by noting that each iteration should always increase the value of L(Q).

Mathematical Formulation

This section outlines the formulation of models proposed to engage data uncertainty in various phases of developing a hydrologic prediction model, namely Bayesian correlation with noisy data (BaNCorr), Variational BaNPCA (VBaNPCA), and variational noisy RVM (VNRVM). The following notation is used: N samples of independent variables in p dimensional real space are denoted by X=[xn]T (n=1,,N; xnp), and the corresponding dependent variable is given y=[yn]T (yn). The independent and dependent variables are observed with known measurement errors given by
x̃n=xn+ζnỹn=yn+ξn
(7)
where x̃n and ỹn = observed values of the nth sample; and ζn and ξn = measurement errors that follow Gaussian distribution with zero means and variances Bn-1 and hn-1, respectively.
Matrices are represented by upper case bold italic letters. All vectors are column vectors and are represented by lower case bold italic letters. Normal font italic letters represent scalars; m×m identity matrix is shown as Im; symbol x denotes expected value of a random variable x; and diag(x) represents a diagonal matrix with diagonal elements x.

Bayesian Correlation with Noisy Data (BaNCorr)

Let a univariate independent variable x and dependent variable y follow bivariate Gaussian distribution with mean μ=[μxμy]T and covariance Σ, i.e., p(x,y|μ,Σ)N([xy]T|μ,Σ). Further, consider N i.i.d. samples from this distribution [(x1,y1),(x2,y2),,(xN,yN)] measured with heterogeneous errors
x̃nỹn=xnyn+ζnξn
(8)
where [x̃nỹn]T = nth observed sample; and [ζnξn]T = associated measurement error vector. The measurement errors are assumed to be independent and follow zero-mean Gaussian distribution
ζnN(ζn|0,bn-1),ξnN(ξn|0,hn-1),n=1,2,,N
(9)
where bn-1 and hn-1 denote the variance of the measurement error in xn and yn, respectively. They are also referred to as precision of the individual measurements. In BaNCorr, the true values (xn and yn) are treated as latent variables. The covariance matrix is decomposed as
=diag(s)Rdiag(s)
(10)
In Eq. (10), s=[σxσy]T = vector of standard deviations; diag(s) = diagonal matrix with diagonal elements s; and R = correlation matrix given by
R=1ρxyρxy1
(11)
To estimate correlation between x and y, ρxy, prior distributions are assigned to all unknown parameters, which for the most general (also most typical) case include μ, s, and ρxy. The posterior distribution is estimated by using Bayes rule as follows:
p(μ,s,ρxy|x̃,ỹ)=p(x̃,ỹ|μ,Σ)p(μ,s,ρxy)p(x̃,ỹ)
(12)
where the likelihood of the data p(x̃,ỹ|μ,Σ) is given by
p(x̃,ỹ|μ,Σ)=n=1Np(x̃n|xn,bn-1)p(ỹn|yn,hn-1)p(xn,yn|μ,Σ)
(13)
In Eq. (13), owing to the assumption of zero-mean Gaussian measurement errors, the distributions p(x̃n|xn,bn-1) and p(ỹn|yn,hn-1) also follow Gaussian distribution with mean xn and yn, and variance bn-1 and hn-1, respectively.
The prior distribution in Eq. (12), p(μ,s,ρxy), can be selected in various ways. In BaNCorr, the selection was guided by a desire to have mathematical elegance and simplicity in incorporating prior information into the learning process. The priors for μ, s, and ρxy are assumed to be independent, i.e., p(μ,s,ρxy)=p(μ)p(s)p(ρxy). For typical hydro-climatic variables, it is difficult to elicit reliable prior information on how mean, variance, and correlation are related. The prior distributions for μx and μy are selected to be Gaussian distributions, i.e., p(μx)N(μx|νx0,βx0-1) and p(μy)N(μy|νy0,βy0-1); Gamma distributions are selected for precision (inverse of the variance) of the individual variables, i.e., p(σx-2)Gamma(σx-2|cx0,dx0), p(σy-2)Gamma(σy-2|cy0,dy0), and p(s)=p(σx)p(σy); and the prior for ρxy is set to be uniform distribution, p(ρxy)U(ρxy|ρL,ρU). Prior knowledge of the data can be incorporated into BaNCorr by adjusting the hyperparameters in the prior distribution. If no a priori knowledge is available, the prior distribution can be made noninformative (broad) by setting: νx0=νy0=0; βx0=βy0=10-3; cx0=dx0=cy0=dy0=10-3; and ρL=-1, ρU=1. Fig. 1 shows the schematic of the BaNCorr model by using directed acyclic graph (DAG). The notational convention used in a DAG is given in the appendix.
Fig. 1. Directed acyclic graph representing BaNCorr model developed for finding correlation in noisy data
The posterior distribution for the model parameters, μ, s, and ρxy, and latent variables, xn and yn, cannot be estimated analytically, and therefore they were approximated by using the Gibbs sampler. The prior distribution over latent variables is implicitly decided on the basis of the prior distribution of model parameters as shown in Fig. 1. The Gibbs sampling procedure requires distribution of each unknown variable conditioned on the values of all remaining variables. For the BaNCorr model, the required distributions are given by (to make notations uncluttered, explicit dependence of the distributions on the observed data x̃ and ỹ has been dropped)
p(μx|μy,σx-2,σy-2,ρxy,x,y)=N(μx|mμx,βx-1)
(14)
p(σx-2|μx,μy,σy-2,ρxy,x,y)=Gamma(σx-2|cx,dx)
(15)
p(xn|μx,μy,σx-2,σy-2,ρxy,y)=N(xn|mxn,pxn-1);n=1,,N
(16)
where
βx=βx0+Nσx-2mμx=βx-1(σx-2n=1Nxn+βx0νx0)cx=cx0+N/2dx=dx0+12n=1N(xn-μx)2pxn=bn+σx-2mxn=pxn-1(σx-2μx+bnx̃n).
A similar set of equations can be easily derived for the variable y. To complete the sampling procedure, a method is required to draw samples from conditional distribution of ρxy. Because this distribution does not have an analytical form, the samples were drawn by using the Griddy Gibbs sampler (Barnard et al. 2000; Ritter and Tanner 1992). The sampling strategy used in BaNCorr was adopted from Chevrier (2007).

Variational BaNPCA

The goal of PCA is to linearly map xn from a p dimensional real to a q dimensional real space so that the transformed data zn (znq; q<p), for a given q, preserves maximum variance in xn. In probabilistic formulation of PCA, xn is defined as linear transformation of zn by using the following linear Gaussian latent variable model:
xn=Wzn+μ+εn
(17)
where zn = latent variables; εn = isotropic (symmetric) Gaussian noise εnN(εn|0,σ2Ip); μ = p-dimensional vector; and W = p×q matrix. For notational simplicity, variance of the error term (σ2) will be represented by τ, where τ=σ-2. Tipping and Bishop (1999) proved that if the prior distribution over zn is zero-mean Gaussian distribution, p(zn)N(zn|0,Iq), then the maximum likelihood solution of W contains principal directions of data matrix X in its columns (although with a rotational ambiguity that can be easily removed as described in Golub and van Loan (1996) and Ahn and Oh (2003).
Tripathi and Govindaraju (2008), developing on the work of Sanguinetti et al. (2005), extended probabilistic PCA to engage measurement uncertainties of observed data in the analysis and developed a Bayesian framework for estimating model parameter W. The algorithm was termed as Bayesian Noisy PCA (BaNPCA) and is shown schematically in Fig. 2(a). BaNPCA was demonstrated to have the following advantages over standard PCA: (1) possesses better data compressibility, (2) yields smoother principal vectors, (3) imputes missing values in the data, and (4) identifies appropriate number of PCs. However, the algorithm was not fully Bayesian because apart from W, the other model parameters and hyperparameters were estimated using maximum a posteriori (MAP) method that approximates a posterior distribution by a delta function at its mode. This approximation is justified when N>>p, a case seldom satisfied by hydro-climatic data sets including sea surface temperature (SST) data. In this study, a complete Bayesian treatment of the BaNPCA algorithm is proposed; the new algorithm is referred to as variational BaNPCA (VBaNPCA). Mathematically, the algorithm is given by Eqs. (7) - (17). An important difference between BaNPCA and VBaNPCA is the introduction of true data xn as a latent variable in VBaNPCA, which substantially reduces the mathematical complexity in estimating noise variance σ2, a computational bottleneck in the earlier BaNPCA algorithm.
Fig. 2. Schematic of indicated algorithms
In VBaNPCA, the following hierarchical priors are assigned to model parameters and hyperparameters
p(W|α)=j=1pN(wj|0,αj);whereα=[α1,,αq]T
(18)
p(αj)=Gamma(αj|a0,b0);j=1,,p
(19)
p(μ)=N(μ|0,β0-1Ip)
(20)
p(τ)=Gamma(τ|c0,d0)
(21)
where the column vector wj denotes jth column of W. The priors are made broad or noninformative by setting a0=b0=c0=d0=β0=10-3. The joint distribution of the observed variables, latent variables, and model parameters for VBaNPCA model [schematically shown in Fig. 2(b)] is given by
p(X̃,X,Z,W,μ,α,τ)=n=1N[p(x̃n|xn,ζn)p(ζn|Bn)p(xn|zn,W,μ,τ)p(zn)]p(μ|β0)p(τ|c0,d0)p(W|α)j=1qp(αj|a0,b0)
(22)
The posterior distribution obtained by using the mean-field approximation is given in the appendix.

Variational Noisy RVM

Let the regression relationship between independent variable xn and dependent variable yn be given by
yn=f(xn,X;w0,w)+ε=w0+wTϕn+ε
(23)
where parameter w0 that allows for a fixed offset in the data is termed as “bias”; w=[w1,,wM]T = weight vector; ε = noise term assumed to have zero-mean Gaussian distribution with variance σε2 (precision τε=σε-2), i.e., εN(ε|0,σε2); and ϕn=[ϕ(x1,xn),,ϕ(xM,xn)]T = vector of basis functions. In RVM, basis functions are kernel functions with one kernel associated with each data point in the training data, i.e., M=N. A popular choice for kernel function is Gaussian or radial basis function (RBF) kernel given by
ϕ(xj,xn)=exp[-(xj-xn)T(xj-xn)σkernel2]
(24)
where σkernel is known as width of the kernel function. In standard RVM [Tipping 2001; Fig. 3(a)] typically used in the hydrologic literature (Khalil et al. 2005), the complexity of the regression model [Eq. (23)] is governed by assigning the following hierarchical priors, also known as automatic relevance determination (ARD) priors, over bias and weight vectors:
p(wj|αj)=N(wj|0,αj-1);j=0,,M
(25)
where αj = hyperparameter with a noninformative uniform prior in log space as with σε2. Ideally, model parameters should be estimated by determining their posterior distribution by using Bayes rule [Eq. (1)]. Because this is analytically intractable, posterior distribution is estimated only for bias and weight vectors whereas point estimates are obtained for other parameters by using the Type-II maximum likelihood approach that approximates posterior distributions by a delta function at their modes. This assumption is justified when the number of data points (N) is very large because in that case posterior distribution is tightly concentrated around its mean. However, in typical hydrologic applications, N tends to be small, and this approximation is rarely justified.
Fig. 3. Schematic of indicated algorithms
A complete Bayesian treatment of RVM through variational inference was proposed by Bishop and Tipping (2000). In this study, their method is extended to incorporate measurement uncertainties of input data into RVM learning. The methodology developed is motivated by the works of Weigend et al. (1996) for state-space models and Quiñonero Candela (2004) for Gaussian processes.
As in BaNCorr and VBaNPCA, the true input xn is treated as a latent variable in VNRVM. The schematic of the VNRVM is shown by using a DAG in Fig. 3(b). The following prior distributions were assigned to model hyperparameters and noise variance:
p(αj)=Gamma(αj|a0,b0);j=0,,M
(26)
p(τε)=Gamma(τε|c0,d0)
(27)
The priors were made flat or noninformative by setting a0=b0=c0=d0=10-3. The joint distribution of the model parameters and observed data is given by
p(y,X,X̃,w,w0,α,α0,τε)=n=1Np(yn|xn,w,w0,τε)p(x̃n|xn,ζn)p(ζn|Bn)p(w|α)p(w0|α0)p(τε)j=0Mp(αj)
(28)
where α=[α1,,αM]T. The mean-field approximation to the posterior distribution of model parameters and latent variables is given in the appendix.

Data Used in the Study

SST Data

Gridded sea surface temperature data available at monthly time step from HadSST2 (Rayner et al. 2006) prepared at the Hadley Centre for Climate Prediction and Research, U. K., 〈http://hadobs.metoffice.com/hadsst2/〉, were used in the study. HadSST2 is an un-interpolated sea surface temperature data set based on in situ measurements of SST obtained primarily from ships and buoys. The climatological values were subtracted from the SST measurements to estimate anomalies. The resulting anomalies were gridded onto a 5° latitude by 5° longitude monthly grid by taking the winsorized mean. Bias correction was applied on the gridded anomalies to remove spurious trends caused by changes in SST measuring practices, following the procedure of Folland and Parker (1995).
HadSST2 accounts for uncertainty arising from (1) measurement and sampling errors (M&SE). Measurement errors are related to precision of the instruments used for measuring SST, whereas sampling errors are caused by estimating an area-averaged quantity from a finite number of noisy observations; and (2) bias attributable to different measurement practices over the observation period. The combined uncertainty was obtained by adding M&SE and errors resulting from bias correction, in quadrature, i.e., by taking the square root of the sum of the squares of errors. This step is justified under the assumption that M&SE and errors resulting from bias correction are independent.

All India Summer Monsoon Rainfall

The area weighted rainfall data for India (Parthasarathy et al. 1994) was extracted from the Indian Institute of Tropical Meteorology, Pune, website, 〈http://www.tropmet.res.in〉. The data extend from January 1871 to December 2006. The primary source of data is the India Meteorological Department. The time series of summer monsoon rainfall was derived by adding monthly rainfall values for monsoon months (JJAS). Because the uncertainties in rainfall are not reported, they were assumed to be noise free in the study.

Synthetic Data Sets

Synthetic data were used to evaluate the performance of the proposed algorithms because evaluation requires true observations and results that are, in general, not available with real-world data. The synthetic data sets were either obtained from the literature or designed to highlight the advantages of the proposed algorithms. They will be described in the sections in which they are used.

Results and Discussion

First, the performance of the proposed algorithms on synthetic data will be presented and compared with state-of-the-art methods. Following this, by using the seasonal forecast of AISMR as an example, application of the proposed algorithms for hydrologic prediction will be illustrated.

Bayesian Noisy Correlation (BaNCorr)

One hundred samples (N=100) from two normally distributed random variables x and y, with mean μx=0 and μy=0, standard deviation sx=1 and sy=1, and correlation ρxy=0.6 were generated and subjected to additive zero-mean Gaussian noise with variance sampled uniformly from the interval [0,u] to yield x̃ and ỹ. The data were generated for different values of u and correlation coefficients were estimated; the process was repeated 100 times. Fig. 4 shows the correlation estimated by the standard method and median of the 5,000 samples generated from the posterior distribution of correlation by using the BaNCorr algorithm (sampled by using MCMC algorithm after burn-in-time of 5,000 steps). As expected, the correlations estimated from the standard method deviate from the true value, whereas those from BaNCorr algorithm are closer to true values, even in presence of large noise, thereby highlighting the advantages of engaging uncertainty information.
Fig. 4. Comparison of standard correlation and BaNCorr for a data set of size N=100 with population correlation of 0.6 denoted by dashed line in the figure; measurement errors were induced by generating samples from Gaussian distribution with mean zero and standard deviation uniformly distributed between 0 and u
The BaNCorr algorithm is next compared with the algorithm proposed by Kulathinal et al. (2002) in the medical science literature, which also engages data uncertainty but yields maximum likelihood estimate of correlation, abbreviated in this paper as MLECorr. Fig. 5 shows the results from the two algorithms for the data set described in the previous paragraph but for different sample sizes and u=0.5. For large N, the estimates from BaNCorr and MLECorr are similar; however, for small N, MLECorr tends to give higher correlation with large variability whereas BaNCorr estimates are more robust. Similarly for a given sample size, MLECorr gives higher correlation for data having large measurement errors, as shown in Fig. 6. The estimate of covariance matrix tends to be unstable in MLECorr for small sample size and large noise but is stable in the BaNCorr algorithm owing to the shrinkage effect implicit in Bayesian methods (Yang and Berger 1994; Barnard et al. 2000).
Fig. 5. Comparison of MLECorr and BaNCorr for data sets of size N with population correlation of 0.6 denoted by dashed line in the figure; measurement errors were induced by generating samples from Gaussian distribution with mean zero and standard deviation uniformly distributed between 0 and 0.5; MLECorr overpredicts the correlation for smaller N
Fig. 6. Comparison of MLECorr and BaNCorr for a data set of size N=50 with population correlation of 0.6 denoted by dashed line in the figure; measurement errors were induced by generating samples from Gaussian distribution with mean zero and standard deviation uniformly distributed between 0 and u; MLECorr tends to overpredict the correlation for large measurement errors
The confidence interval (CI) for the estimated correlation given by the standard method depends only on the sample size. Intuitively, measurement errors should widen CI. Nevertheless, it is not reflected in the standard method owing to its inability to engage uncertainty information. For BaNCorr results, this effect is exhibited in the form of increased dispersion in the posterior distribution of the correlation. Figs. 7 and 8 present the posterior distribution of correlation for different sample sizes and data subjected to different noise levels. The posterior distribution is distinctively different from Gaussian distribution.
Fig. 7. Posterior distributions of correlation with varying sample size; for the synthetic example, population correlation was 0.60 and induced measurement errors were Gaussian distributed with mean zero and standard deviation uniformly distributed between zero and unity
Fig. 8. Posterior distributions of correlation with varying noise levels characterized by u; for the synthetic example, population correlation was 0.60 and sample correlation with 200 samples was 0.56
The standard and BaNCorr algorithms differ markedly in their procedures for selecting a feature that reflects their frequentist and Bayesian origins. In the standard method, a feature is selected if its correlation with the dependent variable is significant, more precisely, if the null hypothesis of no correlation cannot be rejected at a given significance level. In BaNCorr, a feature is selected if the posterior probability of correlation exceeds a threshold. Both procedures have relative merits, although the Bayesian method is more straightforward to understand and implement. Additionally, the Bayesian method allows incorporation of knowledge of the physical process into the inference procedure, whenever available. For instance, a priori knowledge about the positive correlation between dependent and independent variables could be used by appropriately adjusting the prior distribution.

Variational Bayes Noisy PCA

Tripathi and Govindaraju (2008) presented and discussed the advantages of BaNPCA over the traditional PCA methods. The desired properties of BaNPCA are also shared by the variational Bayes noisy PCA (VBaNPCA) algorithm. Hence, for brevity, only two properties, namely imputation of missing values and identification of appropriate number of PCs, will be briefly presented.
Beckers and Rixen (2003) designed a data set for evaluating the performance of data imputing algorithms; a modified version was used by Tripathi and Govindaraju (2008). The data consist of a deterministic and a random component. The ijth element of the deterministic component X is given as
xij=sin(ti)sin(lj)+sin(2.1ti)sin(2.1lj)+sin(3.1ti)sin(3.1lj)+cos(ti|)tanh(lj)+cos(2.1ti)tanh(2lj)+cos(0.1ti)tanh(4lj)+cos(1.1ti)tanh(2.4lj)+tanh(ti+lj)+tanh(2ti+lj)
(29)
where ti=2πi/n and lj=2πj/p, with time index i=1,,n, and space or location index j=1,,p. For the simulations, n and p were fixed to 100 and 50, respectively. The random component consists of two parts; the first part mimics zero-mean normally distributed observation errors whose variance varies log-linearly between 0.05 and 5 across space and fluctuates randomly across time. The second part is an additive zero-mean Gaussian noise with standard deviation of 0.05, and it remains constant for all data points. A hundred replicates of the data set were generated and 10% of the values were randomly removed from each set. The performance is quantified by comparing imputed values with true values. Fig. 9 shows the mean absolute error (MAE) in filling deleted values by VBaNPCA, BaNPCA, and two state-of-the-art data imputation algorithms, namely Regularized EM (RegEM; Schneider 2001), and data interpolating empirical orthogonal functions (DINEOF; Beckers and Rixen 2003). BaNPCA and VBaNPCA have smaller MAE compared to RegEM and DINEOF. The median MAE from BaNPCA and VBaNPCA are similar; however, VBaNPCA estimates have less scatter, suggesting improved performance.
Fig. 9. Mean absolute error in imputing missing values by different algorithms; statistics are based on 100 simulations; RegEM, DINEOF, BaNPCA, and VBaNPCA refer to regularized EM algorithm, data interpolating empirical orthogonal functions, Bayesian noisy PCA, and variational BaNPCA, respectively
To test the capability of VBaNPCA in identifying the optimal number of principal components, a synthetic data set with N=100 samples from a multivariate Gaussian distribution in p=10 dimensional space, with a variance of 5 in the first four dimensions and a variance of 1 in the remaining six dimensions, was generated and subjected to (1) randomly fluctuating additive zero-mean Gaussian noise with variance sampled from the range [0, 0.5]; and (2) fixed additive zero-mean Gaussian noise with variance 0.1. The principal vectors were obtained by using standard PCA (SPCA), noisy PCA (Sanguinetti et al. 2005), BaNPCA, and VBaNPCA, and are displayed by using Hinton diagrams in Fig. 10, wherein each column represents a principal vector (the Hinton diagram provides a qualitative display of the values in a vector matrix. Each value is represented by a square whose size is associated with its magnitude, and whose color indicates its sign). Except for standard PCA (SPCA), the three other algorithms use variance of the randomly fluctuating noise, mimicking measurement errors, for estimating principal vectors. The results indicate that both BaNPCA and VBaNPCA can identify the correct number of PCs suggesting the efficacy of automatic relevance determination (ARD) priors [Eq. (18)] used in them.
Fig. 10. Hinton diagram of principal vectors estimated by different algorithms for data set in 10 dimensional space drawn from a multivariate Gaussian distribution having larger variance in 4 dimensions and smaller variance in remaining 6 dimensions: (a) standard PCA; (b) noisy PCA; (c) BaNPCA; (d) VBaNPCA; Both BaNPCA and VBaNPCA suppress all but four relevant dimensions in the data

Variational Noisy RVM

Two experiments were performed to assess the performance of variational noisy RVM (VNRVM) for linear regression. The results were compared with simulation extrapolation (SIMEX) algorithm (Carroll et al. 2006), recently introduced in the hydrologic literature by Chowdhury and Sharma (2007). Briefly, in the SIMEX algorithm, a relationship is developed between noise and regression coefficient by adding known white noise to the data, which is subsequently extrapolated, typically by using polynomial extrapolation, to estimate the regression coefficient for no noise case.
In the first experiment, 100 samples of independent variable x were generated from xU(x|-10,10) and were subjected to zero-mean Gaussian error with standard deviation drawn randomly from the range [0,5] to yield x̃. The dependent variable y was estimated as y=wx+ε, where εN(ε|0,1) and w was set to 10. The top panel in Fig. 11 shows the value of w estimated by different algorithms from 100 repetitions of the experiment, by using box-plots. Both SIMEX and VNRVM performed better in estimating the regression coefficient than standard least square regression, which ignores uncertainty in data. The results from the SIMEX algorithm were sensitive to the choice of extrapolation method, which is reflected in the results obtained from linear, quadratic, and cubic extrapolation. The bias in estimating w over 100 simulations appears to be similar for SIMEXC and VNRVM; however, VNRVM has a notably smaller variance which can again be attributed to ARD prior that regularizes estimates of w.
Fig. 11. Linear regression for noisy data using (1) standard least square method; (2) SIMEX with linear (SIMEXL), quadratic SIMEXQ), and cubic SIMEXC) extrapolation; and (3) VNRVM; dashed lines show the true values of the regression coefficients and the box-plots represent the estimated coefficients from hundred simulations of synthetic data by respective methods
The second experiment was similar to the first, except that a redundant variable, drawn randomly from the range [-10,10], was introduced into the regression problem. The experiment mimics typical problems in hydrology, in which, in the absence of a priori knowledge, redundant variables are included in regression analysis. The middle and the bottom panels in Fig. 11 show the estimated coefficients by different methods for 100 simulations of the experiment. The performance of all the methods in estimating the coefficient for the relevant variable degrades from the first experiment as reflected in slightly higher bias and increased variance. For the redundant variable, unlike SIMEX, VNRVM forces its coefficient to zero, thus effectively removing the variable from the analysis. The experiment highlights the feature-selection capabilities of linear VNRVM.
To test VNRVM for nonlinear regression, the data from the sinc function sinc(x)=sinc(x)/x were used. The sinc function is a benchmark data set for evaluating kernel methods (Tipping 2001). The input variables were randomly generated from the range x[-10,10] and subjected to an additive zero-mean Gaussian noise with standard deviation sampled from the range [0,8]. To the output variable y=sinc(x), a zero-mean Gaussian noise of standard deviation 0.1, was added. Fig. 12 shows the sinc function and the noisy data set sampled from it. RVM and VNRVM were used to learn the function from the noisy data by using an RBF kernel [Eq. (24)] with σkernel=2. The learned functions are shown as dotted lines in Fig. 12. It is evident that the function learned by RVM is overly smooth, whereas that by VNRVM is closer to the true function.
Fig. 12. Learning sinc function from hundred samples of noisy data (plus symbol); a standard RVM algorithm (RVM) learns an oversmooth function (dotted line with cross marker), whereas the VNRVM estimate (dotted line) closely matches the true function
The value for σkernel was assumed in the results reported for sinc data. In practice, however, it is not known and should be estimated as part of the learning process. Tripathi and Govindaraju (2007) demonstrated the importance of σkernel in RVM and provided a set of procedures for selecting σkernel for hydrologic applications. However, their methods are not directly applicable for the case with noisy input data. The use of the variational principle in VNRVM provides a handy way for selecting σkernel. The model that maximizes the variational lower bound, L(Q), corresponds to the best model. Fig. 13 shows the plot of variational lower bound versus σkernel for the sinc data. The plot suggests σkernel2 to be optimal, and the results reported in Fig. 12 support this choice. The selected value of σkernel is very close to the optimal value suggested by Tripathi and Govindaraju (2007).
Fig. 13. Lower-bound L(Q) of VNRVM for sinc data versus kernel width σkernel; optimal σkernel corresponds to maximum value of L(Q)

Seasonal forecasting of AISMR

Seasonal prediction of all Indian summer monsoon rainfall (AISMR) has been a topic of scientific studies for more than a century, and it is still pursued because of associated socioeconomic benefits. The literature abounds with the theories and models, both physical and empirical, but the goal of reliable prediction remains elusive (Shukla 2007; Gadgil et al. 2005).
In the early 1980s, Charney and Shukla (1981) proposed that seasonal monsoon rainfall is governed by the slowly varying surface boundary conditions of SST, soil wetness, vegetation, and snow cover. This theory, which is among the most influential theories on predictability of AISMR, forms the basis of the model developed in this study. The predictors for the model were selected from global SST data. Tripathi and Govindaraju (2008) studied the effect of engaging uncertainties in SST data on prediction of AISMR, but their focus was on predicting extreme states of AISMR by using robust optimization. In this study, predictability of seasonal mean monsoon rainfall is investigated by using the Bayesian tools developed in this study.
HadSST2 data from January 1901 to December 2006 were used in developing AISMR prediction models. The grid points with more than 50% of missing records were not considered in the analysis. The data were partitioned into five nonoverlapping sectors, (1) tropical Pacific sector (30°S–30°N), (2) North Pacific sector (north of 30°N), (3) tropical Atlantic sector (30°S–30°N), (4) North Atlantic sector (north of 30°N), and (5) Indian Ocean sector (north of 30°S), to reduce the influence of strong ENSO signals on variability of SST data outside of the tropical Pacific Ocean. The partioning of data allows for investigation of the influence of each oceanic sector on AISMR separately, and it also brings down the computational cost of processing global SST data. The monthly SST anomalies were averaged to get seasonal anomalies for four seasons (DJF, MAM, JJA, and SON). Because the measurement errors in monthly anomalies were independent, the error variance in the seasonal anomaly was estimated by taking the average of error variance in monthly anomalies. Seasonal SST tendencies, defined as a change in SST anomalies from the previous season, were estimated and the associated error variances were calculated by adding variances in the anomalies of the current and the previous seasons. The term “SST data” refers to both SST anomalies and SST tendencies. Following Sahai et al. (2008), the predictors for mean monsoon rainfall were identified from the SST data, one season lag to three years lag from the monsoon season. The data were partitioned into a training set (1901–1985) and a test set (1986–2006), and only the former set was used for identifying predictors.
To develop a prediction model, SST data for each season in an oceanic sector were processed by using PCA; the PCs obtained were correlated with AISMR to select predictors, which were then used by a regression model to learn a functional relationship between them and AISMR. Table 1 shows correlation and relative root mean square error (RRMSE) between observed and predicted values during training and testing periods. The algorithms in the first row, namely standard PCA (SPCA), Pearson product-moment correlation (Pearson), and RVM, do not use information on SST uncertainty, which is used by the algorithms in subsequent rows, with the last row using it in all the three steps. The VBaNPCA can automatically identify optimum number of PCs, whereas no objective way exists of identifying the number of PCs in SPCA. Therefore, to facilitate comparison of different algorithms, only the first three PCs of SST data from SPCA and VBaNPCA were used.
Table 1. Prediction Performance of All India Summer Monsoon Rainfall
PCACorrelationRegressionTrainingTesting
rRRMSErRRMSE
SPCAPearsonRVM0.805.520.336.97
  RVM0.883.090.521.57
 PearsonVNRVM0.892.990.511.70
VBaNPCA RVM0.873.130.441.57
 BaNCorrVNRVM0.883.550.421.76
Note: The symbols r and RRMSE correspond to the correlation and relative root mean square error between observed and predicted rainfall, respectively.
A PC was selected as a potential predictor if its Pearson product-moment correlation with AISMR was greater or smaller than zero at 10% significance level, or if the posterior probability of the absolute value of the correlation to be greater than zero was more than 0.9, i.e., p(|r|>0)>0.9 in which p(r) represents posterior distribution of correlation obtained from BaNCorr algorithm. The regression relationships were learned by using linear versions of RVM and VNRVM algorithms. These algorithms can identify relevant predictors while learning the regression relationship. The first step of deriving PCs reduced the dimensionality of predictors from a few thousand to a few hundred. The screening of PCs by correlation futher reduced the number of potential predictors to approximately 50. Finally, the RVM identified approximately 10 out of those 50 features to be relevant for learning the regression relationship. Most of the finally selected predictors were from Tropical Pacific and India Ocean sectors.
The results presented in Table 1 indicate that the models engaging data uncertainty information perform better than the models that do not use uncertainty information for both training and testing phases. The VBaNPCA algorithm appears to significantly affect the prediction performance, whereas other two methods, BaNCorr and VNRVM, only marginally alter the prediction performance. Fig. 14 shows the observed and predicted time series during testing phase of the two models—one that does not engage uncertainty (SPCA-Pearson-RVM; NoUnc) and another that incorporates uncertainty in all three steps of model development (BaNPCA-BaNCorr-VNRVM; WithUnc). The gray shaded region represents one standard error of prediction. The results indicate that NoUnc model is poor in capturing extreme events, a behavior that was also noted in sinc data example (Fig. 12). The WithUnc model can predict extreme years of 1987, 1988, and 1994; however, it could not successfully predict weak monsoon years of 2002 and 2004. Furthermore, normal monsoon year of 1997 is wrongly predicted to be a weak monsoon year by the model.
Fig. 14. Observed and predicted time series of AISMR along with standard error of prediction; predictions are made by using (a) SPCA-Person-RVM (NoUnc); and (b) BaNCorr-VBaNPCA-VNRVM (WithUnc)
The model that utilizes SST measurement uncertainties results in a slightly larger standard error of prediction (Fig. 15). The standard error in NoUnc is attributable to model uncertainty and uncertainties in the estimate of model parameters. In addition to these two components, measurement uncertainties in the SST data also contribute toward the prediction uncertainty in WithUnc. Among the different components of prediction uncertainty, the model uncertainty was observed to be largest. The model uncertainty was 53.5 mm for NoUnc and 46.4 mm for WithUnc. Engaging data uncertainty information marginally reduces the model uncertainty, but it is still very large, and therefore, the effects of data uncertainty are not conspicuous in the overall estimates of prediction uncertainty.
Fig. 15. Comparison of standard error of prediction yielded by (a) SPCA-Person-RVM (NoUnc); and (b) BaNCorr-VBaNPCA-VNRVM (WithUnc)
A small sample size and relatively large pool of potential predictors make assessment of prediction skill difficult. Traditionally, data are divided into a training and a validation set; the model is trained on the former set, and the performance of the trained model on the latter set gives an estimate of the prediction skill. This estimate is not reliable if the sample size of validation set is small. For such cases, cross-validation (CV; Wilks 2006) gives better estimate of prediction skill. In a typical k-fold CV, the data are divided into k disjoint subsets, the model is trained on all subsets except one, and the prediction is made on the subset left out. The procedure is repeated k times, each time selecting a different subset for validation, such that each data point becomes a part of the validation set once. If the predictors for the model are selected from the same data used for validation, it induces selection bias, a subtle point that is sometimes ignored not only in hydrology but even in statistical and machine learning literature (Ambroise and McLachlan 2002). The problem of selection bias becomes acute when the set of potential predictors is very large compared to sample size. In such cases, a reliable estimate of prediction skill requires two-level cross-validation (Wessels et al. 2005). In this procedure, at the first level, data are divided into G disjoint sets, with G-1 subsets forming training data and the remaining set forming validation data. Training data undergoes an inner loop operation. In the inner loop, the k-fold CV is used to select predictors. A model is then trained, for the selected predictors, on the training data. The performance of the trained model is then evaluated on the validation data. The process of splitting the data into a training and a validation set is repeated L times. Average performance over L cycles gives an estimate of the prediction skill. Table 2 presents the CV results for NoUnc and WithUnc models by using the naïve k-fold CV with k=N (leave-one-out CV), and two-fold CV with G, k and L set to 3, 5, and 100, respectively. It is evident from the table that the presence of selection bias in the naïve k-fold CV gives an overly optimistic estimate of prediction skill. The two-fold CV gives a more conservative estimate of prediction skill than that obtained from the testing data set in Table 1.
Table 2. Predictive Skill of Models by Using Cross-Validation
ApproachNaı¨ve CVTwo-level CV
rRRMSErRRMSE
NoUnc0.381.020.02(0.075)1.44(0.265)
WithUnc0.680.750.31(0.084)1.08(0.077)
Note: The Naı¨ve CV does not account for selection bias. The symbols r and RRMSE correspond to the correlation and relative root mean square error between observed and predicted rainfall in the validation data, respectively. Repetitions of the algorithm and their standard deviations are given by the values within the parentheses.
Predicting extreme monsoon states is the most challenging aspect of AISMR prediction. Table 3 presents the performance of NoUnc and WithUnc models for predicting below normal, normal, and above normal states of AISMR in training, testing, and cross-validation. The years with rainfall within 5% of the long term AISMR mean are termed as normal years. The below (above) normal years are years with rainfall 5% below (above) the long term AISMR mean. This table reinforces the earlier conclusion that engaging unceratinty improves the ability to predict extreme years. Neverthless, the WithUnc model could predict only approximately one-third of the extreme years, which is disappointingly low.
Table 3. Performace of NoUnc (SPCA-Pearson-RVM) and WithUnc (BaNPCA-BaNCorr-VNRVM) Models for Predicting Below Normal, Normal, and Above Normal States of AISMR
MethodsTrainingTrainingNaı¨ve CVTwo-level CV
bnnanbnnanbnnanbnnan
Obs263027993314132314132
NoUnc1824120901328137.8622.668.91
WithUnc19271854220251711.9620.1012.17
Note: Below normal (bn), normal (n), and above normal (an). For two-level CV, the values represent the mean of the performance over 100 repetitions of the algorithm.

Concluding Remarks

A suite of Bayesian models was developed to engage data uncertainty when conducting correlation, PCA, and regression analyses. The performance of the proposed models was tested and compared with standard models on synthetic data. Their applicability to hydrologic applications was demonstrated by developing seasonal forecast model for all India summer monsoon rainfall.
The Bayesian noisy correlation (BaNCorr) algorithm uses the Gibbs sampler to estimate correlation among variables with measurement errors. The presence of error attenuates correlation and increases uncertainty in its estimate. Standard correlation analysis by using Pearson product-moment correlation ignores both of these aspects, but these are accounted for by the BaNCorr algorithm. Comparison of BaNCorr with maximum likelihood estimate proposed by Kulathinal et al. (2002) suggested that BaNCorr is more robust for small sample size and large measurement errors.
Variational Bayes noisy PCA (VBaNPCA) was proposed as an improvement over the BaNPCA algorithm (Tripathi and Govindaraju 2008). VBaNPCA was developed by using variational principles. It is fully Bayesian in the sense that all its parameters are treated as random variables and are estimated by using Bayes rule. Unlike BaNPCA, it avoids making the large sample approximation and is therefore more suitable for large dimensional data with small sample size. Application of VBaNPCA to synthetic examples showed that it preserves the desired features of BaNPCA, namely imputing missing values and identifying the appropriate number of PCs, and is also computationally efficient.
The problem of learning the relationship between predictors and predictands, when the former have measurement errors, was addressed by developing variational noisy relevance vector machine (VNRVM). For linear regression, VNRVM provides a closed form iterative solution and can identify relevant predictors among noisy input variables. Comparison of VNRVM with the SIMEX algorithm (Carroll et al. 2006) showed that the estimates of the regression coefficient by VNRVM are more robust then those by SIMEX. With a combination of variational methods and the Metropolis-Hastings algorithm, VNRVM can learn nonlinear relationships in noisy data as illustrated by the synthetic example of sinc function. Additionally, the variational lower bound of VNRVM provides an elegant way of selecting the kernel width parameter used in nonlinear kernel regression.
Practical applicability of the three proposed algorithms was demonstrated by developing the seasonal forecast model for AISMR by using SST data as inputs. It was shown that uncertainty in the SST data can be engaged in developing hydrologic models that may improve prediction performance. The proposed methods propagate measurement uncertainty in input data through various stages of model development up to the predicted value, thus providing correct assessment of prediction capabilities under noisy data. The problem of selection bias in estimating prediction performance was discussed.
This study demonstrated the benefits of engaging measurement errors in developing hydro-climatic models and provides a set of tools to achieve this goal. Clearly prediction skills need to improve for better forecasts. It is hoped that this study will encourage the use of measurement errors in hydrologic studies.

Appendix: DAG Notation and Posterior Distributions

Directed Acyclic Graphs

The following convention is used in the directed acyclic graphs (DAGs) shown in Figs. 1–3: (1) a circle or a node denotes a random number, observed variables are given by shaded circles, whereas the nonshaded circles represent the model parameters or hidden variables, (2) solid elliptical dots represent deterministic parameters, (3) an arrow denotes a distribution of the variable on its head conditioned on the variable on its tail, for e.g., an arrow connecting x to y represents p(y|x), and (4) a plate marked n=1,,N denotes that N nodes exist of the type of which only one example is given inside the plate.

Posterior Distribution for Variational BaNPCA Parameters

The mean-field approximation of the posterior distribution is given by Q(X,Z,W,μ,α,τ)=q(X)q(Z)q(W)q(α)q(μ)q(τ), and by using Eq. (5), the optimal solution for the individual components takes the following closed form:
q(X)=n=1NN(xn|mxn,Pxn-1)q(Z)=n=1NN(zn|mzn,Pzn-1)q(W)=i=1pN(wí|mẃi,Pw-1)q(α)=j=1qGamma(αj|aα,bαj)q(μ)=N(μ|mμ,β-1Ip)q(τ)=Gamma(τ|c,d)
where column vector wí denotes ith row of W, and
Pxn=Bn+τIpmxn=Pxn-1[τ(Wzn+μ)+Bnx̃n]Pzn=τWTW+Iqmzn=τPzn-1[WT(xn-μ)]Pw=diag(α)+τn=1NznznTmẃi=τPw-1[n=1Nzn(xni-μi)]aα=a0+p/2bαj=b0+wjTwj/2β=β0+Nτmμ=τβ-1Ipn=1N(xn-Wzn)c=c0+N/2d=d0+12n=1N[xnTxn+trace(WTWznznT)+μTμ+2μTWzn-2xnT(Wzn+μ)].
The expected values required in the equations were evaluated by using the following properties: For a Gaussian distribution, p(x)N(x|μ,P-1)
x=μ,xxT=μμT+P-1,andxTx=μTμ+trace(P-1)
(30)
and for a Gamma distribution, p(x)Gamma(x|a,b)
x=ab,andx2-x2=ab2
(31)

Posterior Distribution for VNRVM Parameters

The mean-field approximation to the posterior distribution of VNRVM parameters and latent variables is given by
Q(X,w,w0,α,α0,τε)=q(X)q(w)q(w0)q(α)q(α0)q(τε)
The optimal solution for q(X) depends on the choice of kernel function and will be discussed shortly. Optimal solution for other components of Q were analytically estimated as
q(w)=N(w|mw,Pw)q(w0)=N(w0|mw0,pw0)q(α)=j=1MGamma(αj|aα,bαj)q(α0)=Gamma(α0|aα,bα0)q(τε)=Gamma(τε|c,d)
where
Pw=diag(α)+τεn=1NϕnTϕnmw=τεPw-1n=1Nϕn(yn-w0)pw0=α0+Nτεmw0=τεpw0-1n=1N(yn-wTϕn)aα=a0+1/2bαj=b0+wj2;j=0,,Mc=c0+N/2d=d0+12n=1N[yn2+trace(wwTϕnϕnT)+Nw02+2w0wTϕn-2yn(wTϕn+w0)]
and the required moments are estimated by using Eqs. (30) - (31). For linear VNRVM, i.e., ϕn=xn, the optimal solution for q(X) has a closed form given by
q(X)=n=1NN(xn|mxn,Pxn-1)
(32)
where
Pxn=Bn+τεwwTmxn=Pxn-1[τεw(yn-w0)+Bnx̃n].
For RBF kernel, q(X) is given by
lnq(X)=EX[p(y,X,X̃,w,w0,α,α0,τε|)]=-12τεn=1N[trace(wwTϕnϕnT)-2ynwTϕn-xnTBnxn+2x̃nTBnxn]+const
(33)
where EX denotes expectation with respect to all the variables in the model except X. The distribution q(X) does not resemble any standard distribution, and hence required moments for VNRVM have to be approximated. An easy choice is to approximate it by a Gaussian distribution by using either variational or Laplace methods. However, because q(X) is often multimodal, Gaussian approximation gives a poor estimate for higher order moments. Stochastic approximation by using the MH algorithm yielded better estimates for moments and is therefore recommended for use in VNRVM. The jumping distribution was selected to be N(xn|x̃n,Bn-1). Thus, for nonlinear case, VNRVM combines variational and MCMC methods; the usefulness of such a combination has been demonstrated in the context of other statistical models (de Freitas et al. 2001).

References

Ahn, J., and Oh, J. (2003). “A constrained EM algorithm for principal component analysis.” Neural Comput., 15(1), 57–65.
Ambroise, C., and McLachlan, G. J. (2002). “Selection bias in gene extraction on the basis of microarray gene-expression data.” Proc., National Academy of Sciences of the United States of America, 99(10), 6562–6566, .
Barnard, J., McCulloch, R., and Meng, X.-L. (2000). “Modeling covariance matrices in terms of standard deviations and correlations, with applications to shrinkage.” Statistica Sinica, 10(4), 1281–1311.
Beckers, J. M., and Rixen, M. (2003). “EOF calculations and data filling from incomplete oceanographic datasets.” J. Atmos. Ocean. Technol., 20(12), 1839–1856.
Bishop, C. M. (2006). “Pattern recognition and machine learning.” Information science and statistics, 1st Ed., Springer, New York.
Bishop, C. M., and Tipping, M. E. (2000). “Variational relevance vector machines.” Proc., 16th Conf. on Uncertainty in Artificial Intelligence, C. Boutilier, and M. Goldszmidt, eds., Morgan Kaufmann, San Francisco, 46–53.
Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006). Measurement error in nonlinear models: A modern perspective, 2nd Ed., Chapman and Hall/CRC, Boca Raton, FL.
Charney, J. G., and Shukla, J. (1981). “Predictability of monsoons.” Chap. 6, Monsoon dynamics, J. Lighthill, and R. P. Pearce, eds., Cambridge Univ., New York, 99–108.
Chevrier, T. (2007). “Using economic theory to build optimal portfolios.” Ph.D. dissertation, Univ. of Chicago.
Chowdhury, S., and Sharma, A. (2007). “Mitigating parameter bias in hydrological modelling due to uncertainty in covariates.” J. Hydrol. (Amsterdam), 340(3-4), 197–204.
de Freitas, N., Højen-Sørensen, P., Jordan, M. I., and Russell, S. J. (2001). “Variational MCMC.” Proc., 17th Conf. on Uncertainty in Artificial Intelligence, J. S. Breese, and D. Koller, eds., Morgan Kaufmann, San Francisco, 120–127.
Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern classification, 2nd Ed., Wiley Interscience, New York.
Fan, X. (2003). “Two approaches for correcting correlation attenuation caused by measurement error: Implications for research practice.” Educ. Psychol. Meas., 63(6), 915–930, .
Folland, C. K., and Parker, D. E. (1995). “Correction of instrumental biases in historical sea surface temperature data.” Q. J. R. Meteorol. Soc., 121(522), 319–367, .
Fuller, W. A. (1987). Measurement error models, Wiley, New York.
Gadgil, S., Rajeevan, M., and Nanjundiah, R. S. (2005). “Monsoon prediction—Why yet another failure?” Curr. Sci., 88(9), 1389–1400.
Gelman, A., and Rubin, D. B. (1992). “Inference from iterative simulation using multiple sequences.” Stat. Sci., 7(4), 457–472, .
Geman, S., and Geman, D. (1984). “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE Trans. Pattern Anal. Mach. Intell., PAMI-6(6), 721–741.
Geweke, J. (1992). “Evaluating the accuracy of sampling-based approaches to calcualting posterior moments.” Valencia Int. Meetings on Bayesian Statistics, Bayesian Statistics 4, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds., Oxford University, New York, 169–193.
Golub, G. H., and van Loan, C. F. (1996). Matrix computations, 3rd Ed., Johns Hopkins University, Baltimore.
Guyon, I., and Elisseeff, A. (2003). “An introduction to variable and feature selection.” J. Mach. Learn. Res., 3, 1157–1182.
Hastings, W. K. (1970). “Monte Carlo sampling methods using Markov chains and their applications.” Biometrika, 57(1), 97–109, .
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An introduction to variational methods for graphical models.” Mach. Learn., 37(2), 183–233, .
Khalil, A., McKee, M., Kemblowski, M., and Asefa, T. (2005). “Sparse Bayesian learning machine for real-time management of reservoir releases.” Water Resour. Res., 41(11), W11401, .
Kulathinal, S. B., Kuulasmaa, K., and Gasbarra, D. (2002). “Estimation of an errors-in-variables regression model when the variances of the measurement errors vary between the observations.” Stat. Med., 21(8), 1089–1101, .
Kullback, S., and Leibler, R. A. (1951). “On information and sufficiency.” Annals of Mathematical Statistics, 22(1), 79–86.
Kumar, K. K., Rajagopalan, B., Hoerling, M., Bates, G., and Cane, M. (2006). “Unraveling the mystery of Indian monsoon failure during El-Niño.” Science, 314(5796), 115–119, .
Liechty, J. C., Liechty, M. W., and Muller, P. (2004). “Bayesian correlation estimation.” Biometrika, 91(1), 1–14, .
MacKay, D. J. C. (2005). Information theory, inference and learning algorithms, Version 7.2, Cambridge University, U.K., 〈http://www.inference.phy.cam.ac.uk/mackay/itila/book.html〉 (Dec. 7, 2011).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). “Equation of state calculations by fast computing machines.” J. Chem. Phys., 21(6), 1087–1092, .
Pappenberger, F., and Beven, K. J. (2006). “Ignorance is bliss: Or seven reasons not to use uncertainty analysis.” Water Resour. Res., 42(5), W05302, .
Parisi, G. (1988). Statistical field theory, Addison-Wesley, New York.
Parthasarathy, B., Munot, A. A., and Kothawale, D. R. (1994). “All-India monthly and seasonal rainfall series: 1871-1993.” Theor. Appl. Climatol., 49(4), 217–224.
Quiñonero Candela, J. (2004). “Learning with uncertainty—Gaussian processes and relevance vector machines.” Ph.D. dissertation, Technical Univ. of Denmark, Kongens Lyngby.
Quiñonero Candela, J., Girard, A., Larsen, J., and Rasmussen, C. E. (2003). “Propagation of uncertainty in Bayesian kernel models—Application to multiple-step ahead forecasting.” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2, IEEE, Hong Kong, 701–704.
Raftery, A. E., and Lewis, S. (1992). “How many iterations in the Gibbs sampler?” Valencia Int. Meetings on Bayesian Statistics, Bayesian Statistics 4, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds., Oxford University, U.K., 763–773.
Rayner, N. A., et al. (2006). “Improved analyses of changes and uncertainties in sea surface temperature measured in situ since the mid-nineteenth century: The HadSST2 dataset.” J. Clim., 19(3), 446–469, .
Ritter, C., and Tanner, M. A. (1992). “Facilitating the Gibbs sampler: The Gibbs stopper and the Griddy-Gibbs sampler.” J. Am. Stat. Assoc., 87(419), 861–868.
Sahai, A. K., Chattopadhyay, R., and Goswami, B. N. (2008). “A SST based large multi-model ensemble forecasting system for Indian summer monsoon rainfall.” Geophys. Res. Lett., 35, L19705, .
Sanguinetti, G., Milo, M., Rattray, M., and Lawrence, N. D. (2005). “Accounting for probe level noise in principal component analysis of microarray data.” Bioinformatics, 21(19), 3748–3754, .
Schisterman, E., Moysich, K., England, L., and Rao, M. (2003). “Estimation of the correlation coefficient using the Bayesian approach and its applications for epidemiologic research.” BMC Med. Res. Methodol., 3(5), 4, .
Schneider, T. (2001). “Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values.” J. Clim., 14(5), 853–871.
Shukla, J. (2007). “Atmosphere: Monsoon mysteries.” Science, 318(5848), 204–205.
Tipping, M. E. (2001). “Sparse Bayesian learning and the relevance vector machine.” J. Mach. Learn. Res., 1(3), 211–244, .
Tipping, M. E., and Bishop, C. M. (1999). “Mixtures of probabilistic principal component analyzers.” Neural Comput., 11(2), 443–482, .
Tootle, G. A., and Piechota, T. C. (2006). “Relationships between Pacific and Atlantic ocean sea surface temperatures and U.S. streamflow variability.” Water Resour. Res., 42(7), W07411, .
Tripathi, S., and Govindaraju, R. S. (2007). “On selection of kernel parameters in relevance vector machines for hydrologic applications.” Stoch. Environ. Res. Risk Assess., 21(6), 747–764, .
Tripathi, S., and Govindaraju, R. S. (2008). “Engaging uncertainty in hydrologic data sets using principal component analysis: BaNPCA algorithm.” Water Resour. Res., 44, W10409, .
Weigend, A. S., Zimmermann, H. G., and Neuneier, R. (1996). “Clearning.” Neural Networks in Financial Engineering, Proc., 3rd Int. Conf. on Neural Networks in the Capital Markets, A.-P. N. Refenes, Y. Abu-Mostafa, J. Moody, and A. S. Weigend, eds., World Scientific, London, 511–522.
Wessels, L. F. A., et al. (2005). “A protocol for building and evaluating predictors of disease state based on microarray data.” Bioinformatics, 21(19), 3755–3762, .
Wilks, D. S. (2006). “Statistical methods in the atmospheric sciences: An introduction.” International Geophysics, 2nd Ed., Vol. 91, Elsevier Academic, San Diego.
Yang, R., and Berger, J. O. (1994). “Estimation of a covariance matrix 705 using the reference prior.” Ann. Stat., 22(3), 1195–1211, .

Information & Authors

Information

Published In

Go to Journal of Hydrologic Engineering
Journal of Hydrologic Engineering
Volume 16Issue 12December 2011
Pages: 970 - 983

History

Received: Jan 30, 2009
Accepted: May 3, 2010
Published online: May 17, 2010
Published in print: Dec 1, 2011

Permissions

Request permissions for this article.

Authors

Affiliations

Shivam Tripathi, M.ASCE
Graduate Student, School of Civil Engineering, Purdue Univ., West Lafayette, IN 47907.
Rao S. Govindaraju, M.ASCE [email protected]
Professor, School of Civil Engineering, Purdue Univ., West Lafayette, IN 47907 (corresponding author). E-mail: [email protected]

Metrics & Citations

Metrics

Citations

Download citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

View Options

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share