Estimating aggregated nutrient fluxes in four Finnish rivers via Gaussian state space models

Reliable estimates of the nutrient fluxes carried by rivers from land‐based sources to the sea are needed for efficient abatement of marine eutrophication. Although nutrient concentrations in rivers generally display large temporal variation, sampling and analysis for nutrients, unlike flow measurements, are rarely performed on a daily basis. The infrequent data calls for ways to reliably estimate the nutrient concentrations of the missing days. Here, we use the Gaussian state space models with daily water flow as a predictor variable to predict missing nutrient concentrations for four agriculturally impacted Finnish rivers. Via simulation of Gaussian state space models, we are able to estimate aggregated yearly phosphorus and nitrogen fluxes, and their confidence intervals.


INTRODUCTION
Abatement of marine eutrophication calls for reliable estimates of the nutrient fluxes carried by rivers from land-based sources to the sea. Monitoring programs of many important rivers in Finland, and elsewhere, typically involves daily measurements of water flow, but due to the costs, much more infrequent sampling and analysis of phosphorus and nitrogen concentrations. Yet, the concentrations of nutrients often show large temporal variation, especially in rivers receiving loading from diffuse sources (Kauppila and Koskiaho, 2003). The more infrequent the water quality data are, the more sensitive the flux estimates are to the method used to estimate the concentrations for the unsampled days. Several interpolation and extrapolation methods have been suggested to estimate missing monitoring data (Young et al., 1988;Rekolainen et al., 1991;Kronvang and Bruhn, 1996;Quilbé et al., 2006). Although many of the methods simply assume that the observation made on a specific day represents the concentration level for a longer period (e.g., between the midpoints of the preceding, current, and next observation), other approaches make use of the relationship between the concentration and some other variable, usually the flow.
Our aim is to develop a method for estimating fluxes of total phosphorus and total nitrogen for rivers mainly impacted by diffuse loading from agriculture for a given period, commonly a year. For prediction of the missing nutrient concentration measurements, we use a time varying regression model with an additional autoregressive component using the water flow measurements as predictor variables. Various simulation techniques are employed for evaluating our results. As a general framework, we use Gaussian state space models together with Kalman filter and smoother.

Interpolation via state space models and simulation
Our approach to modeling nutrient concentrations and fluxes is based on state space modeling with Kalman filtering, smoothing, and interpolation. The form of the Gaussian state space model sufficient for our purposes is NID.0; H / (1) tC1 D Tˇt C Á t ; Á t NID.0; Q/; t D 1; 2; : : : ; T where NID stands for "normally and independently distributed." The first row (1) is called an observation equation and the second row (2) a state equation. The observed process fy t g may be a scalar or vector valued. The unobserved state process fˇt g is often a vector process. The process starts withˇ1 N.b 1 ; P 1 / independently of error processes f t g; fÁ t g. In our application the system matrix T is a time invariant diagonal matrix, whereas the system matrices X t contain time varying predictor values. The state process fˇt g is a latent process of time varying levels and regression coefficients. The model is defined in more detail in Sections 4.1 and 4.2. Further, the covariance matrices H and Q are time invariant. In our application the interpolation problem arises because there are missing observations. Let Y comprise all the non-missing observations. If the value y t at time t is missing, then the Kalman smoother provides its estimate as the conditional mean O y t D X t Ǒ t together with Ǒ t D E.ˇt j Y / and the conditional covariance matrix Var.y t j Y / D S t . The Gaussian assumption then yields which can be used for obtaining prediction error limits. Plainly, the interpolated value is unbiased in the sense that E.y t O y t / D 0. Formula (3) is useful for single missing values. However, our primary interest is a nonlinear compound measure over a time span t C 1; : : : ; t C s of length s (e.g., a calendar year), denoted by where q t is the water flow on the day t , and e y t is the daily nutrient concentration. If we had the values q t and y t measured on each day, then we would have correct nutrient fluxes. Admittedly, this is not exactly true due to the measurement errors, but it would satisfy the practical needs of evaluating the yearly fluxes. In the subsequent analysis, we focus on the effects of missing nutrient measurements compared to the ideal case of having all measurements.
In section 4, we define our model. Under the specified model, we replace the missing values with the estimates that are simply their conditional expectations. Furthermore, to assess their accuracy, we need the conditional variances as well. Formally, we need to determine Although the conditional means are easily estimated by using known results of log-normal variables, the variances are more complicated because of correlations between the smoothed state variables (see Durbin and Koopman (2002, section 4.5)). Therefore, we rely on simulations (see Durbin and Koopman (2002)). Additionally, these simulations allow easy constructions for the prediction intervals, which are analytically intractable, because the distribution of the sum of the log-normal variables cannot be given in a closed form.
For simulating the missing observations conditionally on Y , we simulate realizations . Q ; Q / from their joint conditional distribution p.ˇ; jY /. Then simulated observations are obtained from Q y t D X t Q t C Q t ; t D 1; : : : ; n. As we are simulating conditionally on Y , Q y t D y t if y t is observed, as y t belongs to Y . The simulation from p.ˇ; jY / can be done by augmenting state vectorˇt with disturbance t , similarly as in Durbin and Koopman (2001, p. 131), and by using the simulation smoothing algorithm of Durbin and Koopman (2002) for the augmented state vector. With a large number of replications, the conditional mean (4) and variance (5) are computed naturally as averages. More specifically, let Q y 1j ; : : : ; Q y nj be the j th simulated series, and Then, with N replicates, the conditional expectations and variances are obtained respectively as . Q m s;t;j m t;s / 2 Assuming that the estimated model is true, the accuracy of the yearly total nutrient fluxes can be computed in terms of prediction intervals. The prediction interval with coverage probability 1 2˛is found by taking the r th smallest and the r th largest value among f Q m s;t;j g; j D 1; : : : ; N with r D N˛; denoted as Q m s;t;low and Q m s;t;up . Assuming the estimated parameters true, the required prediction interval is OE Q m s;t;low ; Q m s;t;up The other measure of accuracy is the coefficient of variation All the computations in this paper have been done in R (R Development Core Team, 2012), using the state space modeling package KFAS (Helske, 2012), where the simulation of the state vector is done by using the simulation smoothing with two antithetic variables to reduce the error due to the simulation (Durbin and Koopman, 2002).

Model fitting and evaluation
The unknown parameters of the nutrient concentration model can be estimated by maximum likelihood method, by using the Kalman filter for computing the log-likelihood of the model. The Kalman filter updating formulas yield us the predicted state b tC1 D E.ˇt C1 j y 1 ; : : : ; y t /, the prediction X tC1 b tC1 for y tC1 , the prediction error v tC1 D y tC1 X tC1 b tC1 , and the prediction error variance (or the covariance matrix in multivariate case) Var.v t / D F t .
The log-likelihood of a linear Gaussian state space model can be written in terms of prediction errors and their covariance matrices, which in applications, depend on unknown parameters. Let us denote the parameter vector by , and let v t; and F t; be the prediction errors and their covariance matrices under . Then the likelihood is given by where p is the dimension of y t . The non-stationary part of the state vector is initialized by the diffuse method suggested by (Durbin and Koopman, 2001), whereas the stationary components are assumed to have a stationary distribution at start. When the series fy t g is multivariate, we transform it into a univariate form as in Durbin and Koopman (2001). This enables us to treat totally and partially missing values automatically as well as automatically adjust the likelihood correctly. The effect of model uncertainty, comprising parameter uncertainty and the uncertainty due to model choice, is evaluated by removing k nutrient measurement vectors from the dataset. The model is then fitted to the thinned data. Let f k be the total nutrient flux of the removed days, and b f k is the corresponding figure estimated using the thinned dataset. The relative error due to thinning is then . b f k f k /=f k . Assuming the model is true and ignoring the parameter estimation error, the difference e k D b f k f k has mean zero. Furthermore, with larger k, the average error per day e k =k tends to be smaller. The same is true also for the relative error e k =f k D .e k =k/=.f k =k/. Therefore, if we plot the absolute relative errors je k j=f k on the thinning size k, we expect to see a decreasing curve, given our model is true. However, if these values remain more or less constant or are increasing, then our model is severely biased.
The overall effect of thinning is assessed through a Monte Carlo experiment. We remove randomly k nutrient measurement vectors and compute the mean relative error where B denotes the number of random replicates and i refers to i th replicate.

DATA
Our data consist of the concentrations of total phosphorus and total nitrogen, and daily water flow measurements from four rivers located in southern Finland, Paimionjoki, Aurajoki, Porvoonjoki, and Vantaanjoki, during 1985-2010. The nutrient data are taken from the databases of the Finnish Environment Institute. Total phosphorus and total nitrogen concentrations have been determined spectrometrically from water samples after digestion with peroxodisulphate. The catchments of these rivers all have a high proportion of the agricultural land (24-43%, Table 1), and the soil is dominated by clay, which renders the water turbid. Much of the phosphorus in these rivers is transported in association with eroded soil particles. In addition, the catchments contain only few lakes (lake percentage 0.30-2.6), which results in high day to day variation in flow. In all the rivers, agriculture is the major source of nutrients. At the beginning of our observation period, the Porvoonjoki has received substantial waste-water loading from the city of Lahti, but due to improved treatment the share of waste-water to total loading has decreased with time, to an average 12% of the anthropogenic loading. In the Vantaanjoki, the respective proportion of waste-water loading is 6.3%, whereas in the other two rivers, it is below 1%.
Daily measurements on nutrient concentrations are available for only 5-10% of the time, whereas flow measurements are usually available for each day.  (1) and (2) with all matrices X t and T being identity matrices. The model is called a local level model, for example, see Harvey (1989). Amisigo and van de Giesen (2005) have used a similar model to patch gaps in daily riverflow series.

Relating nutrient concentration and river flow
It can be argued, as has been done by Wartiovaara (1975) and Rankinen et al. (2010), that the high water flow due to the precipitation has two opposite effects on the nutrient loadings. Precipitation increases the diffuse loading from the agriculture while simultaneously diluting waste-water loading. We have tried to take both these aspects into account. In Figure 1, we have plotted the concentrations on the flow, both in logarithms, but due to zero values, we have first added one to the flow values. To address both of the mutually opposing effects caused by precipitation-induced high flows, we have decided to regress the log-concentration y t on both log.1 C q t /, and 1= log.2 C q t /. In the latter, we have added two to ensure a finite value. Figure 1 includes also some regression curves: the loess curves of first degree (Cleveland and Devlin, 1988), and the ordinary least squares regression of y t onˇ0 Cˇ1 log.1 C q t / and onˇ0 Cˇ1 log.1 C q t / Cˇ2= log.2 C q t /.
By visual inspection, the relation between the concentration and the flow seems to be linear or slightly curved in a log scale. Moreover, the loess curve and the regression curve from model with two predictor variables are quite close to each other, whereas the regression line from model with one predictor lies apart, especially for nitrogen measurements. Therefore, in some cases, it seems clearly beneficial to include both x 1;t D log.1 C q t / and x 2;t D 1= log.2 C q t / D 1= log .1 C e x 1t / as the predictor variables in the model. To treat all series equally, both predictors are present in each model. Note that this visual inspection with regression and loess curves is about finding the proper relationship between concentration and flow, and it ignores the time aspect of the problem which, as we will later see, is an important part of the modeling.

State space specification
As the phosphorus and nitrogen concentration measurements are correlated, we model them together but separately for each river. The model applied to each river is of the form where y P t ; y N t Á is a bivariate process of the logarithms of phosphorus and nitrogen concentrations, respectively;ˇt consists of all coef-ficientsˇi j;t , i D P; N; j D 1; 2, and˛t consists of zero-mean first-order autoregressive processes˛P t and˛N t with T D diagOE P ; N containing the corresponding autoregressive parameters. The disturbance processes are independent of each other. For simplicity, † Á is assumed to be a diagonal matrix. When the diagonal elements are positive, the regression coefficients vary according to a random walk allowing the dependence between the flow and the nutrient concentration to change in time. Note that the model collapses to an ordinary regression model when † Á D 0 and T D 0 (i.e., P D N D 0). The first restriction means that the regression coefficientsˇt are constants. The second one implies that level processes˛P t ;˛N t are white noise processes merged into the errors P t and N t , respectively. Zero variances for the components of coefficient processˇt are sometimes obtained. The state space modeling automatically handles the zero variances in the covariance matrices so that the time invariant regression coefficients coincide with the appropriate generalized least squares estimates. Also, the simulation algorithm is capable of handling the constant states without modifications.
The long-term seasonal weather conditions such as the starting times of snowmelt and autumn rains, as well as the short-term weather conditions such as daily temperature or precipitation also affect concentrations. We assume here that their effects come mainly through flow. In addition, we assume that other environmental effects are mostly captured by the latent autoregressive level processes and coefficient processes of the flow series. We deliberately aim at a parsimonious model with practical formulas for the interpolation of the nutrient fluxes, although the true phenomena behind the variation of nutrient concentrations are obviously more complicated than our model suggests.

Estimated nutrient fluxes and model parameters
The yearly estimates of the nutrient fluxes obtained by simulating the model are given in Table A.1 in the Appendix. Yearly estimates of nutrient fluxes with their simulated 95% prediction intervals are also shown in Figure 2. Each river exhibits a similar fluctuating patterns without a clear trend. Especially yearly phosphorus fluxes, but also nitrogen fluxes clearly peak in 2008, followed by an even larger drop in 2009. Overall, fewer nutrient measurements result in somewhat wider prediction intervals for Porvoonjoki and Vantaanjoki than for Paimionjoki and Aurajoki.
The estimated values of the unknown variance and autoregressive parameters are shown in Table 2. Occasionally, the estimation process yields the variance estimates close to zero (i.e., values less than 10 8 ). In such cases, these are replaced with fixed zeros, and estimation process for other parameters is repeated. In all cases, the likelihood remained practically unchanged. Standard errors of the estimates are computed by inverting the Hessian matrix given by the optimization function optim in R. The variance parameters are estimated in logarithmic scale. By using their standard errors, the confidence intervals for the log-variances can be obtained. Then the confidence intervals for the variances themselves are easily derived.
In all models, the values of autoregressive parameters are close to one (0:95 0:98), and therefore the standard errors might not be very useful, as the sampling distributions are far from normal distribution. The correlations P ; N between the disturbances of the autoregressive processes are around 0.5 for all rivers. This indicates moderate long-term correlation between the underlying phosphorus and nitrogen concentration processes at a given flow level. The instantaneous correlations P ; N , again given the flow, are smaller and more variable: 0.2 or slightly higher in the Paimionjoki and Porvoonjoki, and negligible in the Aurajoki and Vantaanjoki.
The coefficient processes are shown in Figure 3. Somewhat larger regression coefficients of the reciprocal log-flow of the Porvoonjoki and Vantaanjoki compared with those of Paimionjoki and Aurajoki are in concordance with the fact that the former rivers are subject to higher waste-water loads. Otherwise, the interpretation of the regression coefficient processes is difficult. Nevertheless, as predictive tools, individual river-specific models appear to be highly useful.

Model criticism
We have also tested models where the autoregressive processes have been replaced with random walks (i.e., P D N D 1) and a multivariate local level model without regressors, but where the concentration processes are augmented with water flow. In addition, we also tested the ordinary multivariate regression model. All these models yield large autocorrelations of the standardized residuals, and in case of time-varying models, there is a clear inverse relationship between the size of residuals and observed concentration. These apparent violations are avoided by using the model (7). However, even despite obvious violations of model assumptions, yearly estimates of the nutrient fluxes from different time varying models have very similar coefficients of variation with deviations being usually less than one percentage point.  In the case of the ordinary regression model, the coefficients of variation are often substantially smaller. In Figure 4, the coefficients of variation are plotted against the yearly sample sizes of the concentration measurements. The coefficients of variation from the model (7) depend on the yearly sample sizes, whereas results from the ordinary regression model are overoptimistic and counterintuitive: uncertainty in the yearly flux estimate is independent from the amount of measurements in a given year. Both models use the daily water flow for the prediction of the missing concentration measurements, but the ordinary regression is immune to the time order of the measurements, and only the total number of measurements is important. However, we acknowledge that because yearly flux estimates are always conditioned on the model, all models underestimate the true errors of yearly flux estimates, and none of the models considered is "true".
The quantile-to-quantile plots of the standardized residuals of the models reveal heavier tails compared with the normal distribution. This would be problematic if the interest is on the daily values, but because we are interested in yearly values, we believe that the possible non-normality is not critical here. This is because the yearly measure of nutrient fluxes is a sum, which tends to be more normal than its components by the central limit theorem. For evaluating the effects of non-normality, we have made a simulation experiment where the errors t are a random sample from a heavy-tailed bivariate t -distribution with three degrees of freedom scaled to have Var. t / D † . New values representing the concentration measurements, on the same days as the true ones, are then simulated from the model with the estimated parameters. By using these simulated measurements, we fit our proposed model (under Gaussian assumptions), and we computed the coefficients of variation for the yearly fluxes. The coefficients of variation from the simulation are, on the average, within one percentage point of those obtained from the actual dataset thus displaying the negligible effect of non-normality.
The main purpose of our model is to estimate the yearly nutrient flux. To this end, we developed the thinning experiment explained at the end of section 2.2. We have made five experiments by randomly removing 10%, 20%, 30%, 40%, and 50% from the concentration values. The resulting relative errors (6) are reported in Table A.2 in the Appendix. The number of simulations is B D 2000, and each time, the parameters are re-estimated. If the model is correct, we expect a decreasing trend, and this is mostly what we observe. The loss of relative accuracy with 30% thinning is about 5% or less. However, the mean absolute errors MAE k D P B iD1 j b f i;k f i;k j=B increase rapidly as expected when thinning is increased (Table A.3 in the Appendix). When measuring bias using average errors AE k D P B iD1 . b f i;k f i;k /=B (Table A.4 in the Appendix), the total phosphorus flux is underestimated in all rivers, whereas the total nitrogen flux is usually overestimated, except for Aurajoki, where the nitrogen flux is underestimated. Overall, the results suggest that our model performs well enough for practical  1985 1989 1993 1997 2001 2005 2009 Smoothed value of the coefficient process Year  The average errors show that the ordinary regression model overestimates the nitrogen fluxes more than our model, whereas the bias of phosphorus fluxes is slightly smaller. Finally, we note that removing predictor 1= log.2 C q t / from the final model (7) always worsens the model performance compared with including it.

DISCUSSION
We have used Gaussian state space models with partially sparse data for modeling the yearly nutrient fluxes of four rivers running through catchments dominated by agricultural land use. The large proportion of "missing" daily nutrient concentration measurements for corresponding daily flow measurements increased the uncertainty regarding the model selection, parameter estimation, and prediction, thus encouraging the use of models with simple structure and large flexibility. During the observational period covered by this study, Finnish agricultural farmlands experienced a substantial decrease in phosphorus and nitrogen balance OECD (2012). Despite this drastic decrease in nutrient balance, we did not observe any corresponding trends in nutrient fluxes over the last 25 years for any of the four rivers examined here. Greatly reduced nutrient balances do not always lead to concurrent reduction in riverine nutrient fluxes, for example, due to high nutrient reserves in soil and groundwater (e.g., Stålnacke et al. (2004)). Moreover, although nutrient balances form a crucial indicator of the risk of nutrient losses from agriculture, changes in other agricultural practices or in climate may have had an opposite effect on the load (Ekholm et al., 2007).
While we have reported results when the daily water flow is only predictor variable, we have also augmented the model with locally important variables such as daily air temperature, precipitation, and several functions of these. To examine the possible effect of large scale climate patterns, we have also used the North Atlantic Oscillation and Arctic Oscillation indices in combination with flow. Additions of variables operating at either small (temperature and precipitation) or large scales (North Atlantic Oscillation or Arctic Oscillation) did not improve results for any of the models we used.
Many studies examining nutrient dynamics of rivers have stated the need for extensive datasets to be able to make precise statements on the nutrient flux (e.g., Rekolainen et al. (1991)). Although we are conscious that the thinning of an originally sparse data by half can include possible computational caveats and thus may lead to artifacts, our results seem to indicate that when daily flow data are available, relatively sparse data on nutrient concentrations can be used to estimate yearly fluxes. If the aim of monitoring is to assess yearly fluxes of principal nutrients from agriculturally dominated watersheds to receiving downstream locations (e.g., the sea), our findings imply the potential to lower the frequency of water quality (i.e., nutrient) sampling intensities for rivers with permanent gauging stations and long-term records of flow. It should be noted that these concentration measurements could be used for other types of analysis as well, where the number of samples cannot not be reduced.  (7) and for the ordinary least squares regression model with two predictors (marked by )