Autoregressive

It is well known that the so called plug-in prediction intervals for autoregressive processes, with Gaussian disturbances, are too narrow, i.e. the coverage probabilities fall below the nominal ones. However, simulation experiments show that the formulas borrowed from the ordinary linear regression theory yield one-step prediction intervals, which have coverage probabilities very close to what is claimed. From a Bayesian point of view the resulting intervals are posterior predictive intervals when uniform priors are assumed for both autoregressive coeﬃcients and logarithm of the disturbance variance. This ﬁnding opens the path how to treat multi-step prediction intervals which are obtained easily by simulation either directly from the posterior distribution or using importance sampling. A notable improvement is gained in frequentist coverage probabilities. An application of the method to forecasting the annual gross domestic product growth in the United Kingdom and Spain is given for the period 2002–2011 using the estimation period 1962–2001.


Introduction
A traditional approach to time series forecasting usually involves a selection of a family of suitable models, e.g. the class of autoregressive integrated moving average (ARIMA) models.Then using different model selection criteria within the family based of autocorrelation and partial autocorrelation functions together with formal criteria such as Akaike or Bayesian information criterion, the analyst chooses a suitable "best fitting" representative from the model family, estimates the parameters, makes diagnostic checks, and if he is happy with his choice computes predictions and the prediction intervals.The prediction intervals are usually computed as if the chosen model were correct and the parameters completely known, with no reference to the process of model selection.Chatfield (1993Chatfield ( , 1996) ) has strongly criticized the omission of model uncertainty in forecasting.Clements and Hendry (1999, sections 1.3.8 and 2.2) introduce a detailed taxonomy of forecast errors, and stress the effects of the structural breaks in time series forecasting.As a remedy they propose robustifying forecasts for example by differencing and intercept corrections.
It is the common view of references given in the previous paragraph that the parameter uncertainty is often a minor source of prediction errors in practical applications when the sample size is large enough.Clements and Hendry (1999, p. 128) remark that although the parameter uncertainty is unlikely to lead to serious forecast failure it may have larger effect in conjunction with model misspecification.Nevertheless, we believe that it is justified to handle also this part of the model uncertainty.In textbooks it is a common topic, see for example Harvey (1993, p. 58-59).Here we show how to make corrections in a fairly simple way under autoregressive (AR) models.
Several proposals have been made for improving prediction intervals when parameters are estimated.One group of solutions focus on finding a more accurate prediction mean squared error in the presence of estimation; see, for example Phillips (1979), Fuller and Hasza (1981), Ansley and Kohn (1986), Quenneville and Singh (2000), and Pfeffermann and Tiller (2005).Both analytic and bootstrap approaches are tried.Barndorff-Nielsen and Cox (1996) give general results for prediction intervals in the presence of estimated parameters.These results are further developed for time series models by Vidoni (2004Vidoni ( , 2009)).Unfortunately fairly complicated expressions appear already in rather simple models.Bootstrap solutions are given by several authors; see for example Beran (1990), Masarotto (1990), Grigoletto (1998), Kim (2004), Pascual, Romo, and Ruiz (2004), Clements and Kim (2007), Kabaila and Syuhada (2008), and Rodriguez and Ruiz (2009).
Our aim is to find solutions for prediction interval problems using a Bayesian viewpoint with a requirement of objectivity for our predictions.Berger (2006) strongly encourages the use of the term "objective Bayes" in such situations.Operationally "objectivity" in our application means adopting priors which produce prediction intervals which exhibit approximately same coverage probabilities in both Bayesian and frequentist sense.
There is a vast literature on matching posterior and frequentist inferences to some degree of approximation.Methods based on invariance arguments, on information criteria or divergence measures as well as on asymptotic expansions are tried.The introduction section of Fraser et al. (2010) gives a short review on these as well as a list of relevant references to their own works and others.The starting point of their approach is to replace p value computations in a sample space with with posterior integration over the parameter space.Matching the Bayesian and frequentist coverage probabilities of prediction intervals in regression models with independent cases are treated in Datta and Mukerjee (2003) and Datta and Mukerjee (2004).A common feature in all these approaches is that often the so called Jeffreys's prior or its modification shows up as a solution.As we will see it happens also in this article.In a recent article of Arellano and Bonhomme (2009), where the main issue is the bias reduction in panel data models, the authors use a prior related to Jeffreys's prior.A predecessor in the bias reduction with the help of Jeffreys's prior is given by Firth (1993).
Early examples using the Bayesian approach in autoregressive models are given by Zellner (1971, p. 188) and Chow (1974).Later Broemeling and Land (1984) showed, among other things, that using a normal-gamma prior for the parameters the one-step ahead prediction follows a t distribution (understood as a location-scale family).They further deduced that all predictions up to k step ahead have a joint predictive density which consists of the product of k univariate t densities.Thompson and Miller (1986) propose a simulation method for computing prediction intervals based on the Bayesian arguments called "sampling the future".This differs from our approach.We compute the prediction interval by simulating directly the posterior prediction probability which is more accurate and considerably less time consuming.Liu (1994) develops an approximation to their method, and Snyder, Ord, and Koehler (2001) make an approximate extension of it to ARIMA mod-els.They also compare several types of prediction intervals in terms of the frequentist coverage probabilities.
After the emergence of Markov Chain Monte Carlo simulation the Bayesian paradigm has gained increasing popularity in time series and econometrics; see for example Geweke (2005) and Prado and West (2010) and references therein.A thorough exposition of the Bayesian forecasting is given by Geweke and Whiteman (2006).Nevertheless, in forecasting under AR model either direct simulation or importance sampling is fast and sufficient.Based on independent simulation replicates it also renders rather simple formulas for assessing the Monte Carlo simulation error.
The conditional maximum likelihood estimates for β coincide with the least squares estimates and the maximum likelihood estimate for σ 2 , corrected by the degrees of freedom, is (2.4) The predictive value for y n+1 is x n+1 β, and the standard prediction interval with approximate coverage probability 1 − 2α is where z α is the α quantile of the standard normal distribution.In practice, the true coverage may be considerably below the nominal value 1 − 2α.
Let us suppose for a moment that we have an ordinary regression model with some truly exogeneous variables.Then using the same notation, X and x n−1 , as before the exact coverage probability is obtained by As is well known, the extra factor here compared to (2.5) involving the exogeneous variable takes into account the estimation error in regression coefficients.In addition, a minor correction is done by replacing the normal quantile with that from Student's t with n − p − 1 degrees of freedom.Although the assumptions of AR models do not satisfy the assumptions leading to (2.6), our simulations show that the very same intervals (2.6) have practically correct coverage probabilities also under the AR models.
Combining the formulas (2.2)-(2.4)with the identity the conditional likelihood function (2.2) can be written as (2.7) Until now we have had the frequentist approach.But it is also illuminating to see the prediction interval (2.6) from a Bayesian point of view, where β and σ are now treated as a random vector and variable.If we multiply the likelihood L c (β, σ) by the improper prior p(β, σ) = 1/σ, we find, as is well known in the ordinary regression, that a posteriori (2.8) These together lead to the result that the interval (2.6) is a Bayesian predictive interval with exact coverage probability 1 − 2α.The corresponding frequentist coverage probability is, however, only approximate (2.9)However, in section 5 we will find that the approximation (2.9) is very good when the coefficients are well within the stationarity region.The approximation seems to be worst in the nearly unit root cases.The explanation is likely to be as follows.If the coefficients are not too close to the boundary of the stationarity region, the sampling distribution of β is approximately as in in (2.8) when the roles of β and β are interchanged.On the other hand this not true in the nearly unit root case.So the question arises what is "good" prior in AR models in general.It turns out that using a certain modification of Jeffreys's prior seems to be preferred to uniform prior in nearly unit root models.One solution to the nearly unit root problem is provided by Clements and Hendry (1999, p. 92) who recommend in some cases to impose a unit root albeit it were not warranted by the unit root test.
At this point it might be interesting to recall the vigorous debate that broke out in the early 1990's between Peter Phillips and Christopher Sims on the value of unit root econometrics as such and its relation to Bayesian approach in statistics and econometrics, see Sims and Uhlig (1991), Phillips (1991), Sims (1991) and their references.The unit root inference is not an issue here.The Bayesian controversy focused on the choice of an appropriate prior distribution.Nevertheless, we do not want to revive this controversy here simply because we are not so much interested in inferences on autoregressive coefficients themselves but rather in prediction intervals.The role of the prior in this article is to produce a good weighting scheme for the predictive distribution.

Uniform prior
In this section we develop prediction formulas under the Bayesian paradigm employing the noninformative uniform prior for (β, log σ), i.e. the prior takes the form p(β, σ) = 1/σ, for β ∈ R p+1 and σ > 0. We have already seen that when k = 1 the predictive distribution is Student's t.But for k > 1 the predictive distribution is not known to be any common distribution.Therefore we have to rely on simulation.
We use a generic notation p(•) for a density.The Bayesian predictive density of y n+k is defined as the conditional density of y n+k given Y .It is obtained by integration as It is also plain from (2.1) that for some constants ψ 1 , ψ 2 , . . .depending on β we have Box et al. (2008) gives recursions for the coefficients ψ j as follows , and ψ j = 0 for j < 0.
We find ψ 1 = β 1 , ψ 2 = β 2 1 + β 2 and so on.The prediction error variance, given β, σ 2 , is then Combining this with (3.1) and changing the order of integration we find where Φ is the cumulative distribution function of the standard normal distribution.When k > 1 we cannot do the integration involved analytically.Nevertheless, a Monte Carlo solution is straightforward.We can proceed as follows: 1. Draw independent q i , i = 1, . . ., N from χ 2 (n − p − 1), and let The prediction interval is then found by solving separately both PN (b) = α and PN (b) = 1−α.Let the solutions be bα , b1−α respectively.Then ( bα , b1−α ) is the prediction interval with posterior coverage probability 1−2α when N is large.Broemeling and Land (1984) noted that multi-step predictions can be constructed through a sequence of one-step predictions each step involving t distribution.However, they do not suggest any computational method how to utilize this in practice.Thompson and Miller (1986)

Prediction With General Prior
Combine the conditional likelihood (2.2) and a general prior p(β, σ)/σ.Then by (2.7) and (2.8) the joint posterior is It can be evaluated using importance sampling.In the simulation algorithm only the item 3 need be changed to 3' Compute the weighted average (3.4) The prediction interval is then solved as before.
It is also possible to incorporate the starting values (y −p+1 , . . ., y 0 ) = y 0 into the likelihood.Denote cov(y . ., 0. The matrix V can be obtained from the Yule-Walker equations, see Box et al. (2008, p. 58), and To obtain the full likelihood, the conditional likelihood (2.2) is just multiplied by This leads to changing the weights in step 3'.The new weights are where I(•) is an indicator and S p is the stationarity region of AR(p) process.The first coordinate of β i can, of course, take any real value.

Standard Errors for Monte Carlo Prediction Limits
Here we give simple formulas for approximate standard errors when computing the prediction limits by Monte Carlo simulation.Let b α be such that 2), and bα be such that PN ( bα ) = α in (3.3).The first order Taylor expansion at bα leads to bα , where P N is the derivative of PN .The variance of the numerator on the right side can be estimated from the sample values by S 2 /N where where ϕ is the density of the standard normal distribution.Omitting the details we can show that ( bα − b α )/s.e.( bα ) tends to the standard normal distribution as N → ∞.
In case we use a weighted average as in (3.5) and (3.6), a similar technique leads to the standard error

Priors
Here we give some examples of the priors that are likely to give improvements for frequentist coverage probabilities.A popular principle to generate priors, often improper, is to follow Jeffreys's rule which leads to the square root of the determinant of the information matrix.Applying this to the conditional likelihood (2.2) we first find that x t x t .
The information matrix is obtained by taking the expectation given the parameters and changing the sign.Assuming stationarity yields The determinant of the matrix involved is easily seen to equal |V | (note that V depends only on β j , j = 1, . . ., p).Now we take the convention that the parameter groups {σ}, {β 0 } and {β 1 , . . ., β p } are independent a priori, and that log σ, β 0 are uniform.This leads to the prior we here call Jeffreys's prior.
The uniform prior over the stationarity region is In our simulations we combine these priors with the full likelihood, i.e. the weights are p(y 0 | β i , σ i )p J (β i , σ i ) and p(y 0 | β i , σ i )p U (β i , σ i ) for Jeffreys's and uniform on the stationarity region priors, respectively.Note that when Jeffreys's prior is used the determinant |V | cancels when forming the weights.
For the stationary AR(1) model Jeffreys's marginal prior density of β 1 is and for AR(2) the marginal prior density of (β 1 , β 2 ) is Note that the first density is proper whereas the second one is not; the latter property is true for all AR(p) with p > 1.Without a stationarity restriction Berger and Yang (1994) defined an alternative proper marginal prior density for the AR(1) model as which they call a reference prior.It is easily seen that the reference prior is invariant under the transformation β 1 → 1/β 1 .Note that the reference prior should be combined with the conditional likelihood (2.2), and then it produces peaks also in the posterior at β 1 = ±1.There is no use to combine it with the full likelihood because then it reduces to Jeffreys's prior.There seems to be no generalization available of the reference prior to higher order models.
turn back to the frequentist interpretation of the probability.Suppose that we have a realization from an autoregressive process with parameters β, σ which are fixed but unknown.Despite of this we wish to compute the prediction interval (b α , b 1−α ) via a Bayesian route from the probability distribution (3.2).Although the posterior coverage probability is exactly 1 − 2α, the corresponding frequentist probability usually is not.It refers to infinite sequence of new realizations from the same model with parameters β, σ, and it is defined by the probability where all y n+k , b α , b 1−α are random.In short, the frequentist coverage probability is an average coverage probability over the realizations.The conditional frequentist coverage probability The chosen priors are uniform, uniform on the stationarity region, Jeffreys's prior (4.1) and the reference prior (4.2) of Berger and Yang (1994) for AR(1).In all comparisons we have used 50,000 replicates, where each replicate is a realization from a stationary AR(p) process of length n + p. Within each replicate the Monte Carlo sample size is N = 50.In this type of simulation experiments N need not be large, because the main variation in coverage probabilities is due to the different replicates.Nevertheless, in an actual application N should be considerably larger as is seen in the next section.All computations are done in the R environment (R Development Core Team, 2012).
Start with AR(1). Figure 5.1 shows the coverage probabilities of one-step ahead prediction intervals (2.6) based on the t distribution for several values of β 1 , when n = 30.Here, and in all other experiments, we are aiming for the coverage probability of 0.9.For negative β 1 the coverage probabilities tend to be slightly above the nominal value whereas when β 1 is close to 1, they drop below the nominal value.However, note that even for β 1 = 0.9 the coverage probability is 0.894.The standard errors of coverage probabilities are less than 3 × 10 −4 in all cases.Although not shown in the figure the prediction intervals appear to have approximately equal tail probabilities.The coverage probabilities of the standard prediction intervals stay below 0.88.
Figure 5.2 shows the coverage probabilities of multi-step prediction for AR(1) processes with different prior choices and four different parameter values for β 1 .The standard error of coverage probability was less than 7 × 10 −4 in all cases.The reference prior seems to be slightly preferred compared to others.Especially, when β 1 is near to 1, the reference prior leads to coverage probabilities that are closest to the nominal frequentist coverage probability.Elsewhere the priors give almost the same coverage probabilities.In order to examine the differences between plug-in method and the Bayesian method with different priors in the case of AR(2) processes, we use nine different parameter combinations for β 1 and β 2 .The parameters are defined through the roots of the characteristic equation Let the roots be r 1 and r 2 .Then the parameters β 1 and β 2 can be written as The reciprocals of the roots and the corresponding parameters β 1 and β 2 are in Table 5.1.Figure 5.3 shows the coverage probabilities for each process with the nominal coverage probability of 0.9 and n = 30.The standard error of the coverage probability was less than 5 × 10 −4 in all cases.The Bayesian methods perform much better in all cases.Figures 5.4 and 5.5 show the spectral densities and the associated autocorrelation functions of the corresponding processes.The spectral densities are scaled such that the total density integrates to 1.
Table 5.1:The AR(2) models used in the simulation experiments.
When the mass of the spectral density function is mostly peaked on the medium or high frequencies and the corresponding autocorrelation functions alternate, the forecast horizon does not seem to affect the coverage probabilities neither for the Bayesian nor for the plug-in methods.The Bayesian methods are almost exactly equal to the nominal coverage or somewhat exceed it, while the plug-in method is staying clearly below the nominal coverage.In the case where the spectral density function is very flat, the Bayesian methods give coverage probabilities slightly over the nominal level, while the plug-in method stays below.Assuming stationarity with uniform prior seems to provide coverage probabilities little below those of the uniform and Jeffreys's priors.

Annual gross domestic product growth
As an empirical example we applied our method to forecasting the annual gross domestic product (GDP) changes (in percentages) in the United Kingdom (UK) and Spain (Figure 6.1).The data is from World Bank's databank (http://databank.worldbank.org).Observations from 1962-2001 are used for finding an appropriate model for each series.Then the series are forecast for 10 years ahead 2002-2011.In the following we first pretend, as far as possible, that we do not know what has happened in the forecast period.Afterwards when future has been revealed, we try to learn from it.
Based on the autocorrelation and partial autocorrelation functions, we assume an AR(1) model for both series.The least squares estimates of the autoregressive coefficients are 0.35 for the UK and 0.65 for Spain.The estimates for β 0 are 1.77 (UK) and 1.25 (Spain) and for σ 1.93 (UK) and 1.83 (Spain).
The residuals of the Spanish series are well behaved.There is neither apparent autocorrelation nor deviation from normality.As for the residuals 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Now, of course, we know that the preceding reasoning is too optimistic.In 2001 there were not any indication of the financial crisis starting in 2008.Figure 6.1 shows the actual values, the 90% prediction intervals and the point predictions for both series.The latter values in the Bayesian forecasting are posterior medians, computed by setting α = 0.5.As expected, the Bayesian method produces wider prediction intervals than the standard plug-in method for both series and substantially wider for the Spanish series.We expected at most one minor violation in the Spanish series, whereas we see one large drop below the 90% prediction limit.In the UK series there are one value at the boundary (in 2008)  series compared to what our model predicts.
There is also one more thing we can learn.Although the differences between the Bayesian intervals are mostly negligible, we see that Jeffreys's prior leads to significantly higher upper limit for the Spanish series.The same is true also for the the point forecasts (i.e.medians).A closer examination of the importance weights using Jeffreys's prior shows that large values are associated with a large values of the autoregressive coefficient.This seems to be due to the large difference between the first observation and the stationary mean, leading to doubts on the stationarity of the series.Our conclusion is that the uniform prior combined with the conditional likelihood is more preferable to Jeffreys's prior in this case.
What should we think of the drops in 2009 occurring in both series?Looking at the series as such and ignoring other information they seem to be single unlikely events rather than the sign of a structural break.There are at least two arguments for this.Firstly, the values after 2009 (in 2010 and 2011) fit rather well into the main body of the data.Secondly, we computed prediction bands such that the observed values of the year 2009 are lying exactly on the boundary.The corresponding coverage probabilities are 1 − α with α = 0.0016 for the UK and with α = 0.0078 for the Spanish series.Moreover the values in 2009 are also extremely aberrant compared with those in the 40 year period 1962-2011.In summary, the values in 2009 are highly improbable in the light of the rest of the data.
Nevertheless, the accumulated information on world economy and the crisis concerning the euro countries till the end of 2012 makes us think that the assumption of a structural break should still be considered seriously.If we are to forecast from 2012 onwards we should carefully explore the methods suggested by Clements and Hendry (1999).
Table 6.1 gives numeric information on the predictions and prediction intervals related to our application under the assumption that our model is correct.Firstly, it contains the actual prediction limits with their Monte Carlo standard errors obtained from the formulas in section 3.3.Secondly, it shows the coverage probabilities for both models in case they were true.The forecasting horizon is k = 1, 10 and the nominal coverage is 0.90.The two Bayesian methods give one-step ahead prediction intervals which are practically correct.When k = 10, the coverage probabilities for the UK slightly differ from the nominal ones.The standard method gives intervals which have coverage probabilities below the nominal level in both cases.These comparisons concern the chosen models only, not the actual series: The reported coverage probabilities are averages corresponding to AR(1) models with coefficients 0.35 and 0.65.This explains why the actual intervals for the Spanish series with uniform and Jeffrey's priors are of considerably different widths, though their coverage probabilities are close to each other on the average.The deviance may also be related to the aberrant starting value.We have used the sample size of N = 100, 000 in our Monte Carlo simulation, and therefore the standard errors are fairly small.Standard errors should decrease at the rate 1/ √ N that is also confirmed by our experiments.With N = 1000 the standard errors are approximately tenfold compared to those in Table 6.1.

United Kingdom
We have also compared the probabilities that the future value lies below the lower limit or above the upper limit.They seem to be approximately equal.Thus, our method seems to produce equal tail prediction intervals.

Discussion
We have shown the benefits of the Bayesian approach to prediction interval calculations under autoregressive schemes.Our message to the practitioners is that there are appropriate prior distributions leading to improved prediction intervals compared to those obtained by the common plug-in method.It has turned out that the uniform and Jeffreys's priors meet most practical goals for such intervals.Jeffreys's prior might have a slight advantage as regards coverage probabilities, although the dependence on the initial observations may have detrimental effects if the starting values are too far from their mean.Our simulation method is straightforward and easy to understand and to implement.An estimate for the Monte Carlo error can also be obtained.
It is plain that when the length of the time series increases, the parameter uncertainty decreases and thus also the coverage probabilities get closer to the nominal level.For example in AR(1) case, with n = 100, h = 10, and β 1 = 0.5, the coverage probability of the plug-in method roughly achieves the nominal probability.But any guidelines when to use the simulation method proposed here, instead of the plug-in method, seems to require massive Monte Carlo experimenting.The reason is that the outcome depends on many variables: the length of the series and forecasting horizon, the order of the model and the chosen coefficients.We believe that the proposed approach could be taken as a default method, because it is computationally so light and hardly ever gives results worse than the plug-in method.
The prediction intervals are computed under the Gaussian assumption.We have made some simulation experiments under AR(1) and AR(2) models, where the true errors are from Student's t distribution with 5 degrees of freedom and from Laplace distribution, but we still compute the prediction intervals under Gaussian assumption.In both cases the coverage probabilities are smaller than in the case where the true errors are Gaussian, but usually the difference is about one percentage point or less.Although such a small experiment does not afford any general conclusions, it makes us to conjecture that the method is fairly robust against minor deviation from normality.
Although we have handled univariate AR processes only, the method could be extended to general ARIMA and vector autoregressive models, but more careful considerations of prior distributions are then needed.In addition, these complex models can be more sensititive to structural breaks and other issues discussed by Clements and Hendry (1999).
Finally, when dealing with a particular data set, some evidence of the adopted prior is obtained by simulating the estimated model as we have done in section 6.

Figure 5 . 1 :
Figure 5.1: Coverage probabilities of one-step ahead predictions in AR(1) models when n = 30.The nominal coverage is 0.9.Dots are the estimated values, based on 50000 replicates.The upper dots relate to the Bayesian intervals with uniform prior and the lower ones to standard intervals.
Figure 5.2: Coverage probabilities of multi-step ahead predictions in AR(1) models with different values of β 1 when n = 30.The nominal coverage is 0.9.The top dashed black line is based on the reference prior and the bottom dotted line represents the standard coverage probabilities.The lines based on the uniform (solid dark grey line), Jeffreys (light grey dashed line) and uniform stationary (black dot-and-dash line) prior are almost indiscernible apart from the case β 1 = 0.9.
Figure 6.1:The annual GDP 1962-2001 in the UK and Spain together with the point predictions, actual values and 90% prediction intervals for the years 2002-2011.Solid grey lines represents intervals and point estimates computed by the Bayesian method with uniform prior, the grey dashed lines corresponds to the Bayesian method with Jeffreys's prior and black dotted line corresponds to the standard plug-in method.