Importance sampling type estimators based on approximate marginal MCMC

We consider importance sampling (IS) type weighted estimators based on Markov chain Monte Carlo (MCMC) targeting an approximate marginal of the target distribution. In the context of Bayesian latent variable models, the MCMC typically operates on the hyperparameters, and the subsequent weighting may be based on IS or sequential Monte Carlo (SMC), but allows for multilevel techniques as well. The IS approach provides a natural alternative to delayed acceptance (DA) pseudo-marginal/particle MCMC, and has many advantages over DA, including a straightforward parallelisation and additional flexibility in MCMC implementation. We detail minimal conditions which ensure strong consistency of the suggested estimators, and provide central limit theorems with expressions for asymptotic variances. We demonstrate how our method can make use of SMC in the state space models context, using Laplace approximations and time-discretised diffusions. Our experimental results are promising and show that the IS type approach can provide substantial gains relative to an analogous DA scheme, and is often competitive even without parallelisation.


Introduction
Markov chain Monte Carlo (MCMC) has become a standard tool in Bayesian analysis. The greatest benefit of MCMC is its general applicability -it is guaranteed to be consistent with virtually no assumptions on the underlying model. However, the practical applicability of MCMC generally depends on the dimension of the unknown variables, the number of data, and the computational resources available. Because MCMC is only asymptotically unbiased, and sequential in nature, it can be difficult to implement efficiently with modern parallel and distributed computing facilities [44,64,102].
We promote a simple two-phase inference approach, based on importance sampling (IS), which is well-suited for parallel implementation. It combines a typically lowdimensional MCMC targeting an approximate marginal distribution with independently calculated estimators, which yield exact inference over the full posterior. The estimator is similar to self-normalised importance sampling, but is more general, allowing for sequential Monte Carlo and multilevel type corrections. The method is naturally applicable in a latent variable models context, where the MCMC operates on the hyperparameter distribution using an approximate marginal likelihood, and re-weighting is based on a sampling scheme on the latent variables. We detail the application of the method with Bayesian state space models, where we use importance sampling and particle filters for correction.
1.1. Related work. We consider a framework which combines and generalises upon various previously suggested methods, which, to our knowledge, has not been systematically explored before. Importance sampling correction of MCMC has been suggested early in the MCMC literature [e.g. 20,38,46], and used, for instance, to estimate Bayes factors using a single MCMC output [21]. Related confidence intervals have been suggested based on regeneration [11] and in case of multiple Markov chains [94]. Using unbiased estimators of importance weights in this context has been suggested at least in [65,68], who consider marginal inference with a generalisation of the pseudo-marginal method, allowing for likelihood estimators that may take negative values, and in [82] with data sub-sampling.
Nested or compound sampling has also appeared in many forms in the Monte Carlo literature. The SMC 2 algorithm [13] is based on an application of nested sequential Monte Carlo steps, which has similarities with our framework, and the IS 2 method [96] focuses on the case where the preliminary inference is based on independent sampling. We focus on the MCMC approximation of the marginal distribution, which we believe often to be easily implementable in practice, also when the marginal distribution has a non-standard form. The Markov dependence in the marginal Monte Carlo approximation comes with some extra theoretical issues, which we address in detail.
Our setting highlights explicitly the connection of IS type correction and delayed acceptance (DA) [15,33,67], and recently developed pseudo-marginal type MCMC [4,65] such as particle MCMC [2], grouped independence Metropolis-Hastings [9], approximate Bayesian computation (ABC) MCMC [69], the algorithm for estimation of discretely observed diffusions suggested in [10], and annealed IS [57,74]. Theoretical advances of pseudo-marginal methods [3,6,7,14,27,66,92] have already led to more efficient implementation of such methods, but have also revealed fundamental limitations. For instance, the methods may suffer from slow (non-geometric) convergence in practically interesting scenarios [4,62]. Adding dependence to the estimators [cf. 7], such as using the recently proposed correlated version of the pseudo-marginal MCMC [18], may help in more efficient implementation in certain scenarios, but a successful implementation of such a method may not always be possible, and the question of efficient parallelisability remains a challenge. The blocked parallelisable particle Gibbs [93] has appealing limiting properties, but its implementation still requires synchronisation between every update cycle, which may be costly in some computing environments.
The IS approach which we propose may assuage some of the aforementioned challenges of the pseudo-marginal framework; see Section 2.3.

1.2.
Outline. We introduce a generic Bayesian latent variable model in Section 2, detail our approach algorithmically, and compare it with DA. We also discuss practical implications, modifications and possible extensions. After introducing notation in Section 3, we formulate general IS type correction of MCMC and related consistency results in Section 4. We detail the general case (Theorem 3), based on a concept (Definition 2), which we call a 'proper weighting' scheme (following the terminology of Liu [67]), which is natural and convenient in many contexts. In Section 5, we state central limit theorems and expressions for asymptotic variances. Section 6 focuses on estimators which calculate IS correction once for each accepted state, stemming from a so-called 'jump chain' representation. Section 7 details consistency of our estimators in case the approximate chain is pseudo-marginal.
We detail proper weighting schemes in the state space models (SSMs) using sequential Monte Carlo (SMC) in Section 8. We then focus on SSMs with linear-Gaussian state dynamics in Section 9, and show how a Laplace approximation can be used both for approximate inference, and for construction of efficient proper weighting schemes. Section 10 describes an instance of our approach in the context of discretely observed diffusions, with an approximate pseudo-marginal chain. We compare empirically several algorithmic variations in Section 11 with Poisson observations, with a stochastic volatility model and with a discretely observed geometric Brownian motion. Section 12 concludes, with discussion.

The proposed latent variable model inference methodology
A generic Bayesian latent variable model is defined in terms of three random vector, and corresponding conditional densities: • Θ ∼ pr( · ) -prior density of (hyper)parameters, • X | Θ = θ ∼ µ (θ) ( · ) -prior of latent variables given parameters, and • Y | (Θ = θ, X = x) ∼ g (θ) ( · | x) -the observation model. The aim is inference over the posterior of (Θ, X) given observations Y = y, with density π(θ, x) ∝ pr(θ)µ (θ) (x)g (θ) (y | x). Standard MCMC algorithms may, in principle, be applied directly for inference, but the typical high dimension of the latent variable x and the common strong dependency structures often lead to poor performance of generic algorithms.
Our inference approach focuses on the specific structure of the model, based on the factorisation π(θ, x) = π m (θ)r(x | θ), where the marginal posterior density π m and the corresponding conditional r are: π m (θ) := π(θ, x)dx ∝ pr(θ)L(θ) and r(x | θ) := p (θ) (x, y) L(θ) , with the joint density of the latent and the observed p (θ) (x, y), and the marginal likelihood L(θ) given as follows: Two particularly successful latent variable model inference methods, the integrated nested Laplace approximation (INLA) [87] and the particle MCMC methods (PMCMC) [2], rely on this structure. In essence, the INLA is based on an efficient Laplace approximation p a (x, y) of p (θ) (x, y), determining an approximate marginal likelihood L a (θ) and approximate conditional distribution r a (x | y). Particle MCMC uses a specialised SMC algorithm, which provides an unbiased approximation of expectations with respect to p (θ) (x, y) allowing for exact inference, and which is particularly efficient in the state space models context.

2.
1. An algorithmic description. The primary aim of this paper is the efficient use of an approximate marginal likelihood L a (θ) within a Monte Carlo framework that leads to efficient, parallelisable and exact inference. For instance, Laplace approximations often lead to a natural choice for L a (θ). The inference method which we propose comprises two algorithmic phases, which are summarised below: Phase 1: Simulate a Markov chain (Θ k ) k=1,...,n targeting an approximate hyperparameter posterior π a (θ) ∝ pr(θ)L a (θ).
k are in the latent variable space, and calculate W (i) The essential conditions required for the validity of the estimator are: C1: The approximation is consistent, in the sense that L a (θ) > 0 whenever L(θ) > 0, and pr(θ)L a (θ)dθ < ∞. C2: The Markov chain (Θ k ) k≥n is Harris ergodic (Definition 1) with respect to π a . C3: for all θ ∈ T, all functions f of interest, and for f ≡ 1 (i.e. (2) holds with f ( · ) and f * ( · ) omitted). The value of c w need not be known. Both C1 and C2 are easily satisfied by construction of the approximation, and C3 is satisfied by many schemes. Section 8 reviews how (unnormalised) importance sampling and particle filter lead to such schemes. There is also a (mild) integrability condition, which (W k ) must satisfy in order to guarantee a strong conver- ≥ 0 almost surely, it suffices that |f | satisfies (2); see Section 4 for details. Further conditions ensure a central limit theorem , as detailed in Section 5. When Phase 1 is a Metropolis-Hastings algorithm, it is possible to generate only one batch of (Ṽ ..,m for each accepted state (Θ k ). If N k stands for the time spent atΘ k , then the corresponding weights are determined asW k := N k V (i) k /L a (Θ k ); see Section 6 for details about such 'jump chain' estimators.

2.2.
Use with approximate pseudo-marginal MCMC. In many scenarios, such as with time-discretised diffusions, the latent variable prior density µ (θ) cannot be evaluated, and exact simulation is impossible or very expensive. Simulation is also expensive with a fine enough time-discretisation. A coarsely discretised model leads to a natural cheap approximationμ (θ) , but in Phase 1, the Markov chain will often be a pseudo-marginal MCMC [cf. 4], in which case our scheme would have the following form: Phase 1': Simulate a pseudo-marginal Metropolis-Hastings chain (Θ k , U k ) for k = 1, . . . , n, following (i) Draw a proposalΘ k from q(Θ k−1 , · ) and givenΘ k , construct an estima- , accept and set (Θ k , U k ) = (Θ k ,Ũ k ); otherwise reject the move. Phase 2': For each (Θ k , U k ), sample (V Algorithmically, the pseudo-marginal version above is similar to the method in Section 2.1, with the likelihood L a (Θ k ) replaced with its estimator U k . The requirements for the approximate likelihood C1 and its estimator C3 remain identical, and C2 must hold for the pseudo-marginal chain (Θ k , U k ), together with the following condition: C4: The estimatorsŨ k are strictly positive, almost surely, for allΘ k ∈ T. These are enough to guarantee consistency; see Section 7, and in particular Proposition 15 for details, which also justifies why C4 is needed for consistency. In practice it may be easily satisfied, because the likelihood estimatorsŨ k may be inflated, if necessary (see Section 12).
Note that the variables (V k ) may depend on both Θ k and the related likelihood estimate U k . The dependency may be useful, if positively correlated V (i) k and U k are available, leading to lower variance weights W This is similar to the correlated pseudo-marginal algorithm [18], which relies on a particular form of V

2.3.
Comparison with delayed acceptance. The key condition, under which we believe our method to be useful, is that the Phase 1 Markov chain is computationally relatively cheap compared to construction of the random variables (W , with one iteration consisting of the following steps: DA 1: DrawΘ k ∼ P (Θ k−1 , · ). IfΘ k = Θ k−1 reject, otherwise go to (DA 2). DA 2: Conditional onΘ k , draw (Ṽ (i) k ,X (i) k ) which satisfies (2) withΘ k in place of Θ k , and setW If the pseudo-marginal method is used in DA 1, the value L a (Θ k ) is replaced with the related likelihood estimator. Under essentially the same assumptions as required by our scheme, and additionally requiring thatW (i) k ≥ 0, the DA scheme described above leads to a consistent estimator: Our IS scheme is a natural alternative to such a DA scheme, replacing the independent Metropolis-Hastings type accept-reject step DA 2 with analogous weighting. This relatively small algorithmic change brings many, potentially substantial, benefits over DA, which we note next.
(i) Phase 2 corrections are entirely independent 'post-processing' of Phase 1 MCMC output (Θ k ) k=1,...,n , which is easy to implement efficiently using parallel or distributed computing. This is unlike DA 1 and DA 2, which must be iterated sequentially. (ii) If Phase 2 correction variables are calculated only once for each accepted Θ k (socalled 'jump chain' representation, see Section 6), the IS method will typically be computationally less expensive than DA with the same number of iterations, even without parallelisation. (iii) The Phase 1 MCMC chain (Θ k ) may be (further) thinned before applying (much more computationally demanding) Phase 2. is required in DA 2, but not in Phase 2. This may be useful in certain contexts, where multilevel [37,47] or debiasing [71,84,98] are applicable. (See also the discussion in [52] why pseudo-marginal method may not be applicable at all in such a context.) (vii) The separation of 'approximate' Phase 1 and 'exact' Phase 2 allows for two-level inference. In statistical practice, preliminary analysis could be based on (fast) purely approximate inference, and the (computationally demanding) exact method could be applied only as a final verification to ensure that the approximation did not affect the findings. To elaborate the last point, the approximate likelihood L a (θ) is usually based on an approximation p a (x, y). Then, Phase 2 allows for quantification of the bias Eπ[f (Θ, X)] − E π [f (Θ, X)], and confirmation that both inferences lead to the same conclusions.
The further work [35] considers the relationship between IS and DA in terms of the asymptotic variance.

Notation and preliminaries
Throughout the paper, we consider general state spaces while using standard integral notation. If the model at hand is given in terms of standard probability densities, the rest of this paragraph can be skipped. Each space X is assumed to be equipped with a σfinite dominating measure 'dx' on a σ-algebra denoted with a corresponding calligraphic letter, such as X . Product spaces are equipped with the related product σ-algebras and product dominating measures. If X is a subset of an Euclidean space R d , dx is taken by default as the Lebesgue measure and X as the Borel subsets of X. R + stands for the non-negative real numbers, and constant unit function is denoted by 1.
We follow the conventions 0/0 := 0 and N := {1, 2, . . .}. For integers a ≤ b, we denote by a:b the integers within the interval [a, b]. We use this notation in indexing, so that is void, so that for example g(x, y 1:0 ) is interpreted as g(x). Similarly, if i 1:T is a vector, then T ). We also use double-indexing, such as ). Throughout the paper, we assume the underlying MCMC scheme to satisfy the following standard condition.

Definition 1 (Harris ergodicity). A Markov chain is called
Harris ergodic with respect to ν, if it is ψ-irreducible, Harris recurrent and with invariant probability ν.
Virtually all MCMC schemes are Harris ergodic [cf. 75,95], although in some cases careless implementation could lead to a non-Harris chain [cf. 85]. Thanks to the Harris assumption, all the limit theorems which we give hold for any initial distribution of the related Markov chain.

General importance sampling type correction of MCMC
Hereafter, π a is a probability density on T and represents an approximation of a probability density π m of interest. The consistency of IS type correction relies on the following mild assumption. Assumption 1. The Markov chain (Θ k ) k≥1 and the density π a satisfy: (i) (Θ k ) k≥1 is Harris ergodic with respect to π a . (ii) supp(π m ) ⊂ supp(π a ). (iii) w u (θ) := c w π m (θ)/π a (θ), where c w > 0 is a constant.
If Assumption 1 holds and it is possible to calculate the unnormalised importance weight w u (θ) pointwise, the chain (Θ k ) k≥1 can be weighted in order to approximate π m (g) for every g ∈ L 1 (π m ), using (self-normalised) importance sampling [e.g. 20,38] as Harris ergodicity guarantees the almost sure convergence of both the numerator and the denominator. In case π m is a marginal density, which we will focus on, both the ratio w u (θ) and the function g (which will be a conditional expectation) are typically intractable. Instead, it is often possible to construct unbiased estimators, which may be used in order to estimate the numerator and the denominator, in place of w u (Θ k ) and g(Θ k ), under mild conditions. In order to formalise such a setting, we give the following generic condition for ratio estimators, which resemble the IS correction above. Assumption 2. Suppose Assumption 1 holds, and let (S k ) k≥1 , where S k = A k , B k ∈ R 2 , be conditionally independent given (Θ k ) k≥1 , such that the distribution of S k depends only on the value of Θ k , and We record the following simple statement which guarantees consistency under Assumption 2.
Lemma 1. If Assumption 2 holds for some g ∈ L 1 (π m ), then The proof of Lemma 1 follows by observing that (Θ k , S k ) k≥1 is Harris ergodic, where In the latent variable model discussed in Section 2, the aim is inference over a joint target density π(θ, The following formalises a scheme which satisfies Assumption 2 with g = f * and therefore guarantees consistency for a class of functions f ∈ L ⊂ L 1 (π).
Definition 2 (L-Proper weighting scheme). Suppose Assumption 1 holds, and let (P k ) k≥1 be conditionally independent given (Θ k ) k≥1 , such that the distribution of each Remark 2. Regarding Definition 2: (i) In case of non-negative weights, that is, W (ii) When certain multilevel [37,47]  generally take also negative values. In such a case, an extra integrability condition is necessary, and we believe (ii) is required for consistency in general.
(iii) Note that L is closed under linear operations, that is, if a, b ∈ R and f, g ∈ L, then af + bg ∈ L. This, together with L containing constant functions, implies that if f ∈ L, thenf := f − π(f ) ∈ L. (iv) In fact, ξ k may be interpreted as a random (signed) measure. Our results extend also to such generalisation, which may be a useful interpretation for instance in the context of Rao-Blackwellisation, where ξ k could be mixtures of Gaussians.
The following consistency result is a direct consequence of Lemma 1.
Theorem 3. If (ξ k ) k≥1 form a L-proper weighting scheme, then the IS type estimator is consistent, that is, Let us next exemplify a 'canonical' setting of a proper weighting scheme, stemming from standard unnormalised importance sampling.
Proposition 4. Suppose Assumption 1 holds and q (θ) ( · ) defines a probability density on X for each θ ∈ T and supp(π) When the weights are all positive, we record the following simple observations how a proper weighting property is inherited in sub-sampling, which may be useful for instance due to memory constraints.
and let (I k ) be random variables conditionally independent of (Θ k , X k ) k≥1 forms a L-proper weighting scheme. The sub-sampling estimator simplifies to We conclude by recording a complementary statement about convex combinations, allowing to merge multiple proper sampling schemes. Proposition 6. Suppose (ξ k,j ) k≥1 forms a L-proper weighting scheme for each j ∈ {1:N }, then, for any constants β 1 , . . . , β N ≥ 0 with N j=1 β j = 1, the convex combinations ξ k (f ) := N j=1 β j ξ k,j (f ) form a L-proper sampling scheme.

Asymptotic variance and a central limit theorem
The asymptotic variance is a common efficiency measure for Markov chains, which coincides with the limiting variance of related estimators in case a central limit theorem (CLT) holds. Definition 3. Suppose the Markov chain (Θ k ) k≥1 on T has transition probability P which is Harris ergodic with respect to invariant probability π a . For f ∈ L 2 (π a ), the asymptotic variance of f with respect to P is In what follows, we denote byf (θ, x) = f (θ, x) − π(f ) the centred version of any f ∈ L 1 (π), and recall that if f ∈ L, thenf ∈ L. We also denote m The proof of the following CLT is given in Appendix B.
Theorem 7. Suppose that the conditions of Theorem 3 are satisfied, and f ) < ∞ and either of the following hold: Remark 8. In case of reversible chains, the condition in Theorem 7 (i) is essentially optimal, and the CLT relies on a result due to Kipnis and Varadhan [58]. The condition always holds when (Θ k ) k≥1 is geometrically ergodic, for instance (Θ k ) k≥1 is a random-walk Metropolis algorithm and π a is light-tailed [53,86]. In case (Θ k ) k≥1 is sub-geometric, such as polynomial, extra conditions are required; see for instance [54]. The condition (ii) which applies for non-reversible chains is also nearly optimal, and relies on a result due to Maxwell and Woodroofe [70]. See also the review on Markov chain CLTs by Jones [55].
Note that the latter term π a (v) in the asymptotic variance expression contains the contribution of the 'noise' in the IS estimates. If the estimators ξ k (f ) are made increasingly accurate, in the sense that π a (v) becomes negligible, the limiting case corresponds to an IS corrected approximate MCMC and calculating averages over conditional expectations µf (θ). We conclude by relating the asymptotic variance with a straightforward estimator.
Theorem 9. Suppose f ∈ L ∩ L 2 (π) and π a (v) < ∞ where v is defined in Theorem 7, and also π a (m Proof of Theorem 9 is given in Appendix B. The estimator nv n in Theorem 9 provides a consistent estimate for the CLT variance σ 2 f /n when P corresponds to i.i.d. sampling, in which case Var(µf , P ) = π a (µ 2 f ). Typically, Var(µf , P ) ≥ π a (µ 2 f ) (which is always true when P is positive), and then nv n provides a lower bound of the variance. It can provide useful information about the importance sampling noise contribution, and may be used as an optimisation criteria when adjusting the accuracy of the related estimators. Generic Markov chain asymptotic variance estimators (see, e.g., the review [32] and references therein) may also be used with IS correction, by estimating the asymptotic variance of n −1 n k=1 ξ k (f ) and dividing it by [n −1 n k=1 ξ k (1)] 2 .

Jump chain estimators
Many MCMC algorithms such as the Metropolis-Hastings include an accept-reject mechanism, which results in blocks of repeated values Θ k = . . . = Θ k+b . In the context of IS type correction, and when the computational cost of each estimate ξ k is high, it may be desirable to construct only one estimator per each accepted state. To formalise such an algorithm we consider the 'jump chain' representation of the approximate marginal chain [cf. 19,23,27].
Definition 4 (Jump chain). Suppose that (Θ k ) k≥1 is Harris ergodic with respect to π a . The corresponding jump chain (Θ k ) k≥1 with holding times (N k ) k≥1 is defined as follows:Θ and whereN k := k j=1 N j , and withN 0 ≡ 0. Remark 10. If (Θ k ) k≥1 corresponds to a Metropolis-Hastings chain, with non-diagonal proposal distributions q (that is, q(θ, {θ}) = 0 for every θ ∈ T), then the jump chain (Θ k ) consists of the accepted states, and N k − 1 is the number of rejections occurred at state (Θ k ).
Hereafter, we denote by α(θ) := P(Θ k+1 = Θ k | Θ k = θ) the overall acceptance probability at θ. We consider next the practically important 'jump IS' estimator, involving a proper weighting for each accepted state.
Assumption 3. Suppose that Assumption 1 holds, and let (Θ k , N k ) k≥1 denote the corresponding jump chain (Definition 4). Let (ξ k ) k≥1 be a L-proper weighting scheme, where the variables (M k , W ) in the scheme are now allowed to depend on bothΘ k and N k , and the conditions (i) and (ii) in Definition 2 are replaced with the following: Theorem 11. Suppose Assumption 3 holds, then, The proof follows from Lemma 1 because (Θ k ) is Harris ergodic with invariant probabilityπ a (θ) ∝ π a (θ)α(θ); see Proposition 27 in Appendix C. Furthermore, the holding times N k ≥ 0 are, conditional on (Θ k ), independent geometric random variables with parameter α(Θ k ) (Proposition 27), and therefore E[N k |Θ k = θ] = 1/α(θ).
Remark 12. Regarding Assumption 3: (i) Condition (ii) in Assumption 3 is practically convenient, because ξ k are usually chosen either as independent of N k , or increasingly accurate in N k (often taking M k proportional to N k ); see the discussion below. However, (ii) is not optimal: it is not hard to find examples where the estimator is strongly consistent, even thoughm (1) (θ) = ∞ for some θ ∈ T. (ii) In case each ξ k is constructed as a mean of independent (ξ k,1 , . . . , ξ k,N k ) (cf. Proposition 6), the jump chain estimator coincides with the simple estimator discussed in Section 5 (at jump times). However, the jump chain estimator offers more flexibility, which may allow for variance reduction, for instance by using a single mN k particle filter (cf. Section 8) instead of an average of N k independent m-particle filters, or by stratification or control variates. (iii) Even though we believe that the estimators of the form (4) are often appropriate, we note that in some cases Rao-Blackwellised lower-variance estimators of 1/α(Θ j ) may be used instead of N k , as suggested in [23].
Let us finally consider a central limit theorem corresponding to the estimator in Theorem 11, whose proof is given in Appendix C.
Then, the estimator where the limiting variance can be given as: Let us briefly discuss the conditions and implications of Theorem 13 under certain specific cases. When the acceptance probability is bounded from below, inf θ α(θ) > 0, using a proper weighting and so π a (b) < ∞ guarantees (5). For example, if (Θ k ) k≥1 is L 2 -geometrically ergodic, then the acceptance probability is (essentially) bounded away from zero [86], and g := k≥0 P k µf ∈ L 2 (π a ) satisfies g − P g = µf , so that (ii) is satisfied. When ξ k corresponds to an average of i.i.d. ξ k,1 , . . ., ξ k,N k (cf. Proposition 6) which do not depend on N k , Then, π a (αṽ) = π a (v), which leads to an asymptotic variance that coincides with simple IS correction (cf. Theorem 7).
Remark 14. In the non-reversible case, our CLT only applies when a solution g ∈ L 2 (π a ) to the Poisson equation g − P g = µf exists. We believe that the result holds more generally, but this requires showing that the jump chain (Θ k ) k≥1 inherits a central limit theorem from the base chain (Θ k ) k≥1 under more general conditions.

Pseudo-marginal approximate chain
We next discuss how our limiting results still apply, in case the approximate chain is a pseudo-marginal MCMC, as discussed in Section 2.2. Let us formalise next a pseudo- Above, Q a (θ, · ) defines a (regular conditional) distribution on (a measurable space) S Φ , and U : S Φ → R + is a (measurable) function. Under the following condition, the Markov chain (Θ k , Φ k ) k≥1 is reversible with respect to the probability measure π • a (dθ, dφ) := dθQ a (θ, dφ)U (φ)/c a , which admits the marginal π a (θ) [e.g. 6]: Assumption 4. There exists a constant c a > 0 such that for each θ, the random In addition, (Θ k , Φ k ) k≥1 is easily shown to be Harris ergodic under minimal conditions. Let us consider next an abstract minimal condition which ensures consistency of an IS type estimator. We discuss practically relevant sufficient conditions later in Proposition 17.

Proof. For any
Let us finally consider different conditions, which guarantee Assumption 5 (i); the integrability Assumption 5 (ii) may be shown similarly.
Proposition 17. Assumption 5 (i) holds if one of the following hold: (i) For π a -a.e. θ ∈ T, U (Φ θ ) > 0 a.s. and (ii) ζ k only depend on Θ k , and for π a -a.e. θ ∈ T, (iii) For π a -a.e. θ ∈ T (7) holds, and In case of (ii), we have Remark 18. Proposition 17 (i) is the most straightforward in the latent variable context, and often sufficient, since we may choose a positive U (φ) (e.g. by considering inflated U ((φ) = U (φ) + instead). Proposition 17 (ii) may be used directly to verify the validity of an MCMC version of the lazy ABC algorithm [81]. It also demonstrates why positivity plays a key role: if only (7) is assumed and p(θ) is non-constant, then p(θ) must be accounted for, or else we end up with biased estimators targeting a marginal proportional to π m (θ)p(θ). Proposition 17 (iii) demonstrates that strict positivity is not necessary, but in this case a delicate dependency structure is required.

General state space models and sequential Monte Carlo
State space models (SSM) are latent variable models which are commonly used in time series analysis [cf. 12]. In the setting of Section 2, SSMs are parametrised by θ ∈ T, and x = z 1:T ∈ X = S T z and y = y 1:T ∈ Y = S T y , and where, by convention, µ . That is, the latent states Z 1:T form a Markov chain with initial density µ This section reviews general techniques to generate random variables V for any θ and for some class of functions h : S T z → R. These random variables may be used in order to construct a proper weighting; see Corollary 21 below.
Simple IS correction may be applied directly (see Proposition 4). Note that (8) is satisfied for all integrable h, so L = L 1 (π). It is often useful to combine such schemes as in Proposition 6, allowing for instance variance reduction by using pairs of antithetic variables [29].
For the rest of the section, we focus on the particle filter (PF) algorithm [43]; see also the monographs [12,16,24]. We consider a generic version of the algorithm, with the following components [cf. 16]: (i) Proposal distributions: M 1 is a probability density on S z and M t ( · | z 1:t−1 ) defines conditional densities on S z given z 1:t−1 ∈ S t−1 z . (ii) Potential functions: G t : S t z → R + . (iii) Resampling laws: Res( · |ω (1:m) ) defines a probability distribution on {1:m} m for every discrete probability massω (1:m) . The following two conditions are minimal for consistency: Assumption 6. Suppose that the following hold: (1:m) ), for any j ∈ {1:m} and any probability mass vectorω (1:m) . Assumption 6 (i) holds with traditionally used 'filtering' potentials G t (z 1: , assuming a suitable support condition. We discuss another choice of M t and G t in Section 9, inspired by the 'twisted SSM' approach of [45]. It allows a 'look-ahead' strategy based on approximations of the full smoothing distributions q (θ) (z 1:T | y 1:T ). Assumption 6 (ii) allows for multinomial resampling, where A t . Remark 19. If ω * t = 0, then Algorithm 1 is terminated immediately, and all the estimators considered (cf. Proposition 20) equal zero.
The following result summarises alternative ways how the random variables (V ) may be constructed from the PF output, in order to satisfy (8). The results stated below are scattered in the literature [e.g. 16,79], and some may be stated under slightly more stringent conditions, but a self-contained and concise proof of Proposition 20 may be found in [99].
Proposition 20. Let θ ∈ T be fixed, assume Res, M t and G t satisfy Assumption 6, and let h : S T z → R be such that the integral in (8) is well-defined and finite. Consider the random variables generated by Algorithm 1, and let U : :T } and all z 1:T ∈ S T z . Define for t ∈ {2:T }, and any i t , i t−1 ∈ {1:m}, the backwards sampling probabilities (ii) Let I 1:T be random indices generated recursively backwards by I T ∼ b T and I t ∼ b t ( · | I t+1 ). The random variables (V   The estimator in Proposition 20 (i) was called the filter-smoother in [59]. This property was shown in [16,Theorem 7.4.2] in case of multinomial resampling, and extended later [cf. 2]. The statement holds also when the PF is applied with a general sequence of distributions rather than the SSM [16]. Proposition 20 (ii) corresponds to backwards simulation smoothing [41]. Drawing a single backward trajectory is, perhaps surprisingly, probabilistically equivalent to subsampling one trajectory from the filter-smoother estimate in Proposition 20 (i) [26]. However, drawing several trajectories independently as in Proposition 20 (ii) may lead to lower variance estimators. Proposition 20 (iii) and its special case (iv) correspond to the forward-backward smoother [25]; see also [12]. It is a Rao-Blackwellised version of (ii), but applicable only when considering estimates of a single marginal (pair). This scheme can lead to lower variance, but its square complexity in m makes it inefficient with large m.
We next formally state how Proposition 20 allows to use Algorithm 1 to derive a proper weighting scheme. k /π a (θ k ) provide a proper weighting scheme for target distribution π(θ, x 1:T ) = p(θ, x 1:T | y 1:T ) (Definition 2), for the following classes of functions, respectively: In case (Θ k , U k ) k≥1 is a pseudo-marginal algorithm, W k := pr(θ k )V (i) k /U k . Remark 22. The latter two cases in Corollary 21 are stated for a single marginal (pair), but it is clear that we may calculate estimates simultaneously for several marginal (pairs), so that Proposition 20 (iii) is applicable for every function which is of the form We state finally an implication of Proposition 20 outside the main focus of this paper, in general SSM smoothing context (with fixed θ). This result is widely known among particle filtering experts, but appears not to be widely adopted. (ii) If also σ 2 Proof is similar to Theorem 9 in Appendix B.
The estimator E n (h) in Proposition 23 is an importance sampling analogue of the particle independent Metropolis-Hastings (PIMH) algorithm suggested in [2]. Unlike the PIMH, calculation of E n (h) is parallelisable, and allows for straightforward consistent confidence intervals E n (f ) ± β √ v n , where β corresponds to the desired standard Gaussian quantile. Calculation of consistent confidence intervals for a single realisation of a particle smoothing algorithm requires sophisticated techniques [63]. Another promising method recently suggested in [50] relies on unbiased estimators obtained by coupling of conditional sequential Monte Carlo and debiasing tricks as in [39,71,84].

State space models with linear-Gaussian state dynamics
We consider a special case of the general SSM in Section 8, where both S z and S y are Euclidean and µ (θ) t are linear-Gaussian, but the observation models g (θ) t may be non-linear and/or non-Gaussian, taking the form Our setting covers exponential family observation models with Gaussian, Poisson, binomial, negative binomial, and Gamma distributions, and a stochastic volatility model. This class contains a large number of commonly used models, such as structural time series models, cubic splines, generalised linear mixed models, and classical autoregressive integrated moving average models. 9.1. Marginal approximation. The scheme we consider here is based on [28,90], and relies on a Laplace approximation p The linear-Gaussian termsg t approximate g t in terms of pseudo-observationsỹ (θ) t and pseudo-covariances R (θ) t , which are found by an iterative process, which we detail next for a fixed θ. Denote D t (y t | z t ), and assume thatz 1:T is an initial estimate for the modeẑ (θ) 1:T of p (θ) (z 1:T | y 1:T ). following: tzt ) (ii) Run the Kalman filter and smoother for the model with g (θ)       a (z 1:T | y 1:T ), the expectation in (9) is close to one. Our approximation is based on dropping the expectation in (9): L a (θ) :=L a (θ)g (θ) (y 1:T |ẑ 1:T )/g (θ) (ỹ (θ) 1:T |ẑ 1:T ). The same approximate likelihood L a (θ) was also used in a maximum likelihood setting by [31] as an initial objective function before more expensive importance sampling based maximisation was done.
The evaluation of the approximation L a (θ) above requires a reconstruction of the Laplace approximation for each value of θ. We call this local approximation, and consider also a faster global approximation variant, where the pseudo-observations and covariances are constructed only once, at the maximum likelihood estimate of θ.

9.2.
Proper weighting schemes. The simplest approach to construct a proper weighting scheme based on the Laplace approximations is to use the approximate smoothing distribution p (θ) a (z 1:T | y 1:T ) as IS proposal. Such a scheme using the simulation smoother [30] antithetic variables, we call SPDK, following [90].
We consider also several variants of M t and G t in the particle filter discussed in Section 8. The bootstrap filter [43], abbreviated as BSF, t (y t | · ), and hence does not rely on an approximation. Inspired by the developments in [45,101], we consider also the choice a (z 1:T | y 1:T ). This would be optimal in our setting if the G t were constants [45]. As they are often approximately so, we believe that this choice, which we call ψ-APF following [45], can provide substantial benefits over BSF.

Discretely observed diffusions
In many applications, for instance in finance or physical systems modelling, the SSM state transitions arise naturally from a continuous time diffusion model, such as where B t is a (vector valued) Brownian motion and where m (θ) and σ (θ) are functions (vector and matrix valued, respectively). The latent variables X = (Z 1 , . . . , Z T ) are assumed to follow the law of (Z t 1 , . . . ,Z t T ), so µ (θ) k would ideally be the transition densities ofZ t k givenZ t k−1 . These transition densities are generally unavailable (for nonlinear diffusions), but standard time-discretisation schemes allow for straightforward approximate simulation [cf. 60]. The denser the time-discretisation mesh used, the less bias introduced. However, the computational complexity of the simulation is highergenerally proportional to the size of the mesh.
The MCMC-IS may be applied to speed up the inference of discretely observed diffusions by the following simple two-level approach. The 'true' state transition µ (θ) t are based on 'fine enough' discretisations, which are assumed to ensure a negligible bias, but which are expensive to simulate. Cheaper 'coarse' discretisation corresponds to transitionsμ Because neither of the models admit exact calculations, we may only use a pseudomarginal approximate chain as discussed in Sections 2.2 and 7). More specifically, we may use the bootstrap filter (Section 8) with SSM (μ ) to generate the likelihood estimatorsŨ k in Phase 1', and in Phase 2', we may use bootstrap filters for SSM (µ Assuming that the observation model satisfies g (θ) t > 0 guarantees the validity of this scheme, because thenŨ k > 0 (see Proposition 17 (i)). It is most straightforward to simulate the bootstrap filters in Phases 1' and 2' independent of each other, but they may be made dependent as well, by using a coupling strategy [cf. 89]. The correction phase could be also based on exact sampling for diffusions [10], which allow for elimination of the discretisation bias entirely.
The recent work [34] details how unbiased inference is also possible with IS type correction, using randomised multilevel Monte Carlo.

Experiments
We did experiments for our generic framework with SSMs, using Laplace approximations (Section 9) and an approximation based on coarsely discretised diffusions (Section 10). We compared several approaches in our experiments: AI: Approximate inference with MCMC targeting π a (θ), and for each accepted Θ k , sampling one realisation fromp (Θ k ) (z 1:T | y 1:T ). PM: Pseudo-marginal MCMC with m samples targeting directly π(θ, x). DA: Two-level delayed acceptance pseudo-marginal MCMC with first stage acceptance based on π a (θ) and with target π(θ, x). IS1: Jump chain IS correction with mN k samples for each acceptedΘ k . IS2: Jump chain IS correction with m samples for each acceptedΘ k . The IS1 algorithm is similar to simple IS estimator (1), but is expected to be generally safer; see Remark 12 (ii). Except for AI, all the algorithms are asymptotically exact. Ignoring the effects of parallel implementation, the average computational complexity, or cost, of DA and IS2 are roughly comparable, and we have similar pairing between PM and IS1. However, as the weighting in IS methods is based only on the post-burn-in chain, the IS methods are generally somewhat faster.
We used a random walk Metropolis algorithm for π a with a Gaussian proposal distribution, whose covariance was adapted during burn-in following [97], targeting the acceptance rate 0.234. In DA, the adaptation was based on the first stage acceptance probability.
All the experiments were conducted in R [83] using the bssm package which is available online [48]. The experiments were run on a Linux server with eight octa-core Intel Xeon E7-8837 2.67GHz processors with total 1TB of RAM.
In each experiment, we calculated the Monte Carlo estimates several times independently, and the inverse relative efficiency (IRE) was reported. The IRE, defined as the mean square error (MSE) of the estimate multiplied by the average computation time, provides a justified way to compare Monte Carlo algorithms with different costs [40].
Further details and results of the experiments may be found in the preprint version of our article [99]. 11.1. Laplace approximations. In case of Laplace approximations, the maximum likelihood estimate of θ was always used as the starting value of MCMC. We used sub-sampling as in Proposition 5, and sampled one trajectory Z 1:T per each accepted state. We tested the exact methods with three different IS correction schemes, SPDK, BSF and ψ-APF, described in Section 9.2. For BSF and ψ-APF, the filter-smoother estimates as in Proposition 20 (i) were used. When calculating the MSE, we used the average over all estimates from all unbiased algorithms as the ground truth.
For all the exact methods, we chose the IS accuracy parameter m based on a pilot experiment, following the guidelines for optimal tuning of pseudo-marginal MCMC in [27]. More specifically, m was set so that the standard deviation of the logarithm of the likelihood estimate, denoted with δ, was around 1.2 in the neighbourhood of the  AI  DA  IS1  IS2  PM  DA  IS1  IS2  PM  DA  IS1  IS2   Time 54  281  600  166  78  61  71  53  115  78  posterior mean of θ. We kept the same m for all methods, for comparability, even though in some cases optimal choice might differ [91].
11.1.1. Poisson observations. Our first model is of the following form: t (y t | z t ) = Poisson(y t ; e ut ), and with Z 1 = (U 1 , V 1 ) ∼ N (0, 0.1I), where ξ t , η t ∼ N (0, 1). For testing our algorithms, we simulated a single set of observations y 1:100 from this model with Z 1 = (0, 0) and θ = (σ η , σ ξ ) = (0.1, 0.01). We used a uniform prior U (0, 2s) for the parameters, where the cut-off parameter s was set to 1.6 based on the sample standard deviation of log(y 1:T ), where zeros were replaced with 0.1. Results were not sensitive to this upper bound. Based on a pilot optimisation, we set m = 10 for SPDK and ψ-APF, leading to δ ≈ 0.1, and m = 200 for BSF with δ ≈ 1.2. For all algorithms, we used 100,000 MCMC iterations with the first half discarded as burn-in. We ran all the algorithms independently 1000 times. Table 1 shows the IREs, which are re-scaled such that all IREs of PM-BSF equal one. The overall acceptance rate of DA-BSF was around 0.104, and 0.234 for all others. All exact methods led to essentially the same overall mean estimate (0.093, 0.016, −0.075, 2.618-2.619) for (σ η , σ ξ , u 1 , u 100 ), in contrast with AI showing some bias on (u 1 , u 100 ), with overall mean estimates (−0.064, 2.629) and (−0.065, 2.631) with local and global approximation, respectively. IS2-BSF outperformed DA-BSF by about a factor of two in terms of IRE, because of the burn-in benefit. Similarly, IS1-BSF outperformed PM-BSF by a clear margin. With SPDK and ψ-APF, the IS1 and IS2 outperformed the PM and DA alternatives, but with a smaller margin because of smaller overall execution times. There were no significant differences between the SEs of local and global variants, but the global one was faster leading to smaller IREs. 11.1.2. Stochastic volatility model. Our second illustration is more challenging, involving analysis of real time series: the daily log-returns for the S&P index from 4/1/1995 to 28/9/2016, with total number of observations T = 5473. The data was analysed We used a uniform prior on [−0.9999, 0.9999] for φ, a half-Gaussian prior with standard deviation 5 for σ η , and a zero-mean Gaussian prior with standard deviation 5 for ν. SPDK was expected to be problematic, due to its well-known exponential deterioration in T , unlike the particle filter which often scales much better in T [100]. In addition, it is known that for this particular model, the importance weights may have large variability [61,80]. While in principle ψ-APF may also be affected by such fluctuations, we did not observe any problems with it in our experiments.
Based on our pilot experiment, we chose m = 10 for ψ-APF, m = 70 for SPDK and m = 300 for BSF, which all led to δ ≈ 1.1. We used 100,000 MCMC iterations with the first half discarded as burn-in, and 100 independent replications. the IREs re-scaled here with respect to DA-BSF are shown in Table 2. The PM and IS1 were not tested because of their high costs. The results with global approximation are shown only for AI, and indicate significant computational savings. The parallelisation with 8 cores dropped the execution time nearly ideally. The total acceptance rates were 0.1 for DA-BSF, PM-SPDK and DA-ψ-APF, 0.06 for DA-SPDK, and 0.15 for PM-ψ-APF.
Like in the Poisson experiment, the overall means of the exact methods were close to each other, but AI had some bias, this time also with some of the hyperparameters (σ η and ν). The IS1 and IS2 methods outperformed the PM and DA methods similarly as in the Poisson experiment. Due to a much smaller m, the DA-SPDK and DA-ψ-APF were an order of magnitude faster than DA-BSF. Diagnostics from the individual runs of PM-SPDK and DA-SPDK sometimes showed poor mixing, and despite the large reductions in execution time, the IREs were worse than PM-BSF. We observed also cases with a few very large correction weights in IS1-SPDK and IS2-SPDK, which had some impact also on their efficiencies. The SEs of DA-ψ-APF were comparable with the DA-BSF. We did not experience problems with mixing or overly large weights with ψ-APF, which suggests ψ-APF being more robust than SPDK. There were no significant differences in the SEs between the exact methods when using the local and global approximation schemes. 11.2. Discretely observed Geometric Brownian motion. Our last experiment was about a discretely observed diffusion as discussed in Section 10. The model was a geometric Brownian motion, with noisy log-observations: Y k | (Z k = z) ∼ N (log(z), σ 2 y ), withZ 0 ≡ 1, where (B t ) t≥1 stands for the standard Brownian motion, and where Z k =Z k . The discretisations µ (θ) t andμ (θ) t were based on a Milstein discretisation with uniform meshes of sizes 2 L F and 2 L C , respectively, with L C = 4 and L F = 16, reflected to positive values. We did not consider optimising L C and L F , but rather aimed for illustrating the potential gains for the IS2 algorithm from parallelisation. The data was a single simulated realisation of length 50 from the exact model, with ν = 0.05, σ x = 0.3, and σ y = 1. We used a half-Gaussian prior with s.d. 0.1 for ν, a half-Gaussian prior with s.d. 0.5 for σ x , and N (1.5, 0.5 2 ) prior truncated to > 0.5 for σ y . For both IS2 and DA, and both levels, we used m = 50 which led to δ ≈ 0.6.
Assuming a unit cost for each step in the BSF, the total average cost of a parallel IS2 run is n2 L C + α(n − n b )2 L F /M , where α is the mean acceptance rate of the approximate MCMC, n b is the length of burn-in and M is the number of parallel cores used for the weighting. We chose n = 5000, n b = 2500, M = 48, and the target acceptance rate α = 0.234, leading to an expected 43-fold speed-up due to the parallelisation of IS2.
Single run of DA cannot be easily parallelised, but we ran instead multiple independent DA chains in parallel, and averaged their outputs for inference. While such parallelisation may not be optimal, it allows for utilisation of all of the available computational resources. The running time of each DA chain was constrained to be similar to the time required by IS2, leading to n = 100 with n b = 50. Because of the short runs, we suspected that initial bias could play a significant role, which was explored by running two experiments, with MCMC initialised either to the prior mean θ 0 = (0.08, 0.4, 1.5), or to an independent sample from the prior. We experimented also with further thinning, by forming the IS2 correction based on every other accepted state. Table 3 summarises the results from 100 replications. The run time of the parallel DA algorithms was defined as the maximum run time of all parallel chains. The parallelisation speedup of IS2 was nearly optimal, as well as the further speedup from thinning. The SEs with prior mean initialisation were similar between DA and IS2, but DA produced slightly biased results, leading to 9.5 to 13.0 times higher IREs. The efficiency gains of thinning were inconclusive, indicating some gains for the hyperparameters θ, but not for the state variables. The smaller memory requirements and smaller absolute time requirements for the thinning make it still appealing. With prior sample initialisation, DA behaved sometimes poorly, in contrast with IS2 which behaved similarly with both initialisation strategies. 11.3. Summary of results. In our experiments with Laplace approximations, IS1 and IS2 were competitive alternatives to PM and DA, respectively, even without parallelisation. The differences were more emphasised when the cost of correction (number of samples m) was higher. The ψ-APF was generally preferable over SPDK, and BSF was the least efficient. The global approximation gave additional performance boost in our experiments, without compromising the accuracy of the estimates, but we stress that it may not be stable in all scenarios.
As noted earlier, the use of the guidelines by [27] were not necessarily optimal in our setting. We did an additional experiment to inspect how the choice of m affects the IRE with BSF in the Poisson model, and with ψ-APF in the SV model. Figure 1 shows the average IREs as a function of m. Both IS2 and DA behaved similarly, and IS2 was  Figure 1. Average IRE of (σ η , σ ξ , Z 1 , Z 100 ) in the Poisson model with BSF (left) and of (φ, σ η , ν, Z 1 , Z 5473 ) in the SV model with ψ-APF (right). DA is shown in black and IS2 in red.
less than DA uniformly in terms of IRE. In the Poisson-BSF case, the choice m = 200 based on [27] appears nearly optimal. In case of the SV-ψ-APF, the optimal m for DA and IS2 was around 50, which was higher than m = 10 based on [27]. This is likely because of the initial overhead cost of the approximation. The discretely observed geometric Brownian motion example illustrated the potential gains which may be achieved by using the IS2 method in a parallel environment. While we admit that our experiment is academic, we believe that it is indicative, and shows that IS2 can provide substantial gains, and makes reliable inference possible in a much shorter time than DA. The IS framework is less prone to issues with burn-in bias, which can be problematic with naive MCMC parallelisation based on independent chains.

Discussion
Our framework of IS type estimators based on approximate marginal MCMC provides a general way to construct consistent estimators. Our experiments demonstrate that the IS estimator can provide substantial speedup relative to a delayed acceptance (DA) analogue with parallel computing, and appears to be competitive to DA even without parallelisation. We believe that IS is often better than DA in practice, but it is not hard to find simple examples where DA can be arbitrarily better than IS (and vice versa) [35]. Our followup work [35] complements our findings by theoretical considerations, with guaranteed asymptotic variance bounds between IS and DA.
IS is known to be difficult to implement efficiently in high dimensions, but this is not a major concern in most latent variable models, where the hyperparameters are lowdimensional. The IS weight may also be regularised easily by inflating the (estimated) approximate likelihood, for instance with L a (θ) + , with some > 0. If the likelihood L is bounded, then w u (θ) ∝ L(θ)/(L a (θ) + ) is bounded as well. The latter approach can be seen as an instance of defensive importance sampling [49]. Other generic safe IS schemes may also be useful [cf. 77], and tempering may be applied for the likelihood as well.
We used adaptive MCMC in order to construct the marginal chain (Θ k ) k≥1 in our experiments, and believe that it is often useful [cf. 5]. Note, however, that our theoretical results do not apply directly with adaptive MCMC, unless the adaptation is stopped after suitable burn-in. Our results could be extended to hold with continuous adaptation, under certain technical conditions. We detailed proper weighting schemes based on standard IS and particle filters. We note that various PF variations, such as Rao-Blackwellisation, alternative resampling strategies [12], or quasi-Monte Carlo updates [36], apply directly. PFs can also be useful beyond the state space models context [17]. Twisted particle filters [1,101] could also be applied, instead of the ψ-APF.
In a diffusion context, a proper weighting can be constructed based on randomised multilevel Monte Carlo, as recently described in [34]. We are currently investigating various other instances of our framework. Laplace approximations are available for a wider class of Gaussian latent variable models beyond SSMs [cf. 87]. Variational approximations [8,56] and expectation propagation [73] have been found useful in a wide variety of models. In the SSM context, various non-linear filters could also be applied [cf. 88]. Our framework provides a generic validation mechanism for approximate inference, where assessment of bias is difficult in general [cf. 76]. Contrary to purely approximate inference, our approach only requires moderately accurate approximations, as demonstrated by our experiment with global Laplace approximations. Debiased MCMC, as suggested in [39] and further explored in [50,51], may also lead to useful proper weighting schemes.
(v) Suppose h : X × S → R is measurable and such that m h (x) := Q(x, ds)h(x, s) and (K n m h )(x) are well-defined. Then, for any n ≥ 1, Proof. The inheritance of irreducibility measures (i), maximal irreducibility measures (ii), invariant measures (iii), and reversibility is straightforward. For Harris recurrence (iv), let the probability φ K be a maximal irreducibility measure for K, then φǨ(dx × ds) := φ K (dx)Q(x, ds) is the maximal irreducibility measure foř K. Let C ∈ X ⊗ S with φǨ(C) > 0, and choose > 0 such that φ K (C( )) > 0, where where τ k are the hitting times of (X k ) to C( ). This concludes the proof because I S τ k ∈ C Xτ k are independent Bernoulli random variables with success probability at least . The converse statement is similar.
For (v), it is enough to notice that for any (x, s) ∈ X × S and n ≥ 1, it holds thať K n (x, s), dx × ds = K n (x, dx )Q(x , ds).
We next state the following generic results about the asymptotic variance and the central limit theorem of an augmented Markov chain. For h ∈ L 2 0 (ν), we denote as above the conditional mean m h (x) := Q(x, ds)h(x, s) and the conditional variance v h (x) := Q(x, ds)h 2 (x, s) − m 2 h (x). Lemma 25. Let h ∈ L 2 0 (ν). The asymptotic variance of an augmented Markov chain satisfies Proof. Let (X k , S k ) be a stationary Markov chain with transition probabilityǨ. Proof. The reversible case (i) follows from Lemma 25 and the Kipnis and Varadhan CLT [58], which implies (10) for the initial distributionν. The jump chain is Harris by Lemma 24 (iv), so [72, Proposition 17.1.6] guarantees (10) for every initial distribution. The case (ii) follows similarly, but relies on a result due to Maxwell and Woodroofe [70], which guarantees (10)  Because (a + b) 1/2 ≤ a 1/2 + b 1/2 for a, b ≥ 0 and ν(v h ) < ∞, the claim follows.
The denominator converges to c w > 0 almost surely, som by Slutsky's lemma, it is enough to show that the numerator converges in distribution to N 0, Var(νf , P )+π a (v) . This follows from Lemma 26 (i) and (ii), under conditions (i) and (ii), respectively.
Proof of Theorem 9. For n large enough such that n j=1 ξ j (1) > 0, we may write The denominator converges to c 2 w , and the numerator can be written as 1 n n k=1 ξ 2 k (f ) + ξ 2 k (1)D 2 n + 2ξ k (1)ξ k (f )D n with D n := π(f ) − E n (f ).

Appendix C. Proofs about jump chain estimators
In this section, K is assumed to be a Markov kernel on X which is non-degenerate, that is, a(x) := K(x, X \ {x}) > 0 for all x ∈ X. The following proposition complements [23, Lemma 1] and [19], which are stated for more specific cases.
Proposition 27. Suppose (X k ) is a Markov chain with kernel K and (X k ) the corresponding jump chain with holding times (N k ) (Definition 4). Then, the following hold: (i) (X k ) is Markov with transition kernelK(x, A) = K(x, A \ {x})/a(x).
(ii) The holding times (N k ) are conditionally independent given (X k ), and each N k has geometric distribution with parameter a(X k ). (iii) If K admits invariant probability ν(dx), thenK admits invariant probabilitỹ ν(dx) := ν(dx)a(x)/ν(a). In addition, if K is reversible with respect to ν, thenK is reversible with respect toν. (iv) (X k ) is ψ-irreducible if and only if (X k ) is ψ-irreducible, with the same maximal irreducibility measure. (v) (X k ) is Harris recurrent if and only if (X k ) is Harris recurrent.
We now state results about the asymptotic variance of the jump chain, complementing the reversible case characterisation of [19,27].
Proof. The reversible case (i) is a restatement of [19,Theorem 1].