Unbiased inference for discretely observed hidden Markov model diffusions

We develop a Bayesian inference method for diffusions observed discretely and with noise, which is free of discretisation bias. Unlike existing unbiased inference methods, our method does not rely on exact simulation techniques. Instead, our method uses standard time-discretised approximations of diffusions, such as the Euler--Maruyama scheme. Our approach is based on particle marginal Metropolis--Hastings, a particle filter, randomised multilevel Monte Carlo, and importance sampling type correction of approximate Markov chain Monte Carlo. The resulting estimator leads to inference without a bias from the time-discretisation as the number of Markov chain iterations increases. We give convergence results and recommend allocations for algorithm inputs. Our method admits a straightforward parallelisation, and can be computationally efficient. The user-friendly approach is illustrated on two examples, where the underlying diffusion is an Ornstein--Uhlenbeck process or a geometric Brownian motion.


Introduction
Hidden Markov models (HMMs) are widely used in real applications, for example, for financial and physical systems modeling [cf. 5]. We focus on the case where the hidden Markov chain arises from a diffusion process that is observed with noise at some number of discrete points in time [cf. 30]. The parameters associated to the model are static and assigned a prior density. Bayesian inference involves expectations with respect to (w.r.t.) the joint posterior distribution of parameters and states, and is important in problems of model calibration and uncertainty quantification. A difficult part of Bayesian inference for these models is simulation of the diffusion dynamics. Except for some special cases where the transition probability is explicitly known [cf. 22,Section 4.4] or exact simulation [3] type methods can be applied [cf. 3,4,10,33], one must time-discretise the diffusion dynamics with an approximation scheme in order to facilitate tractable inference. This is despite the fact that one is ideally interested when there is no time-discretisation: unbiased inference.
Our goal is unbiased inference for HMM diffusions. As previously mentioned, one approach to unbiased inference is based on exact simulation type methods [3,4,10,33]. At the present point in time, exact simulation type methods are mostly only applicable to one-dimensional models where the Lamperti transformation [cf. 24] can be applied (cf. [25,28,33] for reviews). In contrast, we proceed with an Euler-Maruyama [cf. 22] (referred henceforth as Euler ) or similar time-discretisation of the diffusion, which is generally applicable.
Traditional inference approaches based on time-discretisations face a trade-off between bias and computational cost. Once the user has decided on a suitably fine discretisation size, one can run, for example, the particle marginal Metropolis-Hastings (PMMH) [2]. This algorithm uses a particle filter (PF) [cf. 7], where proposals between time points are generated by the approximation scheme, and ultimately accepted or rejected according to a Metropolis-Hastings type acceptance ratio [cf. 16]. As the discretisation size adopted must be quite fine, a PMMH algorithm can be computationally intensive.
To deal with the computational cost of PMMH, [19] develop a PMMH based method which uses (deterministic) multilevel Monte Carlo (dMLMC) [14,17]. The basic premise of MLMC is to introduce a telescoping sum representation of the posterior expectation associated to the most precise time discretisation. Then, given an appropriate coupling of posteriors with 'consecutive' time discretisations, the cost associated to a target mean square error is reduced, relative to exact sampling from the most precise (time-discretised) posterior. In the HMM diffusion context, the standard MLMC method is not possible, so based upon a PF coupling approach and PMMH, an MLMC method is devised in [19,20], which achieves fine-level, though biased, inference.
1.1. Method. The unbiased and computationally efficient inference method suggested in this paper is built firstly on PMMH, using Euler type discretisations, but using a PMMH targeting a coarse-level model, which is less computationally expensive. This does not yield unbiased inference yet, but it can be achieved by an importance sampling (IS) type correction [cf. 32].
We suggest an IS type correction that is based on a single-term (randomised) MLMC type estimator [23,26] and the PF coupling approach of [19]. The rMLMC correction is based on randomising the running level in the multilevel context of a certain PF, which we refer to as the 'delta PF (∆PF)' (Algorithm 3). In short, the ∆PF uses the PF coupling introduced in [19], but here an estimator is used for unbiased estimation of the difference of unnormalised integrals corresponding to two consecutive discretisation levels, over the latent states with parameter held fixed (cf. Section 2), rather than to the difference of self-normalised PMMH averages.
The resulting IS type estimator leads to unbiased inference over the joint posterior distribution, and is highly parallelisable, as the more costly (randomised) ∆PF corrections may be performed independently en masse given the PMMH base chain output. We are also able to suggest optimal choices for algorithm inputs in a straightforward manner (Recommendation 1 and Figure 1). This is because there is no bias, and therefore the difficult cost-variance-bias trade-off triangle associated with dMLMC is not present. Besides being unbiased and efficient, our method is user-friendly, as it is a combination of well-known and relatively straightforward components: PMMH, Euler approximations, PF, rMLMC, and an IS type estimator. For more about the strengths of the method, see Remark 10 later, as well as [12,32] for more discussion about IS (type) estimators based on approximate Markov chain Monte Carlo (MCMC).
Key to verifying consistency of the method is a finite variance assumption for the r∆PF estimator. We verify a parameter-uniform bound for the variance under a simple set of HMM diffusion conditions in Section 3. Note, however, that consistency of our method is likely to hold more generally. This is in contradistinction to methods based on exact simulation, which require analytically tractable transformations to unit covariance diffusion term and computable bounds in the rejection sampler, in order to even apply the method (see for example the review in the recent preprint [33]).
If an exact simulation method is applicable, the obvious question arises whether our method or the exact simulation method should be applied. The efficiency of exact simulation type methods is dependent upon several and different factors than our method. These factors for exact simulation include proper tuning and tight computable bounds for the rejection sampler. In an ideal scenario for exact simulation, a method based on exact simulation is likely to perform better than our method. However, in the reverse case, our method can perform better, if the efficiency of exact simulation is poor. For instance, the efficiency of exact simulation decreases to zero as the analytically computed upper bound of the IS weight used in the rejection sampler increases to infinity.
Although we have mostly in mind the case of Euler approximation schemes for the diffusion dynamics approximation, which are generally implementable, other schemes could be possibly be used as well [cf. 13]. However, suitable couplings for these schemes in dimensions greater than one may not be trivial. For the sake of theory and proof of consistency, ideally these would have also known weak and strong order convergence rates [cf. 22]. Indeed, assuming a coupling exists, such higher-order schemes can improve convergence of our method (see Sections 5 and 6). More generally, our approach based on PMMH or other approximate MCMC, increasingly fine families of approximations, MLMC, and IS correction, could be applied beyond the HMM diffusion context, for example, to HMM jump-diffusions [cf. 21].
1.2. Outline. Section 2 introduces the aforementioned ∆PF (Algorithm 2) and subsequently discusses some applications of randomisation techniques. The theoretical properties of the ∆PF in the HMM diffusion context are summarised in Section 3. Section 4 presents the suggested IS type estimator (Algorithm 4), based on PMMH with rMLMC (i.e. r∆PF) correction, and details its consistency and a corresponding central limit theorem (CLT). Section 5 suggests suitable allocations in the ∆PF based on rMLMC efficiency considerations. The numerical experiments in Section 6 illustrate our method in practice in the setting of an Ornstein-Uhlenbeck process and geometric Brownian motion. Proofs for the technical results of Sections 3, 4 and 5 are given in Appendix A, B and C, respectively.

Delta particle filter for unbiased estimation of level differences
Consider the (Itô) diffusion process with X t ∈ X := R d , model parameter θ ∈ T, and {W t } t≥0 a Brownian motion of appropriate dimension. We suppose that there are data {Y p = y p } n p=0 , y p ∈ R m , which are observed at equally spaced discrete times, p = 0:n for simplicity. The Markov transition between X p−1 and X p is given by some kernel M p ) defines the HMM diffusion, and is an example of a so-called Feynman-Kac model [cf. 7] described below. As the results of this section can just as easily be stated in terms of Feynman-Kac models, we do so in the following, which shows the generality of our approach.
Particle filter (Algorithm 1) [cf. 7] generates sets of samples and weights corresponding to the Feynman-Kac model, which for ϕ : E 0:n → R lead to an unbiased estimator for the (unnormalised) smoother γ γ γ n (G n ϕ), defined here in terms of the (unnormalised) predictor Algorithm 1 Particle filter for model (M 0:n , G 0:n ) := (M t , G t ) t=0:n with N particles. In each line, i takes values 1:N . Do: n . (In case ω * t = 0, the algorithm is terminated with V (i) = 0 and with arbitrary X (i) ∈ E 0:n .) Proposition 1. Suppose that ϕ : E 0:n → R is such that γ γ γ n (G n ϕ) < ∞. Then, the output of Algorithm 1 satisfies 2.2. Level difference estimation. Suppose that we have two Feynman-Kac models (M F n , G F n ) and (M C n , G C n ) defined on common spaces (E n , E n ). The models correspond to 'finer' and 'coarser' Euler type discretised HMM diffusions. We are interested in estimating (unbiasedly) the difference . If the models are close to each other, as they will be in the multilevel (diffusion) context, we would like the estimator also to be typically small. In many contexts, if one can estimate the difference using a coupling, it is possible to obtain a variance reduction. The particular coupling approach we use here is based on using a combined Feynman-Kac model as in [19], which provides a simple, general and effective coupling of PFs, and which we will use to estimate the level difference of unnormalised smoother (3).
Hereafter, we denotex n = (x F n ,x C n ) ∈ E n × E n , and forx 0:n = (x 0 , . . . ,x n ) ∈ E 2 0 × . . . E 2 n , we setx s 0:n := (x s 0 , . . . ,x s n ) ∈ E 0:n for s ∈ {F, C}. Assumption 2. Suppose that (M t ,Ǧ t ) is a Feynman-Kac model on the product spaces (E t × E t , E t ⊗ E t ), such that: (i)M t is a coupling of the probabilities M F t and M C t , i.e. for all A ∈ E t , we have Algorithm 2 Delta particle filter (∆PF) for unbiased estimation of level differences.
Proposition 3. Under Assumption 2, the output of Algorithm 2 satisfies whenever both integrals on the right are well-defined and finite.
Proof. By the unbiasedness property of PF Algorithm 1, we have where Assumption 2(ii) guaranteesǦ t > 0 whenever G F t > 0, and (i) implies the marginal law of n t=0M t is n t=0 (i) Typically, in the discretisation of diffusions context [14,26], the couplingsM t would be based on using the same underlying Brownian motion [cf. 22]. That is, if . ., corresponds to two steps of an Euler discretisation with step-size h F , then we can use with h C := 2h F for the coarser Euler discretisation. The kernelsM t on the joint space then move N particles according to the fine-level discretisation, and N according to the coarse-level discretisation, both based on the same underlying sequence of standard normals (δW F t+kh F ) k≥1 .
(ii) The choice ofǦ t in Assumption 2(ii) provides a safe 'balance' in between the approximations, as w F and w C are upper bounded by 2 n+1 . Indeed, the coupled Feynman-Kac model can be thought as an 'average' of the two extreme cases-with the choicě G t (x 0:t ) = G F t (x F 0:t ) the coupled PF would coincide marginally with the Feynman-Kac model with dynamics M F t . What is the optimal choice forǦ t is an interesting question. (iii) Clearly, the choice ofǦ 0:t can be made also in other ways. It is sufficient for unbiasedness to chooseǦ t (x 0:t ) such that it is strictly positive whenever either the G F t (x F 0:t ) or G C t (x C 0:t ) product is positive, but choices which make w F and w C bounded are safer, for This was the original choice made in [19] for approximation of normalised smoother differences. This PF coupling approach based on change of measure and weight corrections w F and w C , has been further used also, for example, in [20]. (iv) Later, in the HMM diffusion context, we set G F t = G C t , corresponding to common observational densities, but the method is also of interest with differing potentials.
2.3. Unbiased latent inference. We show here how the randomisation techniques of [23,26] can be used with the output of Algorithms 1 and 2 to provide an unbiased estimator according to the true model, even though the PFs are only run according to approximate models. Let us index the transitions M ( ) p and potentials G ( ) p by ≥ 0. They are assumed throughout to be increasingly refined approximations, in the (weak) sense that In Assumption 2 we set symbols (F, C) to be ( , −1) for ≥ 1. Algorithm 3 can then provide unbiased estimation of γ γ γ n ϕ) (Lemma 6), leading to unbiased inference w.r.t. the normalised smoother , which is stated as Proposition 7 below. Assumption 5. Assumption 2 holds, p = (p ) ∈N is a probability on N := Z ≥1 with p > 0 for all ≥ 1, g : E 0:n → R is a function, and Lemma 6. Under Assumption 5, the estimator formed from the output of Algorithm 3 satisfies , so the result follows by Proposition 1 and linearity of the expectation.
The following suggests a fully parallelisable algorithm for unbiased inference over the normalised smoother, and is an unbiased alternative to the particle independent Metropolis-Hastings (PIMH) [2] run at some fine level of discretisation.   (7) for each k, then almost surely.
The above result follows directly from the results of Section 4. It can also be seen as a multilevel version of [32,Proposition 23], with straightforward estimators for σ 2 . See Section 5 for suggested choices for p and N .

A variance bound for the delta particle filter
In this section we give theoretical results for the ∆PF (Algorithm 2) in the setting of HMM diffusions, which can be used to verify finite variance and therefore consistency of related estimators.

Hidden Markov model diffusions.
We consider an HMM diffusion and corresponding Feynman-Kac model as in Section 2. We omit θ from the notation in the following, which is allowed as the remaining conditions and results in this Section 3 will hold uniformly in θ (i.e. any constants are independent of θ). The following will be assumed throughout.
Condition (D). The coefficients a j , b j,k are twice differentiable for j, k = 1, . . . , d, and dy) for p = 0:n denote the Markov transition of the unobserved diffusion (1), i.e. the distribution of the solution X 1 of (1) started at X 0 = x. With similar setup from Section 2, with E n := X n+1 , we have that (2) takes the form In practice one usually must approximate the true dynamics M (∞) (x, dy) of the underlying diffusion with a simpler transition M ( ) (x, dy), based on some Euler type scheme using a discretisation parameter h = 2 − for ≥ 0 [cf. 22]. The scheme allows for a coupling of the diffusions (X ) t≥0 running at discretisation levels and − 1 (based on using the same Brownian path W t ), such that for some β ∈ {1, 2}, we have (1) is constant or if a Milstein scheme can be applied otherwise, then β = 2; otherwise β = 1 [cf. 18, Proposition D.1.].

3.2.
Variance bound. Assume we are in the above HMM diffusion setting, and that the coupling of Assumption 2 holds, with symbols (F, C) equal to ( , − 1) for ≥ 1, and G ( ) Running Algorithm 2, we recall that ∆ (ϕ), defined in (6), satisfies, by Proposition 3, regardless of the number N ≥ 1 of particles.
Recall that a (measurable) function ϕ : Condition (A). The following conditions hold for the model (M n , G n ): In the following results for ∆ (ϕ), the constant M < ∞ may change from line-to-line. It will not depend upon N or (or θ), but may depend on the time-horizon n or the function ϕ. E denotes expectation w.r.t. the law associated to the ∆PF started at (x, x), with x ∈ X. Below we only consider multinomial resampling in the ∆PF for simplicity, though Theorem 8 and Corollary 9 can be proved also assuming other resampling schemes.
The proofs are given in Appendix A. Based on Corollary 9, Recommendation 1 of Section 5 suggests allocations for p and N in the ∆PF (Algorithm 2) to optimally use resources and minimise variance (5).

Unbiased joint inference for hidden Markov model diffusions
We are interested in unbiased inference for the Bayesian model posterior where pr(dθ) = pr(θ)dθ is the prior on the model parameters, and Here, M  Remark 10. Before stating consistency and central limit theorems, we briefly discuss various aspects of this approach, which are appealing from a practical perspective, and we also mention certain algorithmic modifications which could be further considered.
(i) The first phase (P1) of Algorithm 4 implements a PMMH type algorithm [2]. If = 0, this is exactly PMMH targeting the model . It is generally safer to choose > 0 [32], which ensures that the IS type correction in phase (P2) will yield consistent inference for the ideal model π (∞) (dθ, dx 0:n ) ∝ pr(dθ)G (θ) n (x n )γ (θ,∞) n (x 0:n ) (Theorem 11). Setting > 0 may be helpful otherwise in terms of improved mixing, as the PMMH will target marginally an averaged probability between a 'flat' prior and a 'multimodal' = 0 marginal posterior. Algorithm 4 Randomised multilevel importance sampling type estimator.
0 > 0, and for k = 1:m iter , iterate: (ii) It is only necessary to implement PMMH for the coarsest level. This is typically relatively cheap, and therefore allows for a relatively long MCMC run. Consequently, relative cost of burn-in is small, and if the proposal q is adapted [cf. 1], it has time to converge. (iii) The (potentially costly) r∆PFs are applied independently for each Θ k , which allows for efficient parallelisation. (iv) We suggest that the number of particles 'N 0 ' used in the PMMH be chosen based on [9,29], while the number of particles 'N ' (and p ) can be optimised for each level based on Recommendation 1 of Section 5, or kept constant. One can also afford to increase the number of particles when a 'jump chain' representation is used (see the following remark). (v) The r∆PF corrections may be calculated only once for each accepted state [32]. That is, suppose (Θ k ,Ṽ k , but it is not clear how such dependence could be used in practice to achieve better performance. Likewise, the 'zeroth level' estimate in Algorithm 4 is based solely on particles in (P1), but it could also be based on (additional) new particle filter output. (ix) In order to save memory, it is possible also to 'subsample' only one trajectory X * k from X k,0 , and similarly in Algorithm 2 findX * such that P[X * =X (i) ] ∝V (i) , setting X * (1:2) k,L k :=X * , and defining from the usual output of Algorithm 2, W * (1) k,L k . The subsampling output estimator then takes the form, Note, however, that the asymptotic variance of this estimator is higher, because Then, the estimator of Algorithm 4 is strongly consistent: Remark 12. Regarding Theorem 11, whose proof is given in Appendix B: (i) If all potentials G t are strictly positive, the algorithm constant may be taken to be zero. However, if = 0 and Algorithm 1 with (M (Θ k ,0) 0:n , G (Θ k ) 0:n , N ) can produce an estimate with N i=1 V (i) = 0 with positive probability, the consistency may be lost [32].
Proposition 13. Suppose that the conditions of Theorem 11 hold. Suppose additionally that π (∞) (f 2 ) < ∞ and that the base chain (Θ k , V whenever the asymptotic variance Remark 14. Proposition 13 follows from [32,Theorem 7]. We suggest that N = N 0 for (P1) be chosen based on [9,29] to minimise var(P, µf ), and that (p ) and N = N in (P2) for the r∆PF be chosen as in Recommendation 1 of Section 5, to minimise σ 2 ξ , subject to cost constraints, in order to jointly minimise σ 2 . However, the question of the optimal choice for N 0 in the IS context is not yet settled.

Asymptotic efficiency and randomised multilevel considerations
We summarise the results of this section by suggesting the following safe allocations for probability p = (p ) ∈N and number N = N of particles at level ≥ 1 in the ∆PF (Algorithm 2) used in Algorithm 3 and 4, and Proposition 7, with β given in (8) in the HMM diffusion context of Section 3, or, indeed, with β given in the abstract framework under Assumption 18 given later. See also Figure 1 for the recommended allocations.
Recommendation 1. With strong error convergence rate β given in (8), we suggest the following for p = (p ) ∈N and N ∈ N in ∆PF (Algorithm 2): (β = 1) (e.g. Euler scheme). Choose p ∝ ( 1 2 ) and N ∝ 1 constant. (β = 2) (e.g. Milstein scheme). Choose p ∝ 2 −1.5 ≈ ( 1 3 ) and N ∝ 1 constant. The suggestions are based on Corollary 9 of Section 3, and Propositions 20 (β = 2) and 26 (β = 1) given below (with weak convergence rate α = 1; see Figure 1 for general α). In the Euler case, although the theory below gives the same computational complexity order by choosing any ρ ∈ [0, 1] and setting p ∝ 2 −(1+ρ) and N ∝ 2 ρ , the experiment in Section 6 gave a better result using simply ρ = 0, corresponding to no scaling. 5.1. Efficiency framework. The asymptotic efficiency of Monte Carlo was considered theoretically in [15]; see [14] in the dMLMC context. The developments of this section follow [26] for rMLMC (originally in the i.i.d. setting), while also giving some extensions (also applicable to that setting). We will see that the basic rMLMC results carry over to our setting involving MCMC and randomised estimators based on PF outputs. Proofs are given in Appendix C.
We are interested in modeling the computational costs involved in running Algorithm 4; the algorithm of Proposition 7 is recovered with T = {θ}. Let τ Θ k ,L k represent the combined cost at iteration k of the base Markov chain and weight calculation in Algorithm 4, so that the total cost C (m) of Algorithm 4 with m iterations is The following assumption seems natural in our setting.
Assumption 15. For Θ k ∈ T, a family {τ Θ k , } k, ≥1 consists of positive-valued random variables that are independent of {L k } k≥1 , where L k ∼ p i.i.d., and that are conditionally independent given {Θ k } k≥1 , such that τ Θ k , depends only on Θ k ∈ T and ∈ N.
Remark 17. The quantity E[τ τ τ ]σ 2 is called the 'inverse relative efficiency' by [15], and is considered a more accurate quantity than the asymptotic variance (σ 2 here) for comparison of Monte Carlo algorithms run on the same computer, as it takes into account also the average computational time.
In the following we consider (possibly) variance reduced (if ρ > 0) versions of ∆ (g) of Assumption 5, denoted ∆ , where g = f (θ) , based on running the ∆PF (Algorithm 2) with parameters θ, fixed. The constant C < ∞ may change line-to-line, but does not depend on N , , or θ, but may depend on the time-horizon n and function f . Assumption 18. Assumption 15 holds, and constants 2α ≥ β > 0, γ > 0, and ρ ≥ 0 are such that the following hold: Remark 19. Regarding Assumption 18: (i) We only assume bounded mean cost in (i), rather than the almost sure cost bound commonly used. This generalisation allows for the setting where occasional algorithmic runs may take a long time. (ii) In the original MLMC setting, the cost scaling γ in (i) is taken to be γ = 1 [14,26].
Remark 21. Regarding Proposition 20, in the common case γ = 1 for simplicity: (i) If β > 1 ('canonical convergence regime') and ρ = 0, then a choice for r ∈ (1, β) exists. See also [26,Theorem 4] for a discussion of the theoretically optimal p. (ii) If β ≤ 1 ('subcanonical convergence regime'), then β + ρ ≤ 1 + ρ and so no choice for r exists. In our particular experiment in Section 6, however, the simple choice ρ = 0, corresponding to no scaling in the particles, will turn out to be optimal. 5.2. Subcanonical convergence. When β > 1, within the framework above we have seen that a canonical convergence rate holds (Proposition 20) because E[τ τ τ ] < ∞ and σ 2 < ∞. When β ≤ 1, this is no longer the case, and one must choose between a finite asymptotic variance and infinite expected cost, or vice versa. Assuming the former, and that a CLT holds (Proposition 13), for > 0 and 0 < δ < 1 the Chebyshev inequality implies that the number of iterations of Algorithm 4 so that holds implies that m must be of the order O( −2 ). The question is then how to minimise the total cost C (m), or computational complexity, involved in obtaining the m samples. This will involve optimising for (p ) and N to minimise C (m), while keeping the asymptotic variance finite.

Proposition 22.
Suppose that the assumptions of Proposition 13 hold with σ 2 < ∞, and Assumption 18 holds with E[τ Θ k 0 ,L k 0 ] = ∞ for some k 0 ≥ 1. If with a k = O k c 1 (log 2 k) c 0 for some constants c 0 > 0 and c 1 ≥ 1, then (11) can be obtained with computational complexity Remark 23. The above result shows that even for costs with unbounded tails, reasonable confidence intervals and complexity order may be possible. This may be the case for example when a rejection sampler or adaptive resampling mechanism is used within Algorithm 1 or 4, which may lead to large costs for some Θ k , for example a cost with a geometric tail.
The next results are as in [26,Proposition 4 and 5] in the standard rMLMC setting, and shows how one can choose p, assuming an additional almost sure cost bound, so that σ 2 < ∞, with reasonable complexity.

Numerical simulations
Now the theoretical results relating to the method herein introduced will be demonstrated on two examples. We will consider one example in the canonical regime, and one in the sub-canonical, both of which have likelihoods that can be computed exactly, so that the ground truth π (∞) (f ) can be easily calculated to arbitrary precision. We run each example with 100 independent replications, and calculate the MSE when the chain is at length m as which is depicted as the thick red line, average of the thin lines, in Figure 2 below. The error decays with the optimal rate of cost −1 and log(cost)cost −1 in the canonical and subcanonical cases, respectively, where cost is the realised cost of the run, C (m) from Section 5, measured in seconds, with m iterations of the Markov chain.
The filter at time k is given by the following simple recursion Additionally, the incremental marginal likelihoods (14) can be computed exactly The parameters are chosen as γ = 1, σ 2 = 0.1, n = 5, and the data is generated with θ = (0, 0) T . Our aim is to compute E(θ|y 1:n ) (or E[(a, b) T |y 1:n ], etc., but we will content ourselves with the former). This is done via a brute force random walk MCMC for m = 10 8 steps using the exact likelihood P[y 1:n |θ] as above. The IACT is around 10, so this gives a healthy limit for MSE computations.
For the numerical experiment, we use Euler-Maruyama method at resolution h = 2 − to solve (12) as follows for k = 1, . . . , K = h −1 . Levels and − 1 are coupled in the simulation of ∆ by defining B C 1:K /2 = B F 1:2:K −1 + B F 2:2:K Algorithm 2 is then run using the standard bootstrap particle filter (Algorithm 1) with N = 20 particles and O(N )-complexity multinomial resampling [cf. 5]. Theorem 8 provides a rate of β = 2 for Algorithm 2, because the diffusion coefficient is constant, which implies we are essentially running a Milstein scheme (cf. (8) and [22]). Recommendation 1 (or Proposition 20) of Section 5 suggests arbitrary precision can be obtained by Algorithm 4 with p ∝ 2 −3 /2 and no scaling of particle numbers based on in this canonical β = 2 regime (with weak rate α = 1). We choose a positive PMMH algorithm constant = 10 −6 (cf. Remark 10(i)). We run Algorithm 4 for 10 4 steps, with 100 replications. The results are presented in Figure 2, where it is clear that the theory holds and the MSE decays according to 1/cost. The variance of the run-times is very small over replications.
6.2. Geometric Brownian motion. We next consider the following stochastic differential equation (16) dX t = aX t dW t , with initial condition x 0 = 1, and a := a θ = exp(θ) with θ ∼ N (0, σ 2 ). This equation is Figure 2. The MSE of PMMH rMLMC IS Algorithm 4 applied to the problem of parameter inference for the discretely observed OU process (left plot) and GBM process (middle plot with ρ = 0, right plot with ρ = 1). Replications are given by the thin curves, while the bold curves give the average MSE over replications as well as the lines cost −1 (left plot) and log(cost)cost −1 (middle and right plots) to guide the eye.
analytically tractable as well, and the solution of the transformed equation Z = log X is given via Itô's formula by Defining W k ∼ N (0, 1) i.i.d., one has that and the solution of (16) can be obtained via exponentiation: X k = e Z k . Moreover, noisy observations are introduced on the form where ξ k ∼ N (0, γ 2 ) i.i.d. as above. Therefore we have Again P[y 1:n |θ] can be computed analytically by integrating over (z 1 , . . . , z n ). In order to investigate the theoretical sub-canonical rate, we return to (16) and approximate this directly using Euler-Maruyama method (15), which introduces artificial approximation error. This problem suffers from stability problems when X < 0, so we take h = 2 −5− . Algorithm 1 is then used along with the selection functions Here the diffusion coefficient is not constant, and Euler-Maruyama method yields a rate of β = 1 = α, the borderline case, which is expected to give a logarithmic penalty. Based on Recommendation 1 (or Proposition 26) of Section 5, we consider scaling the particles as 2 ρ with ρ = 2α − β = 1 and ρ = 0, with p ∝ 2 −2 log( ) 2 in both cases. Again we let = 10 −6 . Again the standard bootstrap particle filter is used, with N = 20 × 2 ρ particles. Algorithm 4 is run for 10 5 steps, with 100 replications. The results are presented in Figure  2, and they show good agreement with the theory, in terms of rate. On the other hand, the cost for ρ = 0 is apparently smaller than that of ρ = 1 by a factor of approximately 100.
Note that coupling assumption (17) forM n can be equivalently formulated forM M M n .
As noted in [19] it is simple to establish that for ϕ ∈ B b (E n ), if In order to approximateγ γ γ n (ψ) one can run the following abstract version of Algorithm 2 (recall from Section 3 that we will only consider multinomial resampling). Define for n ≥ 1, The algorithm begins by generating This proceeds recursively, so the joint law of the particles up to time n is Hence we have the estimateγ Remark 27. Note thatγ γ γ N n (ψ) corresponds to the quantity ∆ (ϕ) in (6) from the ∆PF output (Algorithm 2).
A.3. General hidden Markov model case. Define for p ≥ 1 the semigroup if p = n clearlyQ Q Q n,n is the identity operator. For any 0 ≤ n, ϕ ∈ B b (E n × E n ) we seť Q Q Q −1,n (ϕ)(u −1 ) = 0. Now following [7,Chapter 7] we have the following martingale (w.r.t. the natural filtration of the particle system), ϕ ∈ B b (E n × E n ): with the convention thatφ φ φ p (η η η N p−1 ) =η η η 0 if p = 0. The representation immediately establishes that E[γ γ γ N n (ϕ)] =γ γ γ n (ϕ) where the expectation is w.r.t. the law associated to the particle system. We will use the following convention that C is a finite positive constant that does not depend upon n, N or any of the G n , M s n (s ∈ {F, C},M n . The value of C may change from line-to-line. Define for 0 ≤ p ≤ n < ∞ G p,n = n q=p G q with the convention that if p = 0 we write G n . We have the following result. Proposition 28. Suppose that G n < ∞ for each n ≥ 0. Then there exist a C < ∞ such that for any n ≥ 0, (20), one can apply the Burkholder-Gundy-Davis inequality to obtain

Application of the (conditional) Marcinkiewicz-Zygmund inequality yields
After applying C 2 and Jensen inequalities, we then conclude by (21).
A.4. Diffusion case. We now consider the model of Section 3, where we recall that θ is omitted from the notation. A series of technical results are given and the proofs for Theorem 8 and Corollary 9 are given at the end of this section. We recall that the joint probability density of the observations and the unobserved diffusion at the observation times is given by As the true dynamics can not be simulated, in practice we work with Recall an (Euler) approximation scheme with discretisation h = 2 − , ≥ 0. In our context then, M F n corresponds Q ( ) ( ≥ 1) and M C n corresponds Q ( −1) . The initial distribution η 0 is simply the (Euler) kernel started at some given x 0 . As noted earlier in Remark 4(i), a natural coupling of M F n and M C n (and hence of η 0 ) exists. As established in [18, eq. (32)] one has (uniformly in θ as (D) holds with θ independent constants) for C < ∞ We also recall that (8) holds (recall (D) is assumed). We will use M < ∞ to denote a constant that may change from line-to-line. It will not depend upon θ nor N , , but may depend on the time parameter or a function. The following result will be needed later on. The proof is given after the proof of Lemma 30 below.
Proof of Proposition 29. We have the following standard collapsing sum representation: The summand is By Lemma 30, E C [ϕ(x 0:p , X p+1:n ) n q=p+1 G q (X q )|x p ] ∈ B b (X p+1 ) ∩ Lip(X p+1 ) and by (A1) (i) and (ii) G p ∈ B b (X) ∩ Lip(X). So by (22) Application of (A1) (i) gives |T p | ≤ M h and the proof is hence concluded.
Lemma 31. Assume (A1). Then for any n ≥ 0 there exists a M < ∞ such that for any x 0:n ∈ X 2(n+1) Proof. The is proof by induction. The case n = 0: Application of (A1) (ii) and (iii) yield that The result is assumed to hold at rank n − 1, then .
For the first term of the R.H.S. one can follow the argument at the initialisation and apply (A1) (i) and (iii). For the second term of the R.H.S., the induction hypothesis and (A1) (i) and (iii) can be used. That is one can deduce that Recall (18) for the definition of ψ and that Lemma 32. Assume (A1-2). Then for any 0 ≤ p < n, ϕ ∈ B b (X n+1 ) ∩ Lip(X n+1 ) there exists a M < ∞ such that for any It then follows thatQ Q Q p,n (ψ)(x 0:p ) =Ǧ p (x p )(T 1 + T 2 ) where By Lemma 31, ϕ ∈ B b (X n+1 ) ∩ Lip(X n+1 ) and (A1) (i) For T 3 one can use Lemma 30 (along with (A1) (i) and (iii)) to get that For T 4 a similar collapsing sum argument that is used in the proof of Proposition 29 can be used to deduce that |T 4 | ≤ M h . One can then conclude the proof via the above bounds (along with (A1) (i)).
Below E denotes expectation w.r.t. the particle system described in Section A.2 started at position (x, x) at time n = 0 with x ∈ X, in the diffusion case of Section A.4. Recall the particle U i n ∈ E n ×E n at time n ≥ 0 in path space. We denote by U i,s n (j) ∈ X as the j ∈ {0, . . . , n} component of particle i ∈ {1, . . . , N } at time n ≥ 0 of s ∈ {F, C} component. Recall (U i,F n (n), U i,C n (n)) for n ≥ 1 is sampled from the kernelM n ((ū i,F n−1 (n − 1),ū i,C n−1 (n − 1)), · ) where theū denotes post-resampling and the component (U i,F n (j), U i,C n (j)) = (ū i,F n−1 (j),ū i,C n−1 (j)) for j ∈ {0, . . . , n − 1} is kept the same for the earlier components of the particle.
Proof. Our proof is by induction, the case n = 0 following by (8). Assuming the result at n − 1 we have where we have used (A1) (i) and (iii). Applying the induction hypothesis along with (8) yields Then by (A1) (i) and (iii) Hence via the induction hypothesis, one has ≤ M h β and the proof is concluded.

Recall Remark 27.
Proof of Theorem 8. This follows first by applying Proposition 28, followed by Lemma 32 and then some standard calculations followed by Lemma 33.
Proof of Corollary 9. Easily follows by adding and subtractingγ γ γ n (ψ) the C 2 inequality along with Theorem 8, and then using (19) combined with Proposition 29.
Lemma 34. Let {X k } k≥1 be a sequence of independent random variables with E[X k 0 ] = ∞ for at least one k 0 , and let {a k } k≥1 be a sequence of monotonically increasing real numbers with a k /k −→ ∞. Suppose one of the following assumptions holds: (i) k≥1 P[X k > a k ] < ∞, and {X k } k≥1 are also identically distributed, or (ii) k≥1 sup m≥1 P[X m > a k ] < ∞. Then X k > a m infinitely many m ∈ N] = 0. X k > a m infinitely many m], where the first equality comes from (i), and so we conclude. Proof of Proposition 24. With the prescribed choice of p we have finite variance, as uniformly in θ ∈ T. To determine the order of complexity, we would like to apply Lemma 34(i) to the i.i.d sequence {τ * L k } k≥1 , where τ * := C2 γ (1+ρ) . For any k ≥ 1, where a k > 0 is some positive real number, we have, (26) P[τ * L k > a k ] = ≥1 P[τ * > a k ]p = ≥1 1 > 1 γ(1 + ρ) log 2 a k C p .
Because ≥1 p = 1 and p is monotonically decreasing, we have ≥ * p is O(p * ). Setting * = 1 γ(1+ρ) log 2 a k C , we therefore obtain that (26)  and conclude as before, by using that C (m) is asymptotically bounded by a m and setting m = O( −2 ).
Proof of Proposition 26. We are in the basic setting of Proposition 24 as before, but additionally may choose ρ ≥ 0 as we please. The growth of a k given in (27) is essentially determined by γ(1 + ρ)/2b, which can be made small when ρ = 2α − β, implying b = α.