Coupled conditional backward sampling particle filter

Unbiased estimation for hidden Markov models has been recently proposed by Jacob et al (to appear), using a coupling of two conditional particle filters (CPFs). Unbiased estimation has many advantages, such as enabling the construction of asymptotically exact confidence intervals and straightforward parallelisation. In this work we propose a new coupling of two CPFs, for unbiased estimation, that uses backward sampling steps, which is an important efficiency enhancing technique in particle filtering. We show that this coupled conditional backward sampling particle filter (CCBPF) algorithm has better stability properties, in the sense that with fixed number of particles, the coupling time in terms of iterations increases only linearly with respect to the time horizon under a general (strong mixing) condition. In contrast, current coupled CPFs require the particle number to increase with the horizon length. An important corollary of our results is a new quantitative bound for the convergence rate of the popular backward sampling conditional particle filter. Previous theoretical results have not been able to demonstrate the improvement brought by backward sampling to the CPF, whereas we provide rates showing that backward sampling ensures that the CPF can remain effective with a fixed number of particles independent of the time horizon.


Introduction
Conditional particle filter (CPF) (Andrieu et al., 2010) is a promising methodology for hidden Markov model smoothing, especially when combined with backward sampling (Whiteley, 2010) or equivalent ancestor sampling (Lindsten et al., 2014), which we refer to as the conditional backward sampling particle filter (CBPF). Chopin and Singh (2015) introduced a coupled CPF (CCPF) in order to prove a uniform ergodicity result for the CPF. Jacob et al. (2017) proposed to use the CCPF within the scheme of Glynn and Rhee (2014), in order to produce unbiased estimators of smoothing functionals, and also suggested optional use of coupled ancestor sampling (Lindsten et al., 2014). Unbiased estimation is important for several reasons. It allows for standard estimation of confidence intervals, straightforward parallelisation, and when applied with a stochastic gradient, for instance using stochastic approximation expectation maximisation (SAEM), unbiased estimators ensure martingale noise, which has good supporting theory.
We focus here on a simple algorithmic modification of the schemes of Chopin and Singh (2015) and Jacob et al. (2017), involving an index coupled version of Whiteley's (2010) backward sampling step. This approach, which we call the coupled conditional backward sampling particle filter (CCBPF), appears to be more stable than CCPF (with or without ancestor sampling). Under a general (strong mixing) condition, we prove (Theorem 7) that the coupling time of CCBPF grows at most linearly with time horizon, with a fixed (but sufficiently large) number of particles. As an important corollary, we obtain an analogous convergence rate for CBPF (Theorem 4). This result differs from the earlier guarantees for conditional particle filters, which assume (super)linear growth of the number of particles (Andrieu et al., 2018;Lindsten et al., 2015). Our result confirms a long held view, stemming from numerous empirical studies, that the CBPF is stable for long observation sequences with a fixed number of particles. Another version of the CPF which is stable with a fixed number of particles is the blocked version of the CPF introduced in Singh et al. (2017).
We also complement the empirical findings of Jacob et al. (2017) by showing quantitative bounds on the 'one-shot' coupling probability of CCPF. These results are noteworthy as we believe the CCPF's coupling probability does diminish with the length of the time series T unless the particle number is also increased. With the minimal assumption of bounded potentials, we prove (Theorem 5) that the coupling probability of CCPF is at least 1 − O(N −1 ), similar to what shown for the CPF (Andrieu et al., 2018;Lindsten et al., 2015). The constants involved grow very rapidly with T in the absence of the usual stringent mixing condition on the underlying model. When the stringent mixing conditions do hold, we are able to give a more favourable rate of convergence as T increases (Theorem 6), which still requires a increasing number of particles with T .

Notation and preliminaries
Throughout the paper, we assume a general state space X, which is typically R d equipped with the Lebesgue measure. However, our results hold for any measure space X equipped with a σ-finite dominating measure, which is denoted as 'dx'. Product spaces are equipped with the related product measures. We use the notation a:b = a, . . . , b for any integers a ≤ b, and use similar notation in indexing x a:b = (x a , . . . , x b ) and x (a:b) = (x (a) , . . . , x (b) ). We also use combined indexing, such that for instance x T ). We adopt the usual conventions concerning empty products and sums, namely b a ( · ) = 1 and b a ( · ) = 0 when a > b. We denote x ∧ y := min{x, y}, x ∨ y := max{x, y} and (x) + := x ∨ 0.
We use standard notation for the k-step transition probability of a Markov kernel P by P k (x, A) := P (x, dy)P k−1 (y, A) and P 0 (x, A) := Á {x ∈ A}. If ν is a probability measure and f is a real-valued function, then (νP )(A) := ν(dx)P (x, A), (P f )(x) = P (x, dy)f (y) and ν(f ) := ν(dx)f (x), whenever well-defined. The total variation metric between two probability measures µ, ν is defined as µ − ν tv := sup |f |≤1 |µ(f ) − ν(f )|, and f ∞ := sup x |f (x)|. If two random variables X and Y share a common law, we write X d = Y . We are interested in computing expectations of smoothing functionals, with respect to the following probability density on a space X T (cf. Del Moral, 2004): (1) π T (x 1: where M 1 is a probability density, M t are Markov transition densities, is an unknown normalising constant. In the context of hidden Markov models, the potentials are often taken to be of the form in which case π T (x 1:T ) corresponds to the smoothing distribution of the hidden Markov model {g 1:T , f 1:T } conditional on observations y 1:T . We will consider two different conditions for the model. The first is generally regarded as non-restrictive in the particle filtering literature and essentially equivalent with the uniform ergodicity of CPF (Andrieu et al., 2018).
The second is a much stronger assumption, again typical in the particle filtering literature, when proving time-uniform error bounds of particle filtering estimate (cf. Del Moral, 2004;Del Moral et al., 2010).

Convergence time of conditional backward sampling particle filter
Before going to the construction of the coupled conditional particle filters, we formalise the important implication of our result for the convergence time of the conditional backward sampling particle filter (CBPF) (Whiteley, 2010) or its ancestor sampling implementation (Lindsten et al., 2014), which are probabilistically equivalent, and reversible with respect to π T (Chopin and Singh, 2015).
Theorem 4. Suppose Assumption 2 (strong mixing) holds, and denote by P T,N the Markov transition probability of CBPF with M 1:T , G 1:T and N particles. For any α, β ∈ (1, ∞), there exists N 0 ∈ N, such that for all N ≥ N 0 : Proof. The upper bound (i) follows from Theorem 7 and Lemma 27, and (ii) follows directly from (i).
Theorem 4, indicates that under the strong mixing assumption, the mixing time of CBPF increases at most linearly in the number of observations T . We remark that unlike existing results on the CPF, we do not derive a one-shot coupling bound (Chopin and Singh, 2015), or a one-step minorisation measure (Andrieu et al., 2018;Lindsten et al., 2015), to prove the uniform ergodicity of the CBPF transition probability P T,N . This is because the enhanced stability of CBPF's Markov kernel over the Markov kernel of CPF can only be established by considering the behaviour of the iterated kernel P k T,N of Theorem 4, which has thus far proven elusive to study. Thus, in addition to the result, the proof technique is itself novel and of interest. For this reason we dedicate Section 5 to its exposition. Finally, even though α and β in Theorem 4 may be taken arbitrarily small and large, respectively, by increasing N 0 , the rate at which N 0 is required to increase in our results with respect to α and β is fast and far more conservative than what has been observed in practice.

Coupled conditional particle filters and unbiased estimators
This section is devoted to the CCPF and CCBPF algorithms (in short CCxPF where x is a place holder), and the construction of unbiased estimators of E π T [h(X 1:T )] using them. We start with Algorithm 1, where the CCxPF algorithms are given in pseudo-code. The algorithms differ only in lines 12-17, highlighting the small, but important, difference of CCPF and CCBPF. The CCBPF incorporates index coupled backward sampling, which is central to our results. Algorithm 2 details the index coupled resampling (Chopin and Singh, 2015) employed within the methods.
The CCxPF algorithms define Markov transition probabilities on X T × X T . It is direct to check that CCPF and CCBPF coincide marginally with the CPF (Andrieu et al., 2010) and CBPF (Whiteley, 2010) (Chopin and Singh, 2015), it is easy to see that CCxPF are π T -reversible, where π T (ds, ds) = π T (s)δ s (ds)ds.
Let us first state a result that complements the findings of Jacob et al. (2017). It implies that CCPF enjoys similar strong uniform ergodicity like CPF, with the same rate as the number of particles is increased (cf. Andrieu et al., 2018;Lindsten et al., 2015).
Proof of Theorem 5 is given in Appendix A.
Theorem 5 is stated with a fixed time horizon T , and shows that if the number of particles is large enough, then one-shot coupling happens with a high probability from any initial state (s ref ,s ref ). The coupling is one-shot since it compares S andS which are the outputs of a single application of the CCPF algorithm. The concern is that if T is large, N may have to be taken very large in order to guarantee some desired minimum coupling probability. Indeed, we have only been able to show the following result: Theorem 6. Under the setting of Theorem 5, but with Assumption 17 in Appendix B, Theorem 6, which follows from Lemma 20, shows that the probability of coupling does not diminish when N = O(2 T T ). That is, roughly doubling of particle number with every unit increase in T ensures non-diminishing coupling probability.
In our experiments, the CCBPF indicated stable behaviour with a fixed and small number of particles, even for a large T . Our main result, which we state next, consolidates our empirical findings. In contrast to Theorem 5, the statement of the coupling behaviour for CCBPF is not one-shot in nature. In our analysis we show that the pair of trajectories output by the repeated application of the CCBPF kernel couple themselves progressively, starting from their time 1 components until eventually coupling all their components until time T . For this reason the result is stated in terms of the law of this coupling time and not P(S =S) as in Theorem 5.
The proof of the bound (2) is given in Section 5, and the linear coupling time statement follows trivially. The most striking element of this statement is that the coupling time τ does not exceed ρT with greater surety as T increases.
Let us then turn to the use of CCxPF together with the scheme of Glynn and Rhee (2014), as suggested in (Jacob et al., 2017). Algorithm 3 has two adjustable parameters, a 'burn-in' b ≥ 1 and 'number of particles' N ≥ 2 which may be tuned to maximise its Algorithm 3 Unbiased(b, N) 1: Run standard particle filter with (M t , G t ) t=1:T to get a trajectoryS 0 ∈ X T 2: Set (S 0 , − ) ← CCxPF(S 0 ,S 0 , N). 3: for n = 1, 2, . . . do 4: : end for efficiency. Algorithm 3 iterates either the coupled conditional particle filter CCPF or the coupled conditional backward sampling particle filter CCBPF until a perfect coupling of the trajectories S n andS n is obtained.
The following result records general conditions under which the scheme above produces unbiased finite variance estimators.
Theorem 8. Suppose Assumption 1 holds and h : X T → R is bounded and measurable. Then, Algorithm 3 with CCPF, b ≥ 1 and N ≥ 2, denoting by τ the running time (iterations before producing output). ( and the bound on |var(Z) − var π T h(X) | follows from Lemma 28. Part (ii) follows from Theorem 24 and Lemma 26.
Theorem 8 complements the consistency result of Jacob et al. (2017), by quantifying the convergence rates. Fix T : if N is large, then E[τ ] ≈ b, and if b is large, then var(Z) ≈ var π T h(X) . As mentioned before the growth of the constant c with respect to T can be very rapid. In contrast, in case of the CCBPF, the results may be refined as follows: Theorem 9. Suppose Assumption 2 holds, let α, β ∈ (1, ∞) and let N 0 ∈ N be from Theorem 7. Then, Algorithm 3 satisfies, with ρ := log α/ log β: Proof. The results follow from Theorem 7 and Lemma 28, similarly as in the proof of Theorem 8.
Note that the latter term in Theorem 9 (i) is at most (β −1) −1 , showing that the expected coupling time is linear in T . Theorem 9 (ii) may be interpreted so that the CCBPF algorithm is almost equivalent with perfect sampling from π T , when b increased (super)linearly with respect to T .
We conclude the section with a number of remarks regarding Algorithm 3: (i) We follow Jacob et al. (2017) and suggest an initalisation based on a standard particle filter in line 1. However, this initialisation may be changed to any other scheme, which ensures that S 0 andS 1 have identical distributions. Our results above do not depend on the chosen initialisation strategy. (ii) The estimator Z is constructed for a single function h : X T → R, but several estimators may be constructed simultaneously for a number of functions h 1 , . . . , h m . In fact, if we let τ := inf{n ≥ b : S n =S n }, we may regard the random signed measurê ] as the output, which will satisfy the unbiasedness E[μ(ϕ)] = π T (ϕ) at least for all bounded measurable ϕ : X → R.
(iii) We believe that the method is valid also without Assumption 1 but may exhibit bad performance -similar to the conditional particle filter, which is sub-geometrically ergodic with unbounded potentials (Andrieu et al., 2018).
We use stochastic ordering X ≤ st Y of two random variables X and Y , which holds if their distribution functions are ordered P (X ≤ x) ≥ P(Y ≤ x) for all x ∈ R. Two random vectors X and Y are ordered for all functions φ : R n → R for which the expectations exist, and which are increasing, in the sense that φ(x) ≤ φ(y) whenever x ≤ y, where '≤' is the usual partial order x ≤ y if x i ≤ y i for all i = 1:d. Recall also that X ≤ st Y if and only if there exists a probability space with random variablesX andỸ ) such that X d =X and Y d =Ỹ andX ≤Ỹ a.s. (Shaked and Shanthikumar, 2007, Theorem 6.B.1).
Lemma 12. Given any N ∈ N, consider the random variable ∆ defined in Lemma 10. For any β < ∞ and α ∈ 1, (1 − ǫ) −1 , there exists N 0 < ∞ such that for all N ≥ N 0 , Proof. Suppose ϕ : N → R + is decreasing and L ∈ N, then and because p (1) u = 1 and p (0) u = ǫ for u ≤ 0, we may write Furthermore, for t ∈ {1:L}, Lemma 13 implies that for t = 1:L, the terms R t :=Ĉ t /N → 1 in probability as N → ∞, and consequently also p (1) We get the first bound by and applying the result above with ϕ(t) = (L − t) + , because Lemma 13. The expectation ofĈ t generated in Lemma 12 may be lower bounded as follows: Therefore, for any t ∈ N and ε > 0, there exits N 0 such that for all N ≥ N 0 and all u = 1:t, Proof. Denote R t :=Ĉ t /N, then for any t ≥ 1, we have Note that for a, b > 0 and λ ∈ [0, b), the function x → ax(b − λx) −1 is convex on [0, 1]. Therefore, by Jensen's inequality, Starting with a t = 1, b t = 1 and λ t = 0, we conclude that and consequently, . This equals the desired bound.
Remark 14. In order to make the bound in Lemma 13 large, we must have δ t−1 , it is sufficient that we take N ≫ t and log(N) ≥ c+t(− log δ). Usually the latter is dominant, meaning that N of order δ −t is necessary.
We simulated the random variablesĈ t , and observed a similar 'cutoff' -a δ −1 -fold increase in N caused the 'boundary' whereĈ t /N starts to drop from zero around one to zero, to extend by one step further. We believe that Lemma 13 captures the behaviour of C t rather accurately. However, we believe thatĈ t−κ are often rather pessimistic compared with |C t |.
Proof of Theorem 7. The result follows from the fact that P(τ ≥ n) ≤ P(τ ≥ n) whereτ is given in (3), and a Chernoff bound where the final inequality uses Lemma 12 by choosing u = logα, for someα ≤ α such that Proof. The claim is trivial whenever P A + n j=1 f (Z (j) ) = 0 G > 0, so consider the case whence the result follows.
(ii) There exists a coupling such that C t ⊃ C φ t a.s. for all t = 1:T .
The marginal equivalence (i) is straightforward. For the stochastic minorisation (ii), we consider running CCPF(s,s, N) and CCPF s, (φ, . . . , φ), N simultaneously, in a coupled manner. More specifically, set For t ≥ 2, we proceed inductively, assuming that X t−1 , and that C t−1 ⊃ C φ t−1 , which obviously hold for t = 2. Note that then ω Consequently, the outputs of CRes satisfy P(I , and we may couple the outputs such that , and consequently we may also couple X i ∈ C φ t . Proof of 5. Consider CCPF s, (φ, . . . , φ), N , letČ t := i ∈ {2:N} :X Note that the latter quantity does not depend on X (i) t , but only on the marginal conditional particle filterX t , so we may iterate similarly as above to obtain 0 are defined as above, with convention G 0 ≡ 1.
Lemma 19. Suppose that Assumption 17 holds, then, for any t = 1:T − 1, Proof. We may apply Lemma 18 recursively with h Lemma 20. Under Assumption 17, Proof. The first inequality follows once we prove inductively that , which holds for t = 1 by Lemma 19 (which is symmetric wrt. ρ t andρ t ). Then, and the same bound applies to 1−E[ρ t ]. The latter bound follows as 1−β T N ≤ T (1−β N ).

Appendix C. Unbiased estimator based on a coupled Markov kernel
We formalise here the construction of unbiased estimators of Markov chain equilibrium expectations due to Glynn and Rhee (2014), and complement the results in Jacob et al. (2017).
Definition 22 (Coupled Markov kernel). Suppose P is a Markov kernel on S, and P is a Markov kernel on S × S. If P (x,x; · ) ∈ Γ P (x, · ), P (x, · ) for all x,x ∈ X, then, P is a coupled kernel corresponding to P .
Definition 23 (Coupling time). The coupling time of the bivariate Markov chain (X n ,X n ) n≥0 is the random variable τ := inf{n ≥ 0 : X k =X k for all k ≥ n}.
Theorem 24. Let P be an ergodic Markov kernel on X with invariant distribution π (that is, for all x ∈ X, P n (x, · ) − π tv n→∞ − −− → 0), and suppose P is a corresponding coupled Markov kernel. Let ν be any probability distribution on X, and suppose that λ ∈ Γ(νP, ν).
Below, we use h osc := sup x,y∈X |h(x) − h(y)|, and h ∞ = sup x∈X |h(x)| ≥ h osc /2. Under the following assumed distribution on the coupling time, not only is the sequence Z m,L defined in (4) uniformly square integrable, the corresponding sequence {ζ m } is an L 2 Cauchy sequence.
The sum may be upper bounded by L n,ℓ=m .
Simple calculation yields the desired bound.
Lemma 27. Suppose P is a coupled kernel corresponding to a π-ergodic Markov kernel P . Let τ x,x stand for the coupling time of the Markov chain (X n ,X n ) n≥0 with transition probability P and with (X 0 ,X 0 ) ≡ (x,x). Then, P (x, · ) − π tv ≤ 2 sup x∈X P(τ x,x > n).
Proof. Using Lemma 25 together with Lemma 27 and Lemma 26, we get The claim follows easily, because h osc ≤ 2 h ∞ .