Unbiased estimators and multilevel Monte Carlo

Multilevel Monte Carlo (MLMC) and unbiased estimators recently proposed by McLeish (Monte Carlo Methods Appl., 2011) and Rhee and Glynn (Oper. Res., 2015) are closely related. This connection is elaborated by presenting a new general class of unbiased estimators, which admits previous debiasing schemes as special cases. New lower variance estimators are proposed, which are stratified versions of earlier unbiased schemes. Under general conditions, essentially when MLMC admits the canonical square root Monte Carlo error rate, the proposed new schemes are shown to be asymptotically as efficient as MLMC, both in terms of variance and cost. The experiments demonstrate that the variance reduction provided by the new schemes can be substantial.


Introduction
Multilevel Monte Carlo (MLMC) methods pioneered by Heinrich (2001) and Giles (2008) are now standard for estimation of expectations of functionals of processes defined by stochastic differential equations (SDEs). While the MLMC techniques have origins in the integral operators (Heinrich, 2001) and SDEs (Giles, 2008), they can be applied also in other application domains, where estimates with gradually increasing accuracy are available; see the recent review by Giles (2015) and references therein.
More recently, the so-called debiasing techniques (McLeish, 2011;Glynn, 2012, 2015) have attracted a lot of research activity (e.g. Agapiou et al., 2014;Glynn and Rhee, 2014;Jacob and Thiery, 2015;Lyne et al., 2015;Walter, 2017), although similar ideas have been suggested much earlier in more specific contexts (e.g. Glynn, 1983;Rychlik, 1990Rychlik, , 1995. These techniques are based on similar ideas as MLMC, but instead of optimal allocation of computational resources for minimising the error, the primary focus is on providing unbiased estimators. Monte Carlo inference is straightforward with independent unbiased samples, allowing to construct confidence intervals in a reliable way (Glynn and Whitt, 1992b). Debiasing techniques may also be employed within a stochastic approximation algorithm (Dereich and Mueller-Gronbach, 2015). In particular, in a stochastic gradient descent type algorithm (Robbins and Monro, 1951) relevant for instance in maximum likelihood inference (Delyon et al., 1999), unbiased gradient estimate implies pure martingale noise, which is supported by a well-established theory (e.g. Borkar, 2008;Kushner and Yin, 2003).
The debiasing techniques involve balancing with cost and variance, which often boils down to similar methods and conditions as those that are used with MLMC. The connection between MLMC and debiasing techniques has been pointed out earlier at least by Rhee and Glynn (2015), Giles (2015) and Dereich and Mueller-Gronbach (2015), but this connection has not been fully explored yet. The purpose of this paper is to further clarify the connection of MLMC and debiasing techniques, within a general framework for unbiased estimators.
Many techniques for unbiased estimation other than those considered here have been suggested in the literature. For instance, there is a whole body of literature for 'perfect sampling' by Markov chains (Propp and Wilson, 1996) or with certain classes of SDE models (Beskos and Roberts, 2005); see also the recent monograph by Huber (2015) and references therein. Perfect sampling can be used to construct unbiased estimators, but the problem is generally more prestigious and often harder to implement. See also the recent article by Jacob and Thiery (2015) for discussion about other related unbiased estimation techniques.
The rest of the paper is organised as follows. The multilevel Monte Carlo and the previous debiasing methods are presented briefly in Sections 2 and 3, respectively. Section 4 introduces a new general unbiased scheme with an explicit expression for its variance (Theorem 7). The unbiased estimators suggested by McLeish (2011) and Glynn (2012, 2015) are reformulated as specific instances of this scheme, as well as an obvious 'hybrid' scheme with MLMC and unbiased components (Example 11).
New unbiased estimators are suggested in Section 5. Two of the new schemes, termed stratified and residual sampling estimators, have provably lower variance than simple averages of independent unbiased estimators (Proposition 16). Because stratification is a well-known variance reduction technique, these estimators may be well-known, but they do not seem to be recorded in the literature yet. The first main finding of this paper is Theorem 19 which shows that the variances of two of the new schemes are asymptotically equal to that of MLMC under general conditions. This result suggests that unbiasedness can often be achieved with asymptotically negligible additional variance.
The expected cost of the methods is discussed in Section 6. The new schemes appear even more appealing after seeing that the expected cost of MLMC and the unbiased schemes are also asymptotically equivalent (Proposition 22) and therefore the efficiency of an estimator with new schemes can be made arbitrarily close to MLMC (Corollary 23). The limiting variance formulation in Theorem 19 leads into an easily applicable optimisation criteria for the sampling distribution related to the new estimators.
Section 7 presents a generalisation of the unbiased scheme, which accomodates further conditioning and dependent randomisation schemes based on stopping times. Numerical experiments in Section 8 show how the efficiency bounds predicted by theory are attained in four examples, three of which were also studied by Rhee and Glynn (2015). The paper is concluded by a discussion about the implications of the findings in Section 9. Some practical guidelines and possible topics of further research are discussed as well.

Multilevel Monte Carlo
The starting point of MLMC is the existence of estimators (Y i ) i≥1 , which are increasingly accurate approximations of a given 'target' random variable Y , whose expectation is of interest. That is, (Y i ) i≥1 are random variables with EY i → EY , and the task is to provide a numerical approximation of EY . The simulation cost of a single realisation of Y i increases in i, which calls for optimisation of computational resources.
Such a scenario arises for instance in the context of SDE models, commonly applied in option pricing. In such an application, we might have Y := f (X T ), where f : R → R is a given 'payoff' function, and X T corresponds to the terminal value of the process (X t ) t∈ [0,T ] solving the SDE where (B t ) t≥0 stands for the standard Brownian motion, the functions µ, σ : R → R are the drift and diffusion parameters of the model, and x 0 ∈ R is the initial value. Unless µ and σ are of certain specific form, the random variable X T cannot be simulated exactly, but admits easily implementable approximations based on time-discretisation of (1). The commonly applied dyadic uniform meshes are defined as follows The Milstein discretisation (e.g. Kloeden and Platen, 1992) corresponding to τ i is defined by letting X k ) random variables. The final value of such an iteration provides an approximation of X T , so we may set Y i+1 := f (X (τ i ) 2 i ) for i ≥ 0. The cost of simulating a single realisation of Y i is of order 2 i , and under certain, fairly general, conditions on µ, σ and f , the mean square error E(Y − Y i−1 ) 2 ≤ c2 −λi with λ > 1 (e.g. Kloeden and Platen, 1992).
The MLMC is an efficient way to use such approximations in order to approximate EY . It is based on the following, seemingly trivial observation which suggests that one may construct an estimate of EY m using a separate Monte Carlo approximation of each term E(Y i − Y i−1 ): where n 1 , . . . , n m ∈ N and ∆ (j) i are independent random variables with E∆ It is direct to check that then EZ ML = EY m . There are two key reasons why MLMC is useful: where the approximations Y i and Y i−1 are dependent, such that |∆ i | is typically small when i is large. In the context of the SDE example discussed above, such a coupling arises naturally when using a common Brownian path in the discretisations leading to Y i and Y i−1 . (ii) It allows to optimise the computational effort devoted to each 'level' to estimate The key benefit is that fewer samples n i are often necessary with higher i, which leads to lower overall cost. Theorem 1 of Giles (2015) quoted below gives the complexity of MLMC under the common exponential framework. It assumes that the expected cost (computational complexity) of each term ∆ i ) are independent, and there exist positive c 1 , c 2 , c 3 , α, β, γ with α ≥ (β ∧ γ)/2, such that for all i ≥ 1, Then, there exists a positive constant c 4 such that for any < 1/e there are values m and n 1 , . . . , n m ∈ N such that the MLMC estimator satisfies the following mean square error (MSE) bound E(Z ML − EY ) 2 ≤ 2 , and satisfies the expected cost bound The SDE application discussed above often satisfies the conditions of Theorem 1 with β > γ (Kloeden and Platen, 1992). In a multidimensional SDE setting, the antithetic truncated Milstein (Giles and Szpruch, 2014) often lead to β > γ as well. This highlights why MLMC has become so popular-it is often possible to attain a 'canonical' rate −2 , equivalent to a square root error rate κ −1/2 , despite the bias, using simple standard discretisation methods. Crude Monte Carlo, that is, taking a fixed level m and averaging independent realisations of Y m , leads to worse rates. The same canonical square root error rate can be attained with similar assumptions using the debiasing techniques which are discussed next.

Debiasing techniques
As with MLMC, assume that (Y i ) i≥1 and Y are integrable random variables with EY i → EY and Y 0 ≡ 0, then The fundamental observation behind the unbiased schemes is that one may employ randomisation to pick a finite number of terms of the series to construct estimators with expectation EY .
The two results below due to Rhee and Glynn (2015) propose two such estimators, along with expressions for their second moments. Proofs of Theorems 3 and 5 are shown to follow as a consequence of Theorem 7 in Section 4 (Examples 9 and 10).
Theorem 3 (Single term estimator). Suppose (∆ i ) i≥1 and (p i ) i≥1 satisfy Condition 2, and R ∈ N is a random variable independent of (∆ i ) i≥1 with P(R = i) = p i , then the single-term estimator McLeish (2011) first suggested the following estimators, but with different conditions and different expression for the variance.
Condition 4. Suppose (∆ i ) i≥1 are random variables with E∆ i = EY i − EY i−1 , and (p i ) i≥1 is a probability distribution such thatp i := j≥i p j > 0 for all i ∈ N and either one of the following hold: Theorem 5 (Sum estimator). Suppose (∆ i ) i≥1 and (p i ) i≥1 satisfy Condition 4 and R ∈ N is a random variable independent of (∆ i ) i≥1 with P(R = i) = p i , then the sum estimator In case of Condition 4 (i), and, in case of Condition 4 (ii), Remark 6. We follow Rhee and Glynn (2015) and call the estimator of Theorem 5 coupled sum in case of Condition 4 (i) and independent sum in case of Condition 4 (ii). Note that Condition 4 (ii) is equivalent to the assumption of (Rhee and Glynn, 2015, Theorem 5(a)), becausep i is non-increasing, and so Let us briefly return to the case where corresponds to the solution of a time-discretised SDE with a mesh of 2 i points as discussed in Section 2. Many approximation schemes admit a weak rate α > 1/2, which implies that Kloeden and Platen, 1992). If additionally var(∆ i ) ≤ c2 −βi with β > 1, often satisfied for instance by the Milstein scheme (Jentzen et al., 2009;Kloeden and Platen, 1992) or in the multivariate case by the antithetic truncated Milstein scheme (Giles and Szpruch, 2014), then taking p i ∝ 2 −ξi , where ξ ∈ 1, 2α ∧ β satisfies Conditions 2 and 4 (ii). Because the cost of level i is of order 2 i , this sampling scheme admits a finite expected cost (see Section 6 for details). This shows that unbiased estimators with finite variance and finite expected cost can be obtained with the same conditions under which MLMC admits the canonical error rate. In case of correlated (∆ i ) i≥1 in Condition 4 (i), a slightly more stringent condition about the strong rate E(Y − Y i−1 ) 2 ≤ c2 −λi with λ > 1 is required. Giles (2015) pointed out that averaging n independent single term estimators Z (1) introduced in Theorem 3 corresponds to an estimator of the form

General unbiased scheme
where N i is the number of samples from level i, and the expectation of N i is np i . This shows a close connection with MLMC, which corresponds to taking N i ≡ n i = np i (up to some level m). Inspired by this remark, and the techniques by Rhee and Glynn (2015), consider the following general unbiased scheme, with an expression for its variance.
If there exists a strictly increasing sequence of positive integers (m k ) k≥1 such that lim k→∞ sup j≥1 v m k ,m k+j = 0 and i≥1 N i < ∞ almost surely, then the estimator The proof of Theorem 7 is given in Appendix B.
Consider first a simple example of Theorem 7, where N i are taken independent.
then the corresponding estimator is unbiased with variance v. In particular, taking N i Poisson with intensity np i where i≥1 p i = 1, then i N i < ∞ by the Borel-Cantelli lemma and v = n −1 i≥1 E∆ 2 i /p i . The averages of single term estimators and sum estimators introduced in Theorems 3 and 5 correspond to certain dependence structures of (N i ) i≥1 , as we shall see next. The following two examples are important, because the new schemes introduced in Section 5 will be based on the same constructions.
Example 9 (Single term estimators). Suppose Condition 2 holds. Define and call the estimator Z (n) iid . It is easy to see that this corresponds to an average of n independent single term estimators Z (1) . Note that (N i ) i≥1 follow a multinomial distribution with parameters n and (p i ) i≥1 . We have EN i = np i and ). The above also proves Theorem 3 with n = 1.
Example 10 (Sum estimators). Suppose (∆ because EÑ i = np i , and where the latter equality would hold if (R (j) ) were non-increasing.
Changing the indexing to ensure that (R (j) ) are non-increasing does not affect the distribution. For all i, k ≥ 1, Denote for the rest of the proof D i,m : Suppose then that Condition 4 (ii) holds. It is straightforward to check that which satisfies lim →∞ sup m> v ,m = 0 by assumption, and similarly as above, Taking n = 1 concludes also the proof of Theorem 5.
The last example is an obvious 'hybrid' scheme involving MLMC and an 'unbiased tail' scheme, which also falls into the framework of Theorem 7.
Example 11. Assume that r, m ∈ N and that The estimator can be written as The first term coincides with an MLMC estimator with m levels, with expectation EY m , and the second term is an average of single term estimators of EY − EY m , with r samples. Note that we could have used any unbiased scheme in the latter part, provided it satisfies the conditions in Theorem 7.
The new sampling schemes discussed next in Section 5 provide a different view on the balancing; see in particular Theorem 19.

New unbiased estimators
Let us now turn into new practically interesting estimators which correspond to specific choices in Theorem 7. The estimators are based on stratification, a classical variance reduction technique in survey sampling (e.g. Hansen et al., 1953), which have also been widely used in Monte Carlo; see for instance Douc et al. (2005); Glasserman (2003).
The following lemma states classical results on stratification (cf. Vihola, 2015, for proof).
Lemma 12. Let X be an integrable random variable and let A 1 , . . . , A m be exhaustive disjoint events with P(A i ) = q i > 0, and let 1 , . . . , m ∈ N. Assume X (j) i are random variables with conditional laws P(X In what follows, assume (p i ) i≥1 are positive and such that i≥1 p i = 1, and denotẽ p i := j≥i p j . Let us first consider uniform stratification schemes. For that purpose, recall the definition of the generalised inverse distribution function F −1 : (0, 1) → N corresponding to (p i ) i≥1 , Definition 13 (Uniformly stratified estimators). Let n ∈ N, and assume U (j) are independent U j−1 n , j n random variables for j = 1, . . . , n. Let str defined as in Theorem 7 is the uniformly stratified single term estimator.
Σ,str defined as in Theorem 7 is the uniformly stratified sum estimator.
Definition 14 (Systematic sampling estimators). Let u j := (j − 1)/n and define U (j) := u j + U for all j = 1, . . . , n, where U ∼ U (0, 1/n), and let R (j) := F −1 (U (j) ). Define then N i andÑ i as in Definition 13 (i) and (ii), respectively, and the corresponding systematic sampling single-term estimator Z (n) sys and systematic sampling sum estimator Z (n) Σ,sys . The consistency and a variance bound for the uniformly stratified and systematic sampling estimators are stated in Proposition 16, after introducing another slightly different stratification scheme.
Definition 15 (Residual sampling estimators). Let n ∈ N, define n i := np i and let r := n − i≥1 n i ≥ 0. If r > 0, define the 'residual' probability distribution (p * i ) i≥1 as p * i := (np i − n i )/r, and let Q (j) be independent random variables such that P(Q (j) = i) = p * i for j = 1, . . . , r.
Σ ). Proposition 16 follows as a consequence of Lemma 12; the detailed proof is given in Appendix C.
Remark 17. Instead of using u j = (j − 1)/n in systematic sampling, we could use instead any sequence u 1 , . . . , u n ∈ [0, 1], and let U (j) := u j + U mod 1, where U ∼ U (0, 1). This would not be stratification, but it is direct to check the estimators still retain the same expectation, and also the same pessimistic variance bound. For instance, using a lowdiscrepancy sequence (u j ) would correspond to randomised quasi-Monte Carlo (e.g. Dick et al., 2013).
Proposition 16 states that all the new estimators are unbiased, and that the uniformly stratified and residual sampling estimators cannot be worse than averages of independent single term and sum estimators, in terms of variance. In fact, they often have a strictly lower variance, but it is generally difficult to quantify the benefit. The following Theorem 19 is often more useful, indicating that stratified and residual sampling schemes attain asymptotically the efficiency of MLMC under general conditions. Let us first formulate 'idealised' MLMC strategies based on limiting allocations (p i ) i≥1 .
Definition 18. Assume (p i ) i≥1 are non-negative with i≥1 p i = 1 and definep i = j≥i p i . For any n ∈ N, define n i := np i ,n i := np i , m n = max{i ≥ 0 : n i > 0} andm n = max{i ≥ 0 :n i > 0} and denote Practical implementations of MLMC are often based on application of stopping rules, which may determine n i and m n during simulation. Definition 18 can be therefore viewed as 'idealised' version of MLMC, where (nearly) optimal allocation strategy (p i ) i≥1 is known beforehand, and n i and m n are determined in terms of a single 'running-time' parameter n.
Theorem 19. Below, Z The proof of Theorem 19 is given in Appendix D.
Remark 20. The limiting variances in Theorem 19 can be significantly lower than the upper bounds given in Proposition 16, which correspond to averaging independent unbiased estimators. Indeed, var(∆ i ) = E∆ 2 i − (E∆ i ) 2 , so Note thatp i+1 ≤p i for all i, so all the terms in the sum are positive.

Expected cost and asymptotically optimal distribution
Let us next consider the cost of simulating estimators Z of the form given in Theorem 7. Assume each ∆ (j) i has random cost K i,j , such that the total cost of Z is K := i≥1 Assume also that {K i,j } i,j≥1 are independent of (N i ) i≥1 and {K i,j } j≥1 have a common mean κ i = EK i,j < ∞. The expected cost can be written down as follows.
The following records the immediate fact that the estimators introduced in Section 5 have the same expected cost as the simple averages of independent estimators.
Proposition 21. Denote the cost of the estimator Z The following proposition records that both of the MLMC estimators introduced in Definition 18 have a cost that is asymptotically equivalent to the corresponding unbiased estimators, which are based on the same limiting allocation strategy. Clearly n i /n ≤ p i , n i /n → p i and I {n i > 0} → 1 as n → ∞, so if ∞ i≥1 p i κ i < ∞, the result follows by dominated convergence; otherwise one can use Fatou's lemma. The proof with K (n) ML is identical. Glynn and Whitt (1992a) formulate an asymptotic efficiency principle, which states that if (Z (j) ) j≥1 are i.i.d. estimators with finite variance σ 2 < ∞ and with expected cost κ < ∞, then the average of such estimators has asymptotic relative efficiency [κσ 2 ] −1 . Considering this notion of efficiency, let n ∈ N, and consider the following estimator, which is the average of m independent realisations of the estimators above, and ' * ' is a place holder. By letting m → ∞, one may consider the asymptotic efficiency of this estimator as in the following result, stating that stratified and residual sampling procedures always improve upon averages of independent single term and sum estimators, and can be made arbitrarily close to MLMC in asymptotic efficiency.
Corollary 23. Suppose n ∈ N is fixed, then the following asymptotic efficiency holds when m → ∞.
(i) If Condition 2 holds and i≥1 p i κ i < ∞, then the asymptotic efficiency of both Z (n,m) str and Z (n,m) res is no worse than that of Z iid , and can be made arbitrarily close to that of Z Σ,iid would be guaranteed to satisfy a functional central limit theorem, which is out of the scope of this work; see, however, the recent works of Alaya and Kebaier (2015) and Zheng and Glynn (2017) in this direction in a different setting.
If we assume n is taken sufficiently large so that the limits in Theorem 19 are approximately attained, the asymptotic efficiency principle suggests a rule for tuning the distribution (p i ) for stratified and residual sampling distributions: the distribution (p i ) i≥1 which minimises the asymptotic inverse relative efficiency (IRE) σ 2 ∞ EK (n) * or σ 2 Σ,∞ EK (n) Σ, * maximises the efficiency. For the single term estimator, Condition 2, this leads to (4) min The solution to (4) is proportional to β i := var(∆ i )/κ i , if i β i < ∞ (Rhee and Glynn, 2015, Proposition 1). This is straightforward to implement in practice, because reliable estimates of var(∆ i ) are easily available. A straightforward practical procedure, also implemented in the experiments below, is to use the 'empirical optimal distribution' β i / i β i to define directly the first m probabilities p 1 , . . . , p m and a tail probability j>m p j . The tail probabilities p j for j > m are then defined to follow a parametric distribution which guarantees a finite variance based on theory. In the experiments, the tail distribution was chosen to be geometric. In case of the independent sum estimator, Condition 4 (ii), similar minimisation (4) with p i in place of p i yields the asymptotically optimal distribution p i =p i −p i+1 . However, contrary to the single-term estimator, (p i ) i≥1 must additionally be non-increasing. Because (4) is invariant under multiplicative constants on (p i ) i≥1 , the independent sum estimator can never outperform the single term estimator, in terms of asymptotic IRE.
The coupled sum estimator, Condition 4 (i), leads to the optimisation problem This is more involved for two reasons. Estimation of the terms var(Y − Y i ) is not straightforward. In practice, a reasonable (but expensive) approximation may be obtained by approximating Y with Y m , where m is large. As noted in (Rhee and Glynn, 2015, §3), the coupled sum estimator offers one more degree of freedom. Namely, it is possible to consider a coupled sum estimator for a subsequence of the variables (Y i ), that is, employing level . Then, the optimisation would require solving This leads to a combinatorial problem, which is discussed in depth in (Rhee and Glynn, 2015, §3), who also describe a dynamic programming algorithm which finds such optimal J, up to an index m, in O(m 3 ) time.
The asymptotic optimality results above suggest that when considering an average of estimators of the form (3), it is possible to sequentially refine the randomisation distribution (p i ) i≥1 used for Z (n,j) * based on the random variables generated for the previous estimators Z (n,1) * , . . . , Z (n,j−1) * , as suggested by Rhee and Glynn (2015). In particular, the probabilities (p i ) i≥1 of single term estimators could be chosen based on empirical variances of level i variables generated for previous estimators. Then, Z (n,j) * would not be independent, but remain unbiased, and in fact (Z (n,j) * − EY ) j≥1 would be martingale differences. In some applications it is not possible to choose (p i ) i≥1 which yield both finite variance and finite expected cost. Then, it is not possible to attain a canonical square root convergence rate, but Rhee and Glynn (2015) suggest another approach: choose (p i ) i≥1 that ensure finite variance, but which imply infinite expected cost. Using a result due to Feller (1946), they deduce complexity results for unbiased estimators which are close to what are possible with MLMC; see also Zheng et al. (2017). However, quantifying the efficiency in the present setting is not straightforward, so this is left for future work.

Generalised unbiased scheme and dependent randomisation
The general unbiased scheme proposed in Theorem 7 is based on independent randomisation, that is, (N i ) i≥1 are assumed independent of (∆ (j) i ). Such a scheme is often appropriate in practice, but it is also possible to think of cases where (N i ) i≥1 could depend on (∆ (j) i ), for instance in a stopping time fashion. It is also possible to retain unbiasedness while replacing EN i in the estimator by a conditional expectation. We consider below a scheme which accomodates both of these generalisations, while retaining unbiasedness.
Condition 26. Suppose Condition 25 holds, and for all i, j ≥ 1: for all i, j ≥ 1, then, Condition 26 (i) holds because Condition 26 (ii) holds similarly, with c i = C i = E|∆ i |.
(ii) In particular, if (∆ (j) i ) i,j are independent and N i are stopping times with respect to (∆ for all i, j ≥ 1, then Condition 26 holds as above. (iii) If c i = E|∆ i | and Condition 2 holds for some probability distribution (p i ) i≥1 , then Theorem 28. Assume Condition 25 holds and denote for all 0 ≤ < k < ∞ and lim k→∞ sup j≥1 v m k ,m k+j = 0 for some subsequence (m k ) k≥1 , then EZ = EY and var(Z) = lim k→∞ v 0,m k .
Proof of Theorem 28 is given in Appendix E. Theorem 28 (ii) is a generalisation of Theorem 7, and leads to new, potentially interesting estimators. For instance, let N i := n j=1 I R (j) = i where R (1) , . . . , R (n−1) are positive random integers independent of R (n) ∼ (p i ) i≥1 . If we take F i := σ(R (1) , . . . , R (n−1) ) for all i ≥ 0, then E(N i | F i−1 ) = p i + n−1 j=1 I R (j) = i . This can be viewed as a residual sampling scheme applied with the (random) probability distributionp i := E(N i | F i−1 )/n. Analogously, the residual sampling scheme may be viewed through such conditioning, where R (1) , . . . , R (n−r) are deterministic. It is unclear whether Theorem 28 (ii) allows estimators that have practical appeal, such as greater efficiency compared with the estimators introduced in Section 5.
Theorem 28 (i) with Remark 27 is similar to Wald's identity. The stopping time formulation may prove theoretically useful, and suggests a possibility to be used in conjunction with stopping rules developed in the MLMC context (e.g. Collier et al., 2015). However, the practical relevance of such an approach may be limited, because the expectation of a stopping time is often unavailable. It appears also difficult to derive useful explicit variance expressions when (N i ) depend on (∆ (j) i ). The sequential refinement of (p i ) i≥1 during repeated simulation of unbiased estimators, as discussed in Section 6, could be used instead.
The fourth model is an artificial model termed modified gBM, which has the same volatility term as gBM but a time-dependent drift: and σ = 0.1. The target functional is the mean, which was found to have an approximate expected value EX 1 = 1.395612139. This last model is intended to have bigger |EY i − EY | than the previous examples, highlighting the differences between the new estimators and averages of independent estimators. Algorithm 1 summarises an implemention of the single-term and coupled-sum estimators Z · ,res , one may construct R (1) , . . . , R (n) as in the proof of Proposition 16 in Appendix C. The independent sum estimators may be implemented similarly as the coupled sum, by interchanging the lines 8 and 9. The C++ source code of the implementation developed for the tests is available at https://bitbucket.org/mvihola/unbiased-mlmc.
In all but the Heston model, the standard Milstein scheme described in Section 2 was employed. The antithetic truncated Milstein scheme proposed by Giles and Szpruch (2014) was applied with the Heston model. We considered single term, independent sum and coupled sum estimators. With the first two, the distributions (p i ) i≥1 were set to approximately optimal in each model as discussed in Section 6: p 1 , . . . , p m were set empirically to minimise variance, and the tail probability P tail was calculated from prior simulated data. The tail distribution p i for i > m was geometric, p i = P tail (1 − 2 −γ )2 −γ(i−m) , with the parameter γ set in all cases to 1.5. This provided a very good fit with the empirical optimal distribution β i / j≥1 β j with m = 1 in case of gBM and m = 6 with all other examples; see also theoretical results on the Milstein scheme (Jentzen et al., 2009;Kloeden and Platen, 1992) and the antithetic truncated Milstein (Giles and Szpruch, 2014). In the coupled sum estimator, the differences (∆ i ) i≥1 were all based on a single Brownian trajectory. The optimisation was based on estimators of var where Y m corresponded to discretisation with a mesh of 2 13 . An optimal subsequence was found using the dynamic programming algorithm of Rhee and Glynn (2015). Figure 1 shows results based on averages of 10 5 independent runs of each algorithm in each model. The graphs in Figure 1 show the estimated inverse relative efficiency (IRE) of the Algorithm 1 Unbiased estimators in the SDE context 1: function Unbiased((p i ) i≥1 , R (1) , . . . , R (n) ) 2: end for 6: for j = 1, . . . , n do 7: Simulate Brownian path B at mesh τ R (j) −1 .

9:
for ∈ L do 10:  methods: the average cost multiplied with average square deviation from the ground truth values as given above. According to the theory, this quantity is constant with independent averages, and with residual and stratified sampling, it converges to the same limit as the corresponding MLMC estimator. Table 1 shows the corresponding numerical values with n = 10 6 .
The experiments appear to align well with the theoretical findings. In all but the last example, the MLMC estimator admitted the best IRE with small n, and the new estimators appear to admit performance in between the independent and the MLMC case. The performance of the new schemes come close to MLMC performance as n increases, as verified by the differences with n = 10 6 reported in Table 1, which are all relatively small, except for the sum estimators in the modified gBM example indicating still some discrepancy.
Note that the MLMC implemented in the experiments is the 'idealised' version involving same pre-determined allocation strategy as unbiased estimators (Description 18). This explains the discrepancy between the MLMC IRE reported here and by Rhee and Glynn (2015), who employ the original version of the MLMC (Giles, 2008). The findings of Rhee and Glynn (2015) suggest that unbiased estimators applied with stopping rules may sometimes improve upon the original MLMC. The differences in performance of single term and independent sum estimators with gBM, CIR and Heston examples are all relatively small, As n increases, the IREs of the new single term and sum estimators become negligible, as anticipated by the theory. The coupled sum estimator appears to admit greater efficiency with CIR and Heston examples, but the differences between independent average estimators and the new estimators are small. The modified gBM example demonstrates that the new estimators can be significantly more efficient. The numerical values shown in Table 1 indicate a 13-16 fold increase in terms of relative efficiency with the new single term estimators and similar performance with MLMC. The increase is 17-60 fold with sum estimators. In both cases, the stratified and systematic sampling estimators appear to perform slightly better than the residual sampling estimator.

Discussion
This paper presented a general framework for unbiased MLMC estimation, which admits previous debiasing schemes as special cases, and accomodates new, lower variance unbiased estimators. The proposed stratified sampling and residual sampling schemes are promising classes of estimators, as they enjoy good theoretical behaviour-they can not only improve on averages of independent estimators (Proposition 16 and Proposition 21), but also can have a significant gain in efficiency as illustrated in the experiments. Indeed, the stratified and residual sampling estimators can be made arbitrarily close to MLMC in efficiency under general assumptions (Theorem 19 and Proposition 22), highlighting the close connection between the MLMC and the debiasing schemes, and showing that unbiasedness may be achieved with virtually no sacrifice on efficiency. Unbiasedness is an important quality of estimators, when employed as part of stochastic optimisation algorithms (Borkar, 2008;Kushner and Yin, 2003), or in a 'compound sampling' context (Vihola et al., 2016). It also enables rigorous stopping rules, which can lead to benefits over MLMC stopping rules (Rhee and Glynn, 2015, §4).
While the theory presented in this paper does not give guarantees on the limiting efficiency of the systematic sampling, it is expected to behave often similar to stratified and residual sampling schemes, as illustrated by the experiments. However, as stratified and residual sampling enjoy good theoretical properties, and because systematic sampling appears to perform comparatively in practice, stratified or residual sampling are recommended as safer alternatives. The empirical evidence from the numerical experiments in Section 8 suggests that stratified sampling might sometimes perform slightly better than residual sampling. This, together with the straightforward implementation of stratified sampling, makes it appealing for practical purposes.
Averages of independent realisations of the single term and the independent sum estimators may have different efficiencies in general. In case of residual and stratified sampling, the optimally tuned estimators often coincide in asymptotic efficiency. In general, the single term estimator always dominates the independent sum estimator in terms asymptotic efficiency, rendering the single term estimator preferable over the independent sum estimator. Based on the experiments, the optimally tuned coupled sum estimator may sometimes lead into significant performance gains. This suggests also that dependent level estimators might be worth considering in the MLMC context, where independent level estimators are currently widely employed.
Despite the potential performance gain of the coupled estimators, it should be noted that tuning of (p i ) i≥1 requires requires estimation of var(Y − Y i ) ≈ var(Y m − Y i ), with m large, which is often computationally demanding. This is in sharp contrast with the single term estimator, where var(∆ i ) are easily accessible and inexpensive for moderate i. The simplicity of the single term estimator optimisation criterion suggests, as discussed in Section 6, an algorithm which automatically tunes the distribution (p i ) i≥1 during repeated simulation of Z suggest methods for finding optimal discretisation hierarchies in the MLMC context, which could also be explored in the debiasing context.
The MLMC literature provides many other potential further research topics. It is possible to employ essentially all techniques developed in the context of MLMC with debiasing. These include, for instance, quasi-Monte Carlo (Dick et al., 2013;Giles and Waterhouse, 2009), adaptive time steps (Hoel et al., 2012) or adaptive importance sampling scheme based on a drifted Brownian motion (Kebaier and Lelong, 2015). It is yet unclear how well the unbiased estimators can compete with MLMC in scenarios with slower than canonical rate of convergence. This was investigated by Rhee and Glynn (2015) and elaborated in Zheng et al. (2017), but quantifying the effect in the context of the new estimators requires further research. Recent work of Zheng and Glynn (2017) suggests an infinite stratification approach, which is slightly different what was proposed here. Appendix A. Subsequence lemma The following lemma is due to an argument by Rhee and Glynn (2015).
Lemma 29. Suppose E(Y i − Y ) 2 → 0 as i → ∞. Then, there exists strictly increasing (m k ) k≥1 such that for all k ≥ 1, Proof. Let δ i := Y i −Y . If Eδ 2 i = 0 for infinitely many i, choose them as (m k ) k≥1 . Otherwise, let m 1 be such that Eδ 2 i > 0 for all i ≥ m 1 , and define recursively m k+1 := min{i > m k : Eδ 2 i ≤ Eδ 2 m k }. Now, for any k and any m 1 ≤ i ≤ m k , Appendix B. Proof of Theorem 7 Define Z 0 := 0 and for m ≥ 1 then by dominated convergence We deduce that for 0 ≤ < m, admits the bound var(Z (n,m) * The second term on the left equals n −1 m i=1 ξ (m) i /p i , and clearly var(Z (n,m) Σ,res ) → m i=1 ξ (m) i /p i as n → ∞. Now, take (m k ) k≥1 from Lemma 29, then stratification implies that for k, j ≥ 1, We conclude as above by writing for any k ≥ 1, ) .
Because n|cov(A, B)| ≤ nvar(A)nvar(B), all terms on the right can be made arbitrarily small by by choosing m k large enough and letting n → ∞. Finally, the proof of (iii) follows similarly as (i) and (ii).
Appendix E. Proof of Theorem 28 Consider (i) and notice that then E|S i | ≤ c i and by dominated convergence a similar calculation yields which leads to EZ m = EY m . For the latter claim, note that only finitely many of {N i } i≥1 are non-zero, so Z is well-defined, and the result follows by the dominated convergence theorem. The statement (ii) is a generalisation of Theorem 7, and the proof follows similarly. Indeed, Condition 26 (i) and (ii) hold by independence, with c i = C i = E|∆ i |, so part (i) implies EZ m = EY m . Denoting F i,k := σ(F i−1 , F k−1 ), a straightforward calculation similar to (5) yields