Unbiased Markov chain Monte Carlo methods with couplings

Markov chain Monte Carlo (MCMC) methods provide consistent approximations of integrals as the number of iterations goes to ∞. MCMC estimators are generally biased after any fixed number of iterations. We propose to remove this bias by using couplings of Markov chains together with a telescopic sum argument of Glynn and Rhee. The resulting unbiased estimators can be computed independently in parallel. We discuss practical couplings for popular MCMC algorithms. We establish the theoretical validity of the estimators proposed and study their efficiency relative to the underlying MCMC algorithms. Finally, we illustrate the performance and limitations of the method on toy examples, on an Ising model around its critical temperature, on a high dimensional variable‐selection problem, and on an approximation of the cut distribution arising in Bayesian inference for models made of multiple modules.


Introduction
Markov chain Monte Carlo (MCMC) methods constitute a popular class of algorithms to approximate high dimensional integrals arising in statistics and other fields (Liu, 2008;Robert and Casella, 2004;Brooks et al., 2011;Green et al., 2015). These iterative methods provide estimators that are consistent as the number of iterations grows large but potentially biased for any fixed number of iterations, which discourages the parallel execution of many short chains (Rosenthal, 2000). Consequently, efforts have focused on exploiting parallel processors within each iteration (Tjelmeland, 2004;Brockwell, 2006;Lee et al., 2010;Jacob et al., 2011;Calderhead, 2014;Goudie et al., 2017;Yang et al., 2017) and on the design of parallel chains targeting different distributions (Altekar et al., 2004;Wang et al., 2015;Srivastava et al., 2015). Still, MCMC estimators are ultimately justified by asymptotics in the number of iterations, which is discordant with current trends in computing hardware, characterized by increasing parallelism but stagnating clock speeds.
In this paper we propose a general construction to produce unbiased estimators of integrals with respect to a target probability distribution from MCMC kernels. The lack of bias means that these estimators can be implemented on parallel processors in the framework of , without communication between processors. Confidence intervals can be constructed with asymptotic guarantees in the number of processors, in contrast with standard MCMC confidence intervals that are justified asymptotically in the number of iterations (e.g. Flegal et al. (2008), Gong and Flegal (2016), Atchadé (2016) and Vats et al. (2018)). The lack of bias has additional benefits, as discussed in Section 5.5 in which we make use of its interplay with the law of iterated expectations to perform modular inference; see also the discussion in Section 6.
Our contribution follows the path breaking work of , which uses couplings to construct unbiased estimators of integrals with respect to an invariant distribution. They illustrated their construction on Markov chains represented by iterated random functions, leveraging the contraction properties of such functions.  also considered Harris recurrent chains for which an explicit minorization condition holds. Previously, McLeish (2011) employed similar debiasing techniques to obtain 'nearly unbiased' estimators from a single MCMC chain. More recently Jacob et al. (2019) removed the bias from conditional particle filters  by coupling chains so that they meet in finite time. The present paper brings this type of 'Rhee-Glynn' construction to generic MCMC algorithms, with a novel analysis of estimator efficiency and a variety of examples. Our proposed construction involves couplings of MCMC algorithms, which we discuss for generic Metropolis-Hastings and Gibbs samplers. Couplings have been used to study the convergence properties of MCMC algorithms from both theoretical and practical points of view (e.g. Reutter and Johnson (1995), Johnson (1996), Rosenthal (1997), Johnson (1998Johnson ( , 2013, , Roberts and Rosenthal (2004) and Johndrow and Mattingly (2017)). Couplings also underpin perfect samplers Murdoch and Green, 1998;Casella et al., 2001;Flegal and Herbei, 2012;. A notable aspect of the approach of  that is preserved in our method is that only two chains must be coupled for the proposed estimator to be unbiased, without further assumptions on the state space or target distribution. Thus the approach applies more broadly than perfect samplers (see ) while yielding unbiased estimators rather than exact samples. Coupling pairs of Markov chains also forms the basis of the approach of , with a similar motivation for parallel computation. The proposed estimation technique also shares aims with regeneration methods (e.g.  and ), and we propose a numerical comparison in Section 5.2.
In Section 2 we introduce our estimators and present a coupling of random-walk Metropolis-Hastings (MH) chains as an illustration. In Section 3 we establish the efficiency properties of these estimators, discuss the verification of key assumptions and describe the use of the proposed estimators on parallel processors in light of results from for example . In Section 4 we describe how to couple some important MCMC algorithms and illustrate the effect of dimension on algorithms' performance with a multivariate normal distributions target. Section 5 contains more challenging examples including a multimodal target, a comparison with regeneration methods, sampling problems in large dimensional discrete spaces arising in Bayesian variable selection and Ising models, and an application to modular inference. We discuss our findings in Section 6. Scripts in R (R Core Team, 2015) are available from https://github.com/pierrejacob/unbiasedmcmc and supplementary materials are available on line.

Rhee-Glynn estimator
Given a target probability distribution π on a Polish space X and a measurable real-valued test function h that is integrable with respect to π, we want to estimate the expectation E π [h.X/] = h.x/π.dx/. Let P denote a Markov transition kernel on X that leaves π invariant, and let π 0 be some initial probability distribution on X . Our estimators are based on a coupled pair of Markov chains .X t / t 0 and .Y t / t 0 , which marginally start from π 0 and evolve according to P. In particular, we suppose thatP is a transition kernel on the joint space X × X such thatP{.x, y/, A × X } = P.x, A/ andP{.x, y/, X × A} = P.y, A/ for any x, y ∈ X and any measurable set A. We then construct the coupled Markov chain .X t , Y t / t 0 as follows. We draw .X 0 , Y 0 / such that X 0 ∼ π 0 and Y 0 ∼ π 0 . Given .X 0 , Y 0 / we draw X 1 ∼ P.X 0 , ·/, and then for any t 1, given X 0 , .X 1 , Y 0 /, : : : , .X t , Y t−1 /, we draw .X t+1 , Y t / ∼P{.X t , Y t−1 /, ·}. We consider the following assumptions.
Assumption 3. The chains stay together after meeting, i.e. X t = Y t−1 for all t τ .
By construction, each of the marginal chains .X t / t 0 and .Y t / t 0 has initial distribution π 0 and transition kernel P. Assumption 1 requires these chains to result in a uniformly bounded .2 + η/-moment of h; more discussion on moments of Markov chains can be found in Tweedie (1983). Since X 0 and Y 0 may be drawn from any coupling of π 0 with itself, it is possible to set X 0 = Y 0 . However, X 1 is then generated from P.X 0 , ·/, so that X 1 = Y 0 in general. Thus one cannot force the meeting time to be small by setting X 0 = Y 0 . Assumption 2 puts a condition on the coupling that is operated byP and would not in general be satisfied for an independent coupling. Coupled kernels must be carefully designed, using for example common random numbers and maximal couplings, for assumption 2 to be satisfied. We present a simple case in Section 2.2 and further examples in Section 4. We stress that the state space is not assumed to be discrete, and that the constants D and η of assumption 1 and C and δ of assumption 2 do not need to be known to implement the approach proposed. Assumption 3 typically holds by design; coupled chains that stay identical after meeting were termed 'faithful' in Rosenthal (1997).
In Section 3 we establish the validity of the estimator under the three conditions above; this formally justifies the swap of expectation and limit. The estimator can be viewed as a debiased version of h.X k /. Unbiasedness is guaranteed for any choice of k 0, but both the cost and the variance of H k .X, Y/ are sensitive to k; see Section 3.1. Thanks to this unbiasedness property, we can sample R ∈ N independent copies of H k .X, Y/ in parallel and average the results to estimate E π [h.X/] consistently as R → ∞; we defer further considerations on the use of unbiased estimators on parallel processors to Section 3.3.
Before presenting examples and enhancements to the estimator above, we discuss the relationship between our approach and existing work. There is a rich literature applying forward couplings to study Markov chain convergence (Johnson, 1996(Johnson, , 1998(Johnson, , 2013Thorisson, 2000;Lindvall, 2002;Rosenthal, 2002;Douc et al., 2004;Nikooienejad et al., 2016), and to obtain new algorithms such as perfect samplers  and the methods of  and Neal and Pinto (2001). Our approach is closely related to , who employed pairs of Markov chains to obtain unbiased estimators. The present work combines similar arguments with couplings of MCMC algorithms and proposes further improvements to remove bias at a reduced loss of efficiency.
Indeed  did not apply their methodology to the MCMC setting. They considered chains that are associated with contractive iterated random functions (see also Diaconis and Freedman (1999)), and Harris recurrent chains with an explicit minorization condition. A minorization condition refers to a small set C, λ > 0, an integer m 1 and a probability measure ν such that, for all x ∈ C and some measurable set A, P m .x, A/ λν.A/. Such a condition is said to be explicit if the set, constant and probability measure are known by the user. Finding explicit small sets that are useful in practice can present a technical challenge, even for MCMC experts (see the discussion and references in Cowles and Rosenthal (1998)). When available, explicit minorization conditions can also be employed to identify regeneration times, yielding estimators that are amenable to parallel computation in the framework of  and . By contrast Johnson (1996Johnson ( , 1998 and  addressed the question of coupling MCMC algorithms so that pairs of chains meet exactly, without analytical knowledge on the target distribution. The present paper focuses on the use of couplings of this type in the framework of .

Coupled Metropolis-Hastings example
Before further examination of our estimator and its properties, we present a coupling of MH chains that will typically satisfy assumptions 1-3 in realistic settings; this coupling was proposed in Johnson (1998) as part of a method to diagnose convergence. We postpone discussion of other couplings of MCMC algorithms to Section 4. We recall that each iteration t of the MH algorithm  begins by drawing a proposal X Å from a Markov kernel q.X t , ·/, where X t is the current state. The next state is set to X t+1 = X Å if U π.X Å /q.X Å , X t /={π.X t /q.X t , X Å /}, where U denotes a uniform random variable on [0,1], and X t+1 = X t otherwise.
We define a pair of chains so that each proceeds marginally according to the MH algorithm and jointly so that the chains will meet exactly after a random number of steps. We suppose that the pair of chains are in states X t and Y t−1 , and we consider how to generate X t+1 and Y t so that {X t+1 = Y t } might occur.
If X t = Y t−1 , the event {X t+1 = Y t } cannot occur if both chains reject their respective propos-als, X Å and Y Å . Meeting will occur if these proposals are identical and if both are accepted. Marginally, the proposals follow X Å |X t ∼ q.X t , ·/ and Y Å |Y t−1 ∼ q.Y t−1 , ·/. If q.
x, x Å / can be evaluated for all x and x Å , then we can sample from a maximal coupling between the two proposal distributions, which is a coupling of q.X t , ·/ and q.Y t−1 , ·/ maximizing the probability of the event {X Å = Y Å }. How to sample from maximal couplings of continuous distributions is described in Thorisson (2000) and in Section 4.1. One can accept or reject the two proposals by using a common uniform random variable U. The chains will stay together after they meet: at each step after meeting, the proposals will be identical with probability 1, and jointly accepted or rejected with a common uniform variable. This coupling requires neither explicit minorization conditions nor contractive properties of a random-function representation of the chain.

Time-averaged estimator
To motivate our next estimator, we note that we can compute H k .X, Y/ for several values of k from the same realization of the coupled chains, and that the average of these is unbiased as well. For any fixed integer m with m k, we can run coupled chains for max.m, τ / iterations, compute the estimator H l .X, Y/ for each l ∈ {k, : : : , m} and take the average H k:m .X, Y/ = .m − k + 1/ −1 Σ m l=k H l .X, Y/, as we summarize in algorithm 1 in Table 1. We refer to H k:m .X, Y/ as the time-averaged estimator; the estimator H k .X, Y/ is retrieved when m = k. Alternatively we could average the estimators H l .X, Y/ by using weights w l ∈ R for l ∈ {k, : : : , m}, to obtain Σ m l=k w l H l .X, Y/. This will be unbiased if Σ m l=k w l = 1. Rearranging terms in .m − k + 1/ −1 Σ m l=k H l .X, Y/, we can write the time-averaged estimator as The term .m − k + 1/ −1 Σ m l=k h.X l / corresponds to a standard MCMC average with m total iterations and a burn-in period of k − 1 iterations. We can interpret the other term as a bias correction. If τ k + 1, then the correction term equals 0. This provides some intuition for the choice of k and m: large k-values lead to the bias correction being equal to 0 with large probability, and large values of m result in H k:m .X, Y/ being similar to an estimator obtained from a long MCMC run. Thus we expect the variance of H k:m .X, Y/ to be similar to that of MCMC estimators for appropriate choices of k and m.
The estimator H k:m .X, Y/ requires τ − 1 calls toP and max.1, m + 1 − τ / calls to P, which are overall comparable with m calls to P when m is large. Indeed, for the proposed couplings, calls toP are approximately twice as expensive as calls to P. Therefore, the cost of H k:m .X, Y/ is Step 1: draw X 0 and Y 0 from an initial distribution π 0 and draw X 1 ∼ P.X 0 , ·/ Step 2: Step 3: for each l ∈ {k , : : : , m}, compute H l .X, Y/ = h.X l / + Σ τ −1 t=l+1 {h.X t / − h.Y t−1 /}; return H k:m .X, Y/ = .m − k + 1/ −1 Σ m l=k H l .X, Y/, or compute H k:m .X, Y/ with equation (2.1) comparable with 2.τ − 1/ + max.1, m + 1 − τ / iterations of the underlying MCMC algorithm. Thus both the variance and the cost of H k:m .X, Y/ will approach those of MCMC estimators for large values of k and m. This motivates the use of the estimator H k:m .X, Y/ with m > k, which enables us to control the loss of efficiency that is associated with the removal of burn-in bias in contrast with the basic estimator H k .X, Y/ of Section 2.1. We discuss the choice of k and m in further detail in Section 3 and in the subsequent experiments. A variant of estimator (2.1) can be obtained by considering a time lag that is greater than 1 between the two chains .X t / t 0 and .Y t / t 0 , with the meeting time defined as the first time t for which {X t = Y t−lag } occurs. This introduces another tuning parameter but was found to be fruitful in .
We conclude this section with a few remarks on practical implementations. First, the test function h does not have to be specified at run time in algorithm 1 in Table 1. One can store the coupled chains and choose the test function later. Also, one typically resorts to thinning the output of an MCMC sampler if the memory cost of storing chains is prohibitive, or if the cost of evaluating the test function of interest is significant compared with the cost of each MCMC iteration (e.g. Owen (2017)). This is feasible in the framework proposed: one could consider a variation of algorithm 1 where each call to the Markov kernels P andP would be replaced by multiple calls to them. We also observe that the estimators proposed can take values outside the range of the test function h; for instance they can take negative values even if the range of the test function contains only non-negative values.
Finally, we stress the difficulty that is inherent in choosing an initial distribution π 0 . The estimators are unbiased for any choice of π 0 , including point masses, but this choice has an effect on both the computing cost and the variance. There is also a choice about whether to draw X 0 and Y 0 independently from π 0 or not; in our experiments we use independent draws. We shall see in Section 5.1 that unfortunate choices of initial distributions can severely affect the performance of the estimators proposed. This suggests trying more than one choice of initialization, especially in the setting of multimodal targets. Overall the choice of π 0 and its relative importance compared with standard MCMC sampling are open questions.

Signed measure estimator
We can formulate the proposed estimation procedure in terms of a signed measureπ defined bŷ which is obtained by replacing test function evaluations by delta masses in equation (2.1), as in Section 4 of . The measureπ is of the formπ = Σ N l=1 ω l δ Z l where the weights satisfy Σ N l=1 ω l = 1 and where the atoms .Z l / are values among the history of the coupled chains. Some of the weights .ω l / may be negative, makingπ a signed empirical measure. In this view the unbiasedness property states that E[Σ N l=1 ω l h.Z l /] = E π [h.X/] for a test function h. We can consider the convergence behaviour ofπ R = R −1 Σ R r=1π .r/ towards π, where .π .r/ / for r ∈ {1, : : : , R} are independent replications ofπ.  obtained a Glivenko-Cantelli result for a similar measure related to their estimator. In the current setting, assume for simplicity that π is univariate or else consider only one of its marginals. To emphasize the importance of the number of replications R, we rewrite the weights and atoms asπ R = Σ N R l=1 ω l δ Z l . Introduce the function s →F R .s/ = Σ N R l=1 ω l 1.Z l s/ on R. Proposition 2 in Section 3 states that F R converges to F as R → ∞ uniformly with probability 1, where F is the cumulative distribution function of π.
The function s →F R .s/ is not monotonically increasing because of negative weights among .ω l /, which motivates the following comments regarding the estimation of quantiles of π. Assume from now on that the pairs .ω l , Z l / are ordered such that Z l Z l+1 . For any q ∈ .0, 1/ there might be more than one index l such that Σ l−1 i=1 ω i q and Σ l i=1 ω i > q; the quantile estimate might be defined as Z l for any such l. The convergence ofF R to F indicates that all such estimates are expected to converge to the qth quantile of π. Therefore the signed measure representation leads to a way of estimating quantiles of the target distribution in a consistent way as R → ∞. The construction of confidence intervals for these quantiles, perhaps by bootstrapping the R independent copies, stands as an interesting area for future research. Another route to estimate quantiles of π would be to project marginals ofπ R onto the space of probability measures, for instance by using a generalization of the Wasserstein metric to signed measures (Mainini, 2012). One could also estimate F by using isotonic regression (Chatterjee et al., 2015), considerinĝ F R .s/ for various values s as noisy measurements of F.s/.

Properties and parallel implementation
The proofs of the results of this section are in the on-line supplementary materials. Our first result establishes the basic validity of the estimators proposed.
Proposition 1. Under assumptions 1-3, for all k 0 and m k, the estimator H k:m .X, Y/ has expectation E π [h.X/], a finite variance and a finite expected computing time.
A direct consequence of proposition 1 is that an average of R independent copies of H k:m .X, Y/ converges to E π [h.X/] as R → ∞. We discuss more sophisticated results on unbiased estimators and parallel processing in Section 3.3 and other uses of such estimators in Sections 5.5 and 6. Following , we provide proposition 2 on the signed measure estimator (2.2). We recall that such estimators apply to univariate target distributions or to the marginal distributions of a multivariate target.
Proposition 2. Under assumptions 2 and 3, for all m k 0, and assuming that .X t / t 0 converges to π in total variation, introduce the function s →F R .s/ = Σ N R l=1 ω l 1.Z l s/, where .ω l , Z l / N R l=1 are weighted atoms obtained from R independent copies ofπ in equation (2.2). Denote by F the cumulative distribution function of π. Then sup s∈R |F R .s/ − F.s/| −→ R→∞ 0 almost surely: Section 3.1 studies the variance and efficiency of H k:m .X, Y/, Section 3.2 concerns the verification of assumption 2 by using drift conditions and Section 3.2 discusses estimation on parallel processors in the presence of a budget constraint.

Variance and efficiency
We consider the effect of k and m on the efficiency of the estimators proposed, which will then suggest guidelines for the choice of these tuning parameters. Estimators H .r/ k:m .X, Y/, for r = 1, : : : , R, can be generated independently and averaged. More estimators can be produced in a given computing budget if each estimator is cheaper to produce. The trade-off can be understood in the framework of Glynn and Whitt (1992) (see also Rhee and Glynn (2012) and ), by defining the asymptotic inefficiency as the product of the variance and expected cost of the estimator. That product is the asymptotic variance of R −1 Σ R r=1 H .r/ k:m .X, Y/ as the computational budget, as opposed to the number of estimators R, goes to ∞ (Glynn and Whitt, 1992). Of primary interest is the comparison of this asymptotic inefficiency with the asymptotic variance of standard MCMC estimators. We start by writing the time-averaged estimator (2.1) as where MCMC k:m is the MCMC average .m − k + 1/ −1 Σ m l=k h.X l / and BC k:m is the bias correction term. The variance of H k:m .X, Y/ can be written To bound E[BC 2 k:m ], we introduce a geometric drift condition on the Markov kernel P. Assumption 4. The Markov kernel P is π invariant, ϕ irreducible and aperiodic, and there is a measurable function V : X → [1, ∞/, λ ∈ .0, 1/, b < ∞, and a small set C such that, for all x ∈ X , We refer the reader to Meyn and Tweedie (2009) for the definitions and core theoretical tools for working with Markov chains on a general state space: in particular chapter 5 for aperiodicity, ϕ-irreducibility and small sets, and chapter 15 for geometric drift conditions; see also Roberts and Rosenthal (2004). Geometric drift conditions are known to hold for various MCMC algorithms (e.g. Roberts and Tweedie (1996a, b), Jarner and Hansen (2000), Atchade (2006), Khare and Hobert (2013), Choi and Hobert (2013) and Pal and Khare (2014)). Assumption 4 often plays a central role in establishing geometric ergodicity (e.g. theorem 9 in Roberts and Rosenthal (2004)). We show next that this assumption enables an informative bound on E[BC 2 k:m ]. Proposition 3. Suppose that assumptions 2-4 hold, with a function V for which the integral V.x/π 0 .dx/ is finite. If the function h is such that sup x∈X |h.x/|=V.x/ β < ∞ for some β ∈ [0, 1 2 /, then for all m k 0 we have for some constants C δ,β < ∞, and δ β = δ 1−2β ∈ .0, 1/, with δ ∈ .0, 1/ as in assumption 2.
Using proposition 3, inequality (3.1) becomes The variance of H k:m .X, Y/ is thus bounded by the mean-squared error MSE of an MCMC estimator plus additive terms that vanish geometrically in k and polynomially in m − k.
To facilitate the comparison between the efficiency of H k:m .X, Y/ and that of MCMC estimators, we add simplifying assumptions. First, the rightmost terms of inequality (3.2) decrease geometrically with k, at a rate that is driven by δ β = δ 1−2β where δ is as in assumption 2. This motivates a choice of k depending on the distribution of the meeting time τ . In practice, we can sample independent realizations of the meeting time and choose k such that P.τ > k/ is small, i.e. we choose k as a large quantile of the meeting times.
Dropping the third term on the right-hand side of inequality (3.2), which is smaller than the second term, assuming that MSE k:m > 0 and that m > τ with large probability, we obtain the approximate inequality As k increases we expect .m − k + 1/MSE k:m to converge to V{.m − k + 1/ −1=2 Σ m t=k h.X t /}, where X k would be distributed according to π. Denote this variance by V k,m . The limit of V k,m as m → ∞ is the asymptotic variance of the MCMC estimator, denoted by V ∞ . Hence, for k and m − k both large, the loss of efficiency of the method compared with standard MCMC methods is approximately 1 This informal series of approximations suggests that we can retrieve an asymptotic efficiency that is comparable with the underlying MCMC estimators with appropriate choices of k and m that depend on the distribution of the meeting time τ . These choices are thus sensitive to the coupling of the chains and not only to the performance of the underlying MCMC algorithm. Choosing m as a multiple of k, such as 5k or 10k, makes intuitive sense when considering that k=m is the proportion of iterations that are simply discarded in the event that τ < k. In other words, the bias of the MCMC algorithm can be removed at the cost of an increased variance, which can in turn be reduced by choosing sufficiently large values of k and m. This results in a trade-off with the desired level of parallelism: one might prefer to keep k and m small, yielding a suboptimal efficiency for H k:m .X, Y/, but enabling more independent copies to be generated in a given computing time.

Verifying assumption 2
We discuss how assumption 4 on the Markov kernel P can be used to verify assumption 2, on the shape of the meeting time distribution. Informally, assumption 4 guarantees that the bivariate chain {.X t , Y t−1 /, t 1} visits C × C infinitely often, where C is a small set. If there is a positive probability of the event {X t+1 = Y t } for every t such that .X t , Y t−1 / ∈ C × C, then we expect assumption 2 to hold. The next result formalizes that intuition. The proof is based on a modification of an argument by Douc et al. (2004). We introduce D = {.x, y/ ∈ X × X : x = y}. Then assumption 3 readsP{.x, x/, D} = 1 for all x ∈ X .
If assumption 4 holds with a small set of the form C = {x : V.x/ L} for some L > 0, then it holds also for C = {x : V.x/ L } for all L L. In that case we can always choose L sufficiently large that λ + b=.1 + L/ < 1. Hence the main restriction in proposition 4 is the assumption that the small sets in assumption 4 are of the form {x : V.x/ L}, i.e. level sets of V. This is known to be true in some cases. For instance it is known from theorem 2.2 of Roberts and Tweedie (1996b) that, for a large class of MH algorithms, any non-empty compact set is a small set, and therefore for these algorithms it suffices to check that the level sets of the drift function V are compact. Common examples of drift functions include V.x/ = c= √ π.x/ (Roberts and Tweedie, 1996b;Jarner and Hansen, 2000;Atchade, 2006), V.x/ = c exp.b|x|/ (Roberts and Tweedie, 1996a) or the example in Pal and Khare (2014), which all have compact level sets under mild regularity conditions. The work of  contains results that generalize propositions 3 and 4 to Markov chains satisfying polynomial drift conditions (e.g. Andrieu and Vihola (2015)), leading to polynomial tails for the associated meeting times.

Parallel implementation under budget constraints
Our main motivation for unbiased estimators comes from parallel processing; see Sections 5.5 and 6 for other motivations. Independent unbiased estimators with finite variance can be generated on separate machines and combined into consistent and asymptotically normal estimators. If the number of estimators is prespecified, this follows from the central limit theorem for independent and identically distributed variables. We might prefer to specify a time budget, and to generate as many estimators as possible within the budget. The lack of bias allows the application of a variety of results on budget-constrained parallel simulations, which we briefly review here, following Heidelberger (1990, 1991).
We denote the proposed estimator by H and its expectation, which is the object of interest here, by π.h/. Generating H takes a random time C. We write N.t/ for the number of independent copies of H that can be produced by time t. The sequence .H n , C n / n∈N refers to independent and identically distributed copies of .H, C/, so we can write N.t/ = sup{n 0 : C 1 + : : : + C n t}, with N.t/ = 0 if t < C 1 . We add the subscript p to refer to objects that are associated with processor p ∈ {1, : : : , P}.
The first result is that the estimatorH p .t/, defined for all 1 p P as 0 if N p .t/ = 0, and by the sample average of H p1 , : : : , H pN p .t/ otherwise, is biased: Glynn and Heidelberger (1990)  We can define an unbiased estimator of π.h/ with a slight modification ofH p .t/. For all p ∈ {1, : : :  Glynn and Heidelberger (1990)). We denote the average ofH p .t/ over P processors byH.P, t/ = P −1 Σ P p=1H p .t/, which is unbiased for π.h/. Asymptotic results onH.P, t/ can be found in  and are summarized below. We first have the consistency results: lim t→∞H .P, t/ = lim P→∞H .P, t/ = π.h/ almost surely for all t and P, and, if E[|H| 1+δ ] for some δ > 0 and if {t P } is a sequence such that lim P→∞ t P = ∞, thenH.P, t P / converges to π.h/ in probability as P → ∞. Next, we can construct confidence intervals for π.h/ based onH.P, t/, following the end of section 3 in . Indeed, define σ 2 1 .P, t/ = will be used to construct confidence intervals in Sections 5.3 and 5.4. We conclude this section with a remark on the setting where t is fixed and the number of processors P goes to ∞. There, the time to obtainH.P, t/ would typically increase with P. Indeed at least one estimator needs to be completed on each processor forH.P, t/ to be available. The completion time behaves as the maximum of independent copies of the cost C. Under assumption 2, the completion time forH.P, t/ has expectation behaving as log.P/ when P → ∞, for fixed t. Other tail assumptions  would lead to different behaviour for the completion time that is associated withH.P, t/.

Couplings of Markov chain Monte Carlo algorithms
We consider couplings of various MCMC algorithms that satisfy assumptions 2 and 3. These couplings are widely applicable and do not require extensive analytical knowledge of the target distribution. We stress that they are not optimal in general, and we expect that other constructions would yield more efficient estimators. We begin in Section 4.1 by reviewing maximal couplings.

Sampling from maximal couplings
A maximal coupling between two distributions p and q on a space X is a distribution of a pair of random variables .X, Y/ that maximizes P.X = Y/, subject to the marginal constraints X ∼ p and Y ∼ q. We write p and q both for these distributions and for their probability density functions with respect to a common dominating measure, and we refer to the uniform distribution on the interval [a, b] by U. [a, b]/. A procedure to sample from a maximal coupling is described in algorithm 2 in Table 2; see for example section 4.5 of chapter 1 of Thorisson (2000), and Johnson (1998) where it is termed γ-coupling.
We justify algorithm 2 and compute its cost. Denote by .X, Y/ the output of the algorithm. First, X follows p from step 1. To prove that Y follows q, introduce a measurable set A. We write Table 2. Algorithm 2: sampling from a maximal coupling of p and q Step 1: sample X ∼ p and W |X ∼ U{[0, p.X/]}: if W q.X/, output .X, X/ Step 2: otherwise, sample Y Å ∼ q and W Å |Y Å ∼ U{[0, q.Y Å /]} until W Å > p.Y Å /, and output .X, Y Å / P.Y ∈ A/ = P.Y ∈ A, step 1/ + P.Y ∈ A, step 2/, where the events {step 1} and {step 2} refer to the algorithm terminating at step 1 or 2. We compute We can deduce from this that P .step 1/ = X min{p.x/, q.x/}dx. For P.Y ∈ A, step 2/ to be equal to Step 2 is a standard rejection sampler using q as a proposal distribution to targetq, which concludes the proof that Y ∼ q. We also confirm that algorithm 2 maximizes the probability of {X = Y }. Under the algorithm, where d TV .p, q/ = 1 2 X |p.x/ − q.x/|dx is the total variation distance. By the coupling inequality (Lindvall, 2002), this proves that the algorithm implements a maximal coupling.
To assess the cost of algorithm 2, note that step 1 costs one draw from p, one evaluation from p and one from q. Each attempt in the rejection sampler of step 2 costs one draw from q, one evaluation from p and one from q. Hereafter we refer to the cost of one draw and two evaluations by 'one unit'. Observe that the probability of acceptance in step 2 is given by P{W Å p.Y Å /} = 1 − X min{p.y/, q.y/}dy. Then, the number of attempts in step 2 has a geometric distribution with mean [1 − X min{p.y/, q.y/}dy] −1 , and step 2 itself occurs with probability 1 − X min{p.y/, q.y/}dy. Therefore the overall expected cost is two units. The expectation of the cost is the same for all distributions p and q, whereas the variance of the cost depends on d TV .p, q/, and in fact goes to ∞ as this distance goes to 0.
In algorithm 2, the value of X is not used in the generation of Y Å within step 2. In other words, conditionally on {X = Y }, the two output variables are independent. We might prefer to correlate the outputs in the event {X = Y }, e.g. in random-walk MH steps as in the next section. We describe a maximal coupling presented in . It applies to distributions p and q on R d such that X ∼ p can be represented as X = μ 1 + Σ 1=2Ẋ , and Y ∼ q as Y = μ 2 + Σ 1=2Ẏ , where the pair .Ẋ,Ẏ/ follows a coupling of some distribution s with itself. The construction requires that s is spherically symmetrical: s.x/ = s.y/ for all x, y ∈ R d such that x = y , where · denotes the Euclidean norm. For instance, if s is a standard multivariate normal distribution, then X ∼ N .μ 1 , Σ/ and Y ∼ N .μ 2 , Σ/.
Let z = Σ −1=2 .μ 1 − μ 2 / and e = z= z . We independently drawẊ ∼ s and U ∼ U. This procedure outputs a pair .Ẋ,Ẏ/ that follows a coupling of s with itself. We then define .X, Y/ = .μ 1 + Σ 1=2Ẋ , μ 2 + Σ 1=2Ẏ /. On the event {Ẏ =Ẋ + z}, we have X = Y . On the event {Ẏ =Ẋ + z}, the vectorẊ − 2.e Ẋ /e is the reflection ofẊ through the hyperplane orthogonal to e that passes through the origin. We show that the output .X, Y/ follows a maximal coupling of p and q, which we refer to as a maximal coupling with reflection on the residuals, or a 'reflection maximal coupling'. First we show thatẎ follows s, closely following the argument in . For a measurable set B, we compute The first integral above becomes 1 B .w/ min{s.w − z/, s.w/}dw, after a change of variables w := x + z. To simplify the second integral we make the change of variables w := x − 2.e x/e. Since this corresponds to a reflection with respect to a plane orthogonal to e, we have dw = dx, and x = w − 2.e w/e; thus where we have used s{w − 2.e w/e} = s.w/ and s{w − 2.e w/e + z} = s.w − z/, because w − 2.e w/e = w and w − 2.e w/e + z = w − z . Summing the two integrals we obtain P.Ẏ ∈ B/ = B s.w/dw, soẎ ∼ s.
To verify that the procedure corresponds to a maximal coupling of p and q, we observe that with the change of variablex := μ 1 + Σ 1=2 x. This is precisely the total variation distance between p and q, on writing their densities in terms of the density of s. Note that the computational cost that is associated with the above sampling technique is deterministic, in contrast with the cost of algorithm 2. Finally, for discrete distributions with common finite support, a procedure for sampling from a maximal coupling is described in Section 5.4, with a cost that is also deterministic.

Metropolis-Hastings steps
In Section 2.2 we described a coupling of MH chains due to Johnson (1998); we summarize the coupled kernelP{.X t , Y t−1 /, ·} in the following procedure.
Step 2: Step 3: if U min{1, π.X Å /q.X Å , X t /=π.X t /q.X t , X Å /}, then X t+1 = X Å ; otherwise X t+1 = X t . Step Here we address the verification of assumptions 1-3 for this algorithm. Assumption 1 can be verified for MH chains under conditions on the target and the proposal (Nummelin, 2002;Roberts and Rosenthal, 2004). In some settings the explicit drift function that is given in theorem 3.2 of Roberts and Tweedie (1996b) may be used to verify assumption 2 as in Section 3.2. The probability of coupling at the next step given that the chains are in X t and Y t−1 can be controlled as follows. First, the probability of proposing the same value X Å depends on the total variation distance between q.X t , ·/ and q.Y t−1 , ·/, which is typically strictly positive if X t and Y t−1 are in bounded subsets of X . Furthermore, the probability of accepting X Å is often strictly positive on bounded subsets of X , for instance when π.x/ > 0 for all x ∈ X . Assumption 3 is satisfied by design thanks to the use of maximal couplings and common uniform variable U in the above procedure.
Different considerations drive the choice of proposal distribution in standard MCMC sampling and in our proposed estimators. In the case of random-walk proposals with variance Σ, larger variances lead to smaller total variation distances between q.X t , ·/ and q.Y t−1 , ·/ and thus larger probabilities of proposing identical values. However, meeting events only occur if proposals are accepted, which is unlikely if Σ is too large. This trade-off could lead to a different choice of Σ than the optima known for the marginal chains (Roberts et al., 1997), and deserves further investigation.
We perform experiments with a d-dimensional normal target distribution N .0, V/, where V is the inverse of a matrix drawn from a Wishart distribution with identity scale matrix and d degrees of freedom. This setting, which has been borrowed from Hoffman and Gelman (2014), yields normal distribution targets with strong correlations and a dense precision matrix. Below, each independent run is performed with an independent draw of V. We consider normal random-walk proposals with variance Σ set to V=d. The division by d heuristically follows from the scaling results of Roberts et al. (1997). We initialize the chains either from the target distribution, or from a normal distribution centred at .1, : : : , 1/ with identity covariance matrix. We first couple the proposals with a maximal coupling given by algorithm 2. The resulting average meeting times, based on 1000 independent runs, are given in Fig. 1(a). The plot indicates an exponential increase of the average meeting times with the dimension, under both initialization strategies. In passing, this illustrates that meeting times can be large even if the chains marginally start at stationarity, i.e. in a setting where there is no burn-in bias.
Next we perform the same experiments with the reflection maximum coupling that was described in the previous section. The results are shown in Fig. 1(b). The average meeting times now increase at a rate that appears closer to linear in the dimension. This is to be compared with established theoretical results on the linear performance of standard MH estimators with respect to the dimension (Roberts et al., 1997). A formal justification of the scaling that is observed in Fig. 1(b) is an open question, and so is the design of more effective coupling strategies.

Gibbs sampling
Gibbs sampling is another popular class of MCMC algorithms, in which components of a Markov chain are updated alternately by sampling from the target conditional distributions (chapter 10 of Robert and Casella (2004)), implemented for example in the software packages JAGS (Plummer, 2003). In Bayesian statistics, these conditional distributions sometimes belong to a standard family such as the normal, gamma or inverse gamma distributions. Otherwise, the conditional updates might require MH steps. We can introduce couplings in each conditional update, using either maximal couplings of the target conditionals, if these are standard distributions, or maximal couplings of the proposal distributions in MH steps targeting the target conditionals. Controlling the probability of meeting at the next step over a set, as required for the application of proposition 4, can be done case by case. Drift conditions for Gibbs samplers also tend to rely on case-by-case arguments (see for example Rosenthal (1996)).
Gibbs samplers tend to perform well for targets with weak correlations between the components being updated; otherwise Gibbs chains are expected to mix poorly. We perform numerical experiments on normal target distributions in varying dimensions to observe the effect of correlations on the meeting times of coupled Gibbs chains. For each target N .0, V/, we introduce an MH-within-Gibbs sampler, where each univariate component i is updated with a single Metropolis step, using normal distribution proposals with variance V i,i . Here an iteration of the sampler refers to a complete scan of the components. Fig. 2(a) presents the median meeting times as a function of the dimension, when V is the inverse of a Wishart draw as in the previous section. In this highly correlated setting, the meeting times scale poorly with the dimension. The plot presents the median instead of the average, because we have stopped the runs after 500000 iterations; the median is robust to this truncation, but not the average. We remark that shorter meeting times are obtained when initializing the chains away from the target distribution. Next we consider a normal distribution target with covariance matrix V defined by V i,j = 0:5 −|i−j| , which induces weak correlations between components; the inverse of V is tridiagonal. In that case, the same Gibbs sampler performs much more favourably, as we can see from

Coupling of other Markov chain Monte Carlo algorithms
Among extensions of the MH algorithm, Metropolis-adjusted Langevin algorithms (e.g. Roberts and Tweedie (1996a)) are characterized by the use of a proposal distribution given current state X t that is normal with mean X t + h∇ log{π.X t /}=2 and variance hΣ, with tuning parameter h > 0 and covariance matrix Σ. Maximal couplings or reflection maximal couplings of the proposals could be readily implemented to obtain faithful chains. Going further in the use of gradient information, the Hamiltonian or hybrid Monte Carlo algorithm (Duane et al., 1987;Neal, 1993Neal, , 2011) is a popular MCMC algorithm for large dimensional targets. In , the framework of the present paper is applied to pairs of Hamiltonian Monte Carlo chains, with a focus of the verification of assumptions 1-3 in that context. Such couplings were analysed in detail in  and  to obtain convergence rates for the underlying chains. We refer to Heng and Jacob (2019) for more details and provide for completeness some experiments on the normal distribution target that was described above in the on-line supplementary materials. The present paper generalizes unbiased estimators that are obtained by coupling conditional particle filters in Jacob et al. (2019). These algorithms, which were introduced in , target the distribution of latent processes given observations and fixed parameters for non-linear state space models. The couplings of conditional particle filters in Jacob et al. (2019) involve a combination of common random numbers and maximal couplings. Couplings of particle independent MH sampling, which is a particular case of MH sampling with an independent proposal distribution, are simpler to design and are considered in .
The design of generic and efficient MCMC kernels is a topic of active on-going research (see for example Murray et al. (2010), Goodman and Weare (2010), Pollock et al. (2016), Vanetti et al. (2017) and Titsias and Yau (2017) and references therein). Any new kernel could lead to unbiased estimators with the framework proposed, as long as appropriate couplings can be implemented.

Illustrations
Section 5.1 illustrates the effect of k, m and the initial distribution π 0 , identifying a situation where some care is required. Section 5.2 considers the removal of the bias from a Gibbs sampler previously considered for perfect sampling and regeneration methods. Section 5.3 introduces an Ising model and a coupling of a replica exchange algorithm, and we present experiments performed on parallel processors. Section 5.4 considers a high dimensional variable-selection example, with an MH algorithm that has previously been shown to scale linearly with the number of variables. Finally, Section 5.5 focuses on the problem of approximating the cut distribution arising in modular inference, which illustrates the appeal of unbiased estimators beyond parallel computing.

Bimodal target
We use a bimodal target distribution and a random-walk MH algorithm to illustrate our method and to highlight some of its limitations. In particular, we consider a mixture of univariate normal distributions with density π.x/ = 0:5 N .x; −4, 1/ + 0:5 N .x; 4, 1/, which we sample from using random-walk MH with normal proposal distributions of variance σ 2 q = 9. This enables regular jumps between the modes of π. We set the initial distribution π 0 to N .10, 10 2 /, so that chains are likely to start closer to the mode at 4 than the mode at −4. Over 1000 independent runs, we find that the meeting time τ has an average of 20 and a 99% quantile of 105.
We consider the task of estimating 1.x > 3/π.dx/ ≈ 0:421. First, we consider the choice of k and m. Over 1000 independent experiments, we approximate the expected cost E[2.τ − 1/ + max.1, m − τ + 1/] and the variance V{H k:m .X, Y/}, and compute the inefficiency as the product of the two (as in Section 3.1). We then divide the inefficiency by the asymptotic variance of the MCMC estimator, which is denoted by V ∞ , which we obtain from 10 6 iterations and a burn-in period of 10 4 by using the R package CODA (Plummer et al., 2006).
We present the results in Table 3. First, we see that the inefficiency is sensitive to the choice of k and m. Second, we see that when k and m are sufficiently large we can retrieve an inefficiency that is comparable with that of the underlying MCMC algorithm. The ideal choice of k and m will depend on trade-offs between inefficiency, the desired level of parallelism and the number of processors that are available. We present a histogram of the target distribution, obtained by using k = 200 and m = 2000, in Fig. 3(a). These histograms are produced by averaging unbiased estimators of expectations of indicator functions, corresponding to consecutive intervals. Confidence intervals at level 95% are obtained from the central limit theorem and are represented as grey boxes, with vertical bars showing the point estimates.
Next, we consider a more challenging case by setting σ 2 q = 1, again with π 0 = N .10, 10 2 /. These values make it difficult for the chains to jump between the modes of π. Over R = 1000 runs we find an average meeting time of 769, with a 99% quantile of 9186. When the chains start in different modes, the meeting times are often dramatically larger than when the chains start by the same mode. One can still recover accurate estimates of the target distribution, but k and m must be set to larger values. With k = 20000 and m = 30000, we obtain the 95% confidence interval [0.397, 0.430] for 1.x > 3/π.dx/ ≈ 0:421. We show a histogram of π in Fig. 3 . 3. Histograms of the mixture target distribution of Section 5.1, obtained with the proposed unbiased estimators, based on a normal random-walk MH algorithm, with a proposal variance σ 2 q and an initial distribution π 0 , over R D 1000 experiments ( , target density function): (a) σ 2 q D 3 2 and π 0 D N .10, 10 2 /I (b) σ 2 q D 1 2 and π 0 D N .10, 10 2 /I (c) σ 2 q D 1 2 and π 0 D N .10, 1 2 / Finally we consider a third case, with σ 2 q = 1 as before but now with π 0 set to N .10, 1/. This initialization makes it unlikely for a chain to start near the mode at −4. The pair of chains typically converge around the mode at 4 and meet in a small number of iterations. Over R = 1000 replications, we find an average meeting time of 9 and a 99% quantile of 35. A 95% confidence interval on 1.x > 3/π.dx/ obtained from the estimators with k = 50 and m = 500 is [0.799, 0.816], which is far from the true value of 0.421. The associated histogram of π is shown in Fig.  3(c).
Sampling 9000 additional estimators yields a 95% confidence interval [−0:353, 1:595], again using k = 50 and m = 500. Among these extra 9000 values, a few correspond to cases where one chain jumped to the leftmost mode before meeting the other. This resulted in large meeting times and thus a large empirical variance for H k:m . On noting a large empirical variance one can then decide to use larger values of k and m. We conclude that, although our estimators are unbiased and are consistent in the limit as R → ∞, poor performance of the underlying Markov chains combined with ill-chosen initializations can still produce misleading results for any finite R, such as 1000 in this example.

Gibbs sampler for nuclear pump failure data
Next we consider a classic Gibbs sampler for a model of pump failure counts, which was used for example in Murdoch and Green (1998) to illustrate perfect samplers for continuous distributions, and in  to illustrate their regeneration approach. Here we focus on a comparison with the regeneration approach, which was motivated by similar practical concerns to those in this paper, in particular to avoid an arbitrary choice of burn-in, to construct confidence intervals on the expectations of interest and to make principled use of parallel processors.  showed how to construct regeneration times-random times between which the chain forms independent and identically distributed 'tours'. They defined a consistent estimator for arbitrary test functions, whose asymptotic variance takes a simple form. The estimator is then obtained by aggregating over these independent tours.
To form our estimator we apply maximal couplings at each conditional update of the Gibbs sampler, as described in Section 4.3. We begin by drawing 1000 meeting times independently. Following the guidelines of Section 3.1, we set k = 7, corresponding to the 99% quantile of τ and m = 10k = 70. For the regeneration approach,  gave a set of tuning parameters which we adopt below. Applying the regeneration approach to 1000 Gibbs sampler runs of 5000 iterations each, we observe on average 1996 complete tours per run with an average length of 2.50 iterations per tour. These values agree with the count of 1967 tours of average length 2.56 reported in . We observe a posterior mean estimate for β of 2.47 with a variance of 1:89 × 10 −4 over the 1000 independent runs, which implies an efficiency value of .5000 × 1:89 × 10 −4 / −1 = 1:06.
This exceeds the efficiency of 0.94 that was achieved by our estimator with the choice of k = 7 and m = 70. In contrast, the regeneration approach often requires more extensive analytical work with the underlying Markov chain; we refer to  for a detailed description. For reference, the underlying Gibbs sampler achieves an efficiency of 1:08, based on a long run of 5 × 10 5 iterations and a burn-in of 10 3 iterations. More extensive comparisons with other regeneration approaches such as that of  would deserve investigation.

Ising model
We consider an Ising model on a 32 × 32 square lattice with periodic boundaries. This provides a setting where a basic MCMC sampler can mix slowly depending on an inverse temperature parameter θ, and where a replica exchange strategy as in Geyer (1991) can be helpful. We also use this example to illustrate the use of our estimators on a large computing cluster, with the considerations that were reviewed in Section 3.3. For i and j in {1, : : : , 32} 2 we write i ∼ j if i and j are neighbours in the square lattice with periodic boundaries. We write x i ∈ {−1, 1} for the spin at location i, and x = {x i } for the full grid. We write t.x/ for the 'natural statistic' t.x/ = 0:5Σ i∈{1,:::,32} 2 Σ j∼i x i x j summing the products of pairs of neighbours. The 0.5-multiplier here results in each pair of neighbouring sites being counted only once. Under the model, the probability that is associated with a grid x is π θ .x/ ∝ exp{θt.x/}, where θ > 0 denotes an inverse temperature parameter that calibrates the degree of correlation between neighbouring sites.
We consider a single-site Gibbs sampler, called a heat bath algorithm in this context, to approximate the distribution π θ given a value of θ. One iteration of the algorithm consists of a sweep through all the locations i ∈ {1, : : : , 32} 2 . For each i we draw x i from its conditional distribution under π θ given all the other spins. It can be checked that the conditional probability of {x i = 1} given the other spins equals exp.θs i /={exp.θs i / + exp.−θs i /}, where s i denotes the sum of spins over the four neighbours of i. We initialize the chains by drawing spins uniformly in {−1, 1} at each site, independently across sites.
A simple strategy to couple heat bath chains consists of sampling from the maximal coupling of each conditional distribution. For a grid of θ-values from 0.3 and 0.48, we run 100 pairs of chains until they meet. We then plot the average meeting time as a function of θ in Fig. 4(a), noting that the average meeting time increases sharply to values above 10 6 as θ approaches its critical value (see the related discussion in ). We conclude that it would be expensive to produce unbiased estimators based on the heat bath algorithm for values of θ above 0.48, for reasons related to the behaviour of the underlying algorithm.
A coupling of this algorithm involves a pair of ensembles with N chains each; the two ensembles are identical if chain n in the first ensemble equals chain n in the second ensemble, for all We use common random numbers to decide whether to perform swap moves or single-site Gibbs moves, and whether to accept the proposed states in the event of a swap move. In the event of a single-site Gibbs move, we maximally couple each conditional update.
Throughout the following experiments we use p swap = 0:01 and introduce an equally spaced grid of θ-values from θ .1/ = 0:3 to θ .N/ = 0:55 for several choices of N. We note that these grids includes θ-values at which we have seen that the single-site Gibbs sampler mixes poorly. Fig.  4(b) shows the resulting average meeting times over 100 independent runs, as a function of the number of chains N. The average meeting time first decreases with the number of chains but then increases again. A possible explanation is that the mixing of the chains first improves as N increases and then stabilizes; however, it becomes more difficult for the ensembles to meet when N increases since all chains in the ensembles must meet. The minimum average meeting time is here attained for N = 16 chains per ensemble.
Setting N = 16, k = 10 5 and m = 2 × 10 5 we now illustrate the use of the proposed unbiased estimators on a cluster. The test function is taken as x → t.x/ defined above, so we estimate Σ x t.x/π θ .x/ for various values of θ. We use 500 processors to generate unbiased estimates with a time budget of 30 min. Within that time, each processor generated between one and seven estimators, with an average of 3.7 estimators per processor and a total of 1858 estimators. The chronology of the generation of these estimates is illustrated in Fig. 5(a). For each processor, horizontal segments of different colours indicate the duration that is associated with each estimator. The final estimates with standard errors are shown in Fig. 5(b), where we can see that the standard errors are very small compared with the values of the estimates, for each value of θ. These standard errors were computed asσ 1 .P, t/= √ P following equation (3.4), the central limit theorem corresponding to the large processor count limit.

Variable selection
For our next example we consider a variable-selection problem following Yang et al. (2016) to illustrate the scaling of our proposed method on high dimensional discrete state spaces. For integers p and n, let Y ∈ R n represent a response variable depending on covariates X 1 , : : : , X p ∈ R n . We consider the task of inferring a binary vector γ ∈ {0, 1} p representing which covariates to select as predictors of Y , with the convention that X i is selected if γ i = 1. For any γ, we write |γ| = Σ p i=1 γ i for the number of selected covariates and X γ for the n × |γ| matrix of covariates chosen by γ. Inference on γ proceeds by way of a linear regression model relating We assume a prior on γ of π.γ/ ∝ p −κ|γ| 1.|γ| s 0 /. This distribution puts mass only on vectors γ with fewer than s 0 1s, imposing a degree of sparsity. Given γ we assume a normal prior for the regression coefficient vector β γ ∈ R |γ| with zero mean and variance gσ 2 .X γ X γ / −1 . Finally, we give the precision σ −2 an improper prior π.σ −2 / ∝ 1=σ −2 . This leads to the marginal likelihood To approximate the distribution π.γ|X, Y/, Yang et al. (2016) employed an MCMC algorithm whose kernel P is a mixture of two Metropolis kernels. The first component P 1 .γ, ·/ selects a co-ordinate i ∈ {1, : : : , p} uniformly at random and flips γ i to 1 − γ i . The resulting vector γ Å is then accepted with probability 1 ∧ π.γ Å |X, Y/=π.γ|X, Y/, where a ∧ b denotes min.a, b/ for a, b ∈ R. Sampling a vector γ from the second kernel P 2 .γ, ·/ proceeds as follows. If |γ| equals 0 or p, then γ is set to γ. Otherwise, co-ordinates i 0 and i 1 are drawn uniformly among {j : γ j = 0} and {j : γ j = 1} respectively. The proposal γ Å has γ Å i 0 = γ i 1 , γ Å i 1 = γ i 0 and γ Å j = γ j for the other components. Then γ is set to γ Å with probability 1 ∧ π.γ Å |X, Y/=π.γ|X, Y/, and to γ otherwise. The MCMC kernel P.γ, ·/ targets π.γ|X, Y/ by sampling from P 1 .γ, ·/ or from P 2 .γ, ·/ with equal probability. Note that each MCMC iteration can only benefit from parallel processors to a limited extent, since |γ| is always less than s 0 , itself chosen to be a small value; thus the calculation of R 2 γ involves only linear algebra of small matrices. We consider the following strategy to couple the above MCMC algorithm. To sample a pair of states .γ ,γ / given .γ,γ/, we first use a common uniform random variable to decide whether to sample from a couplingP 1 of P 1 to itself or a couplingP 2 of P 2 to itself. The coupled kernel P 1 {.γ,γ/, ·} proposes flipping the same co-ordinate for both vectors γ andγ and then uses a common uniform random variable in the acceptance step. For the coupled kernelP 2 {.γ,γ/, ·}, we need to select two pairs of indices: .i 0 ,ĩ 0 / and .i 1 ,ĩ 1 /. We obtain the first pair by sampling from a maximal coupling of the discrete uniform distributions on {j : γ j = 0} and {j :γ j = 0}. This yields indices .i 0 ,ĩ 0 / with the greatest possible probability that i 0 =ĩ 0 . We use the same approach to sample a pair .i 1 ,ĩ 1 / to maximize the probability that i 1 =ĩ 1 . Finally we use a common uniform variable to accept or reject the proposals. If either vector γ orγ has no 0s or no 1s, then it is kept unchanged.
For various values of n, p and SNR, and the two types of design, we run coupled chains 100 times independently until they meet. We report the average meeting times in Tables 4 and 5. The average meeting times are of the order of 10 4 -10 5 , depending on the problem; the maximum is attained in the correlated design at n = 500, p = 1000 and SNR = 2. In contrast with this, the experiments in Yang et al. (2016) identify the scenario n = 500, p = 5000 and SNR = 1 as the most challenging. This discrepancy deserves further study; it could be due to variations from a synthetic data set to another, or to differences in the criteria being reported.
To illustrate the effect of dimension, we focus on the independent design setting with n = 500 and SNR = 1, and we consider values of p between 100 and 1000. For each value of p, we run coupled chains 1000 times independently until they meet. We present violin plots representing the distributions of meeting times divided by p in Fig. 6(a). The distribution of scaled meeting times appears to be approximately constant as a function of p, suggesting that meeting times increase linearly in p. This is consistent with the findings of Yang et al. (2016), where mixing times are shown to increase linearly in p.
Focusing now on the independent design case with n = 500, p = 1000 and SNR = 1 we consider various values of the prior hyperparameter κ in {0:1, 1, 2}. We set k = 75000 and m = 150000 and generate unbiased estimators on a cluster for 120 min, using 200 processors for each value of κ, and so 600 processors in total. The test function is chosen so that the estimand π.h/ is the vector of inclusion probabilities P.γ i = 1|X, Y/ for i ∈ {1, : : : , 20}. Within the time budget, 39282 estimates were produced, with each processor producing between eight and 181 of these. The largest observed meeting time was 81423. The meeting times were similar across the three values of κ.   Fig. 6(b) shows the results in the form of 95% confidence intervals shown as error bars, using expression (3.4), the central limit theorem relevant when the time budget is fixed and the number of processors grows large. We observe that κ has a strong effect on the probability of including the first 10 variables in this setting, and that the most satisfactory results are obtained for κ = 0:1 rather than for κ = 2, recalling that β Å has non-zero entries in its first 10 components. Note that the error bars are narrow but still noticeable, particularly for κ = 0:1. On Fig. 6(b), the full lines represent estimates that were obtained with 10 independent MCMC runs with 10 6 iterations each, discarding the first 10 5 iterations as burn-in. These MCMC estimates present noticeable variability in spite of the large number of iterations. In a standard MCMC setting, we might run chains for more iterations until the estimates agree across independent runs. In the framework proposed, we increase the precision by generating more independent unbiased estimators without necessarily modifying k or m. Fig. 6(b) suggests that the variable selection procedure that is considered here is sensitive to the prior hyperparameter κ; we refer to Yang et al. (2016), and to Johnson (2013) and Nikooienejad et al. (2016) for related discussions on Bayesian variable selection in high dimension and convergence of MCMC algorithms.

Cut distribution
Finally, our proposed estimator can be used to approximate the cut distribution, which poses a significant challenge for existing MCMC methods (Plummer, 2014;. This illustrates another appeal of the unbiasedness property, beyond the motivation for parallel computation. Consider two models: one with parameters θ 1 and data Y 1 and another with parameters θ 2 and data Y 2 , where the likelihood of Y 2 might depend on both θ 1 and θ 2 . For instance the first model could be a regression with data Y 1 and coefficients θ 1 , and the second model could be another regression whose covariates are the residuals, coefficients or fitted values of the first regression (Pagan, 1984;Murphy and Topel, 2002). In principle one could introduce an encompassing model and conduct joint inference on θ 1 and θ 2 via the posterior distribution. In that case, misspecification of either model would lead to misspecification of the ensemble and thus to a misleading quantification of uncertainty, as noted in several studies (e.g. ), Plummer (2014, Lunn et al. (2009), McCandless et al. (2010,  and Blangiardo et al. (2011)).
The cut distribution (Spiegelhalter et al., 2003;Plummer, 2014) allows the propagation of uncertainty about θ 1 to inference on θ 2 while preventing misspecification in the second model from affecting estimation in the first. The cut distribution is defined as π cut .θ 1 , θ 2 / = π 1 .θ 1 /π 2 .θ 2 |θ 1 /: Here π 1 .θ 1 / refers to the distribution of θ 1 given Y 1 in the first model alone, and π 2 .θ 2 |θ 1 / refers to the distribution of θ 2 given Y 2 and θ 1 in the second model. Often, the density π 2 .θ 2 |θ 1 / can be evaluated up to only a constant in θ 2 , which may vary with θ 1 . This makes the cut distribution difficult to approximate with MCMC algorithms (Plummer, 2014).
A naive approach consists of first running an MCMC algorithm targeting π 1 .θ 1 / to obtain a sample .θ n 1 / N 1 n=1 , perhaps after discarding a burn-in period and thinning the chain. Then, for each θ n 1 , one can run an MCMC algorithm targeting π 2 .θ 2 |θ n 1 /, yielding N 2 samples. One might again discard some burn-in and thin the chains, or just keep the final state of each chain. The resulting joint samples approximate the cut distribution. However, the validity of this approach relies on a double limit in N 1 and N 2 . Diagnosing convergence may also be difficult given the number of chains in the second stage, each of which targets a different distribution π 2 .θ 2 |θ n 1 /.
We consider the example that was described in Plummer (2014), inspired by an investigation of the international correlation between human papilloma virus (HPV) prevalence and cervical cancer incidence (Maucort-Boulch et al., 2008). The first module concerns HPV prevalence, with data independently collected in 13 countries. The parameter θ 1 = .θ 1,1 , : : : , θ 1,13 / receives a beta(1,1) prior distribution independently for each component. The data .Y 1 , : : : , Y 13 / consist of 13 pairs of integers. The first represents the number of women who were infected with high-risk HPV, and the second represents population sizes. The likelihood specifies a binomial model for Y i , independently for each component i. The posterior for this model is given by a product of beta distributions.
The second module concerns the relationship between HPV prevalence and cancer incidence, and posits a Poisson regression. The parameters are θ 2 = .θ 2,1 , θ 2,2 / ∈ R 2 and receive a normal prior with zero mean and variance 10 3 per component. The likelihood in this module is given by Z 1,i ∼ Poisson{exp.θ 2,1 + θ 1,i θ 2,2 + Z 2,i /} for i ∈ {1, : : : , 13}, where the data .Z 1,i , Z 2,i / 13 i=1 are pairs of integers. The first component represents numbers of cancer cases, whereas the second is the number of woman-years of follow-up. The Poisson regression model might be misspecified, motivating departures from inference based on the joint model (Plummer, 2014).
Here we can draw directly from the first posterior, denoted by π 1 .θ 1 /, and obtain a sample .θ n 1 / N n=1 . For each θ n 1 we consider an MH algorithm targeting π 2 .θ 2 |θ n 1 /, using a normal randomwalk proposal with variance Σ. We couple this algorithm by using reflection maximal couplings of the proposals as in Section 4.1. In preliminary runs, starting with a standard bivariate normal distribution as an initial distribution and a proposal covariance matrix set to identity, we estimate the first two moments of the cut distribution, and we use them to refine the initial distribution π 0 and the proposal covariance matrix Σ. With these settings we obtain the distribution of meeting times that is shown in Fig. 7(a). We then set k = 100 and m = 10k, and obtain approximations of the cut distribution represented by histograms in Figs 7(b) and 7(c), using N = 10000 unbiased estimators. The overlaid curves correspond to a kernel density estimate obtained by running m = 1000 steps of MCMC targeting π 2 .θ 2 |θ n 1 / with θ n 1 drawn from π 1 .θ 1 /, for n ∈ {1, : : : , N}, and keeping the final mth state of each chain. The proposed estimators can be refined by increasing the number N of independent replications, whereas the MCMC estimators would converge only in the double limit of N and m going to ∞. , marginal probability density functions of the cut distribution, obtained by running N D 10000 MCMC chains for 1000 steps, each targeting π 2 .θ 2 jθ n 1 / for θ n 1 drawn from π 1 .θ 1 /, and retaining the last state only)

Discussion
By combining the powerful technique of  with couplings of MCMC algorithms, unbiased estimators of integrals with respect to the target distribution can be constructed. Their efficiency can be controlled with tuning parameters k and m, for which we have proposed guidelines: k can be chosen as a large quantile of the meeting time τ , and m as a multiple of k. Improving on these simple guidelines stands as a subject for future research. In numerical experiments we have argued that the estimators proposed yield a practical way of parallelizing MCMC computations in a range of settings. We stress that coupling pairs of Markov chains does not improve their marginal mixing properties, and that poor mixing of the underlying chains can lead to poor performance of the resulting estimator. The choice of initial distribution π 0 can have undesirable effects on the estimators, as in the multimodal example of Section 5.1. Unreliable estimators would also result from stopping the chains before their meeting time.
Couplings of MCMC algorithms can be devised by using maximal couplings reflection couplings and common random numbers. We have focused on couplings that can be implemented without further analytical knowledge about the target distribution or about the MCMC kernels. However, these couplings might result in prohibitively large meeting times, either because the marginal chains mix slowly, as in Section 5.1, or because the coupling strategy is ineffective, as in Section 4.2.
Regarding convergence diagnostics, the framework proposed yields the following representation for the total variation between π k and π, where π k denotes the marginal distribution of X k : Here the supremum ranges over all bounded measurable functions under the assumptions stated.
The equality above has several consequences. For instance, the triangle inequality implies that d TV .π k , π/ min.1, E[max{0, .τ − k − 1/}]/, and we can approximate E[max{0, .τ − k − 1/}] by Monte Carlo sampling for a range of k-values. This is pursued in , where the construction proposed is extended to allow for arbitrary time lags between the coupled chains. Thanks to its potential for parallelization, the framework proposed can facilitate a consideration of MCMC kernels that might be too expensive for serial implementation. For instance, one can improve MH-within-Gibbs samplers by performing more MH steps per component update, Hamiltonian Monte Carlo sampling by using smaller step sizes in the numerical integrator (Heng and Jacob, 2019) and particle MCMC sampling by using more particles in the particle filters Jacob et al., 2019). We expect the optimal tuning of MCMC kernels to be different in the proposed framework from when used marginally.
On top of enabling the application of the results of  to accommodate budget constraints, the lack of bias of the estimators proposed can be beneficial in combination with the law of total expectation, to implement modular inference procedures as in Section 5.5. In Rischard et al. (2018) the lack of bias was exploited in new estimators of Bayesian cross-validation criteria. In Chen et al. (2018) similar unbiased estimators were used in the expectation step of an expectation-maximization algorithm. There may be other settings where the lack of bias is appealing, for instance in gradient estimation for stochastic gradient descents (Tadić and Doucet, 2017).

Discussion on the paper by Jacob, O'Leary and Atchadé
Darren J. Wilkinson (Newcastle University, Newcastle upon Tyne) Jacob, O'Leary and Atchadé are to be congratulated for this interesting and useful contribution to the Markov chain Monte Carlo (MCMC) literature. Since in general we cannot initialize an MCMC chain with an exact sample from the target, we rely on asymptotic convergence to equilibrium, but lack of convergence leads to biased estimates. The method outlined in the paper provides a method for removing the bias from any estimates that are produced from the algorithm output, by using a pair of coupled chains. The approach is quite general and can potentially be applied to many different kinds of MCMC algorithms-this paper concentrates on Metropolis-Hastings and Gibbs sampling, but there are several related papers exploring applications to other MCMC schemes.
It is important to keep in mind that coalescence of the coupled chains and convergence to equilibrium are different. Practical application of the method relies on the time-averaged estimator (equation (2.1)). This is just the regular MCMC estimate with a correction term that terminates once the chains have coupled.

Supporting information
Additional 'supporting information' may be found in the on-line version of this article: 'Unbiased Markov chain Monte Carlo with couplings'. Choosing k (the 'burn-in') to be a high quantile of the coupling time distribution will ensure that most estimates contain no correction at all. But, since k represents 'wasted' computation, it seems desirable to encourage the chains to couple rapidly. Although it is possible to engineer early coupling (via identical initial conditions and a low acceptance rate Metropolis-Hastings kernel, for example), this might be a dangerous strategy (see the example in Section 5.1). When the correction term is present, it can sometimes be large, so developing a better understanding of the distributional properties of the correction term is likely to be useful. The technique proposed for coupling Metropolis-Hastings kernels works for multivariate as well as univariate samplers, though reflection maximal coupling works much better than a simple independent maximal coupling in high dimensions. However, it also makes sense to want to couple Gibbs samplers and other componentwise update algorithms. The paper proposes that, for each component update of a Gibbs sampler, the full conditionals for the two chains should be coupled. It is clear that, once all components have coalesced, the full state of the chain will match, and the coupling of the two chains will be faithful. There is little theory providing reassurance that this will happen in reasonable time for challenging problems, though it does seem to work reasonably well, empirically.
Consider a linear Gaussian auto-regressive AR(1) process defined by X t = αX t−1 + t , t ∼ N.0, σ 2 /, for t = 1, 2, : : : , T , with periodic boundaries. The full conditional for each variable X t is of the form We can run two chains by cycling through each variable in turn and sampling from the coupled full conditionals. Here we use T = 200 and α = 0:99, and run initially for n = 500 iterations (Fig. 8). By studying the coupling of variables in the Gibbs sampler, we see that coupling follows a stochastic process, somewhat reminiscent of the annealing of a one-dimensional Ising model, and that it is not always rapid. We can study the coupling time distribution for this model with simple Monte Carlo sampling, to find that the mean coupling time is around 2:6 × 10 3 , with a median of around 1 × 10 3 . The distribution is long tailed, with a maximum observed coupling time over 1000 replicates of 67000. Choosing k to be a very high quantile of this distribution will therefore be computationally demanding.
One of the main motivations for obtaining unbiased estimates of posterior expectations is to fix one of the issues with parallel chains MCMC sampling. We would like to run many chains independently on different processors and then to pool results in some way. In this case any bias in the chains will not 'average out' across multiple processors, since there will be non-negligible bias in every chain. Unbiased estimators can be safely averaged to obtain new unbiased estimators with reduced variance. However, the debiasing does not eliminate burn-in-it just corrects for it, so there is still repeated burn-in, limiting the scalability of the parallel chains approach. The paper recommends choosing k as a high quantile of the coupling time distribution, and m a multiple of k. Any non-zero k limits speed-up of parallel coupled chains relative to one long run, via Amdahl's law for parallel chains MCMC sampling (Wilkinson, 2005): for burn-in b and monitoring run n on N processors (e.g. an asymptotic limit of 10 for n = 9b). This suggests that parallel chains may be useful in the context of a small number of available processors (of the order of 10), but possibly an increasingly inefficient proposition for large numbers of processors (greater than of the order of 10 2 ). The method proposed in the paper is often straightforward to implement and seems to work on a broad range of samplers. It is potentially useful in the context of parallelization of MCMC algorithms, since averaging unbiased estimators is relatively safe, but it is not clear that it fundamentally solves parallel MCMC sampling, since some kind of burn-in must still be repeated for every pair of chains. Many open questions remain regarding the codevelopment of MCMC and coupling algorithms to optimize efficiency. Overall, this represents an interesting and thought-provoking contribution to the literature, and so it gives me great pleasure to propose the vote of thanks.
Chris Sherlock (Lancaster University) I congratulate Jacob, O'Leary and Atchadé for this methodological development which is relatively straightforward to implement yet has the potential to broaden the use of 'embarrassingly parallel' (Dean Reverse engineering the key construction on the third page, we see that H k .X, Y/ is unbiased because on average X t is one step ahead of Y t−1 ; moreover, as is clear from Fig. 9(a), most of the bias correction occurs during the relatively brief phase that in standard MCMC sampling is called 'burn-in'. Also, from Fig. 9(a), the value of the bias correction term Σ τ −1 t=k+1 {h.X t / − h.Y t−1 /} depends on the meanderings of the two chains until their coupling time and is potentially very noisy. This is mitigated by averaging to form H k:m in equation (2.1), but it still depends on the meanderings of the two chains. Now var.BC k:m / E[var.BC k:m |τ /], but a simple back of the envelope calculation suggests that var.BC k:m |τ / = O{.τ − k/ 3 =.m − k/ 2 }1 .τ >k+1/ , motivating setting k to a high quantile of τ and choosing m so that, when τ > k + l, the denomator dominates the numerator. This also underlines the importance of reflection maximal coupling where, for random-walk proposals, τ is approximately linear in dimension rather than exponential. See, for example, Bladt et al. (2014) and references therein for the background to reflection coupling and other similar methods such as projection coupling.
A target of N{0, diag.1 2 , 2 2 , : : : , 20 2 /} was explored using a proposal of X Å |x ∼ N.x, λ 2 I 20 / for 10 5 iterations starting from N{.10, 20, : : : , 200/ T , I 20 }, and Y Å was obtained by using the reflection maximal coupling. Fig. 9(b) shows the distribution of log 10 .τ / over 5000 repeat simulations. In standard MCMC sampling, the burn-in is decided from the trace plots of a few runs, with the post-burn-in values used for estimation. Thus we keep m fixed at 10 5 and examine the effect of increasing k on estimation of h.X/ = X 20 , the least well-explored component. Fig. 9(c) shows that the mean-squared error (MSE) of unbiased MCMC is prohibitively large until beyond the 0.999-quantile of τ , yet the MSE of the MCMC value might be deemed adequate and is stable throughout.
As mentioned, the noise is due to the meanderings of the two chains and the large variability in τ ; Fig. 9 and the argument above show that the dominant contribution is from the most extreme τ -values. I now provide a Rao-Blackwellization for Metropolis-Hastings MCMC sampling, integrating over the distribution of coupling events, which brings the most extreme τ -values under control and leads to the MSEs in Fig. 9(d). These show a modest reduction for low and medium values of k but an order of magnitude at the 0.99-quantile and a factor of 30 at k = 10 4:25 , at a cost of less than twice that of unbiased MCMC sampling.
Conditionally on .X t , Y t−1 / reflection coupling proposes X Å = Y Å with probability α rc ; otherwise the proposal Y ÅÅ = X Å is a deterministic function of .X t , Y t−1 / and X Å . Let α.a, b/ denote the appropriate Metropolis-Hastings acceptance probability from a current value a to a proposed value b.
; with probability α ∧ , X Å is accepted by both chains and coupling occurs. Instead of sampling whether or not coupling occurs, we ascertain its probability: p coup := α rc × α ∧ . If p coup = 1 then coupling is inevitable; we set τ max = t and sample the X-chain from t onwards. Otherwise, we know that, if coupling occurred at τ = t, X t:m would be sufficient additional information to calculate H k:m , so we may continue sampling the Y -chain conditionally on its not coupling with X at time t and then simply ignore these later Y -values when conditioning on τ = t. Specifically, we track and store a variable p nc t : the probability that the chains are not coupled by and including time t. At iteration t + 1 we perform the following steps.
Step 2: if p coup < 1 then sample an action from {'perform accept-reject steps on X Å and Y ÅÅ ', 'accept X Å and reject Y Å ', 'reject X Å and accept Y Å ', 'reject X Å and Y Å '} according to the probability vector otherwise set τ max = t + 1.

The Rao-Blackwellized bias correction is then
where Y.τ Å / := .Y 1:.τ Å−1/, X τ Å:.m−1//. In the above experiments, always, τ max m; moreover var.E[τ 3 ]/ ≈ 2:4E[var.τ 3 /], suggesting that variability between simulations is more important; however, a few very influential simulations had extremely large values for τ , from the tail of their distribution, and it is control of these which so reduces the MSE. Certain two-block Gibbs samplers, such as for exploring an auto-regressive AR(1) process via a block of odd-numbered components and a block of even-numbered components, could be similarly Rao-Blackwellized since, conditionally on one of the blocks coupling, the other block would be guaranteed to couple. The Rao-Blackwellization could also be applied to the estimate of d TV from Section 6.
The vote of thanks was passed by acclamation.

Radu Craiu (University of Toronto) and Xiao-Li Meng (Harvard University, Cambridge)
We thank Jacob, O'Leary and Atchadé for their tantalizing paper. As developers of Markov chain Monte Carlo (MCMC) methodology our curiosity is triggered, and as users of MCMC sampling our hopes are boosted. Indeed, we are inspired to continue our exploration of improving MCMC samplers and estimators Table 6. Steps for implementing the proposed control variate estimators based on N (independent) paralleled coupled chains Step 1: run N coupled chains and draw independent and identically distributed Bernoulli(0.5) Step 2:  Meng (2001, 2005), Craiu and Lemieux (2007) and Yu and Meng (2011)), among which figures prominently the control variates method. We start by noting that H k .X, Y/, defined in Section 2, can be expressed in two ways:  where A ∨ B = max{A, B}. The first expression is the authors' which rendered the insight that H k .X, Y/ includes a time forward correction of h.X k /'s bias, when τ > k + 1. The second expression indicates that H k .X, Y/ is also a time backward correction for the bias in h.X τ −1 / when τ > k + 1. (No correction is needed when τ k + 1.) The most important insight from the second expression, however, is that Δ j ≡ h.X j / − h.Y j / has zero expectation for any j. This implies that, for any integer-valued η independent of {.X j , Y j /, j = 1, : : :}, we can use C k .η/ = Σ η−2 j=k Δ j as a control variate for H k .X, Y/, i.e.
always shares the mean of H k .X, Y/ but can have a smaller variance with a judicious choice of η. Our preliminary theoretical investigations (Craiu and Meng, 2020) show that a good choice of η is the median of τ − ξ, where ξ is an independent Bernoulli(0.5) variable. This suggests immediately the method given in Table 6, where H k:m is the time averaging extension of H k , defined by the authors' expression (2.1), and similarly H Å k:m extends H Å k . Table 7 comparesH Å k:m withH k:m , the average of the authors' N estimators {H k:m .X .i/ , Y .i/ /} N i=1 , using their pump failure data, but without time averaging (hence m = k = 7). We see a decrease of standard deviations for all cases. The reductions are small because τ i s are small (e.g. τ i s vary mostly from 1 to 6), and hence C k cannot contribute much. Indeed the gains vanish with time averaging using m = 70, as seen in Table 8.
To confirm that this vanishing is due to fast coupling, we simulated a new data set (Table 9) by using the same model for the pump failure data, for which the τ i s are somewhat larger (e.g. mostly {τ i = 1 ∼ 9). The distributions of coupling times for the pump data and the simulated examples are shown in Fig. 10 respectively for two and 10 parallel chains. Table 10 repeats Table 8 with these new data, where we see the gains appear again, albeit quite small. More extensive investigations are reported in Craiu and Meng (2020), especially on the use of equation (2) for improving the theoretical bounds given in .

Andi Q. Wang (University of Bristol), Murray Pollock and Gareth O. Roberts (University of Warwick, Coventry, and Alan Turing Institute, London) and David Steinsaltz (University of Oxford)
We congratulate Jacob, O'Leary and Atchadé for a stimulating and interesting paper. Exploiting coupling constructions within Markov chain Monte Carlo (MCMC) methods can be very powerful; in this work they have exhibited a range of coupling constructions, which can be adopted for certain commonly used MCMC samplers to debias the output. An alternative approach to removing bias is to build global couplings into the design of new MCMC algorithms. Section 5.2 briefly reviews one traditional version of this approach, using regenerations, following a method proposed by .
We wish to call attention to recent advances in regenerative algorithms, that build on two decades of development in the theory of quasi-stationarity in Markov processes. One recent example is the 'Restore' sampler, proposed in Wang et al. (2019), a continuous time regenerative sampler, constructed from three components: a continuous time Markov process Y taking values in the state space X , a regeneration rate κ : X → [0, ∞/ and a probability density function μ on X , the regeneration distribution.
From an algorithmic perspective, the process X t runs according to the law of the Markov process Y but also randomly regenerates at rate t → κ.X t /: a new location is drawn (independently) from the regeneration distribution μ. Under appropriate conditions on the components Y , κ and μ, the resulting process X has a given target probability distribution π as its invariant measure. Fixing Y , μ and the target density π, the corresponding choice of regeneration rate κ, to ensure π-stationarity, is where L is the infinitesimal generator of the Markov process Y , and the constant C is chosen such that κ 0. The process's killing and subsequent regeneration mechanism enables us to construct a natural coupling on the entire state space. Suppose that a killing time can be identified that would have occurred regardless of the state of the process. (This can be done straightforwardly if, say, the regeneration rate κ is uniformly bounded away from 0.) In this case, regardless of the prior evolution of the process, all sample paths can be coupled-because of the fixed regeneration distribution-to a common point in the state space.
By appealing to the coupling from the past principle , it suffices then to simulate the restore process from the last such occurrence in the past forwards to time 0 to achieve a perfect draw from the target distribution. For further details, we refer the reader to Wang et al. (2019).

Anthony Lee (University of Bristol), Sumeetpal S. Singh (University of Cambridge) and Matti Vihola (University of Jyväskylä)
We congratulate Jacob, O'Leary and Atchadé for their inspiring work, which elaborates on the general Markov chain debiasing method of  and demonstrates how it can be applied in practice in various settings. We would like to highlight the importance of understanding the scalability properties of the related couplings, in particular with respect to increasing model dimension, to assess the applicability of the method for challenging problems for which inference is difficult with existing techniques.
We focus here on the context of hidden Markov model smoothing, where debiasing was suggested earlier (Jacob et al., 2020) based on coupled conditional particle filters (Chopin and Singh, 2015). Our follow-up work (Lee et al., 2020) presented another coupling based on the backward sampling conditional particle filter (Whiteley, 2010). The latter approach, termed the coupled conditional backward sampling particle filter, has provably stable behaviour in increasing data record length T (Lee et al., 2020). More specifically, under suitabie 'strong mixing' conditions, the coupling time τ T of this filter-with fixed number of particles-satisfies assumption 2 of the present paper such that P.τ T > t/ C T δ t , with constants log.C T / = O.T/ and δ ∈ .0, 1/, and these orders appear tight (Lee et al., 2020).
Since the Markov chain is uniformly ergodic, an easy modification of the proof of proposition 3 for bounded functions yields the following upper bound for the 'bias correction' term: Clearly, for any > 0, there exists c ∈ .0, ∞/ such that for all k k T = cT it holds that E[BC 2 k:m ] h 2 ∞ .m − k + 1/ −2 . Furthermore, c may be chosen so that log.C T δ k T / = O.−T/, demonstrating that the bias correction part quickly becomes negligible for large T , as long as k k T .
If we consider the suggested unbiased parallelization with P processing units and aim for minimal wall time, the required run time is κτ Å P , where κ is the cost of a single iteration and τ Å P is the maximum of P independent coupling times. It is easy to see that in our context E[τ Å P ] = O{T + log.P/} and, since κ = O.T/, we conclude that the minimal time of the maximum of P independent realizations is O{T 2 + T log.P/}. The use of a fixed number of particles implies that the space complexity is, however, O.TP/.
The following contributions were received in writing after the meeting.

Christine P. Chai (Microsoft Corporation, Redmond)
Bayesian methodology is known to provide biased estimates with a lower mean-squared error (Hoff, 2009), so I am excited to see improvements on removing the bias from Markov chain Monte Carlo (MCMC) kernels. This paper provides excellent theoretical methodology, and I have several questions about the description of the unbiased MCMC estimators.
(a) Why does the paper use only confidence intervals, not credible intervals? I understand that the central limit theorem is a frequentist concept (Gray et al., 2015) but, since this paper is about Bayesian methods, I wonder where the term 'credible interval' applies. (b) What is the contextual definition of the 'meeting time' of the chains? The mathematical definition is provided, but I think that stating the definition in plain text would also be helpful. (c) Why is the convergence speed measured as the 'average meeting time' in the figures? I believe that one time unit refers to the cost of one draw and two evaluations in the sampling algorithm, so the concept makes sense. But the word 'time' makes me relate this term to the actual computation time.
Since the details of the platform (e.g. RStudio) is not mentioned in the paper, I wonder whether there is a better term to describe the computation cost of MCMC sampling. (d) I agree that parallel computing is not a panacea, and that each MCMC iteration can receive only limited benefits from parallel processors. Given the potential for parallelization of the framework proposed, how is the scalability related to the asymptotically exact, 'embarrassingly parallel' MCMC algorithm described in Neiswanger et al. (2013)?
(The opinions and views expressed here are those of the author and do not necessarily state or reflect those of Microsoft.)

Martin Chak, Nikolas Kantas and Grigorios A. Pavliotis (Imperial College London)
We congratulate Jacob, O'Leary and Atchadé for their major and inspiring contribution. In our discussion we shall sketch our ideas for extending the use of couplings to include variance reduction in addition to bias removal. Our approach is based on extending the framework presented in Nüsken and Pavliotis (2019). There, couplings of n π-invariant continuous time Markov processes .Z i ; i = 1, : : : , n/ were considered for variance reduction (with respect to asymptotic variance). Sections 3 and 5 of Nüsken and Pavliotis (2019) present non-trivial constructions for such couplings of Markov processes that can be used in sampling contexts. Typically one should expect that these couplings enhance the exploration of the state space, which seems to be in contrast with constructing Markov processes that meet as required for unbiased Markov chain Monte Carlo sampling. We believe that these two opposing behaviours can be combined. Instead of using independent samples of coupled faithful pairs of chains, one could consider correlating these pairs, Z i = .X i , Y i /, such that each of X i and Y i have marginally in i the same law and then use an unbiased estimator. Given X i one could design Y i so that they eventually meet and the X i s could be jointly propagated by using correlated noise inputs designed for improved joint asymptotic variance or better exploration of the state space as in Nüsken and Pavliotis (2019). This can be achieved in discrete time (or in continuous time) by extending the work of Nüsken and Pavliotis (2019) to discrete time Markov processes and Markov chain Monte Carlo sampling (or by extending the work in this paper for continuous Markov processes). Finally we note that in the most simple case moving from X to work on a joint state space X n requires modifying the test function h.x/ to .1=n/Σ n i=1 h.x i / or a normalized weighted average, so it would be interesting to include variance reduction techniques such as control variates.

Nicolas Chopin (École Nationale de la Statistique et de l'Administration Economique, Institut
Polytechnique de Paris) I congratulate Jacob, O'Leary and Atchadé on their very elegant, coupling-based, approach to making Markov chain Monte Carlo algorithms parallelizable. To illustrate the points I would like to make, I consider the Gibbs sampler of Albert and Chib (1993), for a probit model .y i = sgn.z i /, z i = β T x i + i , i ∼ N.0, 1// where one simulates alternately step 1, β|y, z (Gaussian), and step 2, z|β, y (independently, z i ∼ N >0 .β T x i , 1/ if y i = 1).
Even if we restrict our attention to maximum coupling for step 1, several options are available to couple at step 1: (a) maximum coupling for the z i s; (b) maximum coupling for the residuals i (in this specific case, coupling is particularly simple (and fast); to couple two variates from distributions N >c 1 .0, 1/ and N >c 2 .0, 1/, assuming c 1 < c 2 , sample X ∼ N >c l .0, 1/, and take Y = X if X > c 2 ; otherwise sample Y ∼ N [c 1 , c 2 ] .0, 1/); (c) optimal transport for the z i s (i.e. use the inverse cumulative distribution function algorithm with the same input). In that case, the z i s become increasing close but do not become equal. But note that this still helps to increase the probability that the β-chains become coupled. Fig. 11. Kernel density estimates of the coupling times for the three coupling approaches: , latent coupling; , residuals coupling; , latent transport Fig. 11 shows kernel density estimates for the coupling times of the three approaches (100 runs; data, German credit, with standardized predictors; prior, N.0, 10 2 / for each β j ). If computational time is taken into account, approach (b) is the clear winner, as it seems 20 times faster (at least on this particular data set, and with the usual caveats regarding implementation). But it is also interesting to see that approach (c) (transport) may be competitive without forcing the z i s to be equal.
I would be glad to hear the authors' views, more generally, on (a) the effect of parameterization on coupling performance and (b) the opportunity to use transport coupling (or any coupling that does not generate equal outputs with positive probability) for certain components of the chain.

Mathieu Gerber and Anthony Lee (University of Bristol)
This is a very interesting paper that demonstrates how couplings can be used in practical applications to obtain unbiased approximations.

Reparameterization
Section 4.3 of the paper discusses the coupling of Metropolis-within-Gibbs samplers, and the numerical results suggest that the average meeting time scales (sub)linearly with the dimension d of the target distribution π, provided that its components are weakly dependent. When this condition is not verified one can find a diffeomorphism g such that the components of π g = π • g −1 , the image of π by g, are weakly dependent. Although finding such a diffeomorphism can be difficult in general, when π is a posterior distribution the Bernstein-von Mises theorem suggests taking g.x/ = A −1 .x − μ/ with μ ≈ E π [X] and A such that AA T ≈ cov π .X/. In this case, π g ≈ N .0, I d / and we can follow the approach used in Fig. 2 (with V = I d ) to approximate π g , and therefore also π.
To illustrate, we consider a logistic regression model, with an N .0, I d / prior distribution and n = 100d covariates, and report in Fig. 12 the average meeting time (over 20 replications) as a function of d, which appears to scale (sub)linearly with d.
Jiaqi Gu (University of Hong Kong) I congratulate Jacob, O'Leary and Atchadé for their thought-provoking paper. In the paper, I am interested in two aspects: (a) adapting couplings to substitute a meeting time with finite expected value for traditional convergence diagnosis methods in Markov chain Monte Carlo (MCMC) algorithms and (b) removing the bias of MCMC estimators with telescopic sums at finite expected computational cost.
MCMC methods (Metropolis et al., 1953;Geman and Geman, 1984) have been widely used for Bayesian inference. Although the asymptotic convergence of MCMC estimators has been proved in theory (Tierney, 1994) Consequently, unbiased estimator H k:m .X, Y/ could be obtained at finite expected computation cost O.E[τ ]/. By investigating the performance of couplings in numerical experiments under various settings, the authors propose simple guidelines to implement their unbiased Bayesian inference method as efficiently as possible.
There are still some possible extensions of their method. One example is missing data. As illustrated in Sections 4.2 and 4.3, the average meeting time would increase as the dimension grows. This would make it more difficult to adapt such a strategy to latent variable models. Like the Ising model in Section 5.3, latent variable models f.X obs ; θ/ = f.X obs , X mis ; θ/dX mis usually consist of high dimensional latent variables X mis which need updates but are irrelevant to the target E π [h.θ/]. Given that the dimension of X mis is far greater than parameters θ, the expected meeting time E[τ ] will be extremely large and the unbiased estimators could not be obtained within available computational time if the definition of meeting time is kept unchanged. An extension of the proposed method to deal with missing data problems is needed.

Ajay Jasra (King Abdullah University of Science and Technology, Thuwal), Kody Law (University of Manchester) and Jeremy Heng (École Supérieure des Sciences et Commerciales Business School, Singapore)
We thank Jacob, O'Leary and Atchadé for a most interesting paper. Our discussion centres on a very promising extension of the method for unbiased estimation as we now explain. Consider a probability measure π on measurable space .X, X / for which we want to compute π.ϕ/ := X ϕ.x/π.dx/ with ϕ : X → R, π integrable and measurable. Suppose that one can deal with only a sequence of biased probability measures .π l / l 0 on .X, X /, with π l .|ϕ|/ < ∞ ∀ l ∈ Z + , such that lim l→∞ π l .ϕ/ = π.ϕ/; examples include partially observed diffusion processes (e.g. Jasra et al. (2017)) or inverse problems (e.g. Beskos et al. (2017)). Consider a positive probability mass function P L on Z + . It is known (Rhee and Glynn, 2015;Vihola, 2018) that if one can find a sequence of independent random variables .ξ l / l 0 that are independent of L ∼ P L such that E[ξ 0 ] = π 0 .ϕ/, E[ξ l ] = π l .ϕ/ − π l−1 .ϕ/ ∀ l ∈ N, then ξ L =P L .L/ is an unbiased estimator of π.ϕ/. This is the 'single-term' estimator as discussed by Rhee and Glynn (2015) and Vihola (2018), but alternatives such as the 'coupled sum' estimator are also possible. If l is a parameter induced by an Euler discretization of a stochastic differential equation or a finite element approximation of a solution of a partial differential equation one can completely remove the discretization bias, while only working with biased versions of π. The main idea is now to utilize the authors' approach to obtain the sequence of .ξ l / l 0 . If l = 0 then one can apply the coupled Metropolis-Hastings (MH) method associated with π 0 . For l ∈ N one must be more careful as, to satisfy condition (3), it will typically not be sufficient to run two independent unbiased Markov chain Monte Carlo algorithms for-estimating π l .ϕ/ and π l−1 .ϕ/ respectively. The scheme which we are developing is based on constructing a four-way coupling of MH kernels. This is possible, for example, by sampling a four-way 'maximally' coupled MH proposal, i.e., if Q : X → P.X/ (P.X/ are the probability measures on .X, X /) is a Markov kernel and x 1:4 ∈ X 4 , we sample a proposal kernelQ : X 4 → P.X 4 /, such that for any j ∈ {1, : : : , 4} x −j ∈X 3Q .x 1:4 , dx 1:4 / = Q.x j , dx j / (x −j is x 1:4 without the jth co-ordinate) and Q.x j , dx j /: Acceptance or rejection of proposals for these four chains can be conducted in a similar manner to the authors' but, when l ∈ N, care is required to ensure faithfulness of the chains targeting π l and π l−1 . We believe that this doubly randomized scheme will have multiple applications and moreover have finite variance and expected cost, as well as providing a competitive alternative to Agapiou et al. (2018). In addition, the benefits of such unbiased multilevel Monte Carlo methods (Giles, 2008), relative to advanced multilevel Monte Carlo methods such as in Beskos et al. (2017) and Jasra et al. (2017), would be the lack of bias, amenability to parallelization and comparable or reduced computational cost for a given error.

Lawrence Middleton, George Deligiannidis and Arnaud Doucet (University of Oxford)
We congratulate Jacob, O'Leary and Atchadé for this seminal work for both its elegance and its importance. The implementation of this methodology requires being able to couple two Markov chain Monte Carlo (MCMC) chains and is highly dependent on the specific MCMC algorithm used. We review here the scheme proposed in  to couple two independent Metropolis-Hastings (MH) chains and explain how this idea can be combined with sequential Monte Carlo (SMC) methods to estimate unbiasedly expectations with respect to high dimensional probability measures. Given a target density π.x/ and proposal density q.x/, we can couple two MH chains X and Y to ensure that X t = Y t−1 for all t τ for some meeting time τ by using the same proposal and uniform random variables for X at iteration i and Y at iteration i − 1. Furthermore, by also using q.x/ as initial distribution for both chains and selecting the proposal at the first iteration for X to be equal to the initial distribution of Y , we can ensure that P.τ = 1/ 1 2 . Additionally, conditionally on X 0 , the meeting time follows a geometric distribution; see  for details.
The independent MH algorithm is of limited interest for high dimensional targets. However, the particle independent MH (PIMH) method introduced in  empirically scales much better. This algorithm is nothing but an independent MH algorithm targeting an artificial target π N .x, z/ admitting the desired target π.x/ as a marginal by using an SMC proposal q N .x, z/, where N is the number of particles and z some auxiliary variables. We can thus couple PIMH chains exactly like IMH chains and, under weak regularity assumptions, P.τ = 1/ → 1 as N → ∞. In particular, we can use PIMH steps to estimate unbiasedly expectations with respect to the smoothing distribution of non-linear non-Gaussian state space models by using a particle filter proposal. To estimate unbiasedly expectations with respect to a general probability measure, we can simply use SMC samplers (Del Moral et al., 2006) or annealed importance sampling (Neal, 2001) proposals within PIMH sampling and couple the resulting chains , appendix B).
From a practical point of view, the coupling approach of Jacob and his colleagues is easier to put into practice than the random-truncation technique of  in this context. Indeed, as long as the likelihood of the observations given the latent states is bounded for state space models or the likelihood is bounded for general posterior targets, then the PIMH algorithm is uniformly ergodic for any N and, under very weak additional assumptions, the coupling procedure returns finite variance estimates in finite expected time. To obtain similar guarantees for the methodology of , quantitative convergence results for the PIMH algorithm would need to be obtained to select an appropriate distribution for the truncation time.

Daniel Paulin (University of Edinburgh)
Congratulations go to Jacob, O'Leary and Atchadé for their very interesting paper and the many followups. They have opened up a new field of unbiased Markov chain Monte Carlo sampling that is widely applicable and helps convergence diagnostics and parallelization.
A question for future research is whether the expected coupling times scale well in high dimensional settings. In recent years, considerable progress has been made in the understanding of high dimensional scaling of the Metropolis-adjusted Langevin algorithm  and Hamiltonian Monte Carlo sampling  and Gibbs sampling (Qin and Hobert, 2018). It could be interesting to see whether couplings where the expected coupling time has similarly good dimension dependence can be constructed for such practically relevant algorithms.

Emilia Pompe and Chris Holmes (University of Oxford)
We congratulate Jacob, O'Leary and Atchadé on their impressive contribution, which has already stimu-lated further research in the area and will certainly inspire many other developments. The idea of running parallel computations for Markov chain Monte Carlo (MCMC) algorithms, which is sequential by nature, has recently been attracting much attention in the Bayesian community, e.g. Scott et al. (2016), Dai et al. (2019) and Rendell et al. (2018). We believe that the authors achieved a major breakthrough in this field ); we set the parameter of the PB as T D 10000 and c D 1); (b) density of the coupling times of the Markov chains; (c) violin plots obtained for a function g (see expression (4)) defined asα 1α2 (a product of co-ordinates) whereα :D arg max γ Σ n jD1 w j log{f γ .x j /} C Σ T jD1 w nCj log{f γ .x nCj /} and log.f γ / denotes the log-likelihood function (we compare results obtained with 200 samples drawn from the PB and where unbiased MCMC sampling is used at the first sampling stage, for various values of k and m used for defining the unbiased estimator (see Section 2.3); the parameter k was each time set to 0:1m; each experiment was repeated 100 times; we can see that the variance decreases dramatically as the values of k and m increase; it can also be observed that using the PB or the PB combined with unbiased MCMC sampling indeed corrects the VB approximation-after applying any of these procedures the results are closer to the true value of α 1 α 2 calculated for the posterior) by proposing a general framework for parallelizing computations, applicable to many MCMC algorithms .
Moreover, an important advantage of this method is applicability to models composed of modules Plummer, 2015;, where because of misspecification of the full model it may be beneficial to estimate parameters of these modules sequentially, as discussed in Section 5.5. The law of total expectation ensures that the estimators of integrals with respect to the 'cut distribution' in such models are unbiased. This can be useful more generally, when one wants to use different Monte Carlo techniques preserving unbiasedness, e.g. (standard) importance sampling at subsequent stages of inference. This remarkable property of the unbiased MCMC technique could also be useful beyond cut models as we now discuss.
We can extend unbiased MCMC to the area of Bayesian non-parametrics and the posterior bootstrap proposed by Lyddon et al. (2018) to address issues arising in model misspecification. Following the notation of algorithm 1 of Lyddon et al. (2018), suppose that we want to obtain an unbiased estimator of g.α/. In the current version of algorithm 1 this can be obtained via averaging g.α .i/ / since it is assumed that we can draw independent and identically distributed samples from γ .i/ ∼ π.γ|x 1:n / and x .i/ k ∼ f γ .i/ for k = n + 1, : : : , n + T . In many settings, however, sampling directly from one of those distributions (or both) is impossible, e.g. when the prior for γ is non-conjugate.
To construct an unbiased estimator of g.α/, we can follow the procedure described in Section 5.5 for θ 1 := γ and π 1 .γ/ ∼ π.γ|x 1:n /, An interesting line of research would be developing similar methodology for adaptive MCMC sampling where the proposal or target distributions are updated as the algorithm runs (Roberts and Rosenthal, 2009;Pompe et al., 2018). This is particularly important in the case of two-stage estimation procedures, where after obtaining θ .1/ 1 , : : : , θ .N 1 / 1 in the second stage we run N 1 chains, possibly in parallel. Since each chain targets a different distribution π 2 .θ 2 |θ .i/ 1 /, there are N 1 transition kernels to be tuned-this is why the efficiency of the algorithm could potentially be significantly improved by allowing for updating the proposal distributions on the fly.
Associating Markov chain Monte Carlo (MCMC) sampling with unbiasedness is indeed quite a challenge since MCMC algorithms very rarely produce simulations from the exact target, unless specific tools like renewal control can be exploited in an efficient manner. Although renewal events are often easy to produce, renewal control is too rare an occurrence to consider it as a generic convergence assessment method (Guihenneuc-Jouyaux and Robert, 1998;Robert, 1998). Although coupling is also at the base of perfect simulation methods (Kendall and Møller, 2000;Huber, 2004), the analogy between this debiasing technique and perfect sampling is difficult to fathom, since the coupling of two chains is not a perfect sampling instant, occurring instead much earlier. When discussing the implementation of coupling in Metropolis and Gibbs settings, the authors propose a most simple optimal coupling algorithm which we were not aware of: a form of accept-reject sampling also found in a similar way in perfect sampling. (Something that is only obvious in retrospect is that the variance of the resulting unbiased estimator is at least the variance of the original MCMC estimator.) We would, however, stress that unbiasedness is not the ultimate goal one should seek when running a Monte Carlo approximation, since the paper may otherwise give the impression of achieving a 'free-lunch' result. Reaching (exact) stationarity and exploring the posterior target within an allocated number of iterations was and remains the primary goal of an MCMC algorithm. Assessing both features is a considerable challenge that has not been solved by a foolproof technique, but coupling techniques have been proposed in the early days of MCMC methods towards this goal. In addition to , see, for example, Robert (1995Robert ( , 1998, Guihenneuc-Jouyaux and Robert (1998), Hobert et al. (2002), Roberts and Rosenthal (2001), Hobert and Robert (2004) or Qin and Hobert (2019). Although renewal approaches include a burn-in period, this part of the simulation can be recycled thanks to the authors' technique and added to the independent and identically distributed errors terms in the renewal representation of ergodic averages. However, coupling is clearly working in the settings examined therein, whereas renewal apparently does not. In toy examples like the Efron and Morris (1973) baseball data and the Gelfand and Smith (1990) pump failure data, the parameters k and m of the algorithm can even be optimized against the variance of the averaged averages, which proves a remarkable feat. In 'traditional' MCMC methods, it is standard to check that stationarity has been attained by running a small number of parallel chains, initiated at different starting points, to verify that the final distribution is independent of the initialization-even though the singleversus multiple-chain(s) debate erupted initially with Gelman and Rubin (1992) versus Geyer (1992).
As noted by the authors, a bad choice of the initial distribution p 0 can lead to poor properties. In essence, this occurs and remains undetected for the current proposal because the coupling of the chains occurs long before the chain reaches stationarity. We make two suggestions to alleviate this issue, and hence add a stationarity check as a by-product of the run.
(a) The chains X and Y need to have the same initial distribution, but different pairs of chains on different parallel cores can afford different initial distributions. The resulting estimator remains unbiased. We would therefore suggest that parallel chains be initiated from distributions which put weight on different parts of the parameter space. Ideas from the quasi-Monte-Carlo literature (see Gerber and Chopin (2015)) could be used here. (b) We also note that, although the marginal distributions of X and Y need to be identical, any joint distribution on .X, Y/ produces an unbiased algorithm. We would suggest that it is preferable that X and Y meet (shortly) after the chains have reached stationarity. Here is one possible strategy for this: let p and p be two distributions which put weight on different parts of the space, and Z ∼ Bernoulli. 1 2 /. If Z = 0, take X 0 ∼ p and Y 0 ∼ p ; otherwise take X 0 ∼ p and Y 0 ∼ p. The marginal distribution of both X 0 and Y 0 is 1 2 .p + p /, but the two chains will start in different parts of the parameter space and are likely to meet after they have both reached stationarity.
The ideal algorithm is one which gives a correct answer when it has converged, and a warning or error when it has not. MCMC chains which have not yet reached stationarity (e.g. because they have not found all modes of a multimodal distribution) can be difficult to detect. Here, this issue is more likely to be detected since it would lead to the coupling not occurring: E[τ ] is large, and this is a feature, since it warns the practitioner that their kernel is ill fitted to the target density.
Leah F. South and Chris Nemeth (Lancaster University) and Chris. J. Oates (Newcastle University, Newcastle upon Tyne, and Alan Turing Institute, London) Jacob, O'Leary and Atchadé are to be congratulated on an impressive and thought-provoking contribution to the field. Traditional Markov chain Monte Carlo (MCMC) sampling has benefitted from the development of gradient-based control variates (Assaraf and Caffarel, 1999;Barp et al., 2018;Mira et al., 2013;Oates et al., 2017;South et al., 2018), but it may be more difficult to design gradient-based control variates for unbiased MCMC sampling. Following the notation in the paper, letπ R .   (2019) (the following strategies were compared: (i) direct minimization ofσ.h g/ and (ii) minimization of the bound (5) on σ.h g/, with η, λ 1, γ 1 and π approximated with MCMC output; in each case g was a first-order Stein control variate (Mira et al., 2013;South et al., 2018) estimated by using the first bR=2c chains, whereas π.h/ was estimated by using the remaining R bR=2c chains so that the estimators remain unbiased; runs are based on k D 330, m D 3300 and R D 32; the empirical means from approach (ii) are subtracted for visualization; the median variance reduction factor by using approaches (i) and (ii) is approximately 20; however, approach (i) depended strongly on the numerical approach that was used to minimize the non-convex objective functionσ.h g/ 2 ; code to reproduce the experiment is provided at https://github.com/LeahPrice/debiasedhmc): (a) unbiased MCMC sampling; (b) approach (i); (c) approach (ii) variance from traditional MCMC sampling; existing gradient-based control variates can therefore be used (Belomestny et al., 2019;Mijatović and Vogrinc, 2018). However, at finite m and k the dependence of σ.h/ on h is far from explicit. One could use sample splitting to construct an approximation of the form and attempt to minimizeσ.h − g/. Alternatively, one could bound σ.h − g/ in terms of quantities that are independent of the Markov chain and then minimize the bound. One such bound is provided in the following result, stated for k = m = 0 for simplicity, which we do not claim to be in any sense optimal. Let assumptions 1-3 be satisfied, with η as in assumption 1 and C and δ as in assumption 2. Let π t be the law of X t and assume that λ := sup t 0 d TV .π, π t / < ∞. Let H be a reproducing kernel Hilbert space, with norm denoted ' · H ' and with kernel K : X × X → R satisfying K.x, x/ 1 for all x ∈ X . If |h| 2+η ∈ H then σ.h/ γ{π.|h| 2+η / + λ |h| 2+η where the positive constants γ and λ are h independent. The proof can be found in South et al. (2019). None of the three h-dependent quantities in the bound depend on the law of the Markov chain and thus minimization of the bound may be practical. The, in practice unknown, values of γ.δ/ and λ determine which of the three terms dominate the bound. Fig. 14 displays the variance reduction that is achieved in the 302-dimensional logistic regression example of . These results are encouraging, but more work is required to develop an understanding of gradient-based control variates for unbiased MCMC sampling.

Paul Vanetti and Arnaud Doucet (University of Oxford)
Bias correction for Markov chain Monte Carlo sampling is a long-standing problem, for which this paper represents a breakthrough. We propose two generalizations to derive new unbiased estimators.
First we propose the lagged estimator, obtained by starting chain Y after L steps of the chain X and coupling those two chains such that X t = Y t−L for all t τ where τ is the meeting time. For some N, k such that N k + L, we can then exploit the identity . 6/ Lags greater than 1 were used in  to improve probability metric bounds. The estimator (6) is similar to the time-averaged estimator in the paper under discussion when L = m − k, but the bias correction term that is incurred is not inflated by a large coefficient. Our empirical results (Table 11) x/ = N .x; 0, 1/, initialization π 0 .x 0 / = N .x 0 ; 0, 5 2 /, proposal q.x, x Å / = N .x Å ; x, 1/ and test function h.x/ = x 2 . σ is the standard deviation of a single estimate. Fig. 15. Sharing chains between pairs (the top three curves are three pairs; the X -chain of the second is used as the Y -chain for the first, and similarly the X -chain of the third for the Y -chain of the second; the final curve is the aggregate chain): , positive contributions to the total estimate; , negative contributions in a simple scenario suggest that estimator (8) can outperform the time-averaged estimator at a similar computational cost.
For L sufficiently large, where X L is approximately stationary, the bias correction term of estimator (8) can be interpreted as a removal of the burn-in, representing the difference between a stationary chain and the first iterations of a new chain. This motivates our second innovation, which is to use the X-chain of each pair of coupled chains as the Y -chain for another pair.
If we simulate R pairs .X .r/ , Y .r/ /, using the chain Y .r/ i = X .r+1/ i for r ∈ {1, : : : , R − 1}, and for r = R we take a novel chain (and not the X for any other pair), then averaging the estimates over the R pairs yields an estimator in which the negative bias correction terms for the first R − 1 pairs cancel with the first samples from the next pair. Thus the estimator is equivalent to that obtained by a single long chain with a lag choice of RL; see Fig. 15 for an illustration.
If we had used Y .R/ i = X .1/ i for the Rth pair, the resulting scheme would very closely match the parallel implementation of circular coupling . The framework that is provided in this discussion gives a new interpretation to this scheme, which was designed to provide 'states that all have close to the equilibrium distribution'. A natural question to ask is whether it yields unbiased estimates and, if this is not so, whether it can be modified to achieve exact unbiasedness.

Dootika Vats (Indian Institute of Technology Kanpur) and Galin L. Jones (University of Minnesota, Twin Cities)
We congratulate Jacob, O'Leary and Atchadé on considerably furthering the cause of unbiased estimation in Markov chain Monte Carlo sampling. Although there is often a need for the methods discussed in the paper, unbiased estimation is trivially achieved by successfully starting from stationarity. The authors mention coupling-based perfect simulation techniques as a means to draw exact samples from the target distribution, but we note that simple accept-reject samplers are useful in surprisingly many examples. Accept-reject samplers can be inefficient for high dimensional target distributions but, even there, in conjunction with linchpin variables (Archila, 2016;, reasonably efficient accept-reject samplers can be used to produce one exact draw from the target.
For example, consider the model in Section 5.2 used to study the nuclear pump failure data. The joint posterior density of λ = .λ 1 , : : : , λ K / and β can factorized as f.λ, β/ = f.λ|β/f.β/: Since it is easy to draw samples from f.λ|β/, β is a linchpin variable. Accept-reject samples from the univariate marginal posterior of β are easier to obtain than accept-reject samples from the .K + 1/dimensional joint posterior of .λ, β/. In fact, a simple and efficient accept-reject sampler that targets f.β/ is easy to construct. Samples from the joint posterior can then be obtained via sequential sampling.
Based on a similar linchpin construction, Jones et al. (2006) presented an efficient accept-reject sampler for the baseball batting averages data and model presented in section S3. If the number of covariates is small, efficient accept-reject samplers are also possible for the Bayesian lasso posterior in section S5 and the variable selection model in section 5.2. Archila (2016) presented a few other hierarchical models where efficient accept-reject samplers can be implemented via the linchpin variable approach.
In many situations, accept-reject sampling is too computationally burdensome for a full Monte Carlo procedure, but it may be able to provide one draw at a reasonable cost. When this draw is used as a starting value in a Markov chain, the usual Monte Carlo estimator of any expectation is trivially unbiased. This procedure is easily parallelizable and requires no burn-in. However, in settings where direct sampling is inefficient, the unbiased estimation techniques of Jacob, O'Leary and Atchadé are promising. on processor p has completion time C p .T/ = max.T , C p,1 / where C p,1 is the cost of the first estimator. The final estimator is the average P −1 Σ P p=1H p .T/. Its expected completion time C.P, T/ is the expectation of max{C 1 .T/, : : : , C P .T/}, and we take its precision α.P, T/ to be the inverse of its variance.
(a) When we set the time budget T to 6 s, the estimators take on average between 6 and 9 s to complete, with a visible increase with P. If we had set the budget to 0, we would have still observed approximately the same expected completion times. Thus, in that regime increasing the number of processors increases the run time. When we set the budget to 12 or 24 s, the expected completion times become nearly constant and equal to the budget; the actual increase is logarithmic in P. (b) The precision increases linearly with P, on every row of the table, but there are some noticeable Monte Carlo variations. By comparing two consecutive cells of any row the precision approximately doubles whereas the completion time increases very slowly. When T is sufficiently large, doubling the number of processors doubles the precision while keeping the expected completion time essentially fixed. (c) In contrast, if we compare for example T = 24 and P = 512, and T = 12 and P = 1024, we see that the run time is indeed halved, but that some precision is lost. We believe that this corresponds to Amdahl's law in Wilkinson's discussion. This effect would eventually disappear as the budget T increases.
Similarly, we believe that comparisons between unbiased MCMC estimators and 'one long run' of an MCMC algorithm would be most informative with adequate amounts of context, including the number of processors, budget constraints, the desired precision, the choice of k and m for the proposed estimators and the choice of burn-in for the MCMC algorithm.

Theory on meeting times
Several discussants highlight the importance of understanding the behaviour of the meeting time in the estimators proposed. Indeed the meeting time drives both the cost of the estimator and its variance via the bias correction term.
A few results on the meeting times or closely related objects already exist. In their comment, Lee, Singh and Vihola summarize their analysis of couplings of conditional particle filters, including precise results on the effect of the number of observations and particles on the meeting times. Paulin's comment recalls that much progress has been made on the understanding of standard MCMC methods in recent years; we would like to add that many of these works have employed coupling strategies, such as Diaconis et al. (2010) and Qin and Hobert (2019) for Gibbs samplers, or  and  for Hamiltonian Monte Carlo sampling.
We hope that similar techniques will provide a precise understanding of the cost and variance of the proposed estimators in the future. More discussion in the context of Hamiltonian Monte Carlo sampling can be found in . We also note that the rich literature around coupling from the past  contains theoretical results on meeting times. For example, Collevecchio et al. (2018) found the asymptotic distribution of a rescaled meeting time to be Gumbel in the context of a Fortuin-Kasteleyn model.

Variance reduction
It is encouraging to see the enthusiasm and creativity of the discussants around variance reduction tech-niques. Indeed these techniques directly improve the estimators proposed and the discussants have managed to come up with control variates, Rao-Blackwellization and antithetic variables.
Vanetti and Doucet propose a generalization of the proposed estimators by using a lag L 1 between the chains. In this approach, one samples X 0 , X 1 , : : : , X L by using the original MCMC algorithm with kernel P, Y 0 from π 0 , and then recursively X t+1 and Y t−L+1 given X t and Y t−L fromP. The meeting time τ can be defined as inf{t 1 : X t = Y t−L }. Generalizing equation (2.1) in the paper, we write the lagged, time-averaged estimator as Using a value of L that is larger than 1 is always possible and can yield a large improvement in efficiency. We thank Vanetti and Doucet for this important generalization. We encourage users to implement such lagged estimators and to experiment with the choice of L. Craiu and Meng propose an equally general variance reduction technique in the form of a control variate exploiting the fact that X t and Y t have the same marginal distribution for any fixed t. We look forward to their upcoming paper on the topic. Sherlock proposes a Rao-Blackwellization method for coupled Metropolis-Hastings chains, which leverages an estimate of the probability of coupling at the next step conditionally on the current states. South, Nemeth and Oates propose gradient-based control variates-adapted from those recently developed for standard MCMC estimators (South et al., 2018)with promising preliminary results in a logistic regression setting. Finally Chak, Kantas and Pavliotis envision an approach that is reminiscent of antithetic variables, where different pairs of Markov chains would be constructed to produce a negative correlation in the resulting unbiased estimators. Such couplings might be generated by adapting techniques from Nüsken and Pavliotis (2019).

Cost reduction
With bias out of the picture, the variance and average cost of the estimators proposed constitute their most notable performance characteristics. Thus, it is important to develop a better understanding of how to reduce the cost as well as the variance. Selecting the most efficient MCMC kernel available for the problem at hand is obviously recommended. Then, for a fixed MCMC kernel P, reducing the cost amounts to improving the coupling strategy.
Chopin provides numerical results in the case of the Gibbs sampler of Albert and Chib (1993), emphasizing that various strategies are possible for each update of the Gibbs sampler. In more complicated Gibbs samplers, the possibilities quickly multiply, and thus it would be useful to develop general guidelines for users. In our experience, 'common random numbers' provide a good starting point, since these are simple to implement and often yield chains that contract towards one another to at least some extent. Generating enough contraction to bring the chains close together is the main challenge, since other techniques, such as the maximal couplings that were used extensively in this paper, can be used to obtain exact meetings when the chains are already close; see  for example.
The parameterization of a Gibbs sampler will also certainly impact not only its marginal performance (Papaspiliopoulos and Roberts, 2008) but also the implementation and performance of coupling strategies. The comment of Gerber and Lee regarding parameterization with Bernstein-von Mises heuristics is helpful in that regard, as well as the comment of Vats and Jones regarding the potential presence of low dimensional linchpin variables.

New applications
Some comments pointed out the potential use of the proposed estimators in new applications, which we find exciting.
We look forward to reading more about the work described by Jasra, Law and Heng, and note that it will be an interesting methodological application of multimarginal couplings. There might be connections with the comment of Chak, Kantas and Pavliotis where anticorrelations between pairs of chains would result from four-way couplings.
The application described in the comment by Pompe and Holmes, following up on the 'cut distribution' of Section 5.5, is very exciting to us as it leverages another appeal of unbiased estimators, beyond parallel computation. We agree with Pompe and Holmes that it would be useful to obtain unbiased estimators from adaptive MCMC methods, but at this point this appears to be a challenging open question.
The discussion of Gu raises the question of the applicability of the proposed framework to missing data and latent variable models. We refer to the comments of Lee, Singh and Vihola and of Middleton, Deligiannidis and Doucet, as well as references therein, which concern the smoothing problem in hidden Markov models or state space models. These provide examples of unbiased estimation strategies that have proved effective in high dimensional latent variable models.

Bayesian computation
In response to the comments of Chai and Gu, we emphasize that the methods that are considered in the paper are not specifically Bayesian, in the sense that they do not require the conceptualization of unknown quantities as random variables on which prior distributions would be specified. For Bayesian methods in numerical analysis, see Diaconis (1988), Bernardo et al. (1992) or more recently Cockayne et al. (2019). The term 'Bayesian computation' can refer either to computational methods that are used in Bayesian statistics or to Bayesian methods to perform computation, which seems to foster some confusion. This paper concerns non-Bayesian computational methods which might be useful in Bayesian statistics and elsewhere.

Initialization
The discussion of Ryder, Clarté, Hairault, Lawless and Robert proposes an interesting strategy for initializing the chains. We share their view that large meeting times can sometimes indicate convergence issues with the underlying Markov kernel, which we have experienced ourselves, and can indeed be seen as a feature rather than a bug.
We note that using different initial distributions for different runs result in estimators that are unbiased but not identically distributed, and thus care would be advised for the construction of confidence intervals. In contrast, π 0 can be chosen to be a mixture of various components and X 0 and Y 0 can indeed be drawn from any coupling of π 0 with itself. Thus one could for instance maximize the probability that X 0 and Y 0 are drawn from different components of the same mixture.
The lagged estimator that is proposed in the comment of Vanetti and Doucet has implications for the selection of an initialization strategy. As the lag increases, the issue of small meeting times that can occur when two chains start nearby will tend to disappear. Thus users might find lagged estimators with L > 1 to be safer in general, as was found for the estimation of total variation upper bounds in .

Perfect sampling
Various discussants noted the connections and distinctions between our method and perfect samplers, which often rely on couplings of Markov chains . As noted in Robert, Clarté, Hairault, Lawless and Ryder the relationship between perfect sampling and unbiased estimation is indeed elusive. It is unclear whether the machinery of this paper bring us any closer to perfect sampling.
Vats and Jones remind us that perfect samplers might be easier to implement than commonly thought. Similarly, we can wonder why regeneration approaches such as  are not used more often. There might remain an important gap between the generic applicability of MCMC algorithms such as Metropolis-Hastings algorithms and that of current perfect samplers. Perhaps the work advertised by Wang, Pollock, Roberts and Steinsaltz will help to close that gap. We speculate that, even if perfect samplers were to become fully generic, there would still be room for MCMC strategies, unbiased or not, if they come at a smaller computational cost.