Quasi‐stationary Monte Carlo and the ScaLE algorithm

This paper introduces a class of Monte Carlo algorithms which are based on the simulation of a Markov process whose quasi‐stationary distribution coincides with a distribution of interest. This differs fundamentally from, say, current Markov chain Monte Carlo methods which simulate a Markov chain whose stationary distribution is the target. We show how to approximate distributions of interest by carefully combining sequential Monte Carlo methods with methodology for the exact simulation of diffusions. The methodology introduced here is particularly promising in that it is applicable to the same class of problems as gradient‐based Markov chain Monte Carlo algorithms but entirely circumvents the need to conduct Metropolis–Hastings type accept–reject steps while retaining exactness: the paper gives theoretical guarantees ensuring that the algorithm has the correct limiting target distribution. Furthermore, this methodology is highly amenable to ‘big data’ problems. By employing a modification to existing naive subsampling and control variate techniques it is possible to obtain an algorithm which is still exact but has sublinear iterative cost as a function of data size.


Introduction
Advances in methodology for the collection and storage of data have led to scientific challenges and opportunities in a wide array of disciplines. This is particularly so in statistics as the complexity of appropriate statistical models often increases with data size. Many current state of the art statistical methodologies have algorithmic cost that scales poorly with increasing volumes of data. As noted by Jordan (2013), 'many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data' and this has resulted in a proliferation in the literature of methods ': : : which may provide no statistical guarantees and which in fact may have poor or even disastrous statistical properties'. This is particularly keenly felt in computational and Bayesian statistics, in which the standard computational tools are Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods and their many variants (see for example Robert and Casella (2004)). MCMC methods are exact in the (weak) sense that they construct Markov chains which have the correct limiting distribution. Although MCMC methodology has had considerable success in being applied to a wide variety of substantive areas, they are not well suited to this new era of 'big data' as their computational cost will increase at least linearly with the number of data points. For example, each iteration of the Metropolis-Hastings algorithm requires an evaluation of the likelihood, the calculation of which, in general, scales linearly with the number of data points. The motivation behind the work presented in this paper is developing Monte Carlo methods that are exact, in the same sense as MCMC methods, but that have a computational cost per effective sample size that is sublinear in the number of data points.
To date, the success of methods that aim to adapt MCMC to reduce its algorithmic cost has been mixed and has invariably led to a compromise on exactness-such methodologies generally construct a stochastic process with limiting distribution which is (at least hopefully) close to the desired target distribution. Broadly speaking these methods can be divided into three classes of approach: 'divide-and-conquer' methods, 'exact subsampling' methods and 'approximate subsampling' methods. Each of these approaches has its own strengths and weaknesses which will be briefly reviewed in the following paragraphs.
Divide-and-conquer methods (for instance, Neiswanger et al. (2014), ,  and Minsker et al. (2014)) begin by splitting the data set into a large number of smaller data sets (which may or may not overlap). Inference is then conducted on these smaller data sets and resulting estimates are combined in some appropriate manner. A clear advantage of such an approach is that inference on each small data set can be conducted independently, and in parallel, and so if one had access to a large cluster of computing cores then the computational cost could be significantly reduced. The primary weakness of these methods is that the recombination of the separately conducted inferences is inexact. All current theory is asymptotic in the number of data points, n (Neiswanger et al., 2014;Li et al., 2017). For these asymptotic regimes the posterior will tend to a Gaussian distribution (Johnson, 1970), and it is questionable whether divide-and-conquer methods offer an advantage over simple approaches such as a Laplace approximation to the posterior . Most results on convergence rates (e.g. Srivastava et al. (2016)) have rates that are of order O.m −1=2 /, where m is the number of data points in each subset. As such they are no stronger than convergence rates for analysing just a single batch. One exception is in Li et al. (2017), where convergence rates of O.n −1=2 / are obtained, albeit under strong conditions. However, these results relate only to estimating marginal posterior distributions, rather than the full posterior.
Subsampling methods are designed so that each iteration requires access to only a subset of the data. Exact approaches in this vein typically require subsets of the data of random size at each iteration. One approach is to construct unbiased estimators of pointwise evaluations of the target density by using subsets of the data, which could then be embedded within the pseudomarginal MCMC framework developed by Andrieu and Roberts (2009). Unfortunately, the construction of such positive unbiased estimators is not possible in general (Jacob and Thiéry, 2015) and such methods often require both bounds on, and good analytical approximations of, the likelihood (Maclaurin and Adams, 2015).
More promising practical results have been obtained by approximate subsampling approaches. These methods use subsamples of the data to estimate quantities such as acceptance probabilities (Nicholls et al., 2012;Korattikara et al., 2014;Bardenet et al., 2014), or the gradient of the posterior, that are used within MCMC algorithms. These estimates are then used in place of the true quantities. Although this can lead to increases in computational efficiency, the resulting algorithms no longer target the true posterior. The most popular of these algorithms is the stochastic gradient Langevin dynamics algorithm of Welling and Teh (2011). This approximately samples a Langevin diffusion which has the posterior as its stationary distribution. To do this requires first approximating the continuous time diffusion by a discrete time Markov process, and then using subsampling estimates of the gradient of the posterior within the dynamics of this discrete time process. This idea has been extended to approximations of other continuous time dynamics that target the posterior (Ahn et al., 2012;Chen et al., 2014;Ma et al., 2015).
Within these subsampling methods it is possible to tune the subsample size, and sometimes the algorithm's step size, to control the level of approximation. This leads to a trade-off, whereby increasing the computational cost of the algorithm can lead to samplers that target a closer approximation to the true posterior. There is also substantial theory quantifying the bias in, say, estimates of posterior means, that arise from these methods Vollmer et al., 2016;Chen et al., 2015;Huggins and Zou, 2017;Dalalyan and Karagulyan, 2019), and how this depends on the subsample size and step size. However, although they often work well in practice it can be difficult to know just how accurate the results are for any given application. Furthermore, many of these algorithms still have a computational cost that increases linearly with data size Nagapetyan et al., 2017;Baker et al., 2019).
The approach to the problem of big data that is proposed here is a significant departure from the current literature. Rather than building our methodology on the stationarity of appropriately constructed Markov chains, a novel approach based on the quasi-limiting distribution of suitably constructed stochastically weighted diffusion processes is developed. A quasi-limiting distribution for a Markov process X with respect to a Markov stopping time ζ is the limit of the distribution of X t |ζ > t as t → ∞, and such distributions are automatically quasi-stationary distributions; see ; this concept is completely unrelated to the popular area of quasi-Monte-Carlo methods. These quasi-stationary Monte Carlo (QSMC) methods that have been developed can be used for a broad range of Bayesian problems (of a similar type to MCMC methods) and exhibit interesting and differing algorithmic properties. The QSMC methods developed are exact in the same (weak) sense as MCMC methods, in that they give the correct (quasi-)limiting distribution. There are a number of possible implementations of the theory which open up interesting avenues for future research, in terms of branching processes, by means of stochastic approximation methods, or (as outlined in this paper) SMC methods. We note that the use of continuous time SMC and related algorithms to obtain approximations of large time limiting distributions of processes conditioned to remain alive has also been explored in settings in which a quantity of interest admits a natural representation of this form (see Del Moral and Miclo (2003), Rousset (2006) and related work in the physics literature, such as Giardina et al. (2011) and references therein); a substantial difference between these and the present work is that the QSMC methods that are described here construct a process for which quite a general distribution of interest is the quasi-stationary distribution and entirely avoid time discretization errors. One particularly interesting difference between our class of Monte Carlo and MCMC algorithms is that QSMC methods enable us to circumvent entirely the Metropolis-Hastings-type accept-reject steps, while still retaining theoretical guarantees that the correct limiting target distribution is recovered. In the case of big data problems, this removes one of the fundamental O.n/ bottlenecks in computation.
QSMC methods can be applied in big data contexts by using a novel subsampling approach. We call the resulting algorithm ScaLE, for scalable Langevin exact algorithm. The name refers to the 'Langevin' diffusion which is used in the mathematical construction of the algorithm, although it should be emphazised that it is not explicitly used in the algorithm itself. As shown in Section 4, the approach to subsampling that is adopted here can potentially decrease the computational complexity of each iteration of a QSMC algorithm to be O.1/. Furthermore, for a rejection sampler implementation of QSMC, the use of subsampling introduces no additional error-as the rejection sampler will sample from the same stochastic process, a killed Brownian motion, regardless of whether subsampling is used or not. There can be a computational cost of using subsampling, as the number of rejection sampler iterations that are needed to simulate the killed Brownian motion for a given time interval will increase. However, this paper will show that, by using control variates  to reduce the variability of subsampling estimators of features of the posterior, the on-going algorithm computational cost will be O.1/. Constructing the control variates involves a preprocessing step whose cost is O.n/ (at least in the case of posterior contraction at rate n −1=2 ) but after this preprocessing step the resulting cost of ScaLE per effective sample size can be O.1/. The importance of using control variates to obtain a computational cost that is sublinear in n is consistent with other recent work on scalable Monte Carlo methods (Huggins and Zou, 2017;Quiroz et al., 2016;Dubey et al., 2016;Nagapetyan et al., 2017;Baker et al., 2019).
The next section presents the main result that motivates development of QSMC methods. The following sections then provide detail on how to implement QSMC algorithms in practice, and how and why they are amenable to use with subsampling ideas. For clarity of presentation, much of the technical and algorithmic detail has been suppressed but can be found in the appendices.

Quasi-stationary Monte Carlo
Given a target density π on R d , traditional (i.e. Metropolis-Hastings type) MCMC methods propose, at each iteration from Markov dynamics with proposal density q.x, y/, 'correcting' its trajectory by either accepting the move with probability α.x, y/ = min 1, π.y/q.y, x/ π.x/q.x, y/ , .1/ or rejecting the move and remaining at state x. In QSMC sampling, rather than rejecting a move and staying at x, the algorithm kills the trajectory entirely, according to probabilities which relate to the target density. Simulation of a Markov process with killing inevitably leads to death of the process. Thus it is natural to describe the long-term behaviour of the process through its conditional distribution given that the process is still alive. The limit of this distribution is called the quasi-stationary distribution (see, for example, ). The idea of QSMC is to construct a Markov process whose quasi-stationary distribution is the distribution, π.x/, from which the user wishes to sample. Simulations from such a process can then be used to approximate expectations with respect to π.x/ just as in MCMC sampling.
Although in principle QSMC methods can be used with any Markov process, this paper will work exclusively with killed Brownian motion as it has some convenient properties that can be exploited. Therefore let {X t , t 0} denote d-dimensional Brownian motion initialized at X 0 = x 0 . Suppose that κ.x/ denotes a non-negative hazard rate at which the Brownian motion is killed when it is in state x, and let ζ be the killing time itself. Finally define μ t .dx/ := P.X t ∈ dx|ζ > t/ : .2/ the distribution of X t given that it has not yet been killed. The limit of this distribution as t → ∞ is the quasi-stationary distribution of the killed Brownian motion. The aim will be to choose κ in such a way that μ t converges to π and, with this in mind, we introduce the function φ : R d → R: where ' · ' denotes the usual Euclidean norm and Δ the Laplacian operator. By further imposing the following condition the first theorem can be proved.
Theorem 1. Under the regularity conditions (26) and (27) in Appendix A, suppose that condition 1 holds and set .4/ then it follows that μ t converges in L 1 and pointwise to π.
For a proof of theorem 1, see Appendix A.
Note that the regularity conditions in Appendix A are largely technical smoothness and other weak regularity conditions that are common in stochastic calculus. In contrast, condition 1 is necessary for us to be able to construct QSMC methods. However, since non-pathological densities on R d are typically convex in the tails, the second identity in expression (3) demonstrates that condition 1 is almost always satisfied in real examples.
Theorem 1 can be exploited for statistical purposes by noting that, for some sufficiently large t Å , μ t ≈ π for t > t Å . Thus, given samples from μ t for t > t Å , one would have an (approximate) sample from π. This is analogous to MCMC sampling, with t Å being the burn-in period, the only difference being the need to simulate from the distribution of the process conditionally on its not having died.
The next two sections describe how to simulate from μ t . Firstly a description of how to simulate a killed Brownian motion process exactly in continuous time is provided. A naive approach to sample from μ t is to simulate independent realizations of this killed Brownian motion, and to use the values at time t of those processes which have not yet died by time t. In practice this is impracticable, as the probability of survival will, in general, decay exponentially with t. To overcome this SMC methods are employed.
Both these steps introduce additional challenges that are not present within standard MCMC problems. Thus a natural question is why use QSMC methods at all? This is addressed in Section 4 where it is shown that simulating the killing events can be carried out using just subsamples of data. In fact subsamples of size 2 can be used without introducing any approximation into the dynamics of the killed Brownian motion.

Simulating killed Brownian motion
Theorem 1 relates a target distribution of interest to the quasi-stationary distribution of an appropriate killed Brownian motion. To be able to simulate from this quasi-stationary distribution it is necessary to be able to simulate from killed Brownian motion.
To help to convey the main ideas, first consider the case where the killing rate κ.x/ is bounded above by some constant, K say. In this case it is possible to use thinning (see, for example, Kingman (1992)) to simulate the time at which the process will die. This involves simulating the Brownian motion independently of a Poisson process with rate K. Each event of the Poisson process is a potential death event, and an appropriate Bernoulli variable then determines whether or not the death occurs. For an event at time ξ the probability that death occurs depends on the state of the Brownian motion at time ξ and is equal to κ.x ξ /=K. Thus to simulate the killed Brownian motion to time t the first step is to simulate all events in the Poisson process up to time t. Then, by considering the events in time order, it is straightforward to simulate the Brownian motion at the first event time and as a result to determine whether death occurs. If death does not occur, the next event time can be considered. This is repeated until either the process dies or the process has survived the last potential death event in [0, t]. If the latter occurs, Brownian motion can be simulated at time t without any further conditions. This can be viewed as a rejection sampler to simulate from μ t .x/, the distribution of the Brownian motion at time t conditional on its surviving to time t. Any realization that has been killed is 'rejected' and a realization that is not killed is a draw from μ t .x/. It is easy to construct an importance sampling version of this rejection sampler. Assume that there are k events in the Poisson process before time t, and these occur at times ξ 1 , : : : , ξ k . The Brownian motion path is simulated at each event time and at time t. The output of the importance sampler is the realization at time t, x t , together with an importance sampling weight that is equal to the probability that the path survives each potential death event: Given a positive lower bound on the killing rate, κ.x/ K ↓ for all x, it is possible to improve the computational efficiency of the rejection sampler by splitting the death process into a death process of rate K ↓ and one of rate κ.x/ − K ↓ . Actual death occurs at the first event in either of these processes. The advantage of this construction is that the former death process is independent of the Brownian motion. Thus it is possible first to simulate whether or not death occurs in this process. If it does not we can then simulate, using thinning as above, a killed Brownian motion with rate κ.x/ − K ↓ . The latter will have a lower intensity and thus be quicker to simulate. Using the importance sampling version instead, events in a Poisson process of rate K − K ↓ , ξ 1 , : : : , ξ k say, are simulated, and our realization at time t is assigned a weight This is particularly effective as the exp.−K ↓ t/ term is a constant which will cancel on normalization of the importance sampling weights.

Simulating killed Brownian motion by using local bounds
The approach in Section 3.1 is not applicable in the absence of an upper bound on the killing rate. Even in situations where a global upper bound does exist, the resulting algorithm may be inefficient if this bound is large. Both of these issues can be overcome by using local bounds on the rate. For this section we shall work with the specific form of the killing rate in theorem 1, namely φ.x/ − Φ. The bounds that are used will be expressed in terms of bounds on φ.x/.
Given an initial value for the Brownian motion, x 0 , define a hypercube which contains x 0 . In practice this cube is defined to be centred on x 0 with a user-chosen side length (which may depend on x 0 ). Denote the hypercube by H 1 , and assume that upper and lower bounds U .1/ X and L .1/ X respectively can be found for φ.x/ with x ∈ H 1 . The thinning idea of the previous section can be used to simulate the killed Brownian motion while the process stays within H 1 . Furthermore it is possible to simulate the time at which the Brownian motion first leaves H 1 and the value of the process when this happens (see Appendix C). Thus our approach is to use our local bounds on φ.x/, and hence on the killing rate, to simulate the killing process while x remains in H 1 . If the process leaves H 1 before t it is then necessary to define a new hypercube, H 2 say, to obtain new local bounds on φ.x/ for x ∈ H 2 and to repeat simulating the killing process by using these new bounds until the process either first leaves the hypercube or time t is reached.
The details of this approach are now given, describing the importance sampling version which is used later-though a rejection sampler can be obtained by using similar ideas. The first step is to calculate the hypercube H 1 and the bounds L .1/ X and U .1/ X . We then simulate the time and position at which x first leaves H 1 . We call this the layer information and denote it as R .1/ X = .τ 1 , x τ 1 /. The notion of a layer for diffusions was formalized in Pollock et al. (2016), and we refer the interested reader there for further details. Next the possible killing events on [0, t ∧ τ 1 ] are generated by simulating events of a Poisson process of rate U .1/ X − L .1/ X : ξ 1 , : : : , ξ k say. The next step involves simulating the values of the Brownian motion at these event times (the simulation of which is conditional on R .1/ X -see Appendix C.2 and algorithm 5 there for a description of how this can be done). An incremental importance sampling weight for this segment of time is given as . 5/ If τ 1 < t, then this process is repeated with a hypercube centred on x τ 1 until simulation to time t has been achieved. This gives successive incremental weights W .2/ , W .3/ , : : :. A simulated value for the Brownian motion at time t is given, again simulated conditionally on the layer information for the current segment of time, and an importance sampling weight that is the product of the incremental weights associated with each segment of time. At time t, J.t/ incremental weights have been simulated, leading to the cumulative weight W .j/ : . 6/ Full algorithmic details of the description above are given in algorithm 1. In practice every sample X t will have an importance weight that shares a common constant of exp.Φt/ in equation (6). As such it is omitted from algorithm 1 and the weights are asterisked to denote this. It is straightforward to prove that this approach gives valid importance sampling weights in the following sense.
Proof. First note that, by direct calculation of its Doob-Meyer decomposition conditionally on X[0, T ], W t exp{ t 0 φ.X s /ds} is a martingale; see for example Revuz and Yor (2013). Therefore E[W t |X[0, T ]] exp{ t 0 φ.X s /ds} = 1 and the result follows.

Simulating from the quasi-stationary distribution
In theory we can use our ability to simulate from μ t .x/, using either rejection sampling to simulate from the quasi-stationary distribution of our killed Brownian motion, or importance Algorithm 1. Importance sampling killed Brownian motion algorithm 1. Initialize: input initial value X 0 , and time interval length t. Set i = 1, j = 0, i/ X / and return to step 3 sampling to approximate this distribution. We would need to specify a 'burn-in' period of length t Å say, as in MCMC sampling and then to simulate from μ t AE .x/. If t Å is chosen appropriately these samples would essentially be draws from the quasi-stationary distribution. Furthermore we can propagate these samples forward in time to obtain samples from μ t .x/ for t > t Å , and these would, marginally, be draws from essentially the quasi-stationary distribution.
However, in practice this simple idea is unlikely to work. We can see this most clearly with the rejection sampler, as the probability of survival will decrease exponentially with t-and thus the rejection probability will often be prohibitively large.
Various approaches have been suggested to overcome the inefficiency of this naive approach to simulating from a quasi-stationary distribution (see for example de Oliveira and Dickman (2005), , and the recent rebirth methodology of ). Our approach is to use ideas from SMC methods. In particular, we shall discretize time into m intervals of length T=m for some chosen T and m. Defining t i := iT=m for i = 1, : : : , m, we use our importance sampler to obtain an N-sample approximation of μ t 1 .x/; this will give us N particles, i.e. realizations of x t 1 , and their associated importance sampling weights. We normalize the importance sampling weights and calculate the empirical variance of these normalized weights at time t 1 . If this is sufficiently large we resample the particles, by simulating N times from the empirical distribution that is defined by the current set of weighted particles. If we resample, we assign each of the new particles a weight 1=N.
The set of weighted particles at time t 1 is then propagated to obtain a set of N weighted particles at time t 2 . The new importance sampling weights are just the weights at time t 1 , before propagation, multiplied by the (incremental) importance sample weight calculated when propagating the particle from time t 1 to t 2 . The above resampling procedure is applied, and this whole iteration is repeated until we have weighted particles at time T . This approach is presented as the QSMC algorithm in algorithm 2 in which N eff is the effective sample size (ESS) of the weights (Kong et al., 1994), a standard way of monitoring the variance of the importance sampling weights within SMC sampling, and N th is a user-chosen threshold which determines whether or not to resample. The algorithm outputs the weighted particles at the end of each iteration.
Given the output from algorithm 2, the target distribution π can be estimated as follows. After choosing a burn-in time, t Å .∈ .t 0 , : : : , t m //, sufficiently large to provide reasonable confidence that quasi-stationarity has been 'reached', the approximation to the law of the killed process is then simply the weighted occupation measures of the particle trajectories in the interval [t Å , T ]. More precisely, using the output of the QSMC algorithm, .7/ For concreteness, for a suitable L 1 .π/ function g, the Monte Carlo estimator can simply be set to .8/ The general (g-specific) theoretical ESS is given by var π .g/=var{ π.g/}. Practical approximation of ESS is discussed in Appendix I.

Subsampling
We now return to the problem of sampling from the posterior in a big data setting and assume that we can write the target posterior as where f 0 .x/ is the prior and f 1 .x/, : : : , f n .x/ are likelihood terms. To be consistent with our earlier notation x refers to the parameters in our model. The assumption of this factorization is quite weak and includes many classes of models exhibiting various types of conditional independence structure.
It is possible to sample from this posterior by using algorithm 2 by choosing φ.x/, and hence κ.x/, which determines the death rate of the killed Brownian motion, as defined in expressions (3) and (4) respectively. In practice this will be computationally prohibitive as at every potential death event we determine acceptance by evaluating φ.x/, which involves calculating derivatives of the log-posterior, and so requires accessing the full data set of size n. However, it is easy to estimate φ.x/ unbiasedly by using subsamples of the data as the log-posterior is a sum over the different data points. Here we show that we can use such an unbiased estimator of φ.x/ while still simulating the underlying killed Brownian motion exactly.

Simulating killed Brownian motion with an unbiased estimate of the killing rate
To introduce the approach proposed we begin by assuming that we can simulate an auxiliary random variable A ∼ A, and (without loss of generality) construct a positive unbiased estimator κ A .·/ such that .10/ The approach relies on the following simple result which is stated in a general way as it is of independent interest for simulating from events of probability which are expensive to compute but that admit a straightforward unbiased estimator. Its proof is trivial and will be omitted.
Proposition 1. Let 0 p 1, and suppose that P is a random variable with E[P] = p and 0 P 1 almost surely. Then, if u ∼ U[0, 1], the event {u P} has probability p.
We now adapt this result to our setting, noting that the randomness that is obtained by direct simulation of a p-coin, and that using proposition 1, is indistinguishable.
Recall that, in Section 3.1 to simulate a Poisson process of rate κ, Poisson thinning was used. The initial step is first to find, for the Brownian motion trajectory constrained to the hypercube H, a constant K X ∈ R + such that ∀ x ∈ H, κ.x/ K X holds. Then a dominating Poisson process of rate K X is simulated to obtain potential death events, and then in sequence each potential death event is accepted or rejected. A single such event, occurring at time ξ, will be accepted as a death with probability κ.x ξ /=K X .
An equivalent formulation would simulate a Poisson process of rate κ by using a dominating Poisson process of higher rateK X K X . This is achieved by simply substituting K X forK X in the argument above. However, the penalty for doing this is an increase in the expected computational cost by a factor ofK X =K X -therefore it is reasonable to expect to have a larger number of potential death events, each of which will have a smaller acceptance probability. Now, suppose for our unbiased estimatorκ A that it is possible to identify someK X ∈ R + such that for A almost all A, and all x ∈ H, 0 κ.x/ K X . Noting from equation (10) that we have an unbiased [0, 1]-valued estimator of the probability of a death event in the above argument (i.e. E A [κ A .x/=K] = κ.x/=K) and, by appealing to proposition 1, another (entirely equivalent) formulation of the Poisson thinning argument above is to use a dominating Poisson process of rateK X , and to determine acceptance or rejection of each potential death event by simulating A ∼ A and accepting with probabilityκ A .x ξ /=K (instead of κ.x ξ /=K).
In the remainder of this section we exploit this extended construction of Poisson thinning (using an auxiliary random variable and unbiased estimator), to develop a scalable alternative to the QSMC approach that was introduced in algorithm 2. The key idea in doing so is to find an auxiliary random variable and unbiased estimator which can be simulated and evaluated without fully accessing the data set, while ensuring that the increased number of evaluations that is necessitated by the ratioK X =K X 1 does not grow too severely.

Constructing a scalable replacement estimator
Noting from expressions (3) and (4) that the selection of κ.x/ required to sample from a posterior π.x/ is determined by φ.x/, in this section we focus on finding a practical construction of a scalable unbiased estimator for φ.x/. Recall that φ.x/ := . ∇ log{π.x/} 2 + Δ log{π. φ.x/ U .i/ X . As motivated by Section 4.1, it is then possible to construct an auxiliary random variable A ∼ A, and an unbiased estimator φ A such that .12/ and to determine constantsŨ i/ X such that within the same hypercube we haveL To ensure the validity of our QSMC approach, as justified by theorem 1 in Section 3.3, it is necessary to substitute condition 1 with the following (similarly weak) condition.
To ensure practicality and scalability it is crucial to focus on ensuring that the ratiõ i/ X , does not grow too severely with the size of the data set (as this determines the multiplicative increase in the rate, and hence increased inefficiency, of the dominating Poisson process required within algorithm 2). To do this, our approach develops a tailored control variate, of a similar type to that which has since been successfully used within the concurrent work of two of the authors on MCMC methods (see ).
To implement the control variate estimator, it is first necessary to find a point that is close to a mode of the posterior distribution π, denoted byx. In fact, for the scaling arguments to hold,x should be within O.n −1=2 / of the true mode, and achieving this is a less demanding task than actually locating the mode. Moreover we note that this operation is required to be done only once, and not at each iteration, and so can be done fully in parallel. In practice it would be possible to use a stochastic gradient optimization algorithm to find a value that is close to the posterior mode, and we recommend then starting the simulation of our killed Brownian motion from this value, or from some suitably chosen distribution centred at this value. Doing this substantially reduces the burn-in time of our algorithm. In the following section we describe a simpler method that is applicable when two passes of the full data set can be tolerated in the algorithm's initialization.
Addressing scalability for multimodal posteriors is a more challenging problem, and goes beyond what is addressed in this paper, but is of significant interest for future work. We do, however, make the following remarks. In the presence of multimodality, stochastic gradient optimization schemes may converge to the wrong mode. This is still sufficiently good as long as possible modes are separated by a distance which is O.n −1=2 /; when separate modes which are separated by more than O.n −1=2 / exist, an interesting option would be to adopt multiple control variates.
Remembering that log{π.x/} = Σ n i=0 log{f i .x/} and letting A be the law of I ∼ U{0, : : : , n}, our control variate estimator is constructed thus: . 14/ As such, φ.x/ can be re-expressed as (b) Replace step 7 with τ i : if ξ j = τ i , set i = i + 1, and return to step 2. Else simulate A j = .I j , J j /, with I j , J j ∼ IID U{0,: : : , n}, and set (whereφ A j is defined as in equation (16)) and return to algorithm 1, step 3 †As per algorithm 2 unless stated otherwise.
Letting A now be the law of I, J ∼ IID U{0, : : : , n} the following unbiased estimator of φ can be constructed: .16/ The estimatorsα I .x/ andφ A .x/ are nothing more than classical control variate estimators, albeit in a fairly elaborate setting, and henceforth we shall refer to these accordingly. The construction of the estimator requires evaluation of the constants ∇ log{π.x/} and Δ log{π.x/}. Although both are O.n/ evaluations they must be computed only once and, furthermore, as mentioned above, can be calculated entirely in parallel.
The unbiased estimatorsα I .x/ andφ A .x/ use (respectively) single and double draws from {1, : : : , n} although it is possible to replace these by averaging over multiple draws (sampled with replacement), although this is not studied theoretically in the present paper and is exploited only in Section 7.5 of the empirical study. Embedding our subsampling estimator described above within the QSMC algorithm of Section 3.3 results in algorithm 3, termed the scalable Langevin exact algorithm (ScaLE). A similar modification could be made to the rejection sampling version; called R-QSMC, which was discussed in Section 3.3 and is detailed in Appendix F. This variant is termed the rejection scalable Langevin exact algorithm (R-ScaLE), and full algorithmic details are provided in Appendix G.

Implementation details
In this section we detail some simple choices of the various algorithmic parameters which lead to a concrete implementation of ScaLE. These choices have been made on the bases of parsimony and convenience and are certainly not optimal.
In practice, we are likely to want to employ a suitable preconditioning transformation X = Λ −1 X before applying the algorithm to equate roughly scales for different components. If we did not do this, it is likely that some components would mix particularly slowly. Obtaining a suitablê x and Λ is important. One concrete approach, and that used throughout our empirical study except where otherwise stated, is as follows. Divide a data set into a number of batches which are sufficiently small to be processed by using standard maximum likelihood estimation approaches and estimate the maximum likelihood estimate (MLE) and observed Fisher information for each batch;x can then be chosen to be the mean of these MLEs and Λ −1 to be a diagonal matrix with elements equal to the square root of the sum of the diagonal elements of the estimated information matrices. Better performance would generally be obtained by using a non-diagonal matrix, but this serves to illustrate a degree of robustness to the specification of these parameters. The constants that are required within the control variate can then be evaluated. For a given hypercube H, a boundK X on the dominating Poisson process intensity can then be obtained by simple analytic arguments facilitated by extending that hypercube to includex and obtaining bounds on the modulus of continuity ofφ A . In total, two passes of the full data set are required to obtain the necessary algorithmic parameters and to specify the control variate fully.
As discussed in Section 3.3, it is necessary to choose an execution time T for the algorithm and an auxiliary mesh (t 0 := 0, t 1 , : : : , t m := T ) on which to evaluate g in the computation of the QSMC estimator (8). Within the algorithm the particle set is evolving according to killed Brownian motion with a preconditioning matrix Λ −1 chosen to match approximately the square root of the information matrix of the target posterior. As such, T should be chosen to match the time that is taken for preconditioned Brownian motion to explore such a space, which in the examples considered in this paper ranged from T ≈ 1 to T ≈ 100. The number of temporal mesh points, m, was chosen with computational considerations in mind-increasing m increases the cost of evaluating the estimator and leads to greater correlation between the particle set at consecutive mesh points but ensures when running the algorithm on a multiple-user cluster that the simulation is periodically saved and reduces the variance of the estimator. As the computational cost of the algorithm is entirely determined by the bounds on the discussed modulus of continuity ofφ A , in each of the examples we later consider that our mesh size was loosely determined by the inverse of this quantity and ranged from The initial distribution f x 0 is not too critical, provided that it is concentrated reasonably close (within a neighbourhood of size O.n −1=2 /) to the mode of the distribution. The stability properties of the SMC implementation ensure that the initial conditions will be forgotten (see chapter 7 of Del Moral (2004) for a detailed discussion). The empirical results that are presented below were obtained by choosing, as f x , either a singular distribution concentrated atx or a normal distribution centred at that location with a covariance matrix matching ΛΛ T ; results were found to be insensitive to the particular choice.

Complexity of ScaLE
The computational cost of ScaLE will be determined by two factors: the speed at which μ t approaches π and the computational cost of running the algorithm per unit algorithm time. Throughout the exposition of this paper, the proposal process is simple Brownian motion. Because of posterior contraction, as n grows, this proposal Brownian motion moves increasingly rapidly through the support of π. However, as n grows, killing rates will grow. In this section we shall explore in detail how the computational cost of ScaLE varies with n (its complexity) while bringing out explicitly the delicate link to the rate of posterior contraction and the effect of the choice ofx.
We start by examining the speed of convergence of μ t and in particular its dependence on posterior contraction. Being more explicit about posterior contraction, we say that {π n } are O.n −η=2 / or have contraction rate η=2 for some η > 0 to a limit x 0 if for all > 0 there exists K > 0 such that, when X n ∼ π n , P.|X n − x 0 | > Kn −η=2 / < . It is necessary to extend the definition of μ t to a setting where n increases; hence define μ n,u t .dx/ := P.X t ∈ dx|ζ > t, X 0 = x 0 + n −η=2 u/: .17/ Since we are dealing with Markov processes that are essentially never uniformly ergodic, it is impossible to control convergence times uniformly. The specification of the initial value as X 0 = x 0 + n −η=2 u, which, as n increases, remains close to the centre of the posterior as specified through the contraction rate, goes as far as we can before incurring additional computational costs for bad starting values. Set where ' · ' represents total variation distance. It will be necessary to make the following technical assumption. For all , K > 0 lim sup n→∞ sup |u|<K n η T n,u, < ∞: . 18/ At first sight, assumption (18) may seem strong, but it is very natural and is satisfied in reasonable situations. For example suppose that we have a contraction scaling limit: (A special case of this is the Bernstein-von Mises theorem with η = 1 and h being Gaussian, but our set-up is far broader.) If {X n t } denotes ScaLE on π n , then by simple scaling and time change properties of Brownian motion it is easily checked that if Y t = X n −η t then Y is (approximately) ScaLE on h which is clearly independent of n. Thus to obtain a process which converges in O.1/ we need to slow down X by a time scaling factor of time factor = n η : . 19/ Similar arguments have been used for scaling arguments of other Monte Carlo algorithms that use similar control variates; see for instance the concurrent work of .
Although posterior contraction has a positive effect on computational cost, also, for large n, the rate at which a likelihood subsample needs to be calculated, as measured byλ, needs to increase. Sinceλ depends on the current location in the state space, where we need to be precise we shall setλ n,K to be an available bound which applies uniformly for |x − x 0 | < Kn −η=2 .
The following notion of convergence cost will be required: setting Therefore to understand iteration complexity of ScaLE it is necessary to understand the rate at whichλ n,K grows with n. A general way to do this is to use global, or local, bounds on the second derivatives of the log-likelihood for each datum. To simplify the following exposition a global bound is assumed, so that ρ.∇ 2 log{f I .x/}/ P n , .20/ for some P n > 0, where ρ.·/ represents the spectral radius and ∇ 2 is the Hessian matrix. For smooth densities with Gaussian and heavier tails, the Hessian of the log-likelihood is typically uniformly bounded (in both data and parameter). In such cases such a global bound would be expected, and in fact P n would be constant in n.
Recalling the layer construction of Section 3.2 for a single trajectory of killed Brownian motion, we can ensure that over any finite time interval we have x ∈ H, some hypercube. Let the centre of the hypercube be x Å .
In this section, eventually the assumption that the posterior contracts at a rate n −η=2 will be made, i.e. that {n η=2 .x − x 0 /, n = 1, 2, : : :} is tight. The so-called regular case corresponds to the case where η = 1, although there is no need to make any explicit assumptions about normality in what follows. The practitioner has complete freedom to choose H, and it makes sense to choose this so that x − x Å < C Å n −η=2 for some C Å > 0 and for all x ∈ H.
It is possible to boundφ A .x/ both above and below if we can bound |φ A .x/| over A-almost all possible realizations of A. To bound |φ A .x/|, the approach here is first to consider the elementary estimator in expression (14). By imposing condition (20) we can then obtain max x∈H,I∈{0,:::,n} . 21/ Thus it is possible to bound estimator (16) as follows: + P n d.n + 1/: .

22/
We can use the fact that max x∈H x −x x Å −x + C Å n −η=2 to bound the terms in this expression.
We now directly consider the iteration complexity of ScaLE. We note that the appropriate killing rate to ensure mixing in time O.1/ involves slowing down by the time factor given in expression (19) and is therefore just n −ηλ . Assuming that η 1, and using the bound on |φ A .x/ − C| for the hypercube centred on x Å , we have that, while we remain within the hypercube, 1 n ηλ = O.P n n 1−3η=2 .P n n 1−η=2 + |∇ log{π.x/}|//: .23/ Here the assumption has been made that at stationarity x Å will be a draw from the support of the posterior, so that, under the assumption of posterior contraction at the n −η=2 -rate, then This discussion is summarized in the following result.
This result also illuminates the role that is played by |∇ log{π.x/}| in the efficiency of the algorithm. In the following discussion it is assumed that η = 1. It is worth noting that although a completely arbitrary starting value forx might make |∇ log{π.x/}| an O.n/ quantity leading to an iterative complexity of the algorithm which is O.n 1=2 /, to obtain O.1/ it is simply required that |∇ log{π.x/}| be O.n 1=2 / which gives considerable leeway for any initial explorative algorithm to find a good value forx.
Note that, given bounds on the third derivatives, equation (23) can be improved by linearizing the divergence term in expression (16). This idea is exploited later in a logistic regression example (see Sections 7.2, 7.3 and 7.4).
In the absence of a global bound on the second derivatives, it is possible to replace P n in the above arguments by any constant that bounds the second derivatives for all x such that x −x max x∈H x −x . In this case, the most extreme rate at whichλ can grow is logarithmically with n, for instance for light-tailed models where the data really come from the model being used. Where the tails are misspecified and light-tailed models are being used, the algorithmic complexity can be considerably worse. There is considerable scope for more detailed analyses of these issues in future work.
The above arguments give insight into the influence of our choice ofx. It affects the bound oñ λ, and hence the computational efficiency of ScaLE, through the terms x Å −x . Furthermore the main term in the order ofλ is the square of this distance. Ifx is the posterior mean, then the square of this distance will, on average, be the posterior variance. By comparison, ifx is k posterior standard deviations away from the posterior mean, then on average the square distance will be k 2 + a times the posterior variance (for some constant a), and the computational cost of ScaLE will be increased by a factor of roughly k 2 + a.

Overall complexity
Here we shall briefly discuss the overall complexity of ScaLE. The general set-up of theorem 3 describes the iteration complexity of ScaLE on the assumption that |∇ log{π.x/}| grows no worse than O.n ι /. However, there is a substantial initial computational cost in locatingx and calculating ∇ log{π.x/} which is likely to be O.n/ as there are n terms in the calculation of the latter. Therefore the overall complexity of ScaLE can be described as where t represents algorithm time. This is in contrast with an MCMC algorithm for which the iteration cost would be O.n/, leading to overall complexity tn. A Laplace approximation will involve an initial cost that is (at very least) O.n/ but no further computation.
Since they both involve full likelihood calculations, finding the posterior mode and findinĝ x are both likely to be O.n/ calculations. This can be shown to be so for strongly log-concave posterior densities (Nesterov, 2013), though the cost may be higher if the log-posterior is not concave. However, the above discussion shows that to achieve O.1/ scaling with data we typically only need to findx within O.n −1=2 / of the posterior model. Thus findingx is certainly no more difficult than finding the posterior mode, as we can use the same mode finding algorithm, e.g. Bottou (2010), Nesterov (2013) and Jin et al. (2018), but have the option of stopping earlier.
If n is sufficiently large that the cost of initialization dominates the iteration cost, ScaLE will computationally be no more expensive to implement than the Laplace approximation. In this case we obtain an exact approximate algorithm (ScaLE) for at most the computational complexity of an approximate method (Laplace). These complexity considerations are summarized in Table 1.

Theoretical properties
SMC algorithms in both discrete and continuous time have been studied extensively in the literature (for related theory for approximating a fixed point distribution, including for algorithms with resampling implemented in continuous time, see Del Moral andMiclo (2000, 2003) and Rousset (2006); a discrete time algorithm to approximate a fixed point distribution in a different context was considered by Whiteley and Kantas (2017)). To avoid a lengthy technical diversion, we restrict ourselves here to studying a slightly simplified version of the problem to obtain the simplest and most interpretable possible form of results. The technical details of this construc- Table 1. Complexity of algorithms for big data † n t n+ t †This is split into the complexity of initiation, C init , and the cost of the iterative algorithm, C iter . Here n denotes sample size, t denotes algorithm time, and and η are as given in theorem 3.
tion are deferred to Appendix H and we give here only a qualitative description that is intended to guide intuition and the key result: that the resulting estimator satisfies a Gaussian central limit theorem with the usual Monte Carlo rate.
Consider a variant of the algorithm in which (multinomial) resampling occurs at times kh for k ∈ N where h is a time step resolution that is specified in advance and consider the behaviour of estimates that are obtained at these times. Extension to resampling at a random subset of these resampling times would be possible by using the approach of Del Moral et al. (2012), considering precisely the QSMC algorithm that was presented in algorithm 2 and the ScaLE algorithm in algorithm 3 would require additional technical work that is somewhat beyond the scope of this paper; no substantial difference in behaviour was observed.
To employ standard results for SMC algorithms it is convenient to consider a discrete time embedding of the algorithms described. We consider an abstract formalism in which between the specified resampling times the trajectory of the Brownian motion is sampled, together with such auxiliary random variables as are required in any particular variant of the algorithm. Provided that the potential function that is employed to weight each particle before resampling has conditional expectation (given the path) proportional to the exact killing rate integrated over these discrete time intervals a valid version of ScaLE is recovered.
This discrete time formalism enables results on more standard SMC algorithms to be applied directly to ScaLE. We provide in the following proposition a straightforward corollary to a result in chapter 9 of Del Moral (2004), which demonstrates that estimates that are obtained from a single algorithmic time slice of the ScaLE algorithm satisfy a central limit theorem.
Proposition 2 (central limit theorem). In the context described, under mild regularity conditions (see references given in Appendix H), Z is a standard normal random variable, '⇒' denotes convergence in distribution and σ k .ϕ/ depends on the precise choice of subsampling scheme as well as the test function of interest and is specified in Appendix H along with the law K x hk .

Examples
In this section we present five example applications of the methodology that is developed in this paper, each highlighting a different aspect of ScaLE and contrasted with appropriate competing algorithms. In particular, in Section 7.1 we consider a simple pedagogical example which has a skewed target distribution, contrasted with MCMC sampling; Section 7.2 considers the performance of a logistic regression model in which significantly less information is available from the data about one of the covariates than the others; in Section 7.3 we apply both ScaLE and the stochastic gradient Langevin diffusion algorithm SGLD to a regression problem based on the American Statistical Association's data expo airline on-time performance data, which are of moderately large size (of the order of 10 8 ); Section 7.4 considers ScaLE applied to a very large logistic regression problem, with a data set of size n = 2 34 ≈ 10 10:2 , along with consideration of scalability with respect to data size; finally, in Section 7.5, parameter inference for a contaminated regression example is given, motivated by a big data application with n = 2 27 ≈ 10 8:1 , and illustrating the potential of an approximate implementation of ScaLE even when misinitialized. All simulations were conducted in R on an Xeon X5660 central processor unit running at 2.8 GHz. For the purposes of presenting the ScaLE methodology as cleanly as possible, in each example no prior has been specified. In practice, a prior can be simply included within the methodology as described in Section 4.
Details of the data and code that were used to produce the output within this section can be found at https://rss.onlinelibrary.wiley.com/hub/journal/14679868/seriesb-datasets.

Skewed target distribution
To illustrate ScaLE applied to a simple non-Gaussian target distribution, we constructed a small data set of size n = 10, to which we applied a logistic regression model The data were chosen to induce a skewed target, with y T = .1, 1, 0, : : : , 0/ and x T i = .1, .−1/ i =i/. We used the glm R package to obtain the maximum likelihood estimate (β Å ≈ .−1:5598, −1:3971/) and observed Fisher information, to (mis)initialize the particles in ScaLE. In total N = 2 10 particles were used, along with a subsampling mechanism of size 2 and a control variate computed as in Section 4.2 by settingx = β Å . For comparison we ran a random-walk Metropolis algorithm on the same example initialized at β Å by using the MCMClogit function provided by MCMCpack (Martin et al., 2011), computed the posterior marginals based on 1 million iterations thinned to 100000 and after discarding a burn-in of 10000 iterations, and overlaid them together with those estimated by ScaLE in Fig. 1. These are accompanied by the glm fit used to misinitialize ScaLE.
It is clear from Fig. 1 that the posterior that was obtained by simulating ScaLE matches that of MCMC sampling, and both identify the skew which would be overlooked by a simple normal approximation. The particle set in ScaLE quickly recovers from its misinitialization, and only a modest burn-in period is required. In practice, we would of course not advocate using ScaLE for such a small data setting-the computational and implementational complexity of ScaLE does not compete with MCMC sampling in this example. However, as indicated in Section 5 and the subsequent examples, ScaLE is robust to increasing data size whereas simple MCMC sampling will scale at best linearly.

Heterogeneous logistic regression
For this example a synthetic data set of size n = 10 7 was produced from the logistic regression model (24). Each record contained three covariates, in addition to an intercept. The covari- , and with the true β = .0, 2, − 2, 2/ (where the first co-ordinate corresponds to the intercept). The specification of this data set is such that significantly less information is available from the data about the second covariate than about the others. Data were then generated from model (24) by using the simulated covariates. As before, the glm R package was used to obtain the MLE and observed Fisher information, which was used within ScaLE to set β Å =x ≈ .2:3581 × 10 −4 , 2:3407, − 2:0009, 1:9995/ and Λ ≈ diag.7:6238 × 10 −4 , 1:3202, 1:5137 × 10 −3 , 1:5138 × 10 −3 / respectively. For the control variate ∇ log{π.x/} ≈ .2:0287 × 10 −9 , 2:2681 × 10 −9 , − 2:3809 × 10 −6 , − 2:3808 × 10 −6 / was calculated by using the full data set and as expected (and required for computational considerations) is extremely small, along with Δ log{π.x/}. ScaLE was then applied to this example, using N = 2 10 particles initialized by using a normal approximation given by the computedx and Λ, and a subsampling mechanism of size 2. The simulation was run for 20 h, in which time 84935484 individual records of the data set were accessed (equivalent to roughly 8.5 full data evaluations). Trace plots for the simulation can be found in Fig. 2, along with posterior marginals given by the output (after discarding as burn-in a tenth of the simulation). The posterior marginals are overlaid with the normal approximation given by the R glm fit.
To assess the quality of the output we adopted a standard method for estimating the ESS for a single parameter. In particular, we first estimated a marginal ESS associated with the particles from ScaLE at a single time point, with this defined as the average of the ratio of the variance of the estimator of the parameter by using these particles to the posterior variance of the parameter (Carpenter et al., 1999). To calculate the overall ESS, the dependence of these estimators over time is accounted for by modelling this dependence as an AR(1) process. Full details of this approach are given in Appendix I. The resulting average ESS per parameter by using this approach was found to be 352.
The ScaLE output is highly stable and demonstrates that, despite the heterogeneity in the information for different parameters, the Bernstein-von Mises limit (Laplace approximation) proves here to be an excellent fit. Although the generalized linear model fit is therefore excellent in this case, ScaLE can be effectively used to verify this. This is in contrast with the example in Section 7.1 where ScaLE demonstrates that the generalized linear model-Laplace approximation is a poor approximation of the posterior distribution.

Airline data set
To demonstrate our methodology applied to a real (and moderately large) data set we consider the 'Airline on-time performance' data set which was used for the 2009 American Statistical Association data expo and can be obtained from http://stat-computing.org/dataexpo/ 2009/. The 'airline' data set consists in its entirety of a record of all flight arrival and departure details for all commercial flights within the USA from October 1987 to April 2008. In total the data set comprises 123534969 such flights together with 29 covariates.
For the purposes of this example we selected a number of covariates to investigate what effect (if any) they may have on whether a flight is delayed. The Federal Aviation Administration considers an arriving flight to be late if it arrives more than 15 min later than its scheduled arrival time. As such we take the flight arrival delay as our observed data (given by ArrDelay in the Airline data) and treat it as binary, taking a value of 1 for any flight delayed in excess of the Federal Aviation Administration definition.
In addition to an intercept, we determine three further covariates which may reasonably affect flight arrival: a weekend covariate, which we obtain by treating DayOfWeek as binary, taking a value of 1 if the flight operated on a Saturday or Sunday, a night flight covariate, which we obtain by taking DepTime (departure time) and treating it as binary, taking a value of 1 if the departure is between 8 p.m. and 5 a.m., and flight distance, which we obtain by taking Distance and normalizing by subtracting the minimum distance and dividing by the range.
The resulting data set that was obtained by the above process contained some missing entries, and so all such flights were omitted from the data set (in total 2786730 rows), leaving n =120748238 rows. We performed logistic regression taking the flight arrival delay variable as the response and treating all other variables as covariates.
To allow computation ofx and Λ as required by ScaLE the data were first divided into 13 subsets each of size 9288326 and the MLE and observed information matrix for each were obtained by using the R glm package. The airline data set is highly structured, and so for robustness the order of the flight records was first permuted before applying glm to the data subsets. An estimate for the MLE and observed information matrix for the full data set was obtained by simply taking the mean for each coefficient of the subset MLE fits, and summing the subset information matrices. The centring pointx ≈ .−1:5609, − 0:1698, 0:2823, 0:9865/ was chosen to be the computed MLE fit, and for simplicity Λ −1 was chosen to be the square root of the diagonal of the computed information matrix (Λ ≈ diag.2:309470 × 10 −4 , 4:632830 × 10 −4 , 6:484359 × 10 −4 , 1:2231 × 10 −5 /). As before, and as detailed in Section 4.2, we use the full data set to compute ∇ log{π.x/} ≈ .0:00249, 0:0018, 0:0021, 0:0029/ (which again is small as suggested by the theory, and required for efficient implementation of ScaLE) and Δ log{π.x/} ≈ −3:999.
ScaLE was initialized by using the normal approximation that is available from the glm fit. In total N = 2 12 particles were used in the simulation, and for computing the unbiased estimatorφ A .x/ we used a subsample of size 2. The algorithm was executed so that n individual records of the data set were accessed (i.e. a single access to the full data set), which took 36 h of computational time. The first tenth of the simulation trajectories were discarded as burn-in and the remainder used to estimate the posterior density. The trace plots and posterior densities for each marginal for the simulation can be found in Fig. 3.
For comparison, we also ran SGLD (Welling and Teh, 2011). This algorithm approximately simulates from a Langevin diffusion which has the posterior distribution as its stationary distribution. The approximation comes from both simulating an Euler discretized version of the Langevin diffusion and from approximating gradients of the log-posterior at each iteration. The approximation error can be controlled by tuning the step size of the Euler discretization-with smaller step sizes meaning less approximation but slower mixing. We implemented SGLD by using a decreasing step size, as recommended by the theoretical results of Teh et al. (2016) and used pilot runs to choose the smallest scale for the step size schedule which still led to a well mixing algorithm. As such, the preprocessing expenditure matched that of ScaLE. The accuracy of the estimate of the gradient is also crucial to the performance of SGLD (Dalalyan and Karagulyan, 2019), and we employed an estimator that used control variates (similar to that developed in ScaLE) and a mini-batch size of 1000, following the guidance of Baker et al. (2019), Nagapetyan et al. (2017) and Brosse et al. (2018). For comparable results we ensured that SGLD had the same number of log-likelihood evaluations as ScaLE (i.e. equivalent to one single access to the full data set) and initiated SGLD from the centring value that was used for the control variates. In total the SGLD simulation took 4 h to execute. The first tenth was discarded as burn-in and the remainder was used to estimate the marginal posteriors, which are overlaid with those estimated by ScaLE in Fig. 3.
As can be seen in Fig. 3, SGLD estimates seem to be unstable here, with the algorithm struggling to mix effectively under the decreasing step size constraint, particularly for the fourth covariate. Indeed, the marginal posteriors should be convex and SGLD deviates strongly from this. This unstable behaviour was confirmed in replicate SGLD runs, and indeed it would be difficult to separate out bias from Monte Carlo error for SGLD without much longer runs. This is in contrast with ScaLE which produces far more stable output in this example.

Large data scenario
In this subsection we consider an application of ScaLE to a five-dimensional logistic regression model, considering data sets of up to size n = 2 34 ≈ 10 10:2 . Logistic regression is a model that is frequently employed within big data settings , and here the scalability of ScaLE is illustrated for this canonical model. In this example, we generate a data set of size 2 34 from model (24) by first constructing a design matrix in which the ith entry x i := .1, ζ i,1 , : : :  ζ 1,1 , : : : , ζ n,4 are independent and identically distributed truncated normal random variables with support [−1, 1]. In the big data setting it is natural to assume such control on the extreme entries of the design matrix, either through construction or physical limitation. On simulating the design matrix, binary observations are obtained by simulation using the parameters β = .1, 1, −1, 2, −2/ T . Because of the extreme size of the data we realized observations only as they were required to avoid storing the entire data set; see the code provided for implementation details.
ScaLE was applied to this example by using N = 2 10 particles initialized using a normal approximation given by the available glm fit, and a subsampling mechanism of size 2. The simulation was run for 70 h, in which time 49665450 individual records of the data set were accessed (equivalent to roughly 0.0029 full data evaluations). Trace plots for the simulation can be found in Fig. 4. The first tenth of the simulation trajectories was discarded as burn-in and the remainder used to estimate the posterior density of each marginal. These can also be found in Fig. 4, together with the normal approximation to the posterior marginals given by the R glm fit, which is again very accurate here, agreeing closely with the ScaLE output. Using the ESS approach that was described in Section 7.2 and Appendix I, the average ESS per parameter was found to be 553.
To investigate scaling with data size for this example, we considered the same model using the same process as outlined above with data sets varying in size by a factor of 2 from n = 2 21 to n = 2 33 . Computing explicitly the dominating intensityλ n,K over the support of the density the relative cost of ScaLE for each data set with respect to the data set of size n = 2 34 can be inferred. This is shown in Fig. 5.

Contaminated mixture
In this subsection we consider parameter inference for a contaminated mixture model. This is motivated by big data sets obtained from Internet applications, in which the large data sets are readily available, but the data are of low quality and corrupted with noisy observations. In particular, in our example each datum comprises two features and a model is fitted in which the likelihood of an individual observation (y i ) is .25/ In this model p represents the level of corruption and φ the variance of the corruption. A common approach uses MCMC sampling with data augmentation (Tanner and Wong, 1987). However, for large data sets this is not feasible as the dimensionality of the auxiliary variable vector will be O.n/. For convenience a transformation of the likelihood was made so that each 22 24 26 28 30 32 34 11.020 11.025 11.030 11.035 log 2 (n) A data set of size n = 2 27 ≈ 10 8:1 was generated from the model with parameters μ = [α, β, σ, φ, p] = [2, 5, 1, 10, 0:05]. To illustrate a natural future direction for the ScaLE methodology, in this example we instead implemented an approximate version of ScaLE (as opposed to the exact version that was illustrated in the other examples of Section 7). In particular, the primary implementational and computational bottleneck in ScaLE is the formal 'localization procedure' to obtain almost sure bounds on the killing rate by constraining Brownian motion to a hypercube (as fully detailed in Section 3.2 and Appendix C). Removing the localization procedure results in the Brownian motion trajectories being unconstrained, and the resulting dominating intensitỹ λ being infinite. However, in practice the probability of such an excursion by Brownian motion outside a suitably chosen hypercube can be made vanishingly small (along with the consequent effect on the Monte Carlo output) by simply adjusting the temporal resolution at which the ergodic average is obtained from the algorithm (noting that Brownian motion scaling is O. √ t/, and inflating the bounds on the Hessian for computing the intensity. The resulting 'approximate' algorithm is approximate in a different (more controllable and monitorable) sense than for instance SGLD, but results in substantial (10-50 times) computational speed-ups over the available (but expensive) 'exact' ScaLE.
In contrast with the other examples of Section 7, rather than fitting an approximate model to initialize the algorithm, instead in this example a single point mass to initialize the algorithm was chosen (μ = [2:00045, 5:00025, 0:999875, 10:005 0:0499675]), and this was also used as the point to compute our control variate (described in Section 4.2). The preprocessing for executing ScaLE took approximately 6 h of computational time (and is broadly indicative of the length of time that a single iteration of an alternative MCMC scheme such as the Metropolis adjusted Langevin algorithm would require). As discussed in Section 5, this 'misinitialization' impacts the efficiency of the algorithm by a constant factor but is, however, representative of what one in practice may conceivably be able to do (i.e. find by means of an optimization scheme a point within the support of the target posterior close to some mode and conduct a single O.n/ calculation). The forgetting of this initialization is shown in Fig. 6.
Applying ScaLE for this application we used a particle set of size N = 2 11 and ran the algorithm for a diffusion time of T = 200, with observations of each trajectory at a resolution of t i − t i−1 = 0:1. Again, the choice of N was made as in Section 7.4 as it provided the required stability. (c), (h) σ; The choice of T was made as it corresponded approximately to a computational budget of 1 week.
Each particle trajectory at each time t ∈ [0, T ] was associated with a subsample of the full data set of size 32, rather than 2, with the resulting likelihood estimates combined by simple averaging. This size was chosen as it provided balance with other components of the algorithm but allowed stabilization of the importance weights which was beneficial for the approximate algorithm. In total the entire run required accessing 500 million individual data points, which correspond to approximately four full evaluations of the data set.
An example of a typical run can be found in Fig. 6. A burn-in period of 100 was chosen, and alongside the trace plots in Fig. 6 an estimate of the marginal density of the parameters is provided using the occupation measure of the trajectories in the interval t ∈ [100,200].
To assess the quality of the simulation, the same batch mean method is employed to estimate the marginal ESS for the run post burn-in as detailed in Section 7.4. The mean ESS per dimension for this run was around 930. An analysis of the Metropolis adjusted Langevin algorithm (for a necessarily much smaller run) indicated that it is possible to achieve an ESS of around T=3, where T corresponds to the run length subsequent to burn-in. As indicated above, and neglecting burn-in, this would mean an achievable ESS for a comparable computational budget for the Metropolis adjusted Langevin algorithm would be around 10-15.

Conclusions
In this paper we have introduced a new class of QSMC methods which are genuinely continuous time algorithms for simulating from complex target distributions. We have emphasized its particular effectiveness in the context of big data by developing novel subsampling approaches and the scalable Langevin exact algorithm ScaLE. Unlike its immediate competitors, our subsampling approach within ScaLE is essentially computationally free and does not result in any approximation to the target distribution. Our methodology is embedded within an SMC framework, supported by underpinning theoretical results. In addition, examples to which ScaLE is applied demonstrate its robust scaling properties for large data sets.
We show through theory and examples that the computational cost of ScaLE is more stable to data set size than gold standard MCMC approaches. Moreover we have seen it substantially outperform other approaches such as SGLD which are designed to be robust to data size at the cost of bias and serial correlation. ScaLE can both confirm that simpler approaches such as Laplace approximation are accurate and identify when such approximations are poor (as we see in the examples). We see this as a first step in a fruitful new direction for computational statistics. Many ideas for variations and extensions to our implementation exist and will stimulate further investigation.
Firstly, the need to simulate a quasi-stationary distribution creates particular challenges. Although quasi-stationarity is underpinned by an elegant mathematical theory, the development of numerical methods for quasi-stationarity is understudied. We have presented an SMC methodology for this problem, but alternatives exist. For instance,  have suggested alternative approaches.
Even within an SMC framework for extracting the quasi-stationary distribution, there are interesting alternatives that we have not explored. For example, by a modification of our reweighting mechanism it is possible to relate the target distribution of interest to the limiting smoothing distribution of the process, as opposed to the filtering distribution as we do here. Within the quasi-stationary literature this is often termed the type II quasi-stationary distribution. As such, the rich SMC literature offers many other variations on the procedures that were adopted here.
Using SMC methods benefits from the rich theory that they have. However, the use of QSMC methods actually demands new questions of SMC methods. Theorem 1 gives convergence as T → ∞, whereas proposition 2 gives a precise description of the limit as the number of particles N increases. Theoretical and practical questions are associated with letting both N and T tend to ∞ together. Within the examples in this paper ad hoc rules are used to assign computational effort to certain values of N and T . However, the general question of how to choose these parameters seems completely open.
Throughout the paper, we have concentrated on so-called 'exact approximate' QSMC methods. Of course in many cases good approximations are sufficiently good and frequently computationally less demanding. However, for many approximate methods it will be difficult to quantify the systematic error being created by the approximation. Moreover, we emphasize that there are different strategies for creating effective approximations that emanate directly from exact approximate methods, and where the approximation error can be well understood. We have given an example of this in Section 7.5 but other options are possible also.
There are interesting options for parallel implementation of SMC algorithms in conjunction with ScaLE. For instance an appealing option would be to implement the island particle filter (Del Moral et al., 2016) which could have substantial effects on the efficiency of our algorithms where large numbers of particles are required. Alternatively one could attempt to embed our scheme in other divide-and-conquer schemes as described in Section 1.
The approach in this paper has concentrated solely on killed (or reweighted) Brownian motion, and this strategy has been demonstrated to have robust convergence properties. However, given existing methodology for the exact simulation of diffusions in Beskos and Roberts (2005), , Pollock (2013Pollock ( , 2015 and Pollock et al. (2016), there is scope to develop methods which use proposal measures which much better mimic the shape of the posterior distribution.
The subsampling and control variate approaches that were developed here offer dramatic computational savings for tall data as we see from the examples and from the theory of results like theorem 3. We have not presented ScaLE as a method for high dimensional inference, and the problem of large n and d will inevitably lead to additional challenges. However, there may be scope to extend the ideas of ScaLE still further in this direction. For instance, it might be possible to subsample dimensions and thus to reduce the dimensional complexity for implementing each iteration.
We conclude by noting that, as a by-product, the theory behind our methodology offers new insights into problems concerning the existence of quasi-stationary distributions for diffusions killed according to a state-dependent hazard rate, complementing and extending current state of the art literature (Steinsaltz and Evans, 2007).

Appendix A: Proof of theorem 1
Here we present a proof of theorem 1. However, we first formally state the required regularity conditions. We suppose that π.x/ is bounded, .26/ and, defining ν.x/ = π 2 .x/, we further assume that, for some γ > 0, where Δ represents the Laplacian.

Proof. Consider the diffusion with generator given by
Af.x/ = 1 2 Δf.x/ + 1 2 ∇ log{ν.x/}∇f.x/: As ν is bounded, we assume without loss of generality that its upper bound is 1. Our proof will proceed by checking the conditions of corollary 6 of , which establishes the result. In particular, we need to check that the following conditions are satisfied.
Condition 4. All closed bounded sets are petite.
Condition 3 holds for any regular diffusion since the diffusion has positive continuous transition densities over time intervals t > 0; and positivity and continuity of the density also imply condition 4. For the final condition we require that lim sup .31/ Therefore inequality (29) will hold whenever inequality (27) is true since we have the constraint that η 1 and ∇ν.x/ 2 is clearly non-negative. As such the result holds as required.
Note that condition (27) is essentially a condition on the tail of ν. This will hold even for heavy-tailed distributions, and we show that this is so for a class of one-dimensional target densities in Appendix B.

Appendix B: Polynomial tails
In this appendix we examine condition (27) which we use within theorem 1. This is essentially a condition on the tail of ν, and so we examine the critical case in which the tails of ν are heavy. More precisely, we demonstrate for polynomial tailed densities in one dimension that condition (27) essentially amounts to requiring that ν 1=2 is integrable. Recall that by construction ν 1=2 will be integrable as we have chosen ν 1=2 = π.

Appendix C: Simulation of a path space layer and intermediate points
In this appendix we present the methodology and algorithms that are required for simulating an individual proposal trajectory of (layered) killed multivariate Brownian motion, which is what is required in Section 3. Our exposition is as follows. In Appendix C.1 we present the work of Devroye (2009), in which a highly efficient rejection sampler was developed (based on the earlier work of Burq and Jones (2008)) for simulating the first-passage time for univariate standard Brownian motion for a given symmetric boundary, extending it to consider the case of the univariate first-passage times of d-dimensional standard Brownian motion with non-symmetric boundaries. This construction enables us to determine an interval (given by the first, first-passage time) and layer (a hypercube inscribed by the user-specified univariate boundaries) in which the sample path is almost surely constrained, and by application of the strong Markov property can be applied iteratively to find, for any interval of time, a layer (a concatenation of hypercubes) which almost surely constrains the sample path. In Appendix C.2 we present a rejection sampler enabling the simulation of constrained univariate standard Brownian motion as developed in Appendix C.1, at any desired intermediate point. As motivated in Section 3 these intermediate points may be at some random time (corresponding to a proposed killing point of the proposed sample path), or a deterministic time (in which the sample path is extracted for inclusion within the desired Monte Carlo estimator of QSMC (7)). Finally, in Appendix C.3 we present the full methodology that is required in Sections 3 and 4 in which we simulate multivariate Brownian motion at any desired time marginal, with d-dimensional hypercubes inscribing intervals of the state space in which the sample path almost surely lies.

C.1. Simulating the first-passage times of univariate and multivariate standard Brownian motion
To begin with we restrict our attention to the (ith) dimension of multivariate standard Brownian motion initialized at 0, and the first-passage time of the level θ .i/ (which is specified by the user). In particular we denote Recalling the self-similarity properties of Brownian motion (Karatzas and Shreve (1991), section 2.9), we can further restrict our attention to the simulation of the first-passage time of univariate Brownian motion of the level 1, noting that τ .i/ D =.θ .i/ / 2τ wherē τ := inf{t ∈ R + : |W t − W 0 | 1}, .35/ noting that, at this level, P.W τ = W 0 + 1/ = P.W τ = W 0 − 1/ = 1 2 : . 36/ Denoting by fτ the density ofτ (which cannot be evaluated pointwise), the approach that was outlined in Devroye (2009) for drawing random samples from fτ is a series sampler. In particular, an accessible dominating density of fτ is found (denoted gτ ) from which exact proposals can be made; then upper and lower monotonically convergent bounding functions are constructed (lim n→∞ f ↑ τ , n → fτ and lim n→∞ f ↓ τ , n → fτ such that for any t ∈ R + and > 0 ∃ n Å .t, / such that ∀ n n Å .t, / we have f ↑ τ , n .t/ − f ↓ τ , n .t/ < ), and then evaluated to sufficient precision such that acceptance or rejection can be made while retaining exactness. A minor complication arises in that no known, tractable dominating density is uniformly efficient on R + , and furthermore no single representation of the bounding function converges monotonically to the target density pointwise on R + . As such, the strategy that was deployed by Devroye (2009) is to exploit a dual representation of fτ given by Ciesielski and Taylor (1962) to construct a hybrid series sampler, using one representation of fτ for the construction of a series sampler on the interval .0, t 1 ] and the other representation for the interval [t 2 , ∞/ (fortunately we have t 1 > t 2 , and so we have freedom to choose a threshold t Å ∈ [t 2 , t 1 ] in which to splice the series samplers). In particular, as shown in Ciesielski and Taylor (1962) fτ .t/ = πΣ ∞ k=0 .−1/ k a k .t/ where the elements of the two expansions are given by and so by consequence upper and lower bounding sequences can be constructed by simply taking either representation and truncating the infinite sum to have an odd or even number of terms respectively (and thresholding to between 0 and the proposal gτ , introduced below). More precisely, .−1/ k a k .t/ ∧ gτ .t/: .38/ As shown in lemma 1 of Devroye (2009), the bounding sequences based on the representation of fτ .t/ in equation (37a) are monotonically converging for t ∈ .0, 4= log.3/], and for equation (37b) monotonically converging for t ∈ [log.3/=π 2 , ∞/. After choosing a suitable threshold t Å ∈ [4= log.3/, log.3/=π 2 ] for which to splice the series samplers, then by simply taking the first term in each representation of fτ .t/ a dominating density can be constructed as follows: . 39/ Devroye (2009) empirically optimized the choice of t Å = 0:64 to minimize the normalizing constant of expression (39). With this choice M 1 := g .1/ τ .t/dt ≈ 0:422599 (to six decimal places) and M 2 := g .2/ τ .t/dt ≈ 0:578103 (to six decimal places), and so we have a normalizing constant M = M 1 + M 2 ≈ 1:000702 (to six decimal places) which equates to the expected number of proposal random samples drawn from gτ before we would expect an accepted draw (the algorithmic 'outer loop'). Now considering the iterative algorithmic 'inner loop'-in which the bounding sequences are evaluated to precision sufficient to determine acceptance or rejection-as shown in Devroye (2009), the exponential convergence of the sequences ensures that the number of iterations required is uniformly bounded in expectation by 3.
Generalizing to the case where we are interested in the first-passage time of Brownian motion of a non-symmetric barrier, in particular for l .i/ , υ .i/ ∈ R + , is trivial algorithmically. In particular, using the strong Markov property we can iteratively apply algorithm 4, setting θ .i/ := min.l .i/ , υ .i/ / and simulating intermediate first-passage times of lesser barriers, halting whenever the desired barrier is attained. We suppress this (desirable) flexibility in the remainder of the paper to avoid the resulting notational complexity.

C.2. Simulating intermediate points of multivariate standard Brownian motion conditioned on univariate first-passage times
Clearly in addition to being able to simulate the first-passage times of a single dimension of Brownian motion, we want to be able to simulate the remainder of the dimensions of Brownian motion at that time, or indeed the sample path at times other than its first-passage times. As the dimensions of standard Brownian motion are independent (and so Brownian motion can be composed by considering each dimension separately), we can restrict our attention to simulating a single dimension of the sample path for an intermediate time q ∈ [s, τ ] given W s , the extremal value W τ , and constrained such that, Furthermore, as we are interested in only the forward simulation of Brownian motion, by application of the strong Markov property we need to consider only the simulation of a single intermediate point (although by application of Pollock et al. (2016), section 7, simulation at times conditional on future information is possible).
To proceed, note that (as outlined in Asmussen et al. (1995), proposition 2) the law of a univariate Brownian motion sample path in the interval [s, τ ] (where s < τ ) initialized at .s, W s /, and constrained to attain its extremal value at .τ , W τ /, is simply the law of a three-dimensional Bessel bridge. We require the additional constraint that, ∀ u ∈ [s, τ ], W u ∈ [W s − θ, W s + θ], which can be imposed in simulation by deploying a rejection sampling scheme in which a Bessel bridge sample path is simulated at a single required point (as above) and accepted if it meets the imposed constraint at either side of the simulated point, and rejected otherwise.
As presented in  and Pollock (2013), the law of a Bessel bridge sample path (parameterized as above) coincides with that of an appropriate time rescaling of three independent Brownian bridge sample paths of unit length conditioned to start and end at the origin (denoted by {b .i/ } 3 i=1 ). Supposing that we require the realization of a Bessel bridge sample path at some time q ∈ [s, τ ], then, by simply realizing three independent Brownian bridge sample paths at that time The method by which the proposed Bessel bridge intermediate point is accepted or rejected (recall, to impose the constraint that, ∀ u ∈ [s, τ ], W u ∈ [W s − θ, W s + θ]) is non-trivial as a closed form representation of the required probability does not exist (which we shall denote in this appendix by p). Instead, as shown in theorem 4, a representation for p can be found as the product of two infinite series, which as a consequence of this form cannot be evaluated directly to make the typical acceptance-rejection comparison (i.e. determining whether u p or u > p, where u ∼ U[0, 1]). The strategy that we deploy to retain exactness and to accept with the correct probability p is that of a retrospective Bernoulli sampler (Pollock et al. (2016), section 6). In particular, in corollary 1 we construct monotonically convergent upper and lower bounding probabilities (p ↑ n and p ↓ n respectively) with the property that lim n→∞ p ↑ n → p and lim n→∞ p ↓ n → p such that for any u ∈ [0, 1] and > 0 ∃ n Å .t/ such that ∀n n Å .t/ we have p ↑ n − p ↓ n < , which are then evaluated to sufficient precision to make the acceptance-rejection decision, taking almost surely finite computational time.
Theorem 4. The probability that a three-dimensional Bessel bridge sample path W ∼ W Ws, Wτ s, τ |.W τ , W q /, for s < q < τ attaining its boundary value at .τ , W τ /, remains in the interval [W s − θ, W s + θ] can be represented by the following product of infinite series (where we denote by m := 1. .46/ Relating the decomposition to the statement of theorem 4, p 1 follows directly from the parameterization given and the representation in Pollock (2013), theorem 6.1.2, of the result in , proposition 3. p 2 similarly follows from the representation found in Pollock et al. (2016), theorem 5.
, set n = n + 1 5. p: if u p ↓ n accept, else reject and return to step 1 6. Return .q, W q / . 48/ Furthermore we have =: r n r ∈ .0, 1/, .49/ and soK . 50/ Proof. The summations in the first term of the sequences (47) and (48) follow from theorem 4 and , proposition 3. The summations in the second term of the sequences (47) and (48), and the necessary condition on n 0 , follow from Pollock et al. (2016), corollary 5. The validity of the product form of sequences (47) and (48)  Having established theorem 4 and corollary 1 we can now construct a (retrospective) rejection sampler in which we simulate W q (as per the law of a Bessel bridge) and, by means of an algorithmic loop in which the bounding sequences of the acceptance probability are evaluated to sufficient precision, we make the determination of acceptance or rejection. This is summarized in algorithm 5 further noting that, although the embedded loop is of random length, by corollary 1 we know that it halts in finite expected time (K can be interpreted as the expected computational cost of the nested loop, noting that E[iterations] := Σ ∞ i=0 iP.halt at step i/ = Σ ∞ i=0 P.halt at step i or later/ =K).

C.3. Simulation of a single trajectory of constrained Brownian motion
We now have the constituent elements for Section 3, in which we simulate multivariate Brownian motion at  Pollock et al. (2016)). Recall from Section 3 that the killing times are determined by a random variable whose distribution depends on the inscribed layers, and so the presentation of algorithm 6 necessitates a loop in which the determination of whether the stopping time occurs in the interval is required.
We require the user-specified vector θ to determine the default hypercube inscription size. In practice, as with other MCMC methods, we might often apply a preconditioning matrix to the state space before applying the algorithm.
Further note that, because of the strong Markov property, it is user preference whether this algorithm is run in its entirety for every required time marginal, or whether it resets layer information whenever any one component breaches its boundary and reinitializes from that time on according to algorithm 6, step 4(b).

Appendix D: Path space rejection sampler for μ T
A path space rejection sampler for μ T can be constructed by drawing from Brownian motion measure, X ∼ W x T , accepting with probability P.X/ given by . 52/ The algorithmic pseudocode for this approach is thus presented in algorithm 7. Crucially, determination of acceptance is made using only a path skeleton (as introduced in Pollock et al. (2016), a path skeleton is a finite dimensional realization of the sample path, including a layer constraining the sample path, sufficient to recover the sample path at any other finite collection of time points without error as desired). The path space rejection sampler for μ T outputs the skeleton composed of all intermediate simulations: which is sufficient to simulate any finite dimensional subset of the remainder of the sample path (denoted by X rem ) as desired without error (as outlined in Pollock et al. (2016), section 3.1 and Appendix C), Algorithm 7. Path space rejection sampler for μ T algorithm 1. Input: X 0 2. R: simulate layer information R ∼ R as per Appendix C 3. P .1/ : with probability 1 − exp{ΦT − Σ n R i=1 L .i/ X ..τ i ∧ T/ − τ i−1 /} reject and return to step 2 4. n R : , reject path and return to step 2 . 54/

Appendix E: Killed Brownian motion
In algorithm 4 we detailed an approach to simulate the killing time and location, (τ , Xτ ), for killed Brownian motion. To avoid unnecessary algorithmic complexity, note that we can recover the pair (τ , Xτ ) by a simple modification of algorithm 7 in which we set ∀ iL .i/ X := Φ and return the first rejection time. This is presented in algorithm 8. A variant in which L .i/ X is incorporated would achieve greater efficiency, but has been omitted for notational clarity.
As in the path space rejection sampler for μ T that was presented in Appendix D, in killed Brownian motion (algorithm 8) we can recover in the interval [0,τ / the remainder of the sample path as desired without error as follows (where for clarity we have suppressed the full notation, but can be conducted as described in Appendix C): .55/

Appendix F: Rejection-sampling-based quasi-stationary Monte Carlo algorithm
In Section 3.3 we considered the embedding of the importance sampling killed Brownian motion of algorithm 1 within an SMC algorithm. A similar embedding for the rejection sampling variant (killed Brownian motion) of algorithm 8 is considered here as the probability of the killed Brownian motion trajectory of algorithm 8 remaining alive becomes arbitrarily small as the diffusion time increases. As such, if one wanted to approximate the law of the process conditioned to remain alive until large T it would have prohibitive computational cost.
Considering the killed Brownian motion algorithm that was presented in Appendix E, in which we simulate trajectories of killed Brownian motion, the most natural embedding of this within an SMC framework is to assign each particle constant unnormalized weight while alive, and zero weight when killed. Resampling in this framework simply consists of sampling killed particles uniformly at random from the remaining alive particle set. The manner in which we have constructed algorithm 8 enables us to conduct this resampling in continuous time, and so we avoid the possibility of at any time having an alive particle set of size zero. We term this the (continuous time) rejection QSMC approach, R-QSMC, and present it in algorithm 9. In algorithm 9 we denote m.k/ as a count of the number of killing events of particle trajectory k in the time elapsed until the mth iteration of the algorithm.
Iterating R-QSMC beyond some time t Å at which point we believe that we have obtained convergence and, halting at time T > t Å , we can approximate the law of the killed process by the weighted occupation measures of the trajectories (where ∀ tw .·/ t = 1=N): .dx/dt: .56/ In some instances the tractable nature of Brownian motion will admit an explicit representation of expression (56). If not, one can simply sample the trajectories exactly at equally spaced points to find an unbiased approximation of expression (56), by means detailed in Appendix C.2 and algorithm 4. In particular, if we let t 0 := 0 < t 1 < : : : < t m := T such that t i − t i−1 := T=m, then we can approximate the law of the killed process as we did in expression (7), where w .1:N/ t AE :T = 1=N.

Appendix G: Rejection sampling scalable Langevin exact algorithm R-ScaLE
In Section 4 we noted that the survival probability of a proposal Brownian motion sample path was related to the estimator P.X/ of Appendix D and in Section 4.2 where we developed a replacement estimator. The construction of control variates in Section 4.2 enables us to construct the replacement estimator such that it has good scalability properties. In a similar fashion to the embedding of this estimator within a QSMC algorithm (algorithm 2) resulting in ScaLE (algorithm 3), we can embed this estimator with the rejection sampling variant R-QSMC (algorithm 9) resulting in the rejection scalable Langevin exact algorithm R-ScaLE which we present in algorithm 10.
As presented in algorithm 10 we may also be concerned with the absolute growth ofΦ (relative to Φ) as a function of n to study its computational complexity. Note, however, as remarked on in Appendix E, if this growth is not favourable one can modify algorithm 8 to incorporate the additional path space bound L .i/ X for each layer. Details of this modification have been omitted for notational clarity.

Appendix H: Discrete time sequential Monte Carlo construction
Consider the discrete time system with state space E k = .C.h.k − 1/, hk], Z k / at discrete time k, with the process denoted X k = .X .h.k−1/, hk] , Z k / in which the auxiliary variables Z k take values in some space Z k . ScaLE, with resampling conducted deterministically at times h, 2h, : : :, coincides exactly with the mean field particle approximation of a discrete time Feynman-Kac flow, in the sense and notation of Del Moral (2004), with transition kernel M k .X k−1 , dX k / = W X h.k−1/ h.k−1/, hk .dX .h.k−1/, hk] /Q k .X .h.k−1/, hk] , dZ k / and a potential function G k .X k /, which is left intentionally unspecified to allow a broad range of variants of the algorithm to be included: the property which it must have to lead to a valid form of ScaLE is specified below. AllowW The validity of such a ScaLE algorithm depends on the following identity holding: It is convenient to define some simplifying notation. We define the law of a discrete time process (in which is embedded a continuous time process taking values in C[0, ∞/): and of a family of processes indexed by k,K x k , again incorporating a continuous time process taking values in With a slight abuse of notation, we use the same symbol to refer to the associated finite dimensional distributions, with the intended distribution being indicated by the argument. We also define the marginal Proposition 3. Under mild regularity conditions (see Del Moral (2004) and Chopin (2004)), for any ϕ : R d → R, any algorithm within the framework described admits a central limit in that where Z is a standard normal random variable, '⇒' denotes convergence in distribution and

H.1. Proof outline
It follows by a direct application of the argument underlying the proposition of Johansen and Doucet (2008) (which itself follows from simple but lengthy algebraic manipulations from the results of Del Moral (2004) and Chopin (2004)) that, for any test function ϕ : R d → R satisfying mild regularity conditions (see Del Moral (2004) and Chopin (2004)), where Z is a standard normal random variable, '⇒' denotes convergence in distribution and .ϕ.X hk / −K x k .ϕ.X hk /// 2 with {F p } p 0 being the natural filtration associated withW x . This can be straightforwardly simplified to We conclude with the following corollary, showing that the particular combination of subsampling scheme and path space sampler fits into this framework and providing its particular asymptotic variance expression.
Corollary 2. Such a central limit theorem is satisfied in particular (a) if no subsampling is used and one evaluates the exact (intractable) killing rate (as described in algorithm 2), and (b) if subsampling is employed within the construct of the layered path space rejection sampler (as described in algorithm 3).

Discussion on the paper by Pollock, Fearnhead, Johansen and Roberts
Natesh S. Pillai (Harvard University, Cambridge) I congratulate Pollock, Fearnhead, Johansen and Roberts for a very stimulating paper. The paper introduces the ScaLE algorithm-a novel algorithm for sampling that is not based on the usual Metropolis accept-reject schemes. At its core, ScaLE is still an accept-reject sampler: for fixed running time T , one can obtain (approximate) samples from the distribution π by simulating a Brownian motion {X t } 0 t T and a killing time ζ, and then accepting if ζ > T . The algorithm can be thought of as importance sampling of the Brownian paths. The authors ingeniously use sequential Monte Carlo algorithms to 'revive' killed trajectories.
From the perspective of a user, some details need to be worked out. The famous random-walk Metropolis algorithm is ubiquitous because of its simplicity; to implement it, the user needs only to evaluate π.·/ pointwise. In contrast, for the ScaLE algorithm. the user needs to compute the upper .U x / and lower bounds .L x / of the function φ.·/ on the trajectory of X t -or at least use some extra information on φ. It would be good to streamline this step if possible so that it demands less from the user for an effortless implementation.
Next, how large should T be? For π ∝ 1, φ.x/ = 0 and consequently the algorithm never kills a trajectory and outputs X T ∼ N.0, T/. Thus the 'uniform distribution' π ∝ 1 is approximated by N.0, T/. Thus T cannot be too small; how should T scale with the dimension of the target π? This computation also sheds light on a related point: if the posterior distribution is improper, the usual Markov chain Monte Carlo algorithms can be null recurrent instead of positive recurrent and consequently may not have valid credible regions. What is the analogue here?
Equally important for practical implementation are diagnostic criteria for assessing 'convergence'. Initialization seems to play an important role for ScaLE and poor initializations could hurt its performance. Thus it seems that one cannot use the usual Gelman-Rubin statistic for the ScaLE algorithm without modification.
In big data settings, ScaLE has favourable complexity compared with its alternatives. It has better scaling than naive subsampling and enjoys exactness unlike Laplace approximation methods. However, I believe that there is no 'free lunch'. In Johndrow et al. (2020), we show that subsampling algorithms are just as expensive compared with the original algorithm that uses all of the data at every step. Naive subsampling algorithms use fewer data points (and thus less information) at each step, but require more steps to explore the state space. Working out the details of the analogous story for ScaLE is a worthwhile direction to pursue.
The underlying proposal for ScaLE is Brownian motion or more generally diffusions. Despite their beauty and tractability, diffusions are 'slow' in their exploration of the state space. This is also evident from the research on optimal scaling. Informally, in dimension d, the hybrid Monte Carlo algorithm takes only O.d 1=4 / steps (Beskos et al., 2013) to explore O.1/ distance in the state space, whereas 'diffusive algorithms' such as the random-walk Metropolis and the Langevin algorithms respectively take O.d 1=2 / and O.d 1=3 / steps and thus are markedly inferior. This begs the question of how does ScaLE compare with a state of the art Hamiltonian Monte Carlo algorithm and its variants?
The paper is so rich with ideas that it raises more questions than it answers. In the context of big data, is the assumption of independent and identically distributed data necessary for ScaLE to maintain its exactness? For example, is it possible to have an exact version of ScaLE for hierarchical models? Can one use the discretized Langevin diffusion of the target as the underlying stochastic process instead of Brownian motion? Of course, naive discretization schemes (e.g. Euler) for simulating this Langevin diffusion might destroy its ergodicity; however, sophisticated numerical schemes are available, but how about diffusions with jumps for heavy-tailed targets?
The authors do a terrific job of infusing a tremendous amount of technology from other areas of contemporary Markov chain Monte Carlo research into this paper. In particular, it is exciting to see the power of sequential Monte Carlo algorithms in taking the abstract idea of quasi-stationarity to producing a practical implementation and even providing theoretical guarantees. Also noteworthy is the creative use of control variates to obtain exactness in the context of big data. Connecting the efficiency of ScaLE to posterior contraction rates of the target provides a much needed bridge between Markov chain Monte Carlo theory and mathematical statistics and Bayesian non-parametrics. This is only a start, and the ideas that are initiated in this paper will no doubt keep all of us busy for the next several years. I look forward to further progress in this area and congratulate the authors again for their multifaceted contribution. I enthusiastically propose the vote of thanks! I thank Aaron Smith and Zachary Moore for their helpful comments.

Nicolas Chopin (Ecole Nationale de la Statistique et de l'Administration Economique, Institut Polytechnique de Paris)
This paper proposes an algorithm, ScaLE, which relies on a very elegant probabilistic construction. I would like to make a few comments regarding its practicality, and to ask a few questions to the authors regarding potential improvements. Logistic regression is a very popular benchmark in papers concerned with Bayesian computation. This motivated James Ridgway and me (Chopin and Ridgway, 2017) to discuss which algorithms actually perform best on such a model. Some of our findings were as follows: the posterior of a logistic regression is not very challenging to sample from, because it is very Gaussian like: it is easy and cheap to compute a Gaussian approximation of the posterior: below dimension 50, importance sampling (based on a Gaussian proposal) beats every other Monte Carlo method.
Thus, I was a little disappointed to see that four of the five examples in the paper were based on a logistic regression, of respective dimension 2, 4, 4 and 5. I also note that ScaLE seems to depend on how well behaved is the function Δπ=π. But, if the target density has exponential tails (which is the case for a logistic regression), this function seems to be bounded. In contrast, for a Gaussian distribution this function behaves like O.x 2 /. Could this mean that a logistic regression posterior is even more favourable to ScaLE than a Gaussian distribution?
These comments are slightly unfair: although the examples considered seem particularly simple, in terms of the dimension and the regularity of the model, most of them feature a very large sample size (the so-called big n problem). ScaLE manages to 'touch' each data point a very small number of times. This is quite an achievement. If we were to use importance sampling instead, each data point would be touched N times, e.g. N = 10 4 , when N is the Monte Carlo sample size. Importance sampling is trivial to parallelize, and a parallel implementation of importance sampling may still be competitive with ScaLE for a very large data set: note, however, that ScaLE, being a sequential Monte Carlo sampler, is also amenable to parallelization. (In a sequential Monte Carlo sampler, the only step where particles are not processed independently is the resampling step.) There have been quite a few recent attempts at developing new algorithms to deal with the big n problem; see the very good review of . Another way to attack this problem, which seems often overlooked, would be to perform sequential inference (in one way or another) until 'enough' data have been seen, e.g. when the posterior variance of a given parameter falls below a chosen threshold. Such an approach is of course very pragmatic and does not even attempt to approximate the posterior.
One big advantage of ScaLE is that it is an exact algorithm. Note, however, the double asymptotics: convergence occurs when both N (the number of particles in the sequential Monte Carlo algorithm) and t (the time when you stop the algorithm) go to ∞. Contrast with Markov chain Monte Carlo algorithms, which converge as the number of samples goes to ∞. Since convergence in t seems fast, could we get rid of the bias that is incurred by stopping at time t by using a debiasing techniqueà la Glynn and Rhee (2014)?
Regarding the sequential Monte Carlo algorithm, I would like to make two comments: first, such algorithms provide estimates of the normalizing constants of the underlying Feynman-Kac model; in ScaLE, I believe that these constants equal the probability of survival of the simulated process until time t. Could this quantity be used in some way, e.g. to assess convergence in t?
Second, the authors mention the possibility of using better proposal distributions. Could they elaborate on that point? It seems difficult to construct proposal distributions that would push the particles towards the support of the target π, under the constraint that a small number of data points may be accessed at a given iteration.
To sum up, I think that it is fair to say that more work is required to make ScaLE work for a larger, more realistic class of problems. Still, I remain impressed by what the authors have achieved with this paper, and thus I am more than happy to second the vote of thanks.
The vote of thanks was passed by acclamation. In the traditional study of Markov chains, one begins with a chain, defined, say, by a transition matrix, and goes on to study its resulting properties, such as the (non-)existence of an invariant distribution. The historical development of Markov chain Monte Carlo methods 'inverted' this procedure: given a target distribution π, instead one is interested in constructing a chain with the given π as its invariant distribution.
In a similar vein, the notion of quasi-stationary distributions (QSDs) of killed Markov processes has existed in the probability literature for some time (see, for instance, ): π is quasi-stationary for a Markov process X which is killed at time ζ, if for X 0 ∼ π, for all measurable sets A and t 0, P π .X t ∈ A|ζ > t/ = π.A/: QSDs have found various applications, such as the study of ecology and population processes (Meleard and Villemonais, 2012), where the killing time ζ truly corresponds to the extinction time of a stochastic population. However, now, for the first time, we are interested in constructing a killed Markov process whose QSD coincides with a given distribution π. For a detailed review of this connection, see chapter 1 of Wang (2020).
I now present two results which extend and complement the theory given in the present paper. Their proofs can be found in Wang et al. (2019).
The authors' theorem 1 gives sufficient conditions for a killed Brownian motion to have π as its QSD. We extend this result to more general reversible diffusions and have removed their technical tail condition found in Appendix A.
Theorem 5. Consider the diffusion dX t − ∇A.X t /dt + dW t , .61/ with killing rate where K > 0 is assumed to be chosen so that κ 0. Provided that and basic regularity conditions on A hold, the diffusion (61) killed at rate (62) has π as its QSD.
Since we have allowed for non-zero drift in diffusion (61) this potentially enables us to select a drift function to optimize convergence.
Secondly, when implementing quasi-stationary Monte Carlo methods the rate of convergence is a natural question to pose. For a full statement of the following theorem, see Wang et al. (2019).
Theorem 6. Assume the same conditions as theorem 5. Consider the diffusion dZ t = 1 2 ∇ log π 2 exp.2A/ .Z t /dt + dW t : . 63/ Then the spectral gap of the generator of Z coincides with the spectral gap of the generator of diffusion (61) killed at rate (62).
Heuristically this says that the killed diffusion (61) killed at rate (62) converges to quasi-stationarity at the same rate as diffusion (63) converging to stationarity. I hope that these results will facilitate the construction of novel quasi-stationary Monte Carlo methods going forwards.

Hector McKimm (University of Warwick, Coventry) and Andi Q. Wang (University of Bristol)
We thank Pollock, Fearnhead, Johansen and Roberts for their fascinating work, whose focus on the quasistationary distribution of a Markov process marks a significant break from more traditional Markov chain Monte Carlo methods relying on the stationary distribution of a Markov chain. In particular, we congratulate them for creating a method which is practicable in the big data setting while remaining exact, achieved through their use of subsampling and control variate techniques.
The quasi-stationary Monte Carlo (QSMC) method detailed in the paper relies on simulating an inhomogeneous Poisson process with rate where φ is defined in Section 2 and Φ is a constant chosen to ensure that κ 0. This methodology is similar to that used for generating a class of processes called 'Restore' (Wang et al., 2020): a recently studied class of Markov process applicable to Monte Carlo sampling. A Restore process X is obtained by enriching an underlying continuous time Markov process Y with regenerations from a distribution μ at rate whereκ is a function depending on π and Y , and the constant C is chosen such that κ 0. At the arrival time of the Poisson process with rate κ .x/, instead of being killed, the process regenerates from distribution μ. This choice of regeneration rate implies that the process has invariant distribution π. The event rate (65) can be seen as a generalization of rate (64). When the local process Y in Restore is a Brownian motion,κ coincides with φ. The constant −Φ in expression (64) can also be interpreted as the constant C in expression (65) in the special case μ = π. Indeed, in light of results such as lemma 4.3 of Benaïm et al. (2018) and proposition 2.12 of Wang et al. (2020b), we know that this special case precisely corresponds to when π is quasi-stationary, as in expression (64).
Restore is also comparable with the QSMC method known as ReScaLE, studied in Kumar (2018). In ReScaLE, at the arrival times of a Poisson process with rate also given by expression (64), the process location is redrawn from its empirical occupation measure. Algorithmically this is difficult to do; complex constructions using diffusion bridges were used in Kumar (2018), and there are severe memory constraints. Restore avoids these issues entirely as regenerations are from a fixed distribution μ.
However, subsampling methods are as yet unavailable for the Restore sampler. Indeed, the strength of QSMC methods is their suitability to tall data. The examples that are considered in this work illustrate, in the context of logistic regression, the sampler's ability to capture the distribution of interest accurately while making the equivalent of only a single full data evaluation. By contrast, the Restore sampler is more suited to sampling from multimodal target distributions and potentially in higher dimensions.

David Steinsaltz (University of Oxford) and Andi Q. Wang (University of Bristol)
We congratulate Pollock, Fearnhead, Johansen and Roberts on their novel contribution to the Markov chain Monte Carlo toolbox. What they offer is not simply a new algorithm, but a new paradigm that has already inspired, and will continue to inspire, a broad range of new thinking on the possibilities of Monte Carlo techniques. To illustrate this, we describe here one recently proposed alternative quasi-stationary Monte Carlo method that is derived from the ScaLE algorithm.
One of the practical problems that the authors address is how to sample efficiently from the quasistationary distribution (QSD) of a killed diffusion. They solve this problem through a continuous time sequential Monte Carlo approach, based on weighted particle systems. This approach goes back to the work of Burdzy et al. (2000) and belongs to the class of 'Fleming-Viot' systems. In the context of simulating QSDs, these methods are discussed in Villemonais (2011) and .
However, there is another broad approach to the simulation of QSDs based on stochastic approximation and reinforced processes. Heuristically, the idea is to run a single copy of the killed process forwards in time until the first killing event. At that time, the process is instantaneously reborn at a point in the space, drawn from the empirical occupation measure of the process. The process continues to evolve; each time it is killed and then reborn from its occupation measure.
This alternative approach can be traced back to Aldous et al. (1988) where it was studied on finite state spaces in discrete time and shown to converge to the QSD. Since then it has been studied in a variety of settings: see , Benaïm et al. (2018Benaïm et al. ( , 2019 Benaïm and Cloez (2015), Wang et al. (2020b) and Mailler and Villemonais (2020).
When this stochastic approximation approach is applied to the QSMC setting, we obtain a new QSMC method, dubbed ReScaLE (regenerating ScaLE), studied mathematically in Wang et al. (2020b) and practically in Kumar (2018), who demonstrated that subsampling can still be applied and hence ReScaLE boasts a performance that is competitive with that of ScaLE.
Unlike ScaLE, rather than simulating a cloud of weighted particles, we have only a single Brownian particle. This is conceptually much simpler, avoiding the difficulties that are associated with interacting particle systems, and algorithmically more transparent.
However ReScaLE faces a new computational bottleneck: in the implementation of Kumar (2018) it is necessary to keep track of previous locations of the particle carefully, which can lead to memory issues once the process has been run for a substantial amount of time.

Daniel Rudolf (Universität Göttingen) and Andi Wang (University of Bristol)
We congratulate Pollock, Fearnhead, Johansen and Roberts for this extraordinary contribution. We are pleased to see that the study of quasi-stationarity, which has long been investigated in the probability literature, is now receiving attention in statistics. The numerical experiments conducted in the paper are very promising, in particular with the scaling of the data size. It would appear, however, that the algorithms presented are quite delicate to implement and the choice of killing rate is crucial. Thus, one might ask whether or not using an approximate killing rate will have a significant effect. This is closely related to the perturbation theory of Markov chains; for details and recent advances see Hosseini and Johndrow (2019), Medina-Aguayo et al. (2020), Mitrophanov (2005) and Rudolf and Schweizer (2018) and the references therein. In contrast, however, the present situation is more subtle, since we must work with conditional distributions and quasi-stationary distributions. We state a preliminary result for discrete time Markov chains.
We assume that we have access to an approximate version of k: a functionk: S 0 → [0, 1]. Let .X n / n∈N 0 denote the Markov chain (constructed analogously to above), which corresponds to .Q,k/.
Assume that X 0 =X 0 = x ∈ S 0 , indicated by P x .
Writing ' · TV ' for total variation distance, we have the following theorem.
Theorem 7. Under assumption 1, P x .X n ∈ ·|X n ∈ S 0 / − P x .X n ∈ ·|X n ∈ S 0 / TV k −k ∞ K.x/ min{n, |α −α| −1 }, where K.x/ := 2 sup z∈S 0 S 0c u .x/c u .x/ min{c l .x/,c l .x/} Q.z, dx/: For given n having k −k ∞ sufficiently small compared with n −1 , say for example k −k ∞ n −2 , then the right-hand side of the estimate in theorem 7 is of order n −1 . Dai (Alan Turing Institute, London, and University of Essex, Colchester) We congratulate Pollock, Fearnhead, Johansen and Roberts for a stimulating and interesting paper which introduces a new class of algorithms that differs from traditional Markov chain Monte Carlo (MCMC) methods, in that the approach is based on the quasi-stationary distribution of an appropriately constructed diffusion process. A particularly impressive contribution of the method proposed is that it can be applied in big data contexts while remaining exact by adopting a subsampling approach. Interestingly, ScaLE has connections to another method for tackling large data in the Bayesian framework, namely the Monte Carlo fusion algorithm proposed by Dai et al. (2019). In particular, both algorithms utilize methodology for the exact simulation of diffusions  and use the Langevin diffusion in their mathematical construction (although it is not explicitly used in ScaLE). Further, the Monte Carlo fusion algorithm uses the function φ : R d → R (which is defined in Section 2). However, the use of subsampling ideas was not explored in Dai et al. (2019). The unbiased estimators for φ outlined in Section 4 in this paper is a contribution which might he employed in the Monte Carlo fusion algorithm.

Ryan Chan and Hongsheng
Divide-and-conquer methods (for instance, , , Neiswanger et al. (2013) and Dai et al. (2019)) have been proposed to adapt MCMC algorithms for reducing the computational cost of the algorithm. In these approaches, the data set is split into disjoint subsets and then standard MCMC methods are used for each subset. Inference is then combined into a single inference. In this framework, the target is of the form where each subposterior f c .x/ is a density (up to a multiplicative constant) representing one of the C distributed inferences that we wish to unify. As noted in Section 1 of the paper, the primary weakness of these methods thus far is that the recombination of the separately conducted inferences is inexact and involves some approximation of the subposteriors. However, the Monte Carlo fusion algorithm is exact and is the first exact fusion inference method that enables perfect sampling from expression (66). This is achieved by constructing a rejection sampling scheme on an extended space with the main difficulty being computing an intractable acceptance probability which requires the auxiliary simulation of collections of Brownian bridges. An advantage of the fusion approach is that it can be conducted in a distributed setting and one can exploit large clusters of computing cores. In contrast, the quasi-stationary Monte Carlo algorithm detailed in this paper and more traditional MCMC methods are single-core algorithms. It would be interesting to see whether the authors have any ideas for parallel implementations of ScaLE in the future.

Kengo Kamatani (Osaka University) and Joris Bierkens and Sebastiano Grazzi (Technische Universiteit Delft)
We highly appreciate the new Monte Carlo strategy using killed diffusions. Much remains to be done in this new direction with respect to practical and implementational aspects. For example there are many tuning parameters for the method such as the terminal time, the number of subsampling splittings, the choice of hypercubes and the number of particles. The purpose of this comment is to consider robustness with respect to dimension.
Pollock, Fearnhead, Johansen and Roberts mainly considered large data set scenarios and concluded that the method proposed has an attractive scaling property (described in Section 5). The techniques used are reminiscent of those used for piecewise deterministic Monte Carlo (PDMC) methods (Fearnhead et al., 2018). Among PDMC methods, the recent Boomerang sampler (Bierkens et al., 2020) was devised to tackle high dimension scenarios also and is our inspiration for this discussion. In the high dimensional setting, we expect ScaLE to have difficulties. For instance, when the target distribution is a d-dimensional standard normal distribution, the killing rate increases with order d, with expected killing time being of order 1=d. The application of sequential Monte Carlo methods does not completely solve the issue.
In a high dimensional (or infinite dimensional) setting, a Gaussian reference measure arises naturally in several applications such as Gaussian process regression, spatial statistics, Bayesian inverse problems and many other areas. Therefore it is sensible to construct Monte Carlo methods based on Gaussian reference measures. This is obtained for example by the preconditioned Crank-Nicolson scheme (Cotter et al., 2013) for discrete time Markov chain Monte Carlo methods and by the Boomerang sampler (Bierkens et al., 2020) for PDMC methods.
ln similar spirit, it will be of considerable interest to extend ScaLE to quasi-stationary Monte Carlo methods which use mean reverting reference processes and take advantage of their ergodic stationary distributions. In the present study, the Wiener process has been used as a reference which has the Lebesgue measure as invariant measure. In contrast, the Ornstein-Uhlenbeck process has an invariant Gaussian distribution. If the target distribution is well approximated by the Gaussian distribution (which is often so in the high dimensional setting), using the Ornstein-Uhlenbeck process instead of the Wiener process in ScaLE could potentially lead to better performance and scaling properties in high dimension. We expect that the simulation techniques detailed in the paper could in principle be extended to this class of linear processes.
The following contributions were received in writing after the meeting.
Jere Koskela (University of Warwick, Coventry) and Andi Q. Wang (University of Bristol) This paper is a stimulating contribution to the emerging field of quasi-stationary Monte Carlo (QSMC) methods. Our aim is to draw connections between ScaLE and other existing QSMC approaches, and to highlight the extent to which such methods remain uncharted. The performance of the ScaLE algorithm provides ample motivation for further investigation of the QSMC class and extensions.
QSMC methods are constructed so that a target distribution π is the quasi-stationary distribution of a killed Markov process. ScaLE simulates approximately from the quasi-stationary distribution by using sequential Monte Carlo methods. This has inspired alternative approaches, focusing on path-based regenerative schemes, in which sample trajectories are stochastically killed with a state-dependent rate κ.x/ as in ScaLE but are then restarted from a state-dependent family of regeneration densities μ x .z/. There is considerable flexibility in choosing κ and μ x : Wang et al. (2020a) considered the case where μ x ≡ μ and κ ∝ μ=π (along with generalizations), whereas Kumar (2018) and Wang et al. (2020b) have introduced the non-Markovian QSMC method ReScaLE, where, in an abuse of notation, μ x is the empirical measure of the entire killed path.
A general condition for Markovian, π-invariant pairs .κ, μ x / is κ.x/π.x/ = Q Å π.x/ + Ω π.z/κ.z/μ z .x/dz, .67/ where Ω is the appropriate domain, and Q Å is the adjoint of the generator of the motion of sample paths before they are killed, which need not be π invariant. When Q Å is π invariant, equation (67) is solved by κ ∝ 1=π, and any doubly stochastic family {μ x .z/} x, z∈Ω . To our knowledge this family of solutions has not been identified previously, and little is known about the class of solutions beyond the special cases mentioned above. The family of doubly stochastic solutions has the appealing feature that the whole state need not be regenerated simultaneously. For example, a randomly sampled co-ordinate of a vector-valued process can be regenerated from an a priori fixed, doubly stochastic, scalar density, whereas the remaining coordinates remain unchanged. Both sequential Monte Carlo and regeneration-based methods are difficult to generalize to high dimension, and the ability to use lower dimensional updates provides a promising tool for the endeavour.

Tze Leung Lai and Huanzhong Xu (Stanford University)
Given a target density π on R d , quasi-stationary Monte Carlo (QSMC) sampling replaces rejection in the traditional Metropolis-Hastings acceptance-rejection iterations by terminating (or killing) the iterations, so that the long-term behaviour of the simulation process is described by the quasi-stationary distribution, which is the limiting conditional distribution given that the process is still alive. The basic idea underlying QSMC sampling is to construct a Markov process whose quasi-stationary distribution is the target distribution, and equation (3) and theorem 1 of the paper demonstrate that this can indeed be achieved for the case of a d-dimensional Brownian motion with drift. The authors note that the regularity conditions that are used to prove theorem 1 in Appendix A are 'common in stochastic calculus', for which they cite  and to which we supplement with Pitman and Yor (2018). Sections 3 and 4 of the paper provide important insights, details and historical background for implementing QSMC methods, in particular, embedding subsampling within QSMC sampling leads to the ScaLE (scalable Langevin exact) algorithm whose computational complexity is described in Section 5. The sequential Monte Carlo approach provides the basic building blocks and is described in Section 6 where it is shown how standard sequential Monte Carlo algorithms can be applied to ScaLE, for which a central limit theorem (proposition 2) is derived. Illustrative data sets and software are given in Section 7, whereas Section 8 discusses the interplay between QSMC sampling and machine learning, new directions and open problems in QSMC sampling. We would like to add one such direction or problem, namely, the connection between QSMC and quasi-Monte-Carlo Metropolis-Hastings algorithms, first introduced by Owen and Tribble (2005).

Paul Vanetti and Arnaud Doucet (University of Oxford)
ScaLE is an interesting Monte Carlo scheme preserving exactness under subsampling and is thus relevant to large data applications. Piecewise deterministic Markov chain Monte Carlo schemes also demonstrate similar features Bouchard-Côté et al., 2018); however, these continuous time algorithms can be difficult to understand and implement. A discrete time alternative is the scalable Metropolis-Hastings (SMH) algorithm (Cornish et al., 2019); a subsampling Markov chain Monte Carlo method whose invariant distribution is also the true posterior.
Like ScaLE, the SMH algorithm makes use of control variate ideas, leveraging a Taylor series approximation to the log-posterior density at the maximum likelihood estimate. Like the Metropolis-Hastings (MH) algorithm, the SMH algorithm relies on a proposal q but uses the acceptance probability α.x, x / = min π.x /q.x|x / π.x/q.x |x/ n i=1 min 1, where f i .x/ is the likelihood from observation i,f i .x/ its control variate approximation andπ.
x/ = f 0 .x/Π n i=1f i .x/. Naively simulating an event of probability α.x, x / admits an O.n/ cost. However, as with ScaLE, this can be improved if the derivatives of the per-datum terms are uniformly bounded and if the posterior concentrates. When using a first-order Taylor series approximation forπ, the SMH algorithm accesses O.1/ data at each iteration and O.n −1=2 / when using a second-order approximation (Cornish et al., 2019). We tested the SMH algorithm on the airline and heterogeneous data sets using a second-order Taylor series approximation centred at the maximum likelihood estimate given in the text: see Table 2 for the results of these experiments. We used independent proposals drawn from the Gaussian distribution corresponding to this Taylor series approximation of the log-posterior.
For the airline data, the SMH algorithm demonstrates a notable improvement over the MH algorithm, visiting only 0.34 data on average per iteration. As for the heterogeneous data set, it was necessary to use more finely grained bounds than those described in Cornish et al. (2019) by considering each of the d 3 partial derivatives: see Vanetti (2020) for details. Our implementation generated approximately 3:0 × 10 5 effective samples per second and it visited only 0.015 data per iteration on average.
Although the SMH algorithm significantly outperforms the MH algorithm on these data sets, Monte Carlo methods are of limited interest here. Over 99.9% of the proposals sampled from our independent normal proposal distribution were accepted, suggesting that the posterior distribution is very close to its Laplace approximation in both cases.
Despite their current limitations, it is our hope that subsampling Monte Carlo methods such as ScaLE and the SMH algorithm might form a basis for new algorithms to perform Bayesian inference in more challenging settings, such as random-effect models.
The authors replied later, in writing, as follows.
We are grateful to the discussants for their many insightful and stimulating contributions, and we are very pleased to see quasi-stationary Monte Carlo (QSMC) methods and the ScaLE algorithm inspiring many new directions in scalable Monte Carlo methods for Bayesian inference.
Many discussants ask about alternatives to the ScaLE implementation of QSMC sampling. Indeed Wang's contribution points to work which describes exactly when the approach can be generalized to more general diffusion proposals. Outside the diffusion context, much less is currently known although this seems a promising direction of research especially if the proposal dynamics are easy to simulate. As an example of this approach, Kamatani, Bierkens and Grazzi suggest using piecewise deterministic Markov processes (PDMPs) for this purpose and in particular the Boomerang algorithm as a proposal distribution.
Chopin asks about implementations which can avoid the double asymptotics of sequential Monte Carlo (SMC) methods, whereas Steinsaltz and Wang give one solution, describing the ReScaLE procedure: a single trajectory non-Markov stochastic process which can achieve the quasi-stationary limit. Moreover the ReStoRE algorithm as described by McKimm and Wang is a completely new type of non-reversible Markov chain Monte Carlo algorithm inspired by the ReScaLE algorithm. As McKimm and Wang point out, ReStoRE has considerable potential for high dimensional Bayesian computation problems, although currently it cannot be implemented in a principled way using subsampling and so is less scalable to large data contexts than are the QSMC methods ScaLE and ReSCaLE. Chopin also suggests the ingenious Rhee and Glynn approach to debias the algorithm output. This approach relies crucially on the use of practical and effective coupling techniques. It seems plausible to use this approach to debias in time (Agapion et al., 2018;Jacob et al., 2020); more speculatively, it might be possible to adapt ideas that are used in the context of multilevel sequential Monte Carlo (SMC) methods such as Jasry et al. (2020) to eliminate further the bias arising from the finite sample size, although this would not be trivial and the cost is likely to be prohibitive.
We are impressed by the computational efficiency that is obtained in the subsampled Metropolis-Hastings approach described in the contribution by Vanetti and Doucet. There are other methods which can achieve subsampled iteration complexity O.1/ without compromising on exactness, including the class of PDMPs (Bouchard-Côté et al., 2018;Bierkans et al., 2019), which includes the bouncy particle and ZigZag samplers and their elaborations. Although all these methods can effectively use subsampling with control variates, this comes at the cost of slower Markov chain Monte Carlo mixing. One currently unique feature of R-ScaLE (which we found to be outperformed by ScaLE) is that the statistical properties of the subsampled algorithm are identical with the algorithm without subsampling, so the convergence properties of the QSMC algorithm do not deteriorate as a result. This property is not shared by the subsampled PDMP or scalable Metropolis-Hastings methods, although the control variate constructions used in Metropolis-Hastings methods mitigate the deterioration of mixing caused by subsampling. Lai and Xu remind us of the ingenious quasi-MCMC algorithms that were introduced by Owen and Tribble, although note that the use of 'quasi' in that work refers to quasi-random variables rather than the concept of quasi-stationarity which underpins our paper.
There is no intrinsic reason why QSMC methods should not be applicable to problems in higher dimension than the modest examples in the paper, although the specific ScaLE implementation that we concentrate on in the paper is limited by the complexity of the layered Brownian motion constructions involved. Thus the examples in our paper are all low dimensional as noted by Chopin, and we do not claim that ScaLE can be competitive in high dimensions with algorithms such as the Hamiltonian Monte Carlo algorithm as asked by Pillai.
The focus of the present paper has been mostly on the construction of exact and scalable methods. Of course this comes at a significant computational cost, including the need to provide bespoke bounds on the φ-function and to calibrate tuning parameters such as T as noted by Pillai. One value of the construction of exact methods is that it allows many natural approximations just by omitting or approximating certain algorithmic steps. Therefore, since these algorithms can be understood purely algorithmically, it is often possible to obtain a mathematical understanding of the biases that are introduced in the approximation. In the excellent contribution by Rudolf and Wang, first steps are taken to quantify biases of perturbed quasi-stationary processes. As they note, approximations based around truncation of the killing rate are very natural for simplifying ScaLE and we look forward to future work which can better understand the effects of such truncations.
Chopin and Pillai each touch on convergence diagnostics. The quality of the particle approximation of the law of the process which it approximates can be assessed by the usual SMC tools, such as effective sample size. Assessing the proximity of the finite time marginal law of the process to the quasi-stationary distribution is less standard. In principle the normalizing constant can indeed be estimated from the simulation as suggested and does correspond to the survival probability. However, it is not clear how to use this directly to assess convergence; it might be feasible to exploit the rate of change of the normalizing constant estimate together with the close connection between the killing rate and the Laplacian of the target density.
The possibility of parallelizing ScaLE was mentioned several times. The use of SMC methods means that the only synchronous operation is resampling and so parallel implementation would be straightforward and could indeed substantially accelerate the algorithm. Use of resampling algorithms optimized for parallel or distributed contexts such as Whiteley et al. (2016) could further improve performance, as could exploring mean field particle approximations in continuous time  in which selection operations are not necessarily synchronous.
Pillai asks about how ScaLE might behave on an improper target. In this case, the killed Brownian motion cannot have a quasi-stationary limit. Seneta's classic book on stationarity (Seneta, 2006) talks about extensions of the concepts of transience and null recurrence which we conjecture may be relevant here.
We conclude by thanking again the discussants. We see this paper as the first steps towards establishing QSMC sampling as a viable methodolog for big data problems, and the questions raised will inspire many directions for continuing research in this area.