High dimensional change point estimation via sparse projection

Change points are a very common feature of ‘big data’ that arrive in the form of a data stream. We study high dimensional time series in which, at certain time points, the mean structure changes in a sparse subset of the co‐ordinates. The challenge is to borrow strength across the co‐ordinates to detect smaller changes than could be observed in any individual component series. We propose a two‐stage procedure called inspect for estimation of the change points: first, we argue that a good projection direction can be obtained as the leading left singular vector of the matrix that solves a convex optimization problem derived from the cumulative sum transformation of the time series. We then apply an existing univariate change point estimation algorithm to the projected series. Our theory provides strong guarantees on both the number of estimated change points and the rates of convergence of their locations, and our numerical studies validate its highly competitive empirical performance for a wide range of data‐generating mechanisms. Software implementing the methodology is available in the R package InspectChangepoint.


Introduction
One of the most commonly-encountered issues with Big Data is heterogeneity. When collecting vast quantities of data, it is usually unrealistic to expect that stylised, traditional statistical models of independent and identically distributed observations can adequately capture the complexity of the underlying data generating mechanism. Departures from such models may take many forms, including missing data, correlated errors and data combined from multiple sources, to mention just a few.
When data are collected over time, heterogeneity often manifests itself through nonstationarity, where the data generating mechanism varies with time. Perhaps the simplest form of non-stationarity assumes that population changes occur at a relatively small number of discrete time points. If correctly estimated, these 'changepoints' can be used to partition the original data set into shorter segments, which can then be analysed using methods designed for stationary time series. Moreover, the locations of these changepoints are often themselves of significant practical interest.
In this paper, we study high-dimensional time series that may have changepoints; moreover, we consider in particular settings where at a changepoint, the mean structure changes in a sparse subset of the coordinates. Despite their simplicity, such models are of great interest in a wide variety of applications. For instance, in the case of stock price data, it may well be the case that stocks in related industry sectors experience virtually simultaneous 'shocks' (Chen and Gupta, 1997). In internet security monitoring, a sudden change in traffic at multiple routers may be an indication of a distributed denial of service attack (Peng, Leckie and Ramamohanarao, 2004). In functional Magnetic Resonance Imaging (fMRI) studies, a rapid change in blood oxygen level dependent (BOLD) contrast in a subset of voxels may suggest neurological activity of interest (Aston and Kirch, 2012).
Our main contribution is to propose a new method for estimating the number and locations of the changepoints in such high-dimensional time series, a challenging task in the absence of knowledge of the coordinates that undergo a change. In brief, we first seek a good projection direction, which should ideally be closely aligned with the vector of mean changes. We can then apply an existing univariate changepoint estimation algorithm to the projected series. For this reason, we call our algorithm inspect, short for informative sparse projection for estimation of changepoints; it is implemented in the R package InspectChangepoint .
In more detail, in the single changepoint case, our first observation is that at the population level, the vector of mean changes is the leading left singular vector of the matrix obtained as the cumulative sum (CUSUM) transformation of the mean matrix of the time series. This motivates us to begin by applying the CUSUM transformation to the time series. Unfortunately, computing the k-sparse leading left singular vector of a matrix is a combinatorial optimisation problem, but nevertheless, we are able to formulate an appropriate convex relaxation of the problem, from which we derive our projection direction. At the second stage of our algorithm, we compute the vector of CUSUM statistics for the projected series, identifying a changepoint if the maximum absolute value of this vector is sufficiently large. For the case of multiple changepoints, we combine our single changepoint algorithm with the method of Wild Binary Segmentation (Fryzlewicz, 2014) to identify changepoints recursively.
A brief illustration of the inspect algorithm in action is given in Figure 1. Here, we simulated a 2000 × 1000 data matrix having independent normal columns with identity covariance and with three changepoints in the mean structure at locations 500, 1000 and 1500. Changes occur in 40 coordinates, where consecutive changepoints overlap in half of their coordinates, and the squared 2 norms of the vectors of mean changes were 0.4, 0.9 and 1.6 respectively. The top-left panel shows the original data matrix and the top-right shows its CUSUM transformation, while the bottom-left panel shows overlays for the three detected changepoints of the univariate CUSUM statistics after projection. Finally, the bottom-right panel displays the largest absolute values of the projected CUSUM statistics obtained by running the wild binary segmentation algorithm to completion (in practice, we would apply a termination criterion instead, but this is still helpful for illustrative purposes). We see that the three detected changepoints are very close to their true locations, and it is only for these three locations that we obtain a sufficiently large CUSUM statistic to declare a changepoint. We emphasise that our focus here is on the so-called offline version of the changepoint estimation problem, where we observe the whole data set before seeking to locate changepoints. The corresponding online problem, where one aims to declare a changepoint as soon as possible after it has occurred, is also of great interest (Tartakovsky, Nikiforov and Basseville, 2014), but is beyond the scope of the current work.  1899 1901 1903 1906 1908 1911 1912 1914 1916 1919 1923 1924 1927 1930 1932 1935 1938 1940 1944 1946 1949 1950 1953 1957 1960 1964 1966 1968 1972 1973 1975 1978 1982 1985 1988 1991 1994 1995 1997  nodes in binary segmentation algorithm peak of projected CUSUM Figure 1: Example of inspect algorithm in action. Top-left: visualisation of the data matrix. Top-right: its CUSUM transformation. Bottom-left: overlay of the projected CUSUM statistics for the three changepoints detected. Bottom-right: visualisation of thresholding; the three detected changepoints are above the threshold (dotted red line) whereas the remaining numbers are the test statistics obtained if we run the wild binary segmentation to completion without applying a termination criterion.
Our theoretical development proceeds first by controlling the angle between the estimated projection direction and the optimal direction, which is given by the normalised vector of mean changes. Under appropriate conditions, this enables us to provide finite-sample bounds which guarantee that with high probability we both recover the correct number of changepoints, and estimate their locations to within a specified accuracy. Indeed, in the single changepoint case, the rate of convergence for the changepoint location estimation of our method is within a doubly logarithmic factor of the minimax optimal rate. Our extensive numerical studies indicate that the algorithm performs extremely well in a wide variety of settings.
The study of changepoint problems dates at least back to Page (1955), and has since found applications in many different areas, including genetics (Olshen et al., 2004), disease outbreak watch (Sparks, Keighley and Muscatello, 2010) and aerospace engineering (Henry, Simani and Patton, 2010), in addition to those already mentioned. There is a vast and rapidly growing literature on different methods for changepoint detection and localisation, especially in the univariate problem. Surveys of various methods can be found in Csörgő and Horváth (1997) and Horváth and Rice (2014). In the case of univariate changepoint estimation, state-of-the-art methods include Pruned Exact Linear Time method (PELT) (Killick, Fearnhead and Eckley, 2012), Wild Binary Segmentation (WBS) (Fryzlewicz, 2014) and Simultaneous Multiscale Changepoint Estimator (SMUCE) (Frick, Munk and Sieling, 2014).
Some of the univariate changepoint methodologies have been extended to multivariate settings. Examples include Horváth, Kokoszka and Steinebach (1999), Ombao, Von Sachs and Guo (2005), Aue et al. (2009) andKirch, Mushal andOmbao (2014). However, there are fewer available tools for high-dimensional changepoint problems, where both the dimension p and the length n of the data stream may be large, and where we may allow a sparsity assumption on the coordinates of change. Bai (2010) investigates the performance of the least squares estimator of a single changepoint in the high-dimensional setting. Zhang et al. (2010), Horváth and Hušková (2012) and Enikeeva and Harchaoui (2014) consider estimators based on 2 aggregations of CUSUM statistics in all coordinates, but without using any sparsity assumptions. Enikeeva and Harchaoui (2014) also consider a scan statistic that takes sparsity into account. Jirak (2015) considers an ∞ aggregation of the CUSUM statistics that works well for sparse changepoints. Cho and Fryzlewicz (2015) propose Sparse Binary Segmentation, which also takes sparsity into account and can be viewed as a hard-thresholding of the CUSUM matrix followed by an 1 aggregation. Cho (2016) proposes a double-CUSUM algorithm that performs a CUSUM transformation along the location axis on the columwisesorted CUSUM matrix. In a slightly different setting, Lavielle and Teyssiere (2006), Aue et al. (2009), Bücher et al. (2014, Preuß et al. (2015) and Cribben and Yu (2015) deal with changes in cross-covariance, while Soh and Chandrasekaran (2017) study a high-dimensional changepoint problem where all mean vectors are sparse. Aston and Kirch (2014) considered the asymptotic efficiency of detecting a single changepoint in a high-dimensional setting, and the oracle projection-based estimator under cross-sectional dependence structure.
The outline of the rest of the paper is as follows. In Section 2, we give a formal description of the problem and the class of data generating mechanisms under which our theoretical results hold. Our methodological development in the single changepoint setting is presented in Section 3, and includes theoretical guarantees on both the projection direction and location of the estimated changepoint in the simplest case of observations that are independent across both space and time. Section 4 extends these ideas to the case of multiple changepoints with the aid of Wild Binary Segmentation, and our numerical studies are given in Section 5. Section 6 studies in detail important cases of temporal and spatial dependence. For temporal dependence, no change to our methodology is required, but new arguments are needed to provide theoretical guarantees; for spatial dependence, we show how to modify our methodology to try to maximise the signal to noise ratio of the projected univariate series, and also provide corresponding theoretical results on the performance of this variant of the basic inspect algorithm. Proofs of our main results are given in Section 7; additional results and their proofs are given in the online supplementary material , hereafter referred to simply as the online supplement.
We conclude this section by introducing some notation used throughout the paper. For a vector u = (u 1 , . . . , N j=1 |A ij | q 1/q for their (entrywise) qnorms, as well as u ∞ := max i=1,...,M |u i | and A ∞ := max i=1,...,M,j=1,...,N |A ij |. We write A * := min(M,N ) i=1 σ i (A) and A op := max i σ i (A) respectively for the nuclear norm and operator norm of matrix A, where σ 1 (A), . . . , σ min(M,N ) (A) are its singular values. We also . . , M } and T ⊆ {1, . . . , N }, we write u S := (u i : i ∈ S) and write M S,T for the |S| × |T | submatrix of A obtained by extracting the rows and columns with indices in S and T respectively. For two matrices A, B ∈ R M ×N , we denote their trace inner product as A, B = tr(A B). For two non-zero vectors u, v ∈ R p , we write for the acute angle bounded between them. We let S p−1 := {x ∈ R p : x 2 = 1} be the unit Euclidean sphere in R p , and let S p−1 (k) := {x ∈ S p−1 : x 0 ≤ k}. Finally, we write a n b n to mean 0 < lim inf n→∞ |a n /b n | ≤ lim sup n→∞ |a n /b n | < ∞.

Problem description
We initially study the following basic model: let X 1 , . . . , X n be independent p-dimensional random vectors sampled from and combine the observations into a matrix X = (X 1 , . . . , X n ) ∈ R p×n . Extensions to settings of both temporal and spatial dependence will be studied in detail in Section 6. We assume that the mean vectors follow a piecewise-constant structure with ν + 1 segments. In other words, there exists ν changepoints where we adopt the convention that z 0 := 0 and z ν+1 := n. For i = 1, . . . , ν, write for the (non-zero) difference in means between consecutive stationary segments. We will later assume that the changes in mean are sparse in the sense that there exists k ∈ {1, . . . , p} (typically k is much smaller than p) such that for each i = 1, . . . , ν, since our methodology performs best when aggregating signals spread across an (unknown) sparse subset of coordinates; see also the discussion after Corollary 2 below. However, we remark that our methodology does not require the knowledge of the sparsity level and can be applied in non-sparse settings as well.
Our goal is to estimate the set of changepoints {z 1 , . . . , z ν } in the high-dimensional regime, where p may be comparable to, or even larger than, the length n of the series. The signal strength of the estimation problem is determined by the magnitude of mean changes {θ (i) : 1 ≤ i ≤ ν} and the lengths of stationary segments {z i+1 − z i : 0 ≤ i ≤ ν}, whereas the noise is related to the variance σ 2 and the dimensionality p of the observed data points. For our theoretical results, we will assume that the changepoint locations satisfy and the magnitudes of mean changes are such that Suppose that an estimation procedure outputsν changepoints located at 1 ≤ẑ 1 < · · · < zν ≤ n − 1. Our finite-sample bounds will imply a rate of convergence for inspect in an asymptotic setting where the problem parameters are allowed to depend on n. Suppose that P n is a class of distributions of X ∈ R p×n with sample size n. In this context, we follow the convention in the literature (e.g.  and say that the procedure is consistent for P n with rate of convergence ρ n if inf P ∈Pn as n → ∞.
3 Data-driven projection estimator for a single changepoint We first consider the problem of estimating a single changepoint (i.e. ν = 1) in a highdimensional time series dataset X ∈ R p×n . For simplicity, write z := z 1 , θ = (θ 1 , . . . , θ p ) := θ (1) and τ := n −1 min{z, n − z}. We seek to aggregate the rows of the data matrix X in an almost optimal way so as to maximise the signal-to-noise ratio, and then locate the changepoint using a one-dimensional procedure. For any a ∈ S p−1 , a X is a one-dimensional time series with a X t ∼ N (a µ t , σ 2 ).
Hence, the choice a = θ/ θ 2 maximises the magnitude of the difference in means between the two segments. However, θ is typically unknown in practice, so we should seek a projection direction that is close to the oracle projection direction v := θ/ θ 2 . Our strategy is to perform sparse singular value decomposition on the CUSUM transformation of X. The method and limit theory of CUSUM statistics in the univariate case can be traced back to Darling and Erdős (1956). For p ∈ N and n ≥ 2, we define the CUSUM transformation In fact, to simplify the notation, we will write T for T p,n , since p and n can be inferred from the dimensions of the argument of T . Note also that T reduces to computing the vector of classical one-dimensional CUSUM statistics when p = 1. We write where µ = (µ 1 , . . . , µ n ) ∈ R p×n and W = (W 1 , . . . , W n ) is a p × n random matrix with independent N p (0, σ 2 I p ) columns. Let T := T (X), A := T (µ) and E := T (W ), so by the linearity of the CUSUM transformation we have the decomposition We remark that when σ is known, each |T j,t | is the likelihood ratio statistic for testing the null hypothesis that the jth row of µ is constant against the alternative that the jth row of µ undergoes a single change at time t. Moreover, if the direction v ∈ S p−1 of the potential single change at a given time t were known, then the most powerful test of whether or not ϑ = 0 would be based on |(v T ) t |. In the single changepoint case, the entries of the matrix A can be computed explicitly: Hence we can write A = θγ , where (10) In particular, this implies that the oracle projection direction is the leading left singular vector of the rank 1 matrix A. In the ideal case where k is known, we could in principle let v max,k be a k-sparse leading left singular vector of T , defined bŷ v max,k ∈ argmax and it can then be shown using a perturbation argument akin to the Davis-Kahan 'sin θ' theorem (cf. Davis and Kahan (1970); Yu, Wang and Samworth (2015)) thatv max,k is a consistent estimator of the oracle projection direction v under mild conditions (see Proposition 8 in the online supplement). However, the optimisation problem in (11) is non-convex and hard to implement. In fact, computing the k-sparse leading left singular vector of a matrix is known to be NP-hard (e.g. Tillmann and Pfetsch (2014)). The naive algorithm that scans through all possible k-subsets of the rows of T has running time exponential in k, which quickly becomes impractical to run for even moderate sizes of k.
A natural approach to remedy this computational issue is to work with a convex relaxation of the optimisation problem (11) instead. In fact, we can write This motivates us to absorb the rank constraint into the nuclear norm constraint, which we relax from an equality constraint to an inequality constraint in order to make it convex. Furthermore, we can relax the row sparsity constraint in the definition of M to an entrywise 1 -norm penalty. The optimisation problem of findinĝ where S 1 := {M ∈ R p×(n−1) : M * ≤ 1} and λ > 0 is a tuning parameter to be chosen later, is therefore a convex relaxation of (11). We remark that a similar convex relaxation has appeared in the different context of sparse principal component estimation (d 'Aspremont et al., 2007), where the sparse leading left singular vector is also the optimisation target. The convex problem (13) may be solved using the alternating direction method of multipliers algorithm (ADMM, see Gabay and Mercier (1976); Boyd et al. (2011)) as in Algorithm 1. More specifically, the optimisation problem in (13) is equivalent to maximising T, Y − λ Z 1 − I S 1 (Y ) subject to Y = Z, where I S 1 is the function that is 0 on S 1 and ∞ on S c 1 . Its augmented Lagrangian is given by with the Lagrange multiplier R being the dual variable. Each iteration of the main loop in Algorithm 1 first performs a primal update by maximising L(Y, Z, R) marginally with respect to Y and Z, then followed by a dual gradient update of R with constant step size. The function Π S 1 (·) in Algorithm 1 denotes projection onto the convex set S 1 with respect to the Frobenius norm distance. If A = U DV is the singular value decomposition of A ∈ R p×(n−1) with rank(A) = r, where D is a diagonal matrix with diagonal entries d 1 , . . . , d r , then Π S 1 (A) = UDV , whereD is a diagonal matrix with entriesd 1 , . . . ,d r such that (d 1 , . . . ,d r ) is the Euclidean projection of the vector (d 1 , . . . , d r ) onto the standard (r − 1)-simplex ∆ r−1 := (x 1 , . . . , x r ) ∈ R r : r =1 x = 1 and x ≥ 0 for all .
For an efficient algorithm for such simplicial projection, see Chen and Ye (2011). The soft function in Algorithm 1 denotes an entrywise soft-thresholding operator defined by soft(A, λ) ij := sgn(A ij ) max{|A ij | − λ, 0} for any λ ≥ 0 and matrix A = (A ij ).
Algorithm 1: Pseudo-code for an ADMM algorithm that computes the solution to the optimisation problem (13).
We remark that one may be interested to further relax (13) by replacing S 1 with the larger set S 2 := {M ∈ R p×(n−1) : M 2 ≤ 1} defined by the entrywise 2 -unit ball. We see from Proposition 9 in the online supplement that the smoothness of S 2 results in a simple dual formulation, which implies that is the unique optimiser of the primal problem. The soft-thresholding operation is significantly faster than the ADMM algorithm in Algorithm 1. Hence by enlarging S 1 to S 2 , we can significantly speed up the running time of the algorithm in exchange for some loss in statistical efficiency caused by the further relaxation of the constraint set. See Section 5 for further discussion. Letv be the leading left singular vector of for either S = S 1 or S = S 2 . In order to describe the theoretical properties ofv as an estimator of the oracle projection direction v, we introduce the following class of distributions: let P(n, p, k, ν, ϑ, τ, σ 2 ) denote the class of distributions of X = (X 1 , . . . , X n ) ∈ R p×n with independent columns drawn from (1), where the changepoint locations satisfy (5) and the vectors of mean changes are such that (4) and (6) hold. Although this notation accommodates the multiple changepoint setting studied in Section 4 below, we emphasise that our focus here is on the single changepoint setting. The error bound in Proposition 1 below relies on a generalisation of the curvature lemma in Vu et al. (2013, Lemma 3.1), presented as Lemma 13 in the online supplement.
Proposition 1. Suppose thatM satisfies (15) for either S = S 1 or S = S 2 . Letv ∈ argmaxṽ ∈S p−1 M ṽ 2 be the leading left singular vector ofM . If n ≥ 6 and if we choose λ ≥ 2σ log(p log n), then sup P ∈P(n,p,k,1,ϑ,τ,σ 2 ) The following corollary restates the rate of convergence of the projection estimator in a simple asymptotic regime.
Corollary 2. Consider an asymptotic regime where log p = O(log n), σ is a constant, ϑ n −a , τ n −b and k n c for some a ∈ R, b ∈ [0, 1] and c ≥ 0. Then, setting λ := 2σ log(p log n) and provided a + b + c/2 < 1/2, we have for every δ > 0 that sup P ∈P(n,p,k,1,ϑ,τ,σ 2 ) Proposition 1 and Corollary 2 illustrate the benefits of assuming that the changes in mean structure occur only in a sparse subset of the coordinates. Indeed, these results mimic similar findings in other high-dimensional statistical problems where sparsity plays a key role, indicating that one pays a logarithmic price for absence of knowledge of the true sparsity set. See, for instance, Bickel, Ritov and Tsybakov (2009) in the context of the Lasso in high-dimensional linear models, or Johnstone and Lu (2009) ;Wang, Berthet and Samworth (2016) in the context of Sparse Principal Component Analysis.
Algorithm 2: Pseudo-code for a single high-dimensional changepoint estimation algorithm.
Step 4: Letẑ ∈ argmax 1≤t≤n−1 |v T t |, where T t is the tth column of T , and set T max ← |v Tẑ| Output:ẑ,T max After obtaining a good estimatorv of the oracle projection direction, the natural next step is to project the data matrix X along the directionv, and apply an existing onedimensional changepoint localisation method on the projected data. In this work, we apply a one-dimensional CUSUM transformation to the projected time series and estimate the changepoint by the location of the maximum of the CUSUM vector. Our overall procedure for locating a single changepoint in a high-dimensional time series is given in Algorithm 2. In our description of this algorithm, the noise level σ is assumed to be known. If σ is unknown, we can estimate it robustly using, e.g., the median absolute deviation of the marginal onedimensional time series (Hampel, 1974). Note that for convenience of later reference, we have required Algorithm 2 to output both the estimated changepoint locationẑ and the associated maximum absolute post-projection one-dimensional CUSUM statisticT max .
From a theoretical point of view, the fact thatv is estimated using the entire dataset X makes it difficult to analyse the post-projection noise structure. For this reason, in the analysis below, we work with a slight variant of Algorithm 2. We assume for convenience that n = 2n 1 is even, and define X (1) , X (2) ∈ R p×n 1 by We then use X (1) to estimate the oracle projection direction and use X (2) to estimate the changepoint location after projection (see Algorithm 3). However, we recommend using Algorithm 2 in practice to exploit the full signal strength in the data.
Algorithm 3: Pseudo-code for a sample-splitting variant of Algorithm 2.
Step 2: Use Algorithm 1 or (14) (with inputs T (1) , λ in either case) to solve for Step 4: We summarise the overall estimation performance of Algorithm 3 in the following theorem.
Theorem 3. Suppose σ > 0 is known. Letẑ be the output of Algorithm 3 with input X ∼ P ∈ P(n, p, k, 1, ϑ, τ, σ 2 ) and λ := 2σ log(p log n). There exist universal constants C, C > 0 such that if n ≥ 12 is even, z is even and then We remark that under the conditions of the theorem, the rate of convergence obtained is minimax optimal up to a factor of log log n; see Proposition 10 in the online supplement. It is interesting to note that, once (17) is satisfied, the final rate of changepoint estimation does not depend on τ .
Corollary 4. Suppose that σ is a constant, log p = O(log n), ϑ n −a , τ n −b and k n c for some a ∈ R and b ∈ [0, 1] and c ≥ 0. If a+b+c/2 < 1/2, then the outputẑ of Algorithm 3 with λ := 2σ log(p log n) is a consistent estimator of the true changepoint z with rate of convergence ρ n = o(n −1+2a+δ ) for any δ > 0.
Finally in this section, we remark that this asymptotic rate of convergence has previously been observed in Csörgő and Horváth (1997, Theorem 2.8.2) for a CUSUM procedure in the special case of univariate observations with τ bounded away from zero (i.e. b = 0 in Corollary 4 above).

Estimating multiple changepoints
Our algorithm for estimating a single changepoint can be combined with the wild binary segmentation scheme of Fryzlewicz (2014) to locate sequentially multiple changepoints in high-dimensional time series. The principal idea behind a wild binary segmentation procedure is as follows. We first randomly sample a large number of pairs, (s 1 , e 1 ), . . . , (s Q , e Q ) uniformly from the set {( , r) ∈ Z 2 : 0 ≤ < r ≤ n}, and then apply our single changepoint algorithm to X [q] , for 1 ≤ q ≤ Q, where X [q] is defined to be the submatrix of X obtained by extracting columns {s q + 1, . . . , e q } of X. For each 1 ≤ q ≤ Q, the single changepoint algorithm (Algorithm 2 or 3) will estimate an optimal sparse projection directionv [q] , compute a candidate changepoint location s q +ẑ [q] within the time window [s q + 1, e q ] and return a maximum absolute CUSUM statisticT [q] max along the projection direction. We aggregate the q candidate changepoint locations by choosing one that maximises the largest projected CUSUM statistic, T [q] max , as our best candidate. If T [q] max is above a certain threshold value ξ, we admit the best candidate to the setẐ of estimated changepoint locations and repeat the above procedure recursively on the sub-segments to the left and right of the estimated changepoint. Note that while recursing on a sub-segment, we only consider those time windows that are completely contained in the sub-segment. The precise algorithm is detailed in Algorithm 4.
Algorithm 4 requires three tuning parameters: a regularisation parameter λ, a Monte Carlo parameter Q for the number of random time windows and a thresholding parameter ξ that determines termination of recursive segmentation. Theorem 5 below provides choices for λ, Q and ξ that yield theoretical guarantees for consistent estimation of all changepoints as defined in (7).
We remark that if we apply Algorithm 2 or 3 on the entire dataset X instead of random time windows of X, and then iterate after segmentation, we arrive at a multiple changepoint algorithm based on the classical binary segmentation scheme. The main disadvantage of this classical binary segmentation procedure is its sensitivity to model misspecification. Algorithms 2 and 3 are designed to optimise the detection of a single changepoint. When we apply Algorithm 4: Pseudo-code for multiple changepoint algorithm based on sparse singular vector projection and wild binary segmentation.
Step 3: Letν ← |Ẑ| and sort elements ofẐ in increasing order to yieldẑ 1 < · · · <ẑν. Output:ẑ 1 , . . . ,ẑν Function wbs(s, e) Set Q s,e ← {q : s + nβ ≤ s q < e q ≤ e − nβ} for q ∈ Q s,e do Run Algorithm 2 with X [q] , λ as input, and letẑ [q] ,T [q] max be the output. end Find q 0 ∈ argmax q∈Qs,eT them in conjunction with classical binary segmentation to a time series containing more than one changepoint, the signals from multiple changepoints may cancel each other out in two different ways that will lead to a loss of power. First, as Fryzlewicz (2014) points out in the one-dimensional setting, multiple changepoints may offset each other in CUSUM computation, resulting in a smaller peak of the CUSUM statistic that is more easily contaminated by the noise. Moreover, in a high-dimensional setting, different changepoints can undergo changes in different sets of (sparse) coordinates. This also attenuates the signal strength in the sense that the estimated oracle projection direction from Algorithm 1 is aligned to some linear combination of θ (1) , . . . , θ (ν) , but not necessarily well-aligned to any one particular θ (i) . The wild binary segmentation scheme addresses the model misspecification issue by examining sub-intervals of the entire time length. When the number of time windows Q is sufficiently large and τ is not too small, with high probability we have reasonably long time windows that contain each individual changepoint. Hence the single changepoint algorithm will perform well on these segments.
Just as in the case of single changepoint detection, it is easier to analyse the theoretical performance of a sample-splitting version of Algorithm 4. However, to avoid notational clutter, we will prove a theoretical result without sample splitting, but with the assumption that whenever Algorithm 2 is used within Algorithm 4, its second and third steps (i.e. the steps for estimating the oracle projection direction) are carried out on an independent copy X of X. We refer to such a variant of the algorithm with an access to an independent sample X as Algorithm 4 . Theorem 5 below, which proves theoretical guarantees of Algorithm 4 , can then be readily adapted to work for a sample-splitting version of Algorithm 4, where we replace n by n/2 where necessary.
We remark that the consistency described in Corollary 6 is a rather strong notion, in the sense that it implies convergence in several other natural metrics. For example, if we let denote the Hausdorff distance between non-empty sets A and B on R, then (7) implies that with probability tending to 1, Similarly, denote the L 1 -Wasserstein distance between probability measures P and Q on R by where the infimum is taken over all pairs of random variables U and V defined on the same probability space with U ∼ P and V ∼ Q. Then (7) also implies that with probability tending to 1, where δ a denotes a Dirac point mass at a.

Numerical studies
In this section, we examine the empirical performance of the inspect algorithm in a range of settings, and compare it with a variety of other recently-proposed methods. In both singleand multiple-changepoint scenarios, the implementation of inspect requires the choice of a regularisation parameter λ > 0 to be used in Algorithm 1 (which is called in Algorithms 2 and 4). In our experience, the theoretical choices λ = 2σ log(p log n) and λ = 4σ log(np) used in Theorems 3 and 5 produce consistent estimators as predicted by the theory, but are slightly conservative, and in practice we recommend the choice λ = σ 2 −1 log(p log n) in both cases. Figure 2 illustrates the dependence of the performance of our algorithm on the regularisation parameter, and reveals in this case (as in the other examples that we tried) that this choice of λ is sensible. In the implementation of our algorithm, we do not assume the noise level σ is known, nor even that it is constant across different components. Instead, we estimate the error variance for each individual time series using the median absolute deviation of first-order differences with scaling constant of 1.05 for the normal distribution (Hampel, 1974). We then normalise each series by its estimated standard deviation and use the choices of λ given above with σ replaced by 1. Parameters: n = 1000, p = 500, k = 3 (red) or 10 (orange) or 22 (blue) or 100 (green), z = 400, ϑ = 1, σ 2 = 1. For these parameters, our choice of λ is σ 2 −1 log(p log n) ≈ 2.02. In Step 2 of Algorithm 2, we also have a choice between using S = S 1 and S 2 . The following numerical experiment demonstrates the difference in performance of the algorithm for these two choices. We took n = 500, p = 1000, k = 30 and σ 2 = 1, with a single changepoint located at z = 200. Table 1 shows the angles between the oracle projection direction and estimated projection directions using both S 1 and S 2 as the signal level ϑ varies from 0.5 to 5.0. We have additionally reported the benchmark performance of the naive estimator using the leading left singular vector of T , which illustrates that the convex optimisation algorithms significantly improve the naive estimator by exploiting the sparsity structure. It can be seen that further relaxation from S 1 to S 2 incurs a relatively low cost in terms of the estimation quality of the projection direction, but it offers great improvement in running time due to the closed-form solution (cf. Proposition 9 in the online supplement). Thus, even though the use of S 1 remains a viable practical choice for offline data sets of moderate size, we use S = S 2 in the simulations that follow.  Table 1: Angles (in degrees) between oracle projection direction v and estimated projection directionsv S 1 (using S 1 ),v S 2 (using S 2 ) andv max (leading left singular vector of T ), for different choices of ϑ. Each reported value is averaged over 100 repetitions. Other simulation parameters: n = 500, p = 1000, k = 30, z = 200, σ 2 = 1.
We compare the performance of the inspect algorithm with the following recently proposed methods for high-dimensional changepoint estimation. These include sparsified binary segmentation (sbs) (Cho and Fryzlewicz, 2015), the double CUSUM algorithm (dc) of Cho (2016), a scan statistic-based algorithm (scan) derived from the work of Enikeeva and Harchaoui (2014), the ∞ CUSUM aggregation algorithm (agg ∞ ) of Jirak (2015) and the 2 CUSUM aggregation algorithm (agg 2 ) of Horváth and Hušková (2012). We remark that the latter three works primarily concern the test for the existence of a changepoint. However, their relevant test statistics can be naturally modified into a changepoint location estimator. They can then be extended a multiple changepoint estimation algorithm via a wild binary segmentation scheme in a similar way to our algorithm, in which the termination criterion is chosen by five-fold cross validation. Whenever tuning parameters are required in running these algorithms, we adopt the choices suggested by their authors in the relevant papers.

Single changepoint estimation
All algorithms in our simulation study are top-down algorithms in the sense that their multiple changepoint procedure is built upon a single changepoint estimation submodule, which is used to locate recursively all changepoints via a (wild) binary segmentation scheme. It is therefore instructive first to compare their performance in the single changepoint estimation task. Our simulations were run for n, p ∈ {500, 1000, 2000}, k ∈ {3, p 1/2 , 0.1p, p}, z = 0.4n, σ 2 = 1 and ϑ = 0.8, with θ ∝ (1, 2 −1/2 , . . . , k −1/2 , 0, . . . , 0) ∈ R p . For definiteness, we let the n columns of X be independent, with the leftmost z columns drawn from N p (0, σ 2 I p ) and the remaining columns drawn from N p (θ, σ 2 I p ). To avoid the influence of different threshold levels on the performance of the algorithms and to focus solely on their estimation precision, we assume that the existence of a single changepoint is known a priori and make all algorithms output their estimate of its location; estimation of the number of changepoints in a multiple-changepoint setting is studied in Section 5.3 below. Table 2 compares the performance of inspect and other competing algorithms under various parameter settings. All algorithms were run on the same data matrices and the root mean squared estimation error over 1000 repetitions is reported. Although, in the interests of brevity, we report the root mean squared estimation error only for ϑ = 0.8, simulation results for other values of ϑ were qualitatively similar. We also remark that the four choices for the parameter k correspond to constant/logarithmic sparsity, polynomial sparsity and two levels of non-sparse settings respectively. In addition to comparing the practical algorithms, we also computed the changepoint estimator based on the oracle projection direction (which of course is typically unknown); the performance of this oracle estimator depends only on n, z, ϑ and σ 2 (and not on k or p), and the corresponding root mean squared errors in Table 2 were 10.0, 8.1 and 7.8 when (n, z, ϑ, σ 2 ) = (500, 200, 0.8, 1), (1000, 400, 0.8, 1), (2000, 800, 0.8, 1) respectively. Thus the performance of our inspect algorithm is very close to that of the oracle estimator when k is small, as predicted by our theory.
As a graphical illustration of the performance of the different methods, Figure 3 displays density estimates of their estimated changepoint locations in two different settings taken from Table 2. One difficulty in presenting such estimates with kernel density estimators is the fact that different algorithms would require different choices of bandwidth, and these would need to be locally adaptive, due to the relatively sharp peaks. In order to avoid the choice of bandwidth skewing the visual representation, we therefore use the log-concave maximum likelihood estimators for each method (e.g. Dümbgen and Rufibach, 2009;Cule, Samworth and Stewart, 2010), which is both locally adaptive and tuning-parameter free. It can be seen from Table 2 and Figure 3 that inspect has extremely competitive performance for the single changepoint estimation task. In particular, despite the fact that it is designed for estimation of sparse changepoints, inspect performs relatively well even when k = p (i.e. when the signal is highly non-sparse), especially when the signal strength is relatively large.

Model misspecification
We now extend the ideas of Section 5.1 by investigating empirical performance under several other types of model misspecification. Recall that the noise matrix is W = (W j,t ) := X − µ and we define W 1 , . . . , W n to be the column vectors of W . In models M unif and M exp , we  ∼ Exp(σ) − σ respectively. We note that the correct Hampel scaling constants are approximately 0.99 and 1.44 in these two cases, though we continue to use the constant 1.05 for normally distributed data. In model M cs,loc (ρ), we allow the noise to have a short-range cross-sectional dependence by sampling W 1 , . . . , W n iid ∼ N p (0, Σ) for Σ := (ρ |j−j | ) j,j . In model M cs (ρ), we extend this to global cross-sectional dependence by sampling W 1 , . . . , W n iid ∼ N p (0, Σ) for Σ := (1 − ρ)I p + ρ p 1 p 1 p , where 1 p ∈ R p is an all-one vector. In model M temp (ρ), we consider an auto-regressive AR(1) temporal dependence in the noise by first sampling W j,t iid ∼ N (0, σ 2 ) and then setting W j,1 := W j,1 and W j,t := ρ 1/2 W j,t−1 + (1 − ρ) 1/2 W j,t for 2 ≤ t ≤ n. In M async (L), we model asynchronous changepoint location in the signal coordinates by drawing changepoint locations for individual coordinates independently from a uniform distribution on {z −L, . . . , z +L}. We report the performance of the different algorithms in the parameter setting n = 2000, p = 1000, k = 32, z = 800, ϑ = 0.25, σ 2 = 1 in Table 3. It can be seen that inspect is robust to both temporal and spatial dependence structures, as well as noise misspecification.

Multiple changepoint estimation
The use of the 'burn-off' parameter β in Algorithm 4 was mainly to facilitate our theoretical analysis. In our simulations, we found that taking β = 0 rarely resulted in the changepoint being estimated more than once, and we therefore recommend setting β = 0 in practice, unless prior knowledge of the distribution of the changepoints suggests otherwise. To choose ξ in the multiple changepoint estimation simulation studies, for each (n, p), we first applied inspect to 1000 data sets drawn from the null model with no changepoint, and took ξ to be the largest value ofT max from Algorithm 2. We also set Q = 1000. We consider the simulation setting where n = 2000, p = 200, k = 40, σ 2 = 1 and z = (500, 1000, 1500). Define ϑ (i) := θ (i) 2 to be the signal strength at the ith changepoint.
We set (ϑ (1) , ϑ (2) , ϑ (3) ) = (ϑ, 2ϑ, 3ϑ) and take ϑ ∈ {0.4, 0.6} to see the performance of the algorithms at different signal strengths. We also considered different levels of overlap between the coordinates in which the three changes in mean structure occur: in the complete overlap case, changes occur in the same k coordinates at each changepoint; in the half overlap case, the changes occur in coordinates i−1 2 k + 1, . . . , i+1 2 k for i = 1, 2, 3; in the no overlap case, the changes occur in disjoint sets of coordinates. Table 4 summarises the results. We report both the frequency counts of the number of changepoints detected over 100 runs (all algorithms were compared over the same set of randomly generated data matrices) and two quality measures of the location of changepoints. In particular, since changepoint estimation can be viewed as a special case of classification, the quality of the estimated changepoints can be measured by the Adjusted Rand Index (ARI) of the estimated segmentation against the truth (Rand, 1971;Hubert and Arabie, 1985). We report both the average ARI over all runs and the percentage of runs for which a particular method attains the largest ARI among the six. Figure 4 gives a pictorial representation of the results for one particular collection of parameter settings. Again, we find that the performance of inspect is very encouraging on all performance measures, though we remark that agg 2 is also competitive, and scan tends to output the fewest false positives.

Real data application
We study the comparative genomic hybridisation (CGH) microarray dataset from Bleakley and Vert (2011), available in the ecp R package (James and Matteson, 2015). CGH is a technique that allows detection of chromosomal copy number abnormality by comparing the fluorescence intensity levels of DNA fragments from a test sample and a reference sample. This dataset contains (test to reference) log intensity ratio measurements of 43 individuals with bladder tumour at 2215 different loci on their genome. The log intensity ratios for the first ten individuals are plotted in Figure 5.4. While some of the copy number variations are specific to one individual, some copy number abnormality regions (e.g. between loci 2044 and 2143) are shared across several different individuals and are more likely to be disease-related. The inspect algorithm aggregates the changes present in different individuals and estimates the start and end points of copy number changes. Due to the large number of individualspecific copy number changes and the presence of measurement outliers, direct application of inspect with the default threshold level identifies 254 changepoints. However, practitioners can use the associatedT

Extensions: temporal or spatial dependence
In this section, we explore how our method and its analysis can be extended to handle more realistic streaming data settings where our data exhibit temporal or spatial dependence. For simplicity, we focus on the single changepoint case, and assume the same mean structure for µ = E(X) as described in Section 2, in particular (2), (3), (4), (5) and (6).

Temporal dependence
A natural way of relaxing the assumption of independence of the columns of our data matrix is to assume that the noise vectors W 1 , . . . , W n are stationary. Writing K(u) := cov(W t , W t+u ), we assume here that W = (W 1 , . . . , W n ) forms a centred, stationary Gaussian process with covariance function K. As we are mainly interested in the temporal dependence in this subsection, we assume each component time series evolves independently, so that K(u) is a diagonal matrix for every u. Further, writing σ 2 := K(0) op , we will assume that the dependence is short-ranged, in the sense that for some universal constant B > 0. In this case, the oracle projection direction is still v := θ/ θ 2 and our inspect algorithm does not require any modification. In terms of its performance in this context, we have the following result: Theorem 7. Suppose σ, B > 0 are known. Letẑ be the output of Algorithm 3 with input X and λ := σ 8B log(np). There exist universal constants C, C > 0 such that if n ≥ 12 is even, z is even and Cσ ϑτ kB log(np) n ≤ 1, then P 1 n |ẑ − z| ≤ C σ 2 B log n nϑ 2 ≥ 1 − 12 n .

Spatial dependence
Now consider the case where we have spatial dependence between the different coordinates of the data stream. More specifically, suppose that the noise vectors satisfy W 1 , . . . , W n iid ∼ N p (0, Σ), for some positive definite matrix Σ ∈ R p×p . This turns out to be a more complicated setting, where our initial algorithm requires modification. To see this, observe now that for a ∈ S p−1 , a X t ∼ N (a µ t , a Σa).
It follows that the oracle projection direction in this case is v proj := argmax IfΘ is an estimator of the precision matrix Θ := Σ −1 , andv is a leading left singular vector of M as computed in Step 3 of Algorithm 2, then we can estimate the oracle projection direction byv proj :=Θv/ Θv 2 . The sample-splitting version of this algorithm is therefore given in Algorithm 5. Lemma 23 in the online supplement allows us to control sin ∠(v proj , v proj ) in terms of sin ∠(v, v) and Θ − Θ op , as well as the extreme eigenvalues of Θ. Since Proposition 1 does not rely on the independence of the different coordinates, it can still be used to control sin ∠(v, v). In general, controlling Θ − Θ op in high-dimensional cases requires assumptions of additional structure on Θ (or equivalently, on Σ). For convenience of our theoretical analysis, we assume that we have access to observations W 1 , . . . , W m iid ∼ N p (0, Σ), independent of X (2) , with which we can estimate Θ. In practice, if a lower bound on τ were known, we could take W 1 , . . . , W m to be scaled, disjoint first-order differences of the observations in X (1) that are within n 1 τ of the endpoints of the data stream; more precisely, we can let W t := 2 1/2 (X (1) 2t − X (1) 2t−1 ) for t = 1, . . . , n 1 τ /2 and W n 1 τ /2 +t := 2 1/2 (X (1) n 1 −2t − X (1) n 1 −2t+1 ), so that m = 2 n 1 τ /2 . In fact, Lemmas 24 and 25 in the online supplement indicate that, at least for certain dependence structures, the operator norm error in estimation of Θ is often negligible by comparison with sin ∠(v, v), so a fairly crude lower bound on τ would often suffice.
Theoretical guarantees on the performance of the spatially dependent version of the inspect algorithm in illustrative examples of both local and global dependence structures are provided in Theorem 11 in the online supplement. The main message of these results is that, provided the dependence is not too strong, and we have a reasonable estimate of Θ, we attain the same rate of convergence as when there is no spatial dependence. However, Theorem 11 also quantifies the way in which this rate of convergence deteriorates as the dependence approaches the boundary of its range.
In Figure 6, we compare the performances of the vanilla inspect algorithm and Algorithm 5 on simulated datasets with local and spatial dependence structures. We observe that Algorithm 5 offers improved performance across all values of λ considered by accounting for the spatial dependence, as suggested by our theoretical arguments.
Algorithm 5: Pseudo-code for a sample-splitting variant of Algorithm 2 for spatially dependent data.

Proofs of main results
Proof of Proposition 1. We note that the matrix A as defined in Section 3 has rank 1, and its only non-zero singular value is θ 2 γ 2 . By Proposition 14 in the online supplement, on the event Ω * : By definition, θ 2 ≥ ϑ, and by Lemma 15 in the online supplement, γ 2 ≥ 1 4 nτ . Thus, n on Ω * . It remains to verify that P(Ω c * ) ≤ 4(p log n) −1/2 for n ≥ 6. By Lemma 16 in the online supplement, as desired.
If |ẑ−z i | ≤ nρ, we sayẑ is 'identified' to z i . Moreover, each candidate changepoint b identified by the function call wbs(s, e) in Algorithm 4 satisfies min{b−s, e−b} ≥ nβ > 2nρ. It follows that different elements ofẐ ∪ {0, n} cannot be identified to the same z i , so no element ofẐ is identified to z 0 or z ν+1 , and the second part of the event Ω * holds. It remains to show that each element of {z 1 , . . . , z ν } is identified by some element ofẐ. To see this, note that if z i is not identified, we can let (s * , e * ) be the shortest interval such that s * + 1 ≤ z i ≤ e * and such that (s * , e * ) are a pair of arguments called by the wbs function in Algorithm 4 . By (i), the two endpoints s * and e * are identified to z i 1 and z i 2 respectively, say, for some i 1 ≤ i − 1 and i 2 ≥ i + 1. But then by (ii) a new point b will be added toẐ and the recursion continues on the pairs (s * , b) and (b, e * ), contradicting the minimality of the pair (s * , e * ).
We now prove by induction on the depth of the recursion that on Ω 1 ∩ Ω 2 ∩ Ω 3 ∩ Ω 4 , statements (i) and (ii) hold every time wbs is called in Algorithm 4 . The first time wbs is called, s = 0 and e = n, so (i) is satisfied with the unique choice i 1 = 0 and i 2 = ν + 1. This proves the base case. Now suppose wbs is called with the pair (s, e) satisfying (i), yielding indices i 1 , i 2 ∈ {0, 1, . . . , ν + 1} with |s − z i 1 | ≤ nρ, |e − z i 2 | ≤ nρ. To complete the inductive step, we need to show that (ii) also holds, and if a new changepoint b is detected, then (i) holds for the pairs of arguments (s, b) and (b, e). We have two cases.

so (ii) is satisfied with no additional changepoint detected.
Case 2 : i 2 − i 1 ≥ 2. On the event Ω 1 , for any i * ∈ {i 1 + 1, . . . , i 2 − 1}, there exists q * ∈ {1, . . . , Q} such that s q * ∈ J i * −1 and e q * ∈ J i * . Moreover, since min{s q * − s, e − e q * } ≥ nτ /3 − nρ > nβ provided C ≥ 9 in the condition on β in the theorem, we have q * ∈ Q s,e . Since there is precisely one changepoint within the segment (s q * , e q * ], the matrix T (µ [q * ] ) has rank 1; cf. (9). On Ω 2 , we have T (W [q * ] ) ∞ ≤ λ. Thus, by Proposition 14 and Lemma 15 in the online supplement, under the conditions of the theorem. Therefore, recalling the definition of q 0 in Algorithm 4 , and on the event Ω 2 ∩ Ω 3 , for sufficiently large C > 0. In particular, by the condition, Cρkτ 3 ≤ 1, we have for sufficiently large C > 0 that Thus (ii) is satisfied with a new changepoint b := s q 0 +ẑ [q 0 ] detected. It remains to check that (i) holds for the pairs of arguments (s, b) and (b, e), for which it suffices to show that min 1≤i≤ν |b − z i | ≤ nρ. To this end, we study the behaviour of univariate CUSUM statistics of the projected series (v [q 0 ] ) X [q 0 ] . To simplify notation, we defineX : max − λ > 0, and hence there is at least one changepoint in (s q 0 , e q 0 ]. We may assume thatẑ [q 0 ] is not equal to z i − s q 0 for any i 1 +1 ≤ i ≤ i 2 −1, since otherwise min 1≤i≤ν |b−z i | = 0 and we are done. By Lemma 20 in the online supplement and after possibly reflecting the time direction, we may also assume that there is at least one changepoint to the left ofẑ [q 0 ] , and that if z i 0 − s q 0 is the changepoint immediately left ofẑ [q 0 ] , then the series {Ā t : z i 0 − s q 0 ≤ t ≤ẑ [q 0 ] } is positive and strictly decreasing. By (25) with i 0 in place of i * , we have that on Ω 3 , for sufficiently large C > 0. Our strategy here is to characterise the magnitude ofĒ z i 0 −sq 0 − Eẑ[q 0 ] and the rate of decay of the series {Ā t : This is achieved by considering the following three cases: (a) there is no changepoint to the right ofẑ Comparing (28) with (27), we have thatθ min(z i 0 − s q 0 , e q 0 − z i 0 ) ≥ 0.4λρ −1/2 τ −3/2 . We apply Lemma 19 in the online supplement with e q 0 − s q 0 and z i 0 − s q 0 taking the roles of n and z in the lemma respectively. On the event Ω 3 , we have that for sufficiently large C > 0. Hence, by Lemma 18 and Lemma 19 in the online supplement, on the event Ω 3 ∩ Ω 4 , we have Thus, using the condition that Cρkτ 3 ≤ 1 again, we have that for sufficiently large C, for some universal constants C and C . For case (b), we defineμ := 1 eq 0 −sq 0 eq 0 −sq 0 t=1μ t to be the overall average of theμ series, and let be the centred averages of theμ series on the segments (0, We claim that z i 0 − s q 0 ≥ nτ /15. For, if not, then in particular, By (29) and the fact thatĀ z i 0 −sq 0 > 0, we haveμ L < 0. On the other hand, a similar argument as in (27) shows that √ nτ θ (i 0 ) 2 λ ≥ ρ −1/2 τ −3/2 ≥ C 1/2 .
Thus, it follows from (26) that for sufficiently large C > 0, which can be rearranged to give −μ M ≥ (−μ L )/3. Consequently, contradicting the assumption of case (b). Hence we have established the claim. We can then apply Lemma 21 in the online supplement, withĀ t , e q 0 − s q 0 , z i 0 − s q 0 , z i 0 +1 − s q 0 , −μ L , −μ M and τ /15 taking the roles of g(t), n, z, z , µ 0 , µ 1 and τ in the lemma respectively, to obtain on the event Ω 3 that where we have used (27) in the penultimate inequality and the condition ρ ≤ τ /C in the final inequality. For sufficiently large C, we therefore haveẑ [q 0 ] − (z i 0 − s q 0 ) ≤ nτ /30. Thus, we can apply Lemma 18 and Lemma 21 in the online supplement to obtain on Ω 4 that where we have used (27) in the final inequality. SinceT z i 0 −sq 0 ≤Tẑ[q 0 ] , we must have on Ω 4 that for some universal constant C > 0. Hence, for sufficiently large C > 0, we have that For case (c), by Lemma 20 in the online supplement, the series (Ā t : z i 0 −s q 0 ≤ t ≤ z i 0 +1 − s q 0 ) must be strictly decreasing, then strictly increasing, while staying positive throughout. Define ζ := max{t ∈ [z i 0 − s q 0 , z i 0 +1 − s q 0 ] :Ā t ≤Ā z i 0 +1 −sq 0 − 2λ}. Using a very similar argument to that in case (b), we find that e q 0 − z i 0 +1 ≥ nτ /15, and therefore by Lemma 21 in the online supplement again, z i 0 +1 − s q 0 − (ζ + 1) ≤ 150C −1/2 nτ . Now, on Ω 3 , we havē So we can apply the same argument as in case (b) with ζ taking the role of z i 0 +1 and τ /2 − 1/n in place of τ , and obtain thatẑ for some universal constant C > 0 as desired.
Similarly, using Lemma 22 in the online supplement again instead of Lemma 17, the event Ω 1 defined in (24) satisfies The proof therefore follows from that of Theorem 3.

Online supplementary material for 'High-dimensional changepoint estimation via sparse projection'
This is the online supplementary material for the main paper , hereafter referred to as the main text. We begin with several additional theoretical results, which are referred to in the main text. Subsequent subsections consist of auxiliary results needed for the proofs of our main theorems.

Additional theoretical results
Our first result is an analogue of Proposition 1 for the (computationally inefficient) estimator of the k-sparse leading left singular vector.
By (43) in the proof of Proposition 14, and (31), we find that where we have used Lemma 15 in the final inequality. The desired result follows from bounding E ∞ with high probability as in (20) of the main text.
We next derive the closed-form expression for the solution to the optimisation problem (14) in the main text. Recall the defintions of the set S 2 and the soft function, both given just before (14) in the main text.
Proposition 9. Let T ∈ R p×(n−1) and λ > 0. Then the following optimisation problem We also define g(R) := max Since S 2 and R are compact, convex subsets of R p×(n−1) endowed with the trace inner product, and since φ is affine and continuous in both M and R, we can use the minimax equality theorem Fan (1953, Theorem 1)  We note that the dual function g has a unique minimum over R at R (d) , say, where R (d) j,t := sgn(T j,t ) min(λ, |T j,t |). Let Since the two extreme ends of the chain of inequalities are equal, we necessarily have Proposition 10 below gives a minimax lower bound for the single changepoint estimation problem. In conjunction with Theorem 3, this confirms that the inspect algorithm attains the minimax optimal rate of estimation up to a factor of log log n.
Finally in this section, we provide theoretical guarantees for the performance of our modified inspect algorithm (Algorithm 5) in cases of both local and global spatial dependence.
9 Auxiliary results 9.1 Auxiliary results for the proof of Proposition 1 in the main text The lemma below gives a characterisation of the nuclear norm of a real matrix.
Lemma 12. For n, p ≥ 1, let V n and V p be respectively the sets of n × min(n, p) and p × min(n, p) real matrices having orthonormal columns. Let A ∈ R p×n . Then Proof. Suppose we have the singular value decomposition A =Ṽ DŨ whereṼ ∈ R p×p and U ∈ R n×n are orthogonal matrices and where D = (D ij ) ∈ R p×n has entries arranged in decreasing order along its main diagonal and is zero off the main diagonal. Writing v j and u j for the jth row of V and U respectively, we have as desired.
Next, we present a generalisation of the curvature lemma of Vu et al. (2013, Lemma 3.1).
Lemma 13. Let v ∈ S p−1 and u ∈ S n−1 be the leading left and right singular vectors of A ∈ R p×n respectively. Suppose that the first and second largest singular values of A are separated by δ > 0. Let M ∈ R p×n . If either of the following two conditions holds, (a) rank(A) = 1 and M 2 ≤ 1, Remark: We note that if v ∈ S p−1 and u ∈ S n−1 are the leading left and right singular vectors respectively of A ∈ R p×n , then since the matrix operator norm and the nuclear norm are dual norms with respect to the trace inner product, we have that Thus, Lemma 13 provides a lower bound on the curvature of the function M → A, M as M moves away from the maximiser of the function in S 1 .
Proof. Let A = V DU be the singular value decomposition of A, where V ∈ R p×p and U ∈ R n×n are orthogonal matrices with column vectors v 1 = v, v 2 , . . . , v p and u 1 = u, u 2 , . . . , u n respectively, and D ∈ R p×n is a rectangular diagonal matrix with nonnegative entries along its main diagonal. The diagonal entries σ i := D ii are the singular values of A, and we may assume without loss of generality that σ 1 ≥ · · · ≥ σ r > 0 are all the positive singular values, for some r ≤ min{n, p}.
On the other hand, if condition (b) holds, then by the characterisation of the nuclear norm in Lemma 12, as well as its unitary invariance, we have But if M * ≤ 1, then σ i ≤ 1 for all i, so Using (35), (36), (37) and (38), we therefore have as desired.
Proposition 14. Suppose the first and second largest singular values of A ∈ R p×n are separated by δ > 0. Let v ∈ S p−1 (k) and u ∈ S n−1 ( ) be left and right leading singular vectors of A respectively. Let T ∈ R p×n satisfy T − A ∞ ≤ λ for some λ > 0, and let S be a subset of p × n real matrices containing vu . Suppose one of the following two conditions holds: (a) rank(A) = 1 and S ⊆ {M ∈ R p×n : Then for anyM ∈ argmax Furthermore, ifv andû are leading left and right singular vectors ofM respectively, then Proof. Using Lemma 13, we have SinceM is a maximiser of the objective function M → T, M − λ M 1 over the set S, and since vu ∈ S, we have the basic inequality Denote S v := {j : 1 ≤ j ≤ p, v j = 0} and S u := {t : 1 ≤ t ≤ n, u t = 0}. From (40) and (41) and the fact that Dividing through by vu −M 2 , we have the first desired result. Now, by definition of the operator norm, we have We claim that max sin 2 ∠(û, u), Let v 0 := (v +v)/2 and ∆ : where the penultimate step uses the fact that v 0 2 2 + ∆ 2 2 = 1. A similar inequality holds for sin 2 ∠(v, v), which establishes the desired claim (43). Inequality (39) now follows from (42) and (43).
The final lemma in this subsection provides bounds on different norms of the vector γ, which is proportional to each row of the CUSUM transformation of the mean matrix.
Thus, X is the Ornstein-Uhlenbeck process and we have where the inequality follows from the stationarity of the Ornstein-Uhlenbeck process and a union bound. Let Y = {Y (t) : t ∈ R} be a centred continuous Gaussian process with covariance function Cov(Y (s), Y (t)) = max(1 − |s − t|, 0). Since EX(t) 2 = EY (t) 2 = 1 for all t and Cov(X(s), X(t)) ≥ Cov(Y (s), Y (t)), by Slepian's inequality (Slepian, 1962), sup s∈[0,1] |Y (s)| stochastically dominates sup s∈[0,1] |X(s)|. Hence it suffices to establish the required bound with Y in place of X. The process Y , known as the Slepian process, has excursion probabilities given by closed-form expressions (Slepian, 1961;Shepp, 1971): for x < u, where φ and Φ are respectively the density and distribution functions of the standard normal distribution. Hence for u > 0 we can write P sup as desired.
Remark: This lemma can be viewed as a finite sample version of the law of iterated logarithm.

Auxiliary results for the proof of Theorem 5 in the main text
In addition to auxiliary results given in the previous subsection, the proof of Theorem 5 in the main text also requires the following two lemmas, which study the mean structure of the CUSUM transformation in the multiple changepoint setting.
Lemma 20. Suppose that 0 = z 0 < z 1 < · · · < z ν < z ν+1 = n are integers and that µ ∈ R n satisfies µ t = µ t for all z i < t ≤ t ≤ z i+1 , 0 ≤ i ≤ ν. Define A := T (µ) ∈ R n−1 , where we treat µ as a row vector. If the series (A t : z i + 1 ≤ t ≤ z i+1 ) is not constantly zero, then one of the following is true: (a) i = 0 and (A t : z i + 1 ≤ t ≤ z i+1 ) does not change sign and has strictly increasing absolute values, (b) i = ν and (A t : z i + 1 ≤ t ≤ z i+1 ) does not change sign and has strictly decreasing absolute values, (c) 1 ≤ i ≤ ν − 1 and (A t : z i + 1 ≤ t ≤ z i+1 ) is strictly monotonic, (d) 1 ≤ i ≤ ν − 1 and (A t : z i + 1 ≤ t ≤ z i+1 ) does not change sign and its absolute values are strictly decreasing then strictly increasing.
Proof. This follows from the proof of Venkatraman (1992, Lemma 2.2).
Here, we used the fact that min{r, r − r} ≥ τ in the final bound.

Auxiliary results for theoretical guarantees under dependence
Lemma 22 below, which is used in the proof of Theorem 7 in the main text, provides weaker conclusions than those of Lemmas 16 and 17, but under more general conditions, which in particular allow for time-dependent noise.