Narrowest-Over-Threshold Detection of Multiple Change-points and Change-point-like Features

Summary . We propose a new, generic and ﬂexible methodology for nonparametric function estimation, in which we ﬁrst estimate the number and locations of any features that may be present in the function, and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include change-points in the piecewise-constant signal model, kinks in the piecewise-linear signal model, and other similar irregularities, which we also refer to as generalised change-points. Our methodology works with only minor modiﬁcations across a range of generalised change-point scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalised change-point detection device, termed Narrowest-Over-Threshold (NOT). The key ingredient of NOT is its focus on the smallest local sections of the data on which the existence of a feature is suspected. For selected scenarios, we show the consistency and near-optimality of NOT in detecting the number and locations of generalised change-points. TheNOTestimators are easy to implement and rapid to compute. Importantly, the NOT approach is easy to extend by the user to tailor to their own needs. Our methodology is implemented in the R package not .


Introduction
This paper considers the canonical univariate statistical model where the deterministic and unknown signal f t is believed to display some regularity across the index t, and the stochastic noise ε t is exactly or approximately centred at zero.Despite the simplicity of model (1), inferring information about f t remains a task of fundamental importance in modern applied statistics and data science.When the interest is in the detection of "features" in f t such as jumps or kinks, then non-linear techniques are usually required.
If f t is modelled as piecewise-constant and it is of interest to detect its change-points, several techniques are available, and we only mention a selection.For Gaussian noise ε t , both non-penalised and penalised least squares approaches are considered by Yao and Au (1989).For specific choices of penalty functions, see e.g.Yao (1988), Lavielle (2005) and Davis et al. (2006).The Gaussianity assumption on ε t is relaxed to exponential family distributions in Lee (1997), Hawkins (2001) and Frick et al. (2014).In particular, Frick et al. (2014) also provide confidence intervals for the location of the estimated changepoints.Often this penalty-type approach requires a computational cost of at least O(T 2 ).However, there are exceptions, such as the Pruned Exact Linear Time method (PELT; Killick et al. (2012a)), which achieves a linear computational cost, but requires the further assumption that change-points are separated by time intervals drawn independently from some probability distribution, a scenario in which considerations of statistical consistency are not generally possible.A nonparametric version of PELT is investigated by Haynes et al. (2017).Another general approach is based on the idea of Binary Segmentation (BS; Vostrikova, 1981), which can be viewed as a greedy approach with a limited computational cost.Its popular variants include the Circular Binary Segmentation (CBS; Olshen et al., 2004) and the Wild Binary Segmentation (WBS; Fryzlewicz, 2014).A selection of publications and software can be found in the online repository changepoint.infomaintained by Killick et al. (2012b).
More general change-point problems, in which f t is modelled as piecewise-parametric (not necessarily piecewise-constant) between "knots", the number and locations of which are unknown and need to be estimated, have attracted less interest in the literature and overwhelmingly focus on linear trend detection.Among them, we mention the approach based on the least squares principle and Wald-type tests by Bai and Perron (1998), dynamic programming using the L 0 penalty (Maidstone et al., 2017), and trend filtering (Tibshirani, 2014;Lin et al., 2017).Finally, we mention a related problem of jump regression, where the aim is to estimate the points of sharp cusps or discontinuities of a regression function.As investigated in, e.g., Wang (1995) and Xia and Qiu (2015), it proceeds by estimating the locations of features nonparametrically via wavelets or local kernel smoothing.
The aim of this work is to propose a new, generic approach to the problem of detecting an unknown number of "features" occurring at unknown locations in f t .By a feature, we mean a characteristic of f t , occurring at a location t 0 , that is detectable by considering a sufficiently large subsample of data Y t around t 0 .Examples include: change-points in f t when it is modelled as piecewise-constant, change-points in the first derivative when f t is modelled as piecewise-linear and continuous, and discontinuities in f t or its first derivative when f t is modelled as piecewise-linear but without the continuity constraint.We will provide a precise description of the type of features we are interested in later.Moving beyond f t only, our approach will also permit the detection of similar features present in some distributional aspects of ε t , for example in its variance.Since all types of features we consider describe changes in a parametric description of f t , we use the terms "feature detection" and "change-point detection" interchangeably throughout the paper.Occasionally, for precision, we will be referring to change-point detection in the piecewise-constant model as the "canonical" change-point problem, while our general feature detection problem will sometimes be referred to as a "generalised" change-point problem.
Core to our approach is a particular blend of "global" and "local" treatment of the data Y t in the search for the multiple features that may be present in f t , a combination that gives our method a multiscale character.At the first "global" stage, we randomly draw a number of subsamples (Y s+1 , . . ., Y e ) , where 0 ≤ s < e ≤ T .On each subsample, we assume, possibly erroneously, that only one feature is present and use a tailor-made contrast function derived (according to a universal recipe we provide later) from the likelihood theory to find the most likely location of the feature.We retain those subsamples for which the contrast exceeds a certain user-specified threshold, and discard the others.Amongst the retained subsamples, we search for the one drawn on the narrowest interval, i.e. one for which e − s is the smallest: it is this step that gives rise to the name Narrowest-Over-Threshold (NOT) for our methodology.The focus on the narrowest interval constitutes the "local" part of the method, and is a key ingredient of our approach which ensures that with high probability, at most one feature is present in the selected interval.This key observation gives our methodology a general character and allows it to be used, only with minor modifications, in a wide range of scenarios, including those described in the previous paragraph.Having detected the first feature, the algorithm then proceeds recursively to the left and to the right of it, and stops, on any current interval, if no contrasts can be found that exceed the threshold.
Besides its generic character, other benefits of the proposed methodology include low computational complexity, ease of implementation, accuracy in the detection of the feature locations, and the fact that it enables parametric estimation of the signal on each section delimited by a pair of neighbouring estimated features.Regarding the computational complexity, the fact that typical contrasts are computable in linear time leads to a computational complexity of O(M T ) for the entire procedure; typically, only a limited number of data subsamples, M , need to be drawn (we provide precise bounds later; with finitely many change-points, one can take M = O(log T ) in general).Moreover, the entire threshold-indexed solution path can also be computed efficiently, in typically close-to-linear time, as observed from our numerical experiments.Regarding the estimation accuracy, in the scenarios we consider theoretically, our procedure yields near-optimal rates of convergence for the estimators of feature locations.
On a broader level, our methodology promotes the idea of fitting simple models on subsets of the data (the local aspect), and then aggregating the results to obtain the overall fit (the global aspect), an idea also present in the Wild Binary Segmentation method of Fryzlewicz (2014).However, we emphasise that the way the simple models (here: models containing at most one change-point or feature) are fitted in the NOT and WBS methods are entirely different and have different aims.Unlike the WBS, the NOT methodology focuses on the narrowest intervals of the data on which it is possible to locate the feature of interest.It is this focus that enables NOT to extend beyond change-point detection for a piecewise-constant f t , the latter being the sole focus of the WBS method.The lack of the narrowest-interval focus in the WBS and BS methods means that they are not applicable to more general feature detection, and we explain the mechanics of this important phenomenon briefly in the following simple example.Consider a continuous piecewise-linear signal that has two change-points: (2) If we approximate f t using a piecewise-linear signal with only one change-point in its derivative, then the best approximation (in terms of minimising the 2 distance) will result in an estimated change-point at t = 500, which is away from the true ones at t = 350 and t = 650, as is illustrated in Figure 1.Therefore, taking the entire sample of data and searching for one of its multiple change-points by fitting, via least squares, a triangular signal with a single change-point, does not make sense.It is this issue that leads to the failure of the BS and WBS methods for signals that are not piecewise constant.On the other hand, NOT avoids this issue because of its unique feature of picking the narrowest intervals, which are likely to contain only one change-point.To understand the mechanics of this key feature, imagine that now f t is observed with noise.Through its pursuit of the narrowest intervals, NOT will ensure that, with high probability, some suitably narrow intervals around the change-points t = 350 and t = 650 are considered.More precisely, by construction, they will be narrow enough to contain only one change-point each, but wide enough for the designed contrast (see Section 2.3 for more on contrasts) to indicate the existence of the change-point within both of them.The designed contrast function will indicate the correct location of the change-point (modulo the estimation error) if only one change-point is present in the data subsample considered, unlike in the situation described earlier in which multiple change-points were included in the chosen interval.More details on this example are presented in Section C.3 of the online supplementary materials.
Note that this example is different from the canonical change-point detection problem (i.e.piecewise-constant signal with multiple change-points), where if we approximate the signal using a piecewise-constant function with only one change-point, the change-point of the fitted signal will always be among the true ones (Venkatraman, 1992).Since the latter property does not hold in most generalised change-point detection problems, this highlights the need for new methods with better localisation of the feature of interest, such as our NOT algorithm.Fang et al. (2016) independently consider a related shortest-interval idea in the context of the canonical change-point detection problem.However, they did not consider it as a springboard to more general feature detection problems, which is the key motivation behind NOT and its most valuable contribution.
The remainder of this paper is organised as follows.In Section 2, we give a mathematical description of NOT.In particular, we consider NOT in four scenarios, each with a different form of structural change in the mean and/or variance.For the development of both theory and computation, in selected scenarios, we introduce the tailor-made contrast function derived from the generalised likelihood ratio (GLR).Theoretical properties of NOT, such as its consistency and convergence rates are also provided.In Section 3, we propose to use NOT with the strengthened Schwarz Information Criterion (sSIC) and discuss its computational aspects and theoretical properties.Section 4 discusses possible extensions of NOT.A comprehensive simulation study is carried out in Section 5, where we compare NOT with the state-of-art change-point detection tools.In Section 6, we consider data examples of global temperature anomalies and London housing data.All proofs, together with details on the construction of the contrast functions, the computational aspects and extension of NOT, further discussion on model misspecification, as well as additional simulations and real data example can be found in the online supplementary materials.

Setup
To describe the main framework of NOT, we consider a simplified version of (1), where Y = (Y 1 , . . ., Y T ) is modelled through where f t is the signal, and where σ t is the noise's standard deviation at time t.To facilitate the technical presentation of our results, in Sections 2 and 3, we assume that ε t i.i.d.
We assume that (f t , σ t ) can be partitioned into q + 1 segments, with q unknown distinct change-points 0 = τ 0 < τ 1 < . . .< τ q < τ q+1 = T .Here the value of q is not prespecified and can grow with T .For each j = 1, . . ., q + 1 and for t = τ j−1 + 1, . . ., τ j , the structure of (f t , σ t ) is is modelled parametrically by a local (i.e.depending on j) real-valued d-dimensional parameter vector Θ j (with Θ j = Θ j−1 ), where d is known and typically small.To fix ideas, in the following, we assume that each segment of f t and σ t follows a polynomial.In addition, we require the minimum distance between consecutive change-points to be ≥ d for the purpose of identifiability.(Otherwise, e.g.take f t to be piecewise-linear with a known constant σ t , in which case d = 2; if we had a segment of length 1, then we would not be able to define a line based on a single point.)In other words, (f t , σ t ) can be divided into q different segments, each from the same parametric family of much simpler structure.Some commonly-encountered scenarios are listed below, where the following holds inside the j-th segment for each j = 1, . . ., q + 1: (S1) Constant variance, piecewise-constant mean: σ t = σ 0 and f t = θ j for t = τ j−1 + 1, . . ., τ j .
Since σ 0 in (S1)-(S3) acts as a nuisance parameter, in the rest of this manuscript, for simplicity we assume that its value is known.If it is unknown, then it can be estimated accurately using the Median Absolute Deviation (MAD) method (Hampel, 1974).More specifically, with i.i.d.Gaussian errors, the MAD estimator of σ 0 is defined as in Scenarios (S2) and (S3).Here Φ −1 (•) denotes the quantile function of the standard normal distribution.Note that the MAD estimator is robust to any change-points present in the underlying signal f t , due to its combination of working with the differenced data, and its use of the median.Finally, we note that a different procedure is proposed to estimate σ 0 with dependent errors; see Section 4.1 for more details.

Main idea
We now describe the main idea of NOT formally; more details can be found in Section 2.4 where the pseudo-code of the NOT algorithm is given.
In the first step, instead of directly using the entire data sample, we randomly extract subsamples, i.e. vectors (Y s+1 , . . ., Y e ) , where (s, e) is drawn uniformly from the set of pairs of indices in {0, . . ., T − 1} × {1, . . ., T } that satisfy 0 ≤ s < e ≤ T .Let (Y s+1 , . . ., Y e ; Θ) be the likelihood of Θ given (Y s+1 , . . ., Y e ) .We then compute the generalised log-likelihood ratio (GLR) statistic for all potential single change-points within the subsample and pick the maximum, that is, Note that here we also implicitly require e − s ≥ 2d, which comes from the identifiability condition, because typically we need at least d observations to determine Θ 1 , and another d observations to determine Θ 2 .If constraints are in place between Θ j and Θ j+1 for any j = 1, . . ., q (e.g. as in (S2)), the supremum in the numerator of ( 4) is taken over the set that only contains elements of form Θ 1 × Θ 2 satisfying these constraints.Otherwise, as in (S1), ( S3) and ( S4), ( 4) can be simplified to The above procedure is repeated on M randomly drawn pairs of integers (s 1 , e 1 ), . . ., (s M , e M ).
In the second step, we test all R (sm,em] (Y) for m = 1, . . ., M against a given threshold ζ T .Among those significant ones, we pick the one corresponding to the interval (s m * , e m * ] that has the smallest length.Once a change-point is found in (s m * , e m * ] (i.e.b * that maximises R b (s m * ,e m * ] (Y), a function of b), the same procedure is then repeated recursively to the left and to the right of it, until no further significant GLRs can be found.Note that in each recursive step, one could reuse the previously drawn intervals, provided that they fall entirely within each current subsegment considered.
After the process of estimating the change-points is completed, one can estimate the signals within each segment using standard methods such as least squares or maximum likelihood.Note that the estimation of knot locations in spline regression can be viewed as a multiple change-point detection problem set in the context of polynomial segments that are continuously differentiable but have discontinuous higher order derivatives at the change-points between these segments; NOT can be used for this purpose.
Admittedly, in our framework, one could also use a deterministic scheme (e.g. that in Rufibach and Walther (2010)) to pick a sufficiently rich family of intervals for multiscale inference.However, one advantage of our approach is that through the use of randomness in drawing the intervals, we avoid having to make a subjective choice of a particular fixed design.Nevertheless, with a very large number of intervals drawn, the difference in performance between the random and deterministic designs is likely to be minimal, an observation also made in Fryzlewicz (2014).

Log-likelihood ratios and contrast functions
In many applications, the GLR (4) in NOT can be simplified with the help of "contrast functions" under the setting of Gaussian noise.In particular, these constructions mainly involve taking inner products between the data and other deterministic vectors, which greatly facilitates the development of both theory and computation, especially if these deterministic vectors are mutually orthonormal.In fact, the form of these contrast functions is crucial in our theoretical development.
More precisely, for every integer triple (s, e, b) with 0 ≤ s < e ≤ T , our aim is to find In the following, we give the contrast functions corresponding to scenarios (S1) and ( S2), where the aforementioned properties are satisfied.Their details under scenarios ( S3) and (S4), as well as a comprehesive discussion on the construction, can be found in Section B of the online supplementary materials.We note that this approach recovers the CUSUM statistic in (S1), which is popular in this canonical change-point detection setting.One can view the resulting statistics as generalisations of CUSUM under other scenarios.   .

The NOT algorithm
Here we present the pseudo-code of a generic version of the NOT algorithm.The main ingredient of the NOT procedure is a contrast function C b (s,e] (•), chosen by the user, depending on the assumed nature of change-points in the data, e.g. as exemplified by our scenarios (S1) and (S2) above, and scenarios (S3) and (S4) in Section B of the online supplementary materials.In addition, some tuning parameters are needed: ζ T > 0 is the threshold with respect to which the contrast should be tested, while M is the number of the intervals drawn in the procedure.Guidance on the choice of ζ T and M is given in Section 3. In particular, there we advocate an automatic choice of ζ T by combining NOT with an information-based criterion, thus making our procedure threshold-free.
To sum up, the input include the data vector Y, the set of F M T that contains all randomly drawn sub-intervals for testing, and the global variable S for the set of estimated changepoints initialised with S = ∅.Then NOT is started recursively with (s, e] = (0, T ] and a given ζ T .Here the entire set of F M T that contains all random intervals is generated before we start running Algorithm 1.In this way, we are better able to control the computational complexity of the entire procedure.

Theoretical properties of NOT
In this section, we analyse the theoretical behaviour of the NOT algorithm in Scenarios (S1) and (S2).We use infill asymptotics, which is standard in the literature on a posteriori change-point detection.An attractive feature of our methodology is that proofs for other scenarios can in principle be constructed "at home" by the user, by following the same generic proof strategy as the one we use for these two scenarios.
First, we revisit the canonical change-point detection problem, (S1), where the signal vector f = (f 1 , . . ., f T ) is piecewise-constant.Here σ 0 is assumed to be known.Otherwise, one can plug in the MAD estimator, described in Section 2.1, without affecting the validity of our theory.For notational convenience, we set σ 0 = 1.For other values of σ 0 , our theorems are still valid with only minor adjustments to the constants therein.Explicit T being a set of M left-open and right-closed intervals, with each pair of start-and end-points drawn independently and uniformly from the set of pairs of indices in {0, . . ., T − 1} × {1, . . ., T } that satisfy the conditions outlined at the beginning of Section 2.2, S = ∅.Output: Set of estimated change-points S ⊂ {1, . . ., T }.
To start the algorithm: expressions for all the constants (i.e.C, C 1 , C 2 , C 3 ) are given in Section I.2 of the online supplementary materials.
..,q ∆ f j .Let q and τ1 , . . ., τq denote, respectively, the number and locations of change-points, sorted in increasing order, estimated by Algorithm 1 with the contrast function given by (6).Then there exist constants C, C 1 , C 2 , C 3 > 0 (not depending on T ) such that given δ In the simplest canonical case where we have finitely many changepoints with δ T ∼ T and f T ∼ 1, so the condition δ log T is always satisfied for a sufficiently large T .Theorem 1 indicates that the NOT procedure requires M = O(log T ) many random intervals for consistent detection of all the change-points, which leads to a total computational cost of O(T log T ) for the entire procedure.Furthermore, max j=1,...,q |τ j − τ j | = O p (log T ), which trails the minimax rate of O p (1) by only a logarithmic factor.In addition, we note that the NOT procedure allows for δ 1/2 T f T , a quantity that characterises the difficulty level of the problem, to be of order √ log T .As argued in Chan and Walther (2013), this is the smallest rate that permits change-point detection for any method from a minimax perspective.
Next, we revisit Scenario (S2), in which the signal is piecewise-linear and continuous.Again, we set σ 0 = 1 for notational convenience.Explicit expressions of the constants in the following theorem (i.e.C, C 1 , C 2 , C 3 ) can be found in Section I.3 of the online supplementary materials.
..,q ∆ f j .Let q and τ1 , . . ., τq denote, respectively, the number and locations of change-points, sorted in increasing order, estimated by Algorithm 1 with the contrast function given by (8).Then there exist constants In the case in which we have finitely many change-points with δ T ∼ T , we again need M = O(log T ) random intervals for consistent estimation of all the change-points, leading to the total computational cost of O(T log T ).In addition, when f T ∼ T −1 (a case in which f t is bounded), our theory indicates that the resulting change-point detection rate of NOT is O p (T 2/3 (log T ) 1/3 ), which is different from the rate of O p (T 2/3 ) derived by Raimondo (1998) by only a logarithmic factor; moreover, under additional assumptions and with a more careful but restrictive choice of ζ T , this rate can be further improved to O p (T 1/2 (log T ) 1/2 ); see Section 3.4 and Lemma 9 in the online supplementary materials for more details.Furthermore, we remark that in more general cases (i.e.number of change-points increasing with T ) in Scenario (S2), the difficulty level of the problem in Scenario (S2) can be charaterised by δ 3/2 T f T , a quantity analogous to δ 1/2 T f T in the setting of (S1).
Both Theorem 1 and Theorem 2 imply that there exists an admissible range of thresholds that would ensure consistent change-point detection.They pave the way for establishing Theorem 3 and Theorem 4 in Section 3, which promote the automatic selection of the threshold via an information criterion.
Finally, we emphasise again that the WBS will fail to estimate change-point consistently in Scenario (S2), for reasons described in Section 1.

Motivation
The success of Algorithm 1 depends on the choice of the threshold ζ T .Although Theorem 1 and Theorem 2 state that there exists ζ T that guarantee consistent estimation of the changepoints, this choice still typically depends on some unobserved quantities; furthermore, there are many more general scenarios where a theoretically optimal threshold might be difficult to derive.
Note that for a given Y and F M T , each threshold ζ T corresponds to a candidate model produced by NOT.Therefore, if we could produce a "solution path" of candidate models obtained from NOT along all possible thresholds, we could then try to select the best model along the solution path via minimising an information-based criterion.In this sense, the task of selecting the best threshold is equivalent to selecting the best model on the solution path.
) for any i = 0, 1, . . ., N − 1, and T .However, the thresholds T are unknown and depend on the data, therefore naively applying Algorithm 1 on a range of pre-specified thresholds typically does not recover the entire solution path.Moreover, from the computational point of view, repeated application of Algorithm 1 to find the solution path is not optimal either, because intuitively one would expect the solutions for ζ T to be similar for most i.These issues are circumvented by Algorithm 2, which is able to compute the entire threshold-indexed solution path quickly, thus facilitating the study of a data-driven approach to the choice of ζ T in Section 3.3.The key idea of Algorithm 2 is to make use of information from T (ζ ) iteratively for every i = 0, . . ., N − 1.The pseudo-code of Algorithm 2, as well as other relevant details, can be found in Section C.2 of the online supplementary materials.

Choice of ζ T via the strengthened Schwarz Information Criterion (sSIC)
Suppose we have T (ζ (1) ), . . ., T (ζ (N ) ) that form the NOT solution path, i.e. the collection of candidate models produced by Algorithm 2. We propose to select T (ζ (k) ) that minimises the strengthened Schwarz Information Criterion (sSIC; Liu et al. (1997) T .Further, denote by n k the total number of estimated parameters, including the locations of the change-points and free parameters in Θ 1 , . . ., Θ qk+1 (N.B. the total number of the latter can be different from the dimensionality of each Θ j multiplied by the number of segments, as e.g. in (S2)).Then the strengthened Schwarz Information Criterion (sSIC) is sSIC for some pre-given α ≥ 1, with τ0 = 0 and τqk+1 = T .When α = 1, we recover the well-known Schwarz Information Criterion (SIC).
One reason we use the sSIC here is to facilitate our theoretical development below.In fact, once we obtain the NOT solution path via Algorithm 2, other criteria, such as MBIC (Zhang and Siegmund, 2007), Minimum Description Length (MDL; Davis et al. (2006)) or Steepest Drop to Low Levels(SDLL; Fryzlewicz (2018b)) could conceivably be used for model (or equivalently, threshold) selection.

Theoretical properties of NOT with the sSIC
In this section, we analyse the theoretical behaviour of NOT with the sSIC in Scenarios (S1) and ( S2).Here we focus on the situation where the number of change-points q is fixed (i.e.does not increase with T ).This is typical for the theoretical development of information-criteron-based approaches, and reflects the fact that such approaches tend to work better in practice for signals with at most a moderate number of change-points.See also Yao (1988).Again, for notational convenience, we set σ 0 = 1.Our results below provide theoretical justifications for using NOT with the sSIC.Crucially, in contrast to Algorithm 1, here one does not need to supply a threshold.

Computational complexity
Here we elaborate on the computational complexity of Algorithm 1 (see Section 2.4) and Algorithm 2 (see Section 3.2 and Section C.2 of the online supplementary materials).For both algorithms, the task of computation can be divided into two main parts.First, we need to evaluate a chosen contrast function for all points in the M randomly picked left-open and right-closed intervals with their start-and end-points in {0, . . ., T − 1} and {1, . . ., T } respectively.In the second part, we find potential locations of the change-points for a single threshold ζ T in the case of Algorithm 1 and for all possible thresholds in the case of Algorithm 2.
Naturally, the computational complexity of the first part depends on the cost of computing the contrast function for a single interval.In all scenarios studied in this paper, this cost is linear in the length of the interval, i.e. the cost of computing {C b (s,e] (Y)} e−1 b=s+1 is O(e − s).This is explained in detail in Section C.1 of the online supplementary materials.The intervals drawn in the procedures have approximately O(T ) points on average, therefore the computational complexity of the first part of the computations is O(M T ) in a typical application.Importantly, as the calculations for one interval are completely independent of the calculations for another, it is straightforward to run these computations in an "embarrassingly parallel" manner.In addition, for the second part, as mentioned in detail in the Section C.2 of online supplementary materials, its computational complexity is typically less than O(M T ), thus bringing the total computational complexity of both Algorithm 1 and Algorithm 2 to O(M T ).
Figure 3 shows execution times for the implementation of Algorithm 2, the NOT solution path algorithm, implemented in the R package not, with the data {Y t } T t=1 being i.i.d.N (0, 1).The running times appears to scale linearly both in T (Figure 3a) and in M (Figure 3b), which provides evidence that the computational complexity of Algorithm 2 in this particular example is practically of order O(M T ).
Finally, we remark that the memory complexity of Algorithm 2 is also O(M T ), which combined with its low computational complexity implies that our approach can handle problems of size T in the range of millions.(Baranowski et al., 2016b), for various feature detection problems with the data Y t , t = 1, . . ., T being i.i.d.N (0, 1).In a single run, computations for the input of the algorithm are performed in parallel, using 8 cores of an Intel Xeon 3.6 GHz CPU with 16 GB of RAM.The computation times are averaged over 10 runs in each case.
3.6.Other practical considerations 3.6.1.Choice of M As can be seen in Theorem 1 and Theorem 2, the minimum required value for M grows with T (i.e. at O(log T ), for a fixed number of well-spaced change-points).In practice, when the number of observations is of the order of thousands, we would recommend setting M = 10000.With this value of M , the implementation of Algorithm 1 provided in the R not package (Baranowski et al., 2016b) achieves the average computation time not longer than 2 seconds in all examples in Section 5 using a single core of an Intel Xeon 3.6 GHz CPU.This can be accelerated further, as the not package allows for computing the contrast function over the intervals drawn in parallel using all available CPU cores.
However, caution must be exercised for signals with a large expected number of changepoints, for which M may need to be increased.For example, Maidstone et al. (2017) found that NOT with M = 10 5 offered better practical performance on the change-point-rich signals they considered.In the most extreme scenario where one expects change-points to occur very frequently with a large T , we would recommend picking M as large as possible to match the available computational power and applying a penalty less stringent than the sSIC.See Section F of the online supplementary materials.

Early stopping for NOT with the sSIC
If the number of change-points in the data is expected to be rather moderate, then it may not be necessary to calculate sSIC for all k.In practice, solutions on the path corresponding to very small values of ζ T contain many estimated change-points.Such solutions are unlikely to minimise (11).By considering |T (ζ we could achieve some computational gains without adversely impacting the overall performance of the methodology.As such, in all applications presented in this work we compute sSIC only for k such that T )| ≤ q max with q max = 25.

NOT under different noise types
In this section, we discuss how NOT can be extended to handle different noise types.Section 4.1 deals with dependent noise, while Section 4.2 covers heavy-tailed noise.In addition, we investigate the case of noise with slow-varying variance in Section D of the online supplementary materials.

NOT under dependent noise
When the errors ε t in model (3) are dependent with Eε t = 0 and Var(ε t ) = 1, the aforementioned NOT procedure can still be applied as a quasi-likelihood-type procedure.Conceivably, using NOT here would incur information loss.As is shown in Corollaries 1 and 2 in Scenarios (S1) and (S2), NOT is still consistent if we replace the noise's i.i.d.assumption in Theorems 1 and 2 by stationarity with short-memory.This new dependence assumption is satisfied by a large class of stationary time series models, including autoregressive moving average (ARMA) models.See also numerical examples in Section E of the online supplementary materials, where we again select the thresholds automatically via sSIC.Here we assume that σ 0 = 1.However, if not, MAD-type estimators based on the simple differencing are no longer appropriate for dependent data.We comment on this issue later.
The following corollaries give guidelines on the choice of the threshold, as well as guarantee on the performance of NOT from a theoretical perspective.
Corollary 1. Suppose Y t follow model (3) in Scenario (S1), but with {ε t } being a stationary short-memory Gaussian process, i.e. the auto-correlation function of {ε t }, denoted by Then, the conclusion of Theorem 1 still holds (with different constants).
Corollary 2. Suppose Y t follow model (3) in Scenario (S2), but with {ε t } being a stationary short-memory Gaussian process.The conclusion of Theorem 2 holds (with different constants).
In our theoretical development for the dependent noise setting, the smallest permitted threshold to be used in the NOT algorithm depends linearly on σ 0 ( ∞ k=−∞ |ρ k |) 1/2 .This quantity can also be viewed as a generalisation to the independent noise setting, where the threshold is proportional to σ 0 (since ∞ k=−∞ |ρ k | = 1).More details of its derivation is provided in Section I.6 of the online supplementary materials.
This poses a few challenges in the practical application of NOT to signals with dependent noise: (i) the (pre-)estimation of the residuals ε t in preparation for the estimation of their long-run variance; (ii) the estimation σ 0 ; and (iii) the estimation of σ 0 ( ∞ k=−∞ |ρ k |) 1/2 .These problems are known to be difficult in time series analysis in general.A possible solution is outlined below.
For (i), we have had some success with the wavelet-based method of Johnstone and Silverman (1997), which was implemented in R package wavethresh (Nason, 2016); its advantages are that it is specifically designed for dependent noise and that, being based on nonlinear wavelet shrinkage, it is particularly suited for signals with irregularities, such as (generalised) change-points.Here the Haar wavelet transform of the data is appropriate in Scenario (S1), while a transform with respect to any wavelet that annihilates linear functions is appropriate in Scenarios (S2) and (S3).Once the empirical residuals are obtained from (i), we could then estimate σ 0 in (ii) by its sample version, and estimate σ 0 ( ∞ k=−∞ |ρ k |) 1/2 in (iii) in a model-based way (e.g. using the autoregressive model with its order p chosen by an information criterion).
Another possibility to estimate change-points under dependent noise is to use selfnormalising based statistics.See, for instance, Shao and Zhang (2010), Betken (2016), Pešta and Wendler (2018) and Zhang and Lavitas (2018).These statistics could potentially be fed into our NOT approach as well.
Finally, we mention two practical ways of reducing the dependence and making the series closer to Gaussian, before applying NOT: (A) pre-average the data over non-overlapping moving windows of size h, creating a new dataset of length T /h ; the hope is that by the law of large numbers, the pre-averaged noise will be closer to Gaussian and also less serially dependent than the original noise; (B) add additional i.i.d.Gaussian noise to the data, with mean zero and suitably chosen standard deviation; this will have a similar effect as previously, i.e. it will bring the distribution of the data closer to Gaussian and reduce the serial dependence within the data.

Extension of NOT under heavy-tailed noise
NOT appears to be relatively robust under noise misspecification.As is demonstrated later in Section 5, it offers reasonable estimates when the noise is non-Gaussian but the Gaussian contrast functions are used.We now discuss how its performance can be improved further in the presence of heavy-tailed noise.
In Scenario (S1), we propose to apply the following new contrast function, defined for Y and 0 < s < b < e < T as in our NOT procedure.Here for any vector v = (v 1 , . . ., v T ) , the i-component of S (s,e] (v) is given by S (s,e] (v) i = sign v i − (e − s) −1 e t=s+1 v t and ψ b (s,e] is defined by ( 5).(For certain noise distributions, subtracting the sample median of v instead of the sample mean would appear more appropriate.)The rationale behind ( 12) is to assign Y s+1 − 1 e−s e t=s+1 Y t , . . ., Y e − 1 e−s e t=s+1 Y t (i.e.residuals for fitting a curve with no changepoint on a given interval) into two classes (±1, i.e. a two-point distribution, thus with light tails) and apply the contrast function to their ±1 labels.Empirical performance of NOT (via Algorithm 2) combined with (12) and sSIC is also illustrated in Section E of the online supplementary materials.

Settings
We consider examples following (S1)-(S4) introduced in Section 2.3, as well as an extra example satisfying (S5) σ t = σ 0 and f t is a piecewise-quadratic function of t.
We simulate data according to Equation ( 3 (e) a stationary Gaussian AR(1) process of ϕ = 0.3, with zero-mean and unit-variance.
A detailed specification of our test models can be found in Section A of the online supplementary materials.Figure 4 shows the examples of the data generated from models (M1)-(M7), as well as the estimates produced by NOT in a typical run.

Estimators
We apply Algorithm 2 to compute the NOT solution path and pick the solution minimising the sSIC introduced in Section 3.3 with α = 1 (which is equivalent to the SIC).In each simulated example, we use the contrast function designed to detect change-points in the scenario that the example follows, given in Section 2.3 and Section B of the online supplementary materials under the assumption that ε t is i.i.d.Gaussian.The resulting method is referred to simply as 'NOT'.In addition, for Scenario (S1) only, we also apply Algorithm 2 combined with (12) and the SIC, which we call 'NOT HT'.Here 'HT' stands for 'heavy tails'.The number of intervals drawn in the procedure and the maximum number of change-points for the SIC are set to M = 10000 and q max = 25, respectively.
We then compare the performance of NOT and NOT HT against the best competitors available on CRAN.To the best of our knowledge, none of the competing packages can be applied in all of Scenarios (S1)-(S5).
Note that e-cp3o, NMCD, NOT, PELT and NP-PELT can be also used for change-point detection in Scenario (S4), where change-points occur in the mean and variance of the data.In addition, for Scenario (S4), we also include the Heterogeneous SMUCE method (Pein et al., 2017) implemented in stepR (Pein et al., 2018), and the Segment Neighbourhoods method (Auger and Lawrence, 1989) implemented in changepoint (Killick and Eckley, 2014;Killick et al., 2016).We refer to them as HSMUCE and SegNeigh respectively.
Only the B&P method allows for change-point detection in piecewise-linear and piecewisequadratic signals (in particular, the WBS is not suitable for these settings as described in Sections 1 and 2.5), hence we also study the performance of the trend filtering methodology of Kim et al. (2009) termed as TF hereafter, using the implementation available from the R package genlasso (Taylor and Tibshirani, 2014), to have a broader comparison.See also Lin et al. (2017).The TF method aims to estimate a piecewise-polynomial signal from the data, not focusing on the change-point detection problem directly.Let f (T F ) t denote the TF estimate of the true signal f t , then the TF estimates of the change-points in Scenario (S2) are defined as those τ for which |2 , where > 0 is a very small number being the numerical tolerance level (more precisely, we set = 1.11 × 10 −15 in our study).In the piecewise-quadratic case, the change-points are defined as those τ for which the third order differences | We note that both B&P and TF require a substantial amount of computational resources in this study.
Finally, we remark that the tuning parameters for the competing methods are set to the values recommended by the corresponding R packages, and the R code for all simulations can be downloaded from our GitHub repository (Baranowski et al., 2016a).

Results
Here we only present the results under the setting where the noise is (a) i.i.d.standard normal in Table 1.Additional results under the other above-mentioned noise settings can be found in Section E of the online supplementary materials.
For each method, we show a frequency table for the distribution of q − q, where q is the number of the estimated change-points and q denotes the true number of change-points.We also report Monte-Carlo estimates of the Mean Squared Error of the estimated signal, given by MSE = E 1 T T t=1 (f t − ft ) 2 .For all methods but TF, ft is calculated by finding the least squares (LS) approximation of the signal of the appropriate type depending on the true f t , between each consecutive pair of estimated change-points.For TF, ft used in the definition of the MSE is the penalised least squares estimate of f t returned by the TF algorithm.
To assess the performance of each method in terms of the accuracy of the estimated locations of the change-points, we report estimates of the (scaled) Hausdorff distance where 0 = τ 0 < τ 1 < . . .τ q < τ q+1 = T and 0 = τ0 < τ1 < . . .τq < τq+1 = T denote, respectively, true and estimated locations of the change-points.From the definition above, it follows that 0 ≤ d H ≤ 1.An estimator is regarded as performing well when its d H is close to 0. However, d H would be large when the number of change-points is under-estimated or some of the estimated change-points are far away from the real ones.In addition, we also report estimates of the inverse V-measure d V defined as where V (•, •) is the V-measure (with β = 1) proposed by Rosenberg and Hirschberg (2007) for the evaluation of segmentation.An estimator is regarded as performing well when its d V is close to 0. More specifically, 0 ≤ d V ≤ 1, and a perfect estimator has d V = 0, while d V = 1 means none of the features are detected (i.e.q = 0).We find that in most of the simulated scenarios, NOT is among the most competitive methods in terms of the estimation of the number of change-points, their locations, as Table 1.Distribution of q − q for data generated according to (3) with the noise term ε t being i.i.d.N (0, 1) for various choices of f t and σ t given in Section A of the online supplementary materials and competing methods listed in Section 5. Also, the average Mean-Square Error of the resulting estimate of the signal f t , average Hausdorff distance d H , average inverse V-measure d V and average computation time in seconds using a single core of an Intel Xeon 3.6 GHz CPU with 16 GB of RAM, all calculated over 100 simulated data sets.Bold: methods with the largest empirical frequency of q − q = 0 or smallest average of d H or d V , and those within 10% of the highest or lowest accordingly.well as the true signal.Importantly, it is very fast to compute, which gives it a particular advantage over its competitors in Scenarios (S2), ( S3) and (S5).Finally, NOT with the contrast function derived under the assumption that the noise is i.i.d.Gaussian is relatively robust against the misspecification in ε t , when the truth is either correlated or heavy-tailed.

Temperature anomalies
We analyse the GISS Surface Temperature anomalies data set available from GISTEMP Team (2016), consisting of monthly global surface temperature anomalies recorded from January 1880 to June 2016.The anomaly here is defined as the difference between the average global temperature in a given month and the baseline value, being the average calculated for that time of the year over the 30-year period from 1951 to 1980; for more details see Hansen et al. (2010).This and similar anomalies series are frequently studied in the literature with a particular focus on identifying change-points in the data, see e.g.Ruggieri (2013) or James and Matteson (2015).The plot of the data (Figure 5a) indicates the presence of a linear trend with several change-points in the temperature anomalies series.The corresponding changes are not abrupt, therefore we believe that Scenario (S2) with change-points in the slope of the trend is the most appropriate here.To detect the locations of the change-points, we apply NOT (via Algorithm 2) with the contrast given by (8), combined with the SIC to determine the best model on the solution path.
The NOT estimate of the piecewise-linear trend and the corresponding empirical residuals are shown in Figure 5.We identify 8 change-points located at the following dates: March 1901, December 1910, July 1915, June 1935, April 1944, December 1946, June 1976and May 2015.Previous studies conducted on similar temperature anomalies series (observed at a yearly frequency and obtained from a different source), report change-points around 1910change-points around , 1945change-points around and 1976change-points around (see Ruggieri (2013) ) for an overview of a number of related analyses).In addition to the change-points around these dates, NOT identifies two periods, 1901-1915 and 1935-1946, with local deviations from the baseline.We also observe a long-lasting upward trend in the anomalies series starting in December 1946.Finally, NOT indicates that the slope of the trend is increasing, with the most recent change-point in May 2015.

UK House Price Index
We analyse monthly percentage changes in the UK House Price Index (HPI), which provides an overall estimate of the changes in house prices across the UK.The data and a detailed description of how the index is calculated are available online from UK Land Registry (2016).Fryzlewicz (2018a), who proposes a method for signal estimation and changepoint detection in Scenario (S1), used this data set to illustrate the performance of his methodology.We perform a similar analysis, assuming the more flexible Scenario (S4), allowing for changes both in the mean and the variance, which, we argue, leads to additional insights and better-interpretable estimates for this dataset.As in Fryzlewicz (2018a), we analyse the percentage changes in the HPI for three London boroughs, namely Hackney, Newham and Tower Hamlets, all of which are located in East London.Hackney and Tower of Hamlets border on the City of London, a major business and financial district, with the latter being home to Canary Wharf, another important financial centre.On the other hand, Newham, located to the east of Hackney and Tower Hamlets, hosted the London 2012 Olympic Games which involved large-scale investment in that borough.
Figure 6 shows monthly percentage changes in HPI for the analysed boroughs and the corresponding NOT estimates, obtained using the contrast function for Scenario (S4).As recommended in Section 3.3, we set the number of intervals drawn in the procedure to M = 10000 and choose the threshold that minimises the SIC.For better comparability, NOT is applied with the same random seed for each data series.
In contrast to Fryzlewicz (2018a), whose TGUH method estimates at least 10 changepoints in each HPI series, we detect just a few change-points in the data, facilitating the interpretation of the results.Furthermore, for all three boroughs, NOT estimates two change-points (one around March 2008 and one around September 2009) that could possibly be linked to the 2008-2009 financial crisis and its impact on the housing market.Estimated standard deviations for that period are much larger than the estimates corresponding to the other segments of piecewise-constancy, suggesting that the market is more volatile during 2008-2009, and thus in this example Scenario (S4) may be more relevant than (S1) considered in Fryzlewicz (2018a).

Fig. 1 .
Fig. 1.Best 2 approximation of the true signal (dashed) via a triangular signal with a single change-point, the location of which is fixed at the left change-point (left panel), halfway between the true change-points (middle panel) and at the right change-point (right panel).Approximation errors (in terms of squared 2 distance) are given in the top-right corners of the corresponding panels.
), (b) heuristically speaking, the value of C b (s,e] (Y) is relatively small if there is no changepoint in (s, e], (c) the formulation of C b (s,e] (Y) mainly consists of taking inner products between the data and certain contrast vectors.

Fig. 3 .
Fig.3.Execution times (in seconds) for the implementation of Algorithm 2 available in R package not(Baranowski et al., 2016b), for various feature detection problems with the data Y t , t = 1, . . ., T being i.i.d.N (0, 1).In a single run, computations for the input of the algorithm are performed in parallel, using 8 cores of an Intel Xeon 3.6 GHz CPU with 16 GB of RAM.The computation times are averaged over 10 runs in each case.

Fig. 4 .
Fig. 4.Examples of data generated from simulation models outlined in Section A. Figure4a-4g: data series Y t (thin grey), true signal f t (dashed black), ft being the least squares (LS) estimate of f t with the change-points estimated by NOT (thick red).Figure4h: centered data |Y t − ft | (thick grey), true standard deviation σ t (dashed black) and the estimated standard deviation σt between the change-points detected by NOT (thick red).
Fig. 4.Examples of data generated from simulation models outlined in Section A. Figure4a-4g: data series Y t (thin grey), true signal f t (dashed black), ft being the least squares (LS) estimate of f t with the change-points estimated by NOT (thick red).Figure4h: centered data |Y t − ft | (thick grey), true standard deviation σ t (dashed black) and the estimated standard deviation σt between the change-points detected by NOT (thick red).

Fig. 5 .
Fig. 5.Change-point analysis for the GISSTEMP data set introduced in Section 6.1.Figure5a: the data series Y t (thin grey) and ft estimated using change-points returned by NOT (thick red).Figure5b: residuals εt = Y t − ft .
Fig. 5.Change-point analysis for the GISSTEMP data set introduced in Section 6.1.Figure5a: the data series Y t (thin grey) and ft estimated using change-points returned by NOT (thick red).Figure5b: residuals εt = Y t − ft .

Fig. 6 .
Fig. 6.Change-point analysis for the monthly percentage changes in the UK House Price Index from January 1995 to May 2016. Figure 6a, 6c and 6e: the monthly percentage changes Y t and the fitted piecewise-constant mean ft , between the change-points estimated with NOT. Figure 6b, 6d and 6f: |Y t − ft | and the fitted piecewise-constant standard deviation σt , between the changepoints estimated with NOT.