Volume 76, Issue 3
Original Article
Free Access

Multiscale change point inference

Klaus Frick

Corresponding Author

Interstate University of Applied Sciences of Technology, Buchs, Switzerland

Address for correspondence: Hannes Sieling, Institute for Mathematical Stochastics, University of Göttingen, Goldschmidtstrasse 7, 37077 Göttingen, Germany. E‐mail: hsielin@math.uni-goettingen.deSearch for more papers by this author
Axel Munk

University of Göttingen, Göttingen, Germany

Max Planck Institute for Biophysical Chemistry, Göttingen, Germany

Search for more papers by this author
Hannes Sieling

University of Göttingen, Göttingen, Germany

Search for more papers by this author
First published: 09 May 2014
Citations: 71

Summary

We introduce a new estimator, the simultaneous multiscale change point estimator SMUCE, for the change point problem in exponential family regression. An unknown step function is estimated by minimizing the number of change points over the acceptance region of a multiscale test at a level α. The probability of overestimating the true number of change points K is controlled by the asymptotic null distribution of the multiscale test statistic. Further, we derive exponential bounds for the probability of underestimating K. By balancing these quantities, α will be chosen such that the probability of correctly estimating K is maximized. All results are even non‐asymptotic for the normal case. On the basis of these bounds, we construct (asymptotically) honest confidence sets for the unknown step function and its change points. At the same time, we obtain exponential bounds for estimating the change point locations which for example yield the minimax rate urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0001 up to a log‐term. Finally, the simultaneous multiscale change point estimator achieves the optimal detection rate of vanishing signals as n→∞, even for an unbounded number of change points. We illustrate how dynamic programming techniques can be employed for efficient computation of estimators and confidence regions. The performance of the multiscale approach proposed is illustrated by simulations and in two cutting edge applications from genetic engineering and photoemission spectroscopy.

1. Introduction

Assume that we observe independent random variables Y=(Y1,…,Yn) through the exponential family regression model
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0002(1)
where {Fθ}θ ∈ Θ is a one‐dimensional exponential family with densities fθ and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0003 a right continuous step function with an unknown number K of change points. Figs 1(a) and 1(b) depict such a step function with K=8 change points and corresponding data Y for the Gaussian family urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0004 with fixed variance σ2.
image
(a) True regression function ϑ, (b) Gaussian observations Y with n=367 and variance σ2=1, (c) estimated change point locations with confidence intervals for various values of α (y‐axis) and (d) SMUCE urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0005 with confidence bands (image) and confidence intervals for the change point locations (image) at α=0.4
The change point problem consists in estimating
  1. the number of change points of ϑ and
  2. the change point locations and the function values (intensities) of ϑ.
Additionally, we address the more involved issue of constructing
  1. confidence bands for the function ϑ and simultaneous confidence intervals for its change point locations.

1.1. Multiscale statistics and estimation

The goals (a)–(c) will be achieved on the basis of a new estimation and inference method for the change point problem in exponential families: the simultaneous multiscale change point estimator SMUCE. Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0006 denote the space of all right continuous step functions with an arbitrary but finite number of jumps on the unit interval [0,1) with values in Θ. For urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0007 we denote by J(ϑ) the ordered vector of change points and by #J(ϑ) its length, i.e. the number of change points. In a first step, SMUCE needs to solve the (non‐convex) optimization problem

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0008(2)
where q is a threshold to be specified later. Tn(Y,ϑ) is a certain multiscale statistic for a candidate function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0009. Optimization problems of the type (2) have been recently considered in Höhenrieder (2008) for Gaussian change point regression (see also Boysen et al. (2009) for a related approach) and for volatility estimation in Davies et al. (2012). Tn in problem (2) evaluates the maximum over the local likelihood ratio statistics on all discrete intervals [i/n,j/n] such that ϑ is constant on these with value θ=θi,j, i.e.
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0010(3)
where e= exp (1) and ‘ log ’ denotes the natural logarithm. The local likelihood ratio statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0011 for testing H0:θ=θ0 against H1:θθ0 on the interval [i/n,j/n] is defined as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0012(4)
It measures how well the data can be described locally by a constant value θ0 on the interval [i/n,j/n]. We stress that the multiscale statistic Tn does not act on all intervals [i/n,j/n]⊆[0,1] but only on those which the candidate function ϑ is constant on; see also Davies et al. (2012), Höhenrieder (2008) and Olshen et al. (2004). Thus the system of intervals appearing in equation makes up the specific multiscale nature of Tn. The  log ‐expression in equation 3 can be seen as a scale calibrating term that puts different scales on equal footing. As argued in Dümbgen and Spokoiny (2001) and Chan and Walther (2013) this improves the power of the multiscale test over the majority of scales. Roughly speaking, from a multiscale point of view, scale calibration becomes advantageous, since there are many more small intervals than large intervals.
SMUCE integrates the multiscale test on the right‐hand side in equation 3 into two simultaneous estimation steps: model selection (estimation of K) and estimation of ϑ given K. The minimal value of #J in problem (2) gives the estimated number of change points, which is denoted by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0013. To obtain the final estimator for ϑ first consider the set of all solutions of problem (2) given by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0014(5)
which constitutes a confidence set for the true regression function ϑ as we shall discuss later. Then, SMUCE urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0015 is defined to be the constrained maximum likelihood estimator within this confidence set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0016, i.e.
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0017(6)
Fig. 1(d) shows an example of SMUCE (the red line) for Gaussian observations. As stressed above, the multiscale constraint on the right‐hand side of problem (2) renders SMUCE sensitive to the multiscale nature of the signal ϑ. The signal in Fig. 1 is a case in point: it exhibits large and small scales simultaneously and remarkably SMUCE urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0018 recovers them both equally well.

1.2. Deviation bounds and confidence sets

The parameter urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0019 in problem (2) plays a crucial role because it governs the trade‐off between data fit (the right‐hand side in problem (2)) and parsimony (the left‐hand side in problem (2)). It has an immediate statistical interpretation. From expression (2) it follows that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0020(7)
Hence, by choosing q=q1−α to be the (1−α)‐quantile of the (asymptotic) null distribution of Tn(Y,ϑ), we can (asymptotically) control the probability of overestimating the number of change points by α. In fact, we show that the null distribution of Tn(Y,ϑ) can be bounded asymptotically by a distribution which does not depend on ϑ anymore (Section 2.2.2.). It is noteworthy that for Gaussian observations this bound is even non‐asymptotic (Section 2.4.). Fig. 1(c) shows for different choices of α (the y‐axis) the corresponding estimates for the change point locations (black dots; the vertical ticks mark the true change point locations). The number of estimated change points is monotonically increasing in α in accordance with inequality (7) which guarantees at error level α that SMUCE has no more jumps than the true signal ϑ. We emphasize that SMUCE is remarkably stable with respect to the choice of α, i.e. the number of change points K=8 is estimated correctly for 0.2⩽α⩽0.9. Our simulations in Section 5. confirm this stability even in non‐Gaussian scenarios.
As mentioned before, the threshold q1−α for SMUCE automatically controls the error of undersmoothing expression (7), i.e. the probability of overestimating the number of change points. In addition, we prove an exponential inequality that bounds the error of oversmoothing, i.e. the probability of underestimating the number of change points. Any such bound necessarily must depend on the magnitude of the signal ϑ on the smallest scale, as no method can recover arbitrarily fine details for given sample size n; see Donoho (1988) for a more rigorous argument in the context of density estimation. Our bound (see theorem 2 in Section 2.3)
math image(8)
reflects this fact and indeed depends only on the smallest interval length λ, the smallest absolute jump size Δ and the number of change points K of the true regression function ϑ. Here, C>0 is some known universal constant depending only on the family of distributions (Section 2.3.).

As a consequence of inequalities (7) and (8), urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0022 in expression (5) constitutes an asymptotic confidence set at level 1−α and we shall explain in Section 3.2. how confidence bands for the graph of ϑ and confidence intervals for its change points can be obtained from this. See Fig. 1(d) for illustration.

Of course, honest (i.e. uniform) confidence sets cannot be obtained on the entire set of step functions urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0023, as Δ and λ can become arbitrarily small. Nevertheless, we can show that simultaneously both confidence bands for ϑ and intervals for the change points are asymptotically honest with respect to a sequence of nested modelsurn:x-wiley:13697412:media:rssb12047:rssb12047-math-0024 that satisfy
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0025(9)
i.e. the confidence level α is kept uniformly over urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0026 as n→∞ (Section 2.6.). Here λn and Δn denote the smallest interval length and smallest absolute jump size in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0027 respectively.

1.3. Choice of q

Balancing the probabilities for overestimation and underestimation in inequalities (7) and (8) gives an upper bound on urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0028, the probability that the number of change points is misspecified. This bound depends on n, q, λ and Δ in an explicit way and opens the door for several strategies to select q, e.g. such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0029 is maximized. One may additionally incorporate prior information on Δ and λ and we suggest a simple way to do this in Section 4..

A further consequence of inequalities (7) and (8) is that under a suitable choice of q=qn the probability of misspecification urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0030 tends to 0 and hence urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0031 converges to the true number of change points K (model selection consistency), such that the underestimation error in inequality (8) vanishes exponentially fast.

Finally, we obtain explicit bounds on the precision of estimating the change point locations which again depend on q, n, λ and Δ. For any fixed q>0 they are recovered for all estimators in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0032, including SMUCE, at the optimal rate 1/n (up to a  log ‐factor). Moreover, these bounds can be used to derive slower rates uniformly over nested models as in condition (9) (Section 2.6.).

1.4. Detection power for vanishing signals

For the case of Gaussian observations we derive the detection power of the multiscale statistic Tn in equation 3, i.e. we determine the maximal rate at which a signal may vanish with increasing n but still can be detected with probability 1, asymptotically. For the task of detecting a single constant signal against a noisy background, we obtain the optimal rate and constant (see Chan and Walther (2013), Dümbgen and Spokoiny (2001), Dümbgen and Walther (2008) and Jeng et al. (2010)). We extend this result to the case of an arbitrary number of change points, retrieving the same optimal rate but different constants (Section 2.5.). Similar results have been derived recently in Jeng et al. (2010) for sparse signals, where the estimator takes into account the explicit knowledge of sparsity. We stress that SMUCE does not rely on any sparsity assumptions yet it adapts automatically to sparse signals owing to its multiscale nature.

1.5. Implementation, simulations and applications

The applicability of dynamic programming to the change point problem has been the subject of research recently (see for example Boysen et al. (2009), Fearnhead (2006), Friedrich et al. (2008) and Harchaoui and Lévy‐Leduc (2010)). SMUCE urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0033 can also be computed by a dynamic program owing to the restriction of the local likelihoods to the constant parts of candidate functions. This has already been observed by Höhenrieder (2008) for the multiscale constraint that was considered there. We prove that expression (6) can be rewritten into a minimization problem of a penalized cost function with a particular data‐driven penalty (see lemma 1 in Section 3.).

Much in the spirit of the dynamic program that was suggested in Killick et al. (2012), our implementation exploits the structure of the constraint set in expression (6) to include pruning steps. These reduce the worst case computation time urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0034 considerably in practice and make it applicable to large data sets. Simultaneously, the algorithm returns a confidence band for the graph of ϑ as well as confidence intervals for the location of the change points (Section 3.), the latter without any additional cost. An R package (stepR) including an implementation of SMUCE is available on line (http://www.stochastik.math.uni-goettingen.de/smuce).

Extensive simulations reveal that SMUCE is competitive with (and indeed often outperforms) state of the art methods for the change point problem which all have been tailor made to specific exponential families (Section 5.). Our simulation study includes the circular binary segmentation (CBS) method (Olshen et al., 2004), unbalanced Haar wavelets (Fryzlewicz, 2007), the fused lasso (Tibshirani et al., 2005) and the modified Bayes information criterion (MBIC) (Zhang and Siegmund, 2007) for Gaussian regression, the multiscale estimator in Davies et al. (2012) for piecewise constant volatility and the extended taut string method for quantile regression in Dümbgen and Kovac (2009). In our simulations we consider several risk measures, including the mean‐squared error MSE and the model selection error urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0035. Moreover, we study the feasibility of our approach for different real world data sets, including two benchmark examples from genetic engineering (Lai et al., 2005) and a new example from photoemission spectroscopy (Hüfner, 2003) which amounts to Poisson change point regression. Finally, in Section 6., we briefly discuss possible extensions to serially dependent data, among others.

1.6. Literature survey and connections to existing work

The problem of detecting changes in the characteristics of a sequence of observations has a long history in statistics and related fields, dating back to the 1950s (see for example Page (1955)). In recent years, it has experienced a renaissance in the context of regression analysis due to novel applications that mainly came along with the rapid development in genetic engineering (Braun et al., 2000; Jeng et al., 2010; Lebarbier and Picard, 2011; Olshen et al., 2004; Zhang and Siegmund, 2007) and financial econometrics (see Davies et al. (2012), Inclán and Tiao (1994), Lavielle and Teyssière (2007) and Spokoiny (2009)). Owing to the widespread occurrence of change point problems in different communities and areas of applications, such as statistics (Carlstein et al., 1994), electrical engineering and signal processing (Blythe et al., 2012), mobile phone communication (Zhang et al., 2009), machine learning (Harchaoui and Lévy‐Leduc, 2008), biophysics (Hotz et al., 2012), quantum optics (Schmidt et al., 2012), econometrics and quality control (Bai and Perron, 1998) and biology (Siegmund, 2013), an exhaustive list of existing methods is beyond reach. For a selective survey, we refer the reader also to Basseville and Nikiforov (1993), Brodsky and Darkhovsky (1993), Chen and Gupta (2000), Csörgő and Horváth (1997) and Wu (2005) and the extensive list in Khodadadi and Asgharian (2008).

Our approach as outlined above can be considered as a hybrid method of two well‐established approaches to the change point problem.

Likelihood ratio and related statistics, on the one hand, are frequently employed to test for a change in the parameter of the distribution family and to construct confidence regions for change point locations. Approaches of this type date back as far as Chernoff and Zacks (1964) and Kander and Zacks (1966) and have gained considerable attention since (Dümbgen (1991), Hinkley (1970), Hinkley and Hinkley (1970), Hušková and Antoch (2003), Siegmund (1988) and Worsley (1983, 1986) and Arias‐Castro et al. (2011), Bhattacharya (1987) and Siegmund and Yakir (2000) for generalizations to the multivariate case). The likelihood ratio test was also extensively studied for sequential change point analysis (Siegmund, 1986; Siegmund and Venkatraman, 1995; Yakir and Pollak, 1998). All these methods are primarily designed to detect a predefined maximal number (mostly 1) of change points. On the other hand, if the number of change points is unknown, an additional model selection step is required, which can be achieved by proper penalization of model complexity, e.g. measured by the number of change points itself or by surrogates for it. This is often approached by maximizing a penalized likelihood (PL) function of the form
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0036
over a suitable space of functions, e.g. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0037 as in this paper or functions of bounded variation (Mammen and van de Geer, 1997), etc. Here l(Y,ϑ) is the (log‐) likelihood function. The penalty term pen(ϑ) penalizes the complexity of ϑ and prevents overfitting. It increases with the dimension of the model and serves as a model selection criterion.

Linear l0‐penalization, i.e. pen(ϑ)=ω#J(ϑ), has already been considered in Yao (1988) and Yao and Au (1989) with a BIC‐type weight ω∼ log (n). More sophisticated methods based on weighted l0‐penalties have since been further developed in Boysen et al. (2009), Braun et al. (2000), Winkler and Liebscher (2002) and Wittich et al. (2008) and more recently in Demaret et al. (2013) for higher dimensions. Model‐selection‐based l0‐penalized functionals, which are non‐linear in #J(ϑ), have been investigated in Arlot et al. (2012), Birgé and Massart (2001), Lavielle (2005), Lavielle and Moulines (2000) and Lavielle and Teyssière (2007) for change point regression. Zhang and Siegmund (2007) introduced a penalty which depends on the number of change points and additionally on their locations.

Further prominent penalization approaches include the fused lasso procedure (see Friedman et al. (2007), Tibshirani et al. (2005) and Harchaoui and Lévy‐Leduc (2010)) that uses a linear combination of the total variation and the l1‐norm penalty as a convex surrogate for the number of change points which has been primarily designed for the situation when ϑ is sparse. Recently, aggregation methods (Rigollet and Tsybakov, 2012) have been advocated for the change point regression problem as well.

Most similar in spirit to our approach are estimators which minimize target functionals under a statistical multiscale constraint. For some early references see Donoho (1995), Nemirovski (1985) and more recently Candès and Tao (2007), Davies and Kovac (2001), Davies et al. (2009) and Frick et al. (2012). In our case this target functional equals the number of change points.

The multiscale calibration in expression (3) is based on the work of Chan and Walther (2013), Dümbgen and Spokoiny (2001) and Dümbgen and Walther (2008). Multiscale penalization methods have been suggested in Kolaczyk and Nowak (2004) and Zhang and Siegmund (2007), multiscale partitioning methods including binary segmentation in Fryzlewicz (2012), Olshen et al. (2004), Sen and Srivastava (1975) and Vostrikova (1981) and recursive partitioning in Kolaczyk and Nowak (2005).

Aside from the connection to frequentists' work cited above, we claim that our analysis also provides an interface for incorporating a priori information on the true signal into the estimator (see Section 4.). We stress that for minimizing the bounds in inequalities (7) and (8) on the model selection error urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0038 it is not necessary to include full priors on the space of step functions urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0039. Instead it suffices simply to specify a prior on the smallest interval length λ and the smallest absolute jump size Δ. The parameter choice strategy that is discussed in Section 4. or the limiting distribution of Tn(Y,ϑ) in Section 2., for instance, can be refined within such a Bayesian framework. This, however, will not be discussed in this paper in detail and is postponed to future work. For recent work on a Bayesian approach to the change point problem we refer to Du and Kou (2012), Fearnhead (2006) Luong et al. (2012), Rigaill et al. (2012) and the references therein.

We finally stress that there is a conceptual analogue of SMUCE to the Dantzig selector as introduced in Candès and Tao (2007) for estimating sparse signals in Gaussian high dimensional linear regression models (see James and Radchenko (2009) for an extension to exponential families). Here the l1‐norm of the signal is to be minimized subject to the constraint that the residuals are pointwise within the noise level. SMUCE, in contrast, minimizes the l0‐norm of the discrete derivative of the signal subject to the constraint that the residuals are tested to contain no signal on all scales. We shall briefly address this and other relationships to recent concepts in high dimensional statistics in a discussion in Section 6.. In summary, the change point problem is an ‘n=p’ problem and hence substantially different from high dimensional regression where ‘pn’. As we shall show, multiscale detection of sparse signals then becomes possible without any sparsity assumption entering the estimator. Another major statistical consequence of this paper is that post model selection inference is doable over a large range of scales uniformly over nested models in the sense of expression (9).

2. Theory

This section summarizes our main theoretical findings. In Section 2.3. we discuss consistency of the estimated number of change points. This result follows from an exponential bound for the probability of underestimating the number of change points on the one hand. On the other hand we show how to control the probability of overestimating the number of change points by means of the limiting distribution of Tn(Y,ϑ) as n→∞ (Section 2.2.). We give improved results, including a non‐asymptotic bound for the probability of overestimating the number of change points, for Gaussian observations (Sections 2.4. and 2.5.5). In Section 2.6. we finally show that the change point locations can be recovered as fast as the sampling rate up to a  log ‐factor and discuss how asymptotically honest confidence sets for ϑ can be constructed over a suitable sequence of nested models.

2.1. Notation and model

We shall henceforth assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0040 is a one‐dimensional, standard exponential family with ν‐densities
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0041(10)
Here urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0042 denotes the natural parameter space. We shall assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0043 is regular and minimal which means that Θ is an open interval and that the cumulant transform ψ is strictly convex on Θ. We shall frequently make use of the functions
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0044(11)
for XFθ. Note that m and v are strictly increasing and positive on Θ respectively.

2.1.1. Observation model and step functions

We assume that Y=(Y1,…,Yn) are independent observations given by model (1) where ϑ:[0,1)→Θ is a right continuous step function, i.e.
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0045(12)
where 0=τ0<τ1<…<τK<τK+1=1 are the change point locations and θk ∈ Θ the corresponding intensities, such that θkθk+1 for k=0,…,K. The collection of step functions on [0,1) with values in Θ and an arbitrary but finite number of change points will be denoted by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0046. For urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0047 as in equation 12 we denote by J(ϑ)=(τ1,…,τK) the increasingly ordered vector of change points and by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0048 its length. We shall denote the set of step functions with K change points and change point locations restricted to the sample grid by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0049.

For any estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0050 of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0051, the estimated number of change points will be denoted by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0052 and the change point locations by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0053 and we set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0054 for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0055. For simplicity, for each urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0056 we restrict ourselves to estimators which have change points only at sampling points, i.e. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0057 with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0058 for some urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0059. To keep the presentation simple, throughout what follows we restrict ourselves to an equidistant sampling scheme as in model (1). However, we mention that extensions to more general designs are possible.

2.1.2. Multiscale statistic

Let 1⩽ijn. Then, the likelihood ratio statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0060 in expression (4) can be rewritten as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0061
Introducing the notation ϕ(x)= sup θ ∈ Θ θxψ(θ) for the Legendre–Fenchel conjugate of ψ and J(x,θ)=ϕ(x)−{θxψ(θ)} we find that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0062
where urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0063. The multiscale statistic Tn(Y,ϑ) in expression (3) was defined to be the (scale‐calibrated) maximum over all urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0064 such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0065 for some urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0066. As mentioned in Section 1. we shall sometimes restrict the minimal interval length (scale) by a sequence of lower bounds urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0067 tending to 0. To ensure that the asymptotic null distribution is non‐degenerate, we assume for non‐Gaussian families (see also Schmidt‐Hieber et al. (2013))
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0068(13)
Then, the modified version of equation 3 reads
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0069(14)

2.2. Asymptotic null distribution

We give a representation of the limiting distribution of the multiscale statistic Tn in equation 14 in terms of
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0070(15)
where (B(t))t⩾0 denotes standard Brownian motion. We stress that the statistic M is finite almost surely and has a continuous distribution supported on [0,∞) (see Dümbgen et al. (2006) and Dümbgen and Spokoiny (2001)).

Theorem 1.Assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0071 satisfies condition (13). Then,

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0072(16)
Further, let M0,…,MK be independent copies of M as in expression (15). Then, the right‐hand side in expression (16) is stochastically bounded from above by M and from below by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0073

A proof is given in the on‐line supplement.

It is important to note that the limit distribution in expression (16) (which is the same as the lower bound) depends on the unknown regression function ϑ only through the number of change points K and the change point locations τk, i.e. the function values of ϑ do not play a role. From the upper bound in theorem 1 we obtain
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0074(17)
with qα being the α‐quantile of M. In practice the distribution of M is obtained by simulations. In Section 4. we shall see that for the Gaussian case even a non‐asymptotic version of theorem 1 can be obtained, which allows for finite sample refinement of the null distribution of Tn. As the asymptotics are rather slow, this finite sample correction is helpful even for relatively large samples, say if n is of the order of a few thousands. This is highlighted in Fig. 2 where it becomes apparent that the empirical null distributions for finite samples, obtained from simulations, are in general not supported in [0,∞). To the best of our knowledge, it is an open and challenging problem to derive tight bounds for the tails of M (see Dümbgen et al. (2006), Dümbgen and Spokoiny (2001) and Dümbgen and Walther (2008)) which is not addressed in this paper. By such bounds the probability of overestimating the number of change points could be controlled explicitly, as we shall see in the upcoming section. Moreover, we point out that the inequality in expression (17) is not sharp, if the true function has at least one change point. This is because we bound Tn in inequality (17) by qα, which is the quantile of M which serves as the bound for the right‐hand side in expression (16). For an illustration of this, Fig. 3 shows P–P‐plots of the exact null distribution of signals with two, four and 10 equidistant change points against the null distribution of a signal without change points for sample size n=500. Of course, further information on the minimal number and location of change points can be used to improve the distributional bound by M in theorem 1. We shall not pursue this further.
image
Simulations of (a) the cumulative distribution function and (b) density of M as in expression (15) for n=50 (image), n=500 (image) and n=5000 (——–) equidistant discretization points
image
Probability–probability plots of the empirical null distribution of a signal without change points (x‐axis) against signals with (a) two, (b) five and (c) 10 equidistant change points (y‐axis) for n=500

2.3. Exponential inequality for the estimated number of change points

In this section we derive explicit bounds on the probability that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0075 as defined in problem (2) underestimates the true number of change points K. In combination with the results in Section 2..2, these bounds will imply model selection consistency, i.e. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0076 for a suitable sequence of thresholds urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0077 in problem (2).

We first note that, with the additional constraint in expression (14) on the minimal interval length, the estimated number of change points is given by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0078(18)
Now let Δ and λ be the smallest absolute jump size and the smallest interval length of the true regression function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0079 respectively and assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0080 for all t ∈ [0,1]. We give the aforementioned exponential upper bound on the probability that the number of change points is underestimated. The result follows from the general exponential inequality in the on‐line supplement, theorem 7.10.

Theorem 2.(underestimation bound). Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0081 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0082 be defined as in expression (18) with λ⩾2cn. Then, there is a constant urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0083 subject to

math image(19)

From theorem 7.10 and lemma 7.11 in the on‐line supplement it follows that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0085(20)
which gives C=1/32 for the Gaussian family and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0086 for the Poisson family, given urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0087 in the latter case.
On the one hand, if q=qn and qn/√n→0 as n→∞, it becomes clear from theorem 2 that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0088 with high probability. On the other hand, it follows from theorem 1 that Tn(Y,ϑ;cn) is bounded almost surely as n→∞ if cn is as in expression (13). This in turn implies that the probability for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0089 tends to 1, since
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0090(21)
whenever qn→∞, as n→∞. Thus, we summarize in the following theorem.

Theorem 3.(model selection consistency). Let the assumptions of theorems 1 and 2 hold and additionally assume that qn→∞ and qn/√n→0 as n→∞. Then,

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0091

Giving a non‐asymptotic bound for the probability for overestimating the true number of change points (in the spirit of expression (21)) appears to be rather difficult in general. For the Gaussian case though this is possible, as we shall show in the next section.

2.4. Gaussian observations

We now derive sharper results for the case when urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0092 is the Gaussian family of distributions with constant variance. In this case model (1) reads
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0093(22)
where ɛ1,…,ɛn are independent urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0094 random variables, σ>0 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0095 denotes the expectation of Y. To ease the notation we assume in what follows that σ=1. For the general case replace Δ by Δ/σ.

In the Gaussian case it is possible to remove the lower bound for the smallest scales cn as in expression (13) because the strong approximation by Gaussian observations in the proof of theorem 1 becomes superfluous. We obtain the following non‐asymptotic result on the null distribution.

Theorem 4.(null distribution of Tn). For any urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0096

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0097
where M(n) is as M in expression (15) where the supremum is taken only over the system of discrete intervals [i/n,j/n].

In contrast with theorem 1, this result is non‐asymptotic and the inequality holds for any sample size. For this reason, we obtain the following improved upper bound for the probability of overestimating the number of change points.

Corollary 1.(overestimation bound). Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0098 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0099 be defined as in expression (18). Then for any urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0100

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0101

This corresponds to the ‘worst‐case scenario’ for overestimation when the true signal ϑ has no jump.

For the probability of underestimating the number of change points, we can improve theorem 2 for Gaussian observations (see theorem 7.12 in the on‐line supplement) to
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0102(23)

2.5. Multiscale detection of vanishing signals for Gaussian observations

We shall now discuss the ability of SMUCE to detect vanishing changes in a signal. We begin with the problem of detecting a signal on a single interval against an unknown background.

Theorem 5.Let ϑn(t)=θ0+δnIn(t) for some θ0,θ0+δn ∈ Θ, and for some sequence of intervals In⊂[0,1] and Y be given by expression (22). Further let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0103 be bounded away from zero and assume,

  1. for signals on a large scale (i.e. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0104), that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0105 and,
  2. for signals on a small scale (i.e. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0106), that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0107 with ɛn, subject to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0108 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0109.

Then,

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0110(24)

A proof is given in the on‐line supplement.

Theorem 1 gives sufficient conditions on the signals ϑn (through the interval length urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0111 and the jump height δn) as well as on the thresholds qn such that the multiscale statistic Tn detects the signals with probability 1, asymptotically; put differently, this means that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0112. We stress that the above result is optimal in the following sense: no test can detect signals satisfying urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0113 with asymptotic power 1 (see Chan and Walther (2013), Dümbgen and Spokoiny (2001) and Jeng et al. (2010)). For the special case, when qnqα is a fixed α‐quantile of the null distribution Tn(Y,ϑn) (or of the limiting distribution M in expression (15)), the result boils down to the findings in Chan and Walther (2013) and Dümbgen and Spokoiny (2001). In particular, aside from the optimal asymptotic power (24), the error of the first kind is bounded by α. The result in theorem 5 goes beyond that and allows us to shrink the error of the first kind to zero asymptotically, by choosing qn→∞.

We finally generalize the results in theorem 5 to the case when urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0114 has more than one change point. To be more precise, we formulate conditions on the smallest interval and the smallest jump in ϑ such that no change point is missed asymptotically.

Theorem 6.Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0115 be a sequence in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0116 with Kn change points and denote by Δn and λn the smallest absolute jump size and smallest interval in ϑn respectively. Further, assume that qn is bounded away from zero and,

  1. for signals on large scales (i.e.  lim & inf ;λn>0), that √(λnnn/qn→∞,
  2. for signals on small scales (i.e. λn→0) with Kn bounded, that √(λnnn⩾(4+ɛn)×√ log (1/λn) with ɛn√ log (1/λn)→∞ and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0117 and
  3. the same as in assumption (b), with Kn unbounded and the constant 12 instead of 4.

Then,

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0118

A proof is given in the on‐line supplement.

Theorem 1 amounts to saying that the statistic Tn can detect multiple change points simultaneously at the same optimal rate (in terms of the smallest interval and jump) as a single change point. The only difference is the constants that bound the size of the signals that can be detected. These increase with the complexity of the problem: √2 for a single change against an unknown background, 4 for a bounded (but unknown) and 12 for an unbounded number of change points. In Jeng et al. (2010) it was shown that for step functions that exhibit certain sparsity patterns the optimal constant √2 can be achieved. It is important to note that we do not make any sparsity assumption on the true signal. Finally we mention an analogue to theorem 4.1. of Dümbgen and Walther (2008) in the context of detecting local increases and decreases of a density. As in theorem 6 only the constants and not the rates of detection change with the complexity of the alternatives.

2.6. Estimation of change point locations and simultaneous confidence sets

In this section we shall provide several results on confidence sets that are associated with SMUCE. We shall see that these are linked in a natural way to estimation of change point locations. We generalize the set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0119 in expression (5) by replacing Tn(Y,ϑ) in expression (3) with Tn(Y,ϑ,cn) as in expression (14) and consider the set of solutions of the optimization problem
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0120(25)
Any candidate in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0121 recovers the change point locations of the true regression function ϑ with the same rate of convergence. It is determined by the smallest scale urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0122 for the considered interval lengths in the multiscale statistic Tn in expression (14) and hence equals the sampling rate up to a log‐factor.

Theorem 7.Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0123 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0124 be the set of solutions of problem (25) and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0125 a sequence in (0,1]. Further let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0126 as in expression (20). Then, for all urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0127,

math image

A proof is given in the on‐line supplement.

For a fixed signal urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0129, a sufficient condition for the right‐hand side in theorem 7 to vanish as n→∞ is
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0130

Here the constant C matters; for example in the Gaussian case C=1/32 (see Section 2.3.). This improves several results that have been obtained for other methods; for example in Harchaoui and Lévy‐Leduc (2010) for a total variation penalized estimator a  log 2(n)/n rate has been shown.

In what follows we shall apply theorem 7 to determine subclasses of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0131 in which the change point locations are reconstructed uniformly with rate cn. These subclasses are delimited by conditions on the smallest absolute jump height Δn and on the number of change points Kn (or the smallest interval lengths λn by using the relationship Kn⩽1/λn) of its members. For instance, the rate function cn=nβ with some β ∈ [0,1) implies the condition
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0132
The choice β=0 gives the largest subclass but no convergence rate is guaranteed since cn=1 for all n. A value of β close to 1 implies a much smaller subclass of functions which then can be reconstructed uniformly with convergence rate arbitrarily close to the sampling rate 1/n. We finally point out that the result in theorem 7 does not presume that the number of change points is estimated correctly. If cn additionally satisfies condition (13) and if in theorem 7 q=qn→∞ slower than − log (cn), we find from theorem 3 that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0133 and it follows from theorem 7 that for n sufficiently large
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0134
The solution set of the optimization problem (25) constitutes a confidence set for the true regression function ϑ. Indeed, we find that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0135(26)
In particular, it follows from theorem 3 that, if q1−α is the (1−α)‐quantile of M, the set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0136 is an asymptotic confidence set at level 1−α.

Corollary 2.Let α ∈ (0,1) and q1−α be the (1−α)‐quantile of the statistic M in expression (15). Then,

math image(27)
with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0138 as in theorem 3. Consequently we find that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0139
for any urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0140.

We mention that for the Gaussian family (see Section 2.4.) inequality (27) even holds for any n, i.e. the o(1) term on the right‐hand side can be omitted. Thus the right‐hand side of inequality (27) gives an explicit and non‐asymptotic lower bound for the true confidence level of C(qα).

In what follows we use this result to determine classes of step functions on which confidence statements hold uniformly. As it is a subset of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0141, the confidence set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0142 is difficult to visualize in practice. Therefore, in Section 2. we compute a confidence band B(q)⊂[0,1]×Θ that contains the graphs of all functions in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0143 as well as disjoint confidence intervals for the change point locations denoted by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0144 for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0145. For simplicity, we denote the collection urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0146 by I(q) and agree on the notation
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0147(28)
Put differently, ϑ ≺ I(q) implies that simultaneously the number of change points is estimated correctly, the change points lie within the confidence intervals and the graph is contained in the confidence band. As we shall show in Section 3.2., the confidence set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0148 and I(q) are linked by the relationship
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0149(29)
Following the terminology in Li (1989), I(q) is called asymptotically honest for the class urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0150 at level 1−α if
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0151
Such a condition obviously cannot be fulfilled over the entire class urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0152, since signals cannot be detected if they vanish too fast as n→∞. For Gaussian observations this was made precise in Section 2.4.
To overcome this difficulty, we shall relax the notion of asymptotic honesty. Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0153, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0154, be a sequence of subclasses of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0155. Then I(q) is sequentially honest with respect toS(n) at level 1−α if
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0156

By combining expressions (26), (27) and (29) we obtain the following result about the asymptotic honesty of I(q1−α).

Corollary 3.Let α ∈ (0,1) and q1−α be the (1−α)‐quantile of the statistic M in expression (15) and assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0157 is a sequence of positive numbers. Define urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0158. Then I(q1−α) is sequentially honest with respect tourn:x-wiley:13697412:media:rssb12047:rssb12047-math-0159 at level 1−α, i.e.

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0160

By estimating 1/λn we find that the confidence level α is kept uniformly over nested models urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0161, as long as urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0162. Here λn and Δn are the smallest interval length and smallest absolute jump size in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0163 respectively.

3. Implementation

We now explain how SMUCE, i.e. the estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0164 with maximal likelihood in the confidence set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0165, can be computed efficiently within the dynamic programming framework. In general the algorithm proposed is of complexity urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0166. We shall show, however, that in many situations the computation can be performed much faster.

Our algorithm uses dynamic programming ideas from Friedrich et al. (2008) in the context of complexity penalized M‐estimation. See also Davies et al. (2012) and Höhenrieder (2008) for a special case in our context. Moreover, we include pruning steps as in Killick et al. (2012), who also provided a survey on dynamic programming in change point regression from a general point of view. We shall show that it is always possible to rewrite urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0167 as a solution of a minimization of a complexity penalized cost function with data‐dependent penalty. For this, we shall denote the log‐likelihood of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0168 as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0169
Without restriction, we shall assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0170 for all urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0171.
Following Friedrich et al. (2008), we call a collection urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0172 of discrete intervals a partition if its union equals the set {1,…,n}. We denote by β(n) the collection of all partitions of {1,…,n}. For urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0173 let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0174 denote the number of discrete intervals in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0175. Hence, any discrete step function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0176 can be identified with a pair urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0177, where
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0178
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0179
and ϑ(t)=θI⇔⌈nt⌉ ∈ I. Next, we note that for a given θI ∈ Θ the negative log‐likelihood on a discrete interval I is given by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0180. With this we define the costs of θI on I as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0181(30)
The minimal costs on the interval I are then defined by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0182 where we agree that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0183 is such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0184. We stress that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0185 if and only if no θI ∈ Θ exists such that the multiscale constraint is satisfied on I. Finally, for an estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0186 the overall costs are given by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0187
In Friedrich et al. (2008) a dynamic program was designed for computing minimizers of
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0188(31)

It is shown that the computation time amounts to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0189 given that the minimal costs urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0190 can be computed in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0191. We now show that each minimizer of expression (31) maximizes the likelihood over the set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0192, if γ>0 is chosen sufficiently large. This γ can be computed explicitly for any given data (Y1,…,Yn) according to the next result.

Lemma 1.Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0193. Then, any solution of expression (31) is also a solution of expression (6).

For completeness, we briefly outline the dynamic programming approach for the minimization of expression (31) as established in Friedrich et al. (2008). Define for rn the Bellman function by B(0)=−γ and
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0194
and let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0195 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0196 be such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0197. Clearly, B(n) is the minimal value of expression (31) and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0198 is a minimizer of expression (31). A key ingredient is the following recursion formula (see Friedrich et al. (2008), lemma 1)):
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0199
Let pn and assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0200 are given for all r<pn. Then, compute the best previous change point position, i.e.
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0201(32)
and set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0202 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0203. With this we can iteratively compute the Bellman function B(p) and the corresponding minimizers urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0204 for p=1,…,n and eventually obtain urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0205, i.e. a minimizer of expression (31). According to lemma 1, this urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0206 solves problem (6) if γ is chosen sufficiently large.

We note that, for a practical implementation of the dynamic program proposed, the efficient computation of the values urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0207 is essential. We postpone this to the upcoming subsection and will discuss the complexity of the algorithm first. Following Friedrich et al. (2008) the dynamic programming algorithm is of order O(n2), given that the minimal costs urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0208 are computed in O(1) steps. Note that this does not hold true for the costs in expression (30). However, as we shall show in the next subsection, the set of all optimal costs urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0209 can be computed in O(n2) steps and hence the complete algorithm is of order O(n2) again.

In our implementation the specific structure of the costs (see expression (30)) has been employed by including several pruning steps in the dynamic program, similarly to Killick et al. (2012). Since the details are rather technical, we give only a brief explanation why the computation time of the algorithm as described below can be reduced: the speed‐ups are based on the idea to consider only such r in equation 32 that may lead to a minimal value, i.e. those r that are strictly larger than urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0210. The number of intervals, on which SMUCE is constant, is of order urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0211, instead of n2 if all intervals were considered. The number of intervals [r,p] which are needed in expression (32) is essentially of the same order. This indicates that SMUCE is much faster for signals with many detected change points than for signals with few detected change points, which has been confirmed by simulations. The pruned algorithm is implemented for the statistical software R in the package stepR (available from http://www.stochastik.math.uni-goettingen.de/smuce; the SMUCE procedure for several exponential families is available via the function smuceR).

3.1. Computation of minimal costs

Let rijp. Since {Fθ}θ ∈ Θ was assumed to be a regular, one‐dimensional exponential family, the natural parameter space Θ is a non‐empty, open interval (θ1,θ2) with −∞⩽θ1<θ2⩽∞. Moreover, the mapping urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0212 is strictly convex on Θ and has the unique global minimum at urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0213 if and only if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0214. In this case it follows from Nielsen (1973), theorem 6.2, that for all q>0
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0215
with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0216. In other words, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0217 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0218 are the two finite solutions of the equation
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0219(33)
If urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0220, then Nielsen (1973), theorem 6.2, implies that either urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0221 or urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0222. Let us assume without restriction that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0223 which in turn shows that Θ=(−∞,θ2) and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0224. In this case, the infimum of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0225 is not attained and equation 33 has only one finite solution urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0226. The lower bound urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0227 then is trivial.
After computing urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0228 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0229 for all rijp, define urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0230 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0231. Hence, if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0232 we obtain
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0233
Moreover, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0234 if and only if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0235.

To summarize, the computation of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0236 (and hence the computation of the minimal costs urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0237) reduces to finding the non‐trivial solutions of equation 33 for all rijp. This can either be done explicitly (as for the Gaussian family, for example) or approximately by Newton's method, say.

Finally, we obtain that, given that the urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0238 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0239 are computed in O(1), the bounds urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0240 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0241 are computed in O(n2). This follows from the observation that for 1⩽rpn
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0242
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0243
which allows for iterative computation.

3.2. Computation of confidence sets

The dynamic programming algorithm gives, in addition to the computation of SMUCE, an approximation to the solution set C(q) of problem (25) as discussed in Section 6.. The algorithm outputs disjoint intervals urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0244 as well as a confidence band B(q)⊂[0,1]×Θ such that, for each estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0245,
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0246
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0247
To make this clear let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0248 and define
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0249(34)
Then, for any estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0250 that satisfies urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0251, it holds that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0252 with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0253 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0254.

Now we construct a confidence band B(q) that contains the graphs of all functions in C(q). For this, let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0255 be as above and note that for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0256 there is exactly one change point in the interval urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0257 and no change point in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0258. First, assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0259. Then we obtain a lower and an upper bound for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0260 by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0261 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0262 respectively. Now let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0263. Then, the kth change point is either to the the left or to the right of t and hence any feasible estimator is constant either on urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0264 or on urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0265. Thus, we obtain a lower bound by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0266 and an upper bound by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0267.

4. On the choice of the threshold parameter

The choice of the parameter q in problem (2) is crucial because it balances data fit and parsimony of the estimator. First we discuss a general recipe that takes into account prior information on the true signal ϑ. On the basis of this a specific choice is given in the second part which we found particularly suitable for our purposes. Further generalizations are discussed briefly. As shown in corollary 2 for the general case, q determines asymptotically the level of significance for the confidence sets urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0268. For the Gaussian case we have shown in Section 4. that this result is even non‐asymptotic, i.e. from corollary 1 it follows that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0269(35)
where α(q) is defined as α(q)=P(Mq). This allows us to control the probability of overestimating the number of change points. If this is considered as a measure of smoothness, inequality (35) can be interpreted as a minimal smoothness guarantee. This is similar in spirit to results on other multiscale regularization methods (see Donoho (1995) and Frick et al. (2012)). As argued in Section 2.6. in general it is not possible to bound the minimal number of change points without further assumptions on the true function ϑ (see also Donoho (1988)) in the context of mode estimation for densities). However, we can draw a sharp bound for the probability of underestimating the number of change points from expression (23) in terms of the minimal interval length λ and minimal feature size η2=Δ2, which give
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0270
where we have exploited the fact that K⩽1/λ. By combining inequality (35) with the bound above we find that
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0271(36)
To optimize the bound on the probability of estimating the correct number of change points, we must balance the error of overestimation and underestimation. Therefore, we aim for maximizing the right‐hand side over q. Given λ and η2=Δ2 we therefore suggest choosing qas
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0272(37)
Explicit knowledge of the influence of λ and η in equation paves the way to various strategies for incorporating prior information to determine q. We might for example use a full prior distribution on (λ,η) and minimize the posterior model selection error
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0273
In what follows we suggest a rather simple way to proceed, which we found empirically to perform quite well. We stress that there is certainly room for further improvement. Motivated by the results of Section 2.4. we suggest defining λ and η=√()Δ in dependence of n implicitly by the assumptions
  1. η*=12√{− log (λ*)} and
  2. λ*=g(Δ,n),
for some function g with values in (0,1]. According to theorem 6, the first assumption reflects the worst‐case scenario among all signals that can be recovered with probability 1 asymptotically. The second assumption corresponds to a prior belief in the true function ϑ. In the following simulations we always choose g(Δ,n)=Δ which puts the decay of λ and Δ on equal footing. We then come back to the approach in equation and define
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0274(38)
where λ* and η* are defined by assumptions (a) and (b). Consequently, the maximizing element urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0275 picks that q which maximizes the probability bound in expression (36) of correctly estimating the number of change points. Note that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0276 does not depend on the true signal ϑ but only on the number of observations n.

Even though the motivation for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0277 is built on the assumption of Gaussian observations, simulations indicate that it performs well for other distributions also. That is why we choose urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0278, unless stated differently throughout all simulations. There α(q) is estimated by Monte Carlo simulations with sample size n=3000. These simulations are rather expensive but need to be performed only once. For a given n, a solution of equation may then be approximated numerically by computing the right‐hand side for a range of values for q.

We stress again that the general concept given by equation can be employed further to incorporate prior knowledge of the signal as will be shown in Section 5.6.

5. Simulations

As mentioned in Section 1., the literature on the change point problem is vast and we shall now aim for comparing our approach within the plethora of established methods for exponential families. All SMUCE instances computed in this section are based on the optimization problem (2), i.e. we do not restrict the interval lengths, as required in Section 2. for technical reasons.

5.1. Gaussian mean regression

Recall model (22) in Section 4. with constant variance σ2 and piecewise constant means μ, i.e. we set θ=μ/σ2 and ψ(θ)=μ2/(2σ2) in expression (10). Throughout what follows we assume that the variance σ2 is known; otherwise one may estimate it by standard methods (see for example Davies and Kovac (2001) or Dette et al. (1998). Then, the multiscale statistic (14) evaluated at urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0279 reads
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0280
After selecting the model urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0281 according to expression (18), SMUCE becomes
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0282
In our simulation study we consider the following change point methods. A large group follows the common paradigm of maximizing a PL criterion of the form
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0283(39)
over urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0284 for k=1,…,n, where the function pen(ϑ) penalizes the complexity of the model. This includes the BIC that was introduced in Schwarz (1978) which suggests the choice pen(ϑ)={2#J(ϑ)}  log (n). As was for instance stressed in Zhang and Siegmund (2007), the formal requirements to apply the BIC are not satisfied for the change point problem. Instead Zhang and Siegmund (2007) proposed the following penalty function in expression (39), which is denoted as the MBIC:
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0285
They compare their MBIC method with the traditional BIC as well as with the methods in Olshen et al. (2004) and Fridlyand et al. (2004) by means of a comprehensive simulation study and demonstrated the superiority of their method with respect to the number of correctly estimated change points. For this reason we consider only Zhang and Siegmund (2007) in our simulations.
In addition, we shall include the PL oracle as a benchmark, which is defined as follows: recall that K denotes the true number of change points. For given data Y, define ωl and ωu as the minimal and maximal element of the set
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0286
respectively. In particular, for ωm:=(ωl+ωu)/2 the penalized maximum likelihood (ML) estimator, i.e. a maximizer of expression (39) obtained with penalty pen(ϑ)=ωm#J(ϑ), has exactly K change points. For our assessment, we simulate 104 instances of data Y and compute the median ω* of the corresponding ωms. We then define the PL oracle to be a maximizer of expression (39) with pen(ϑ)=ω*#J(ϑ). Of course, PL oracles are not accessible in practice (since K and ϑ are unknown). However, they represent benchmark instances within the class of estimators given by expression (39) and penalties of the form pen(ϑ)=ω#J(ϑ). We stress again that even if SMUCE and the PL oracle have the same number of change points they are in general not equal, since the likelihood in expression (6) is maximized only over the set C(q).
Moreover, we consider the fused lasso algorithm which is based on computing solutions of
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0287(40)
where ‖·‖1 denotes the l1‐norm and ‖·‖TV the total variation seminorm (see also Harchaoui and Lévy‐Leduc (2010)). The fused lasso is not specifically designed for the change point problem. However, because of its prominent role and its application to change point problems (see for example Tibshirani and Wang (2008)), we include it in our simulations. An optimal choice of the parameters (λ1,λ2) is crucial and in our simulations we consider two fused lasso oracles FLMSE and FLcp. In 500 Monte Carlo simulations (using the true signal) we compute λ1 and λ2 such that the mean integrated squared error MISE is minimized for FLMSE and such that the frequency of correctly estimated number of change points is maximized for FLcp.

In summary, we compare SMUCE with the MBIC approach that was suggested in Zhang and Siegmund (2007), the CBS algorithm (R package available from http://cran.r-project.org/web/packages/PSCBS) proposed in Olshen et al. (2004), the fused lasso algorithm (R package available from http://cran.r-project.org/web/packages/flsa/) that was suggested in Tibshirani et al. (2005), unbalanced Haar wavelets (R package available from http://cran.r-project.org/web/packages/unbalhaar/) (Fryzlewicz, 2007 and the PL oracle as defined above. Since the CBS algorithm tends to overestimate the number of change points Olshen et al. (2004) included a pruning step which requires the choice of an additional parameter. The choice of the parameter is not explicitly described in Olshen et al. (2004) and here we consider only the unpruned algorithm.

We follow the simulation set‐up that was considered in Olshen et al. (2004) and Zhang and Siegmund (2007). The application that they bear in mind is the analysis of array‐based comparative genomic hybridization (CGH) data. Array CGH is a technique for recording the number of copies of genomic DNA (see Kallioniemi et al. (1992)). As pointed out in Olshen et al. (2004), piecewise constant regression is a natural model for array DNA copy number data (see also Section 5.6.1.). Here, we have n=497 observations with constant variance σ2=0.04 and the true regression function has six change points at locations τi=li/n and (l1,…,l6)=(138,225,242,299,308,332) with intensities (θ0,…,θ6)=(−0.18,0.08,1.07,−0.53,0.16,−0.69,−0.16). To investigate robustness against small deviations from the model with step functions, a local trend component is included in these simulations, i.e.
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0288(41)

Following Zhang and Siegmund (2007) we simulate data for σ=0.2 and a=0 (no trend), a=0.01 (long trend) and a=0.025 (short trend) (Fig. 4). Moreover, we included a scenario with a smaller signal‐to‐noise ratio, i.e. σ=0.3 and a=0 and a scenario with a higher signal‐to‐noise ratio, i.e. σ=0.3 and a=0. For both scenarios we do not display results with a local trend, since we found the effect to be very similar to the results with σ=0.2. Table 1 shows the frequencies of the number of detected change points for all methods mentioned and the corresponding MISE and mean integrated absolute error MIAE. Moreover, in Fig. 5 we display a typical observation of model (41) with a=0.1 and b=0.1 and the aforementioned estimators. The results show that SMUCE outperforms the MBIC (Zhang and Siegmund, 2007) slightly for σ=0.2 and appears to be less vulnerable to trends, in particular. Notably, SMUCE often performs even better than the PL oracle. For σ=0.3 SMUCE has a tendency to underestimate the number of change points by 1, whereas CBS and in particular the MBIC estimate the true number K=6 with high probability correctly. As is illustrated in Fig. 5 this is because SMUCE cannot detect all change points at level 1−α ≈ 0.55 as we have chosen it following the simple rule (38) in Section 4.. For further investigation, we lowered the level to 1−α=0.4 (see the last row in Table 1). Even though this improves estimation, SMUCE performs comparably with CBS and the PL oracle now, it is still worse than the MBIC.

image
True signal (image), simulated data (•) and confidence bands (image) and confidence intervals for the change points (image) for (a) a=0, (b) a=0.01 and (c) a=0.025 and σ2=0.2
image
(a) Typical example of model (41) for b=0 and σ2=0.43 and (b) change points and confidence intervals for SMUCE with α=0.1,…,0.9 (left‐hand y‐axis) and corresponding quantiles q1−α (right‐hand y‐axis)
Table 1. Frequencies of estimated number of change points and MISE by model selection for SMUCE, the PL oracle, MBIC (Zhang and Siegmund, 2007), CBS (Olshen et al., 2004) and the fused lasso oracles FLcp and FLMSE as well as the unbalanced Haar wavelets estimator (Fryzlewicz, 2007)†
Method Trend σ Results for the following numbers MISE MIAE
of change points:
4 5 6 7 8
SMUCE (1−α=0.55) No 0.1 0.000 0.000 0.988 0.012 0.000 0.00019 0.00885
PL oracle No 0.1 0.000 0.000 1.000 0.000 0.000 0.00019 0.00874
MBIC (Zhang and Siegmund, 2007) No 0.1 0.000 0.000 0.964 0.031 0.005 0.00020 0.00888
CBS (Olshen et al., 2004) No 0.1 0.000 0.000 0.922 0.044 0.034 0.00023 0.00903
Unbalanced Haar wavelets No 0.1 0.000 0.000 0.751 0.137 0.112 0.00026 0.00926
(Fryzlewicz, 2007)
FLcp No 0.1 0.124 0.122 0.419 0.134 0.201 0.00928 0.15821
FLMSE No 0.1 0.000 0.000 0.000 0.000 1.000 0.00042 0.00274
SMUCE (1−α=0.55) No 0.2 0.000 0.000 0.986 0.014 0.000 0.00117 0.01887
PL oracle No 0.2 0.024 0.001 0.975 0.000 0.000 0.00138 0.01915
MBIC (Zhang and Siegmund, 2007) No 0.2 0.000 0.000 0.960 0.037 0.003 0.00120 0.01894
CBS (Olshen et al., 2004) No 0.2 0.000 0.000 0.870 0.089 0.041 0.00146 0.01969
Unbalanced Haar wavelets No 0.2 0.000 0.000 0.637 0.222 0.141 0.00174 0.02063
(Fryzlewicz, 2007)
FLcp No 0.2 0.184 0.162 0.219 0.174 0.261 0.08932 0.23644
FLMSE No 0.2 0.000 0.000 0.000 0.000 1.000 0.00297 0.03692
SMUCE (1−α=0.55) Long 0.2 0.000 0.000 0.825 0.171 0.004 0.00209 0.03314
PL oracle Long 0.2 0.026 0.030 0.944 0.000 0.000 0.00245 0.03452
MBIC (Zhang and Siegmund, 2007) Long 0.2 0.000 0.000 0.753 0.215 0.032 0.00214 0.03347
CBS (Olshen et al., 2004) Long 0.2 0.000 0.000 0.708 0.130 0.162 0.00266 0.03501
Unbalanced Haar wavelets Long 0.2 0.000 0.000 0.447 0.308 0.245 0.00279 0.03515
(Fryzlewicz, 2007)
FLcp Long 0.2 0.078 0.112 0.219 0.215 0.376 0.08389 0.22319
FLMSE Long 0.2 0.000 0.000 0.000 0.000 1.000 0.00302 0.03782
SMUCE (1−α=0.55) Short 0.2 0.000 0.002 0.903 0.088 0.007 0.00235 0.03683
PL oracle Short 0.2 0.121 0.002 0.877 0.000 0.000 0.00325 0.03846
MBIC (Zhang and Siegmund, 2007) Short 0.2 0.000 0.000 0.878 0.107 0.015 0.00238 0.03695
CBS (Olshen et al., 2004) Short 0.2 0.000 0.000 0.675 0.182 0.143 0.00267 0.03806
Unbalanced Haar wavelets Short 0.2 0.000 0.000 0.602 0.225 0.173 0.00288 0.03849
(Fryzlewicz, 2007)
FLcp Short 0.2 0.175 0.126 0.192 0.210 0.297 0.08765 0.23105
FLMSE Short 0.2 0.000 0.000 0.000 0.000 1.000 0.00331 0.04111
SMUCE (1−α=0.55) No 0.3 0.030 0.340 0.623 0.007 0.000 0.00660 0.03829
PL oracle No 0.3 0.181 0.031 0.788 0.000 0.000 0.00505 0.03447
MBIC (Zhang and Siegmund, 2007) No 0.3 0.015 0.006 0.927 0.050 0.002 0.00364 0.03123
CBS (Olshen et al., 2004) No 0.3 0.006 0.019 0.764 0.157 0.054 0.00449 0.03404
Unbalanced Haar wavelets No 0.3 0.008 0.004 0.602 0.244 0.142 0.00556 0.03792
(Fryzlewicz, 2007)
FLcp No 0.3 0.038 0.059 0.088 0.115 0.700 0.08792 0.23496
FLMSE No 0.3 0.531 0.200 0.125 0.078 0.066 0.09670 0.24131
SMUCE (1−α=0.4) No 0.3 0.000 0.099 0.798 0.089 0.000 0.00468 0.03499
  • †The true signals, which are shown in Fig. 4, have six change points.

For an evaluation of FLMSE and FLcp one should account for the quite different nature of the fused lasso: the weight λ1 in expression (40) penalizes estimators with large absolute values, whereas λ2 penalizes the cumulated jump height. However, none of them encourages directly sparsity with respect to the number of change points. That is why these estimators often incorporate many small jumps (which is well known as the staircase effect). In comparison with SMUCE we find that SMUCE outperforms FLMSE with respect to MISE and it outperforms FLcp with respect to the frequency of correctly estimating the number of change points. The example in Fig. 6 suggests that the major features of the true signal are recovered by FLMSE. But, additionally, there are also some artificial features in the estimator which suggest that an additional filtering step must be included (see Tibshirani and Wang (2008)).

image
Example of model (41) for a=0.01, b=0.1 and σ=0.2 (image, true signal): (a) SMUCE; (b) MBIC; (c) unbalanced Haar wavelets; (d) CBS; (e) FLMSE; (f) FLcp

The unbalanced Haar estimator also has a tendency to include too many jumps, even though the effect is much smaller than for lasso‐type methods, i.e. it is much sparser with respect to the number of change points. As this estimator has been primarily designed for estimation of ϑ and not the jump locations it performs well with respect to MISE and MIAE.

Again, we note that Table 1 can be complemented by the simulation study in Zhang and Siegmund (2007) which accounts for the classical BIC (Schwarz, 1978) and the method suggested in Fridlyand et al. (2004).

5.2. Gaussian variance regression

Again, we consider normal data Yi; however, in contrast with the previous section we aim to estimate the variance urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0289. For simplicity we set μ=0. This constitutes a natural exponential family with natural parameter θ=−(2σ2)−1 and ψ(θ)=− log (−2θ)/2 for the sufficient statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0290, i=1,…,n. It is easily seen that the MR‐statistic in this case reads
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0291
After selecting the model urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0292 according to expression (18), SMUCE is given by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0293

We compare our method with those of Davies et al. (2012) and Höhenrieder (2008). Similarly to SMUCE they proposed to minimize the number of change points under a multiscale constraint. They additionally restricted their final estimator to coincide with the local ML estimator on constant segments. As pointed out by Davies et al. (2012) and Höhenrieder (2008) this may increase the number of change points detected. Following their simulation study we consider test signals σk with k=0,1,4,9,19 equidistant change points and constant values alternating from 1 to 2 (k=1), from 1 to 2 (k=4), from 1 to 2.5 (k=9) and from 1 to 3.5 (k=19). For this simulation the parameters of both procedures are chosen such that the number of changes should not be overestimated with probability 0.9. For any signal we computed both estimates in 1000 simulations. The difference in true and estimated number of change points as well as MISE and MIAE are shown in Table 2. Considering the number of correctly estimated change points, it shows that SMUCE performs better for few changes (k=1,4,9) and worse for many changes (k=19). This may be explained by the fact that the multiscale test in Davies et al. (2012) does not include a scale calibration and is hence more sensible on small than on larger scales; see also Section 6.1.1. With respect to MISE and MIAE SMUCE outperforms in every scenario, interestingly even for k=19, where the method of Davies et al. (2012) performs better with respect to the estimated number of change points.

Table 2. Comparison of SMUCE and the method in Davies et al. (2012)†
Method k Results for the following differences between estimated MISE MIAE
and true numbers of change points:
3 2 1 0 1 2 3
SMUCE 0 0.000 0.000 0.000 0.945 0.053 0.002 0.000 0.00072 0.02040
Davies et al. (2012) 0 0.000 0.000 0.000 0.854 0.127 0.019 0.000 0.00093 0.02122
SMUCE 1 0.000 0.000 0.000 0.975 0.024 0.001 0.000 0.00653 0.04295
Davies et al. (2012) 1 0.000 0.000 0.000 0.901 0.089 0.009 0.001 0.00935 0.04648
SMUCE 4 0.000 0.000 0.000 0.997 0.003 0.000 0.000 0.02153 0.07967
Davies et al. (2012) 4 0.000 0.000 0.000 0.957 0.042 0.001 0.000 0.03378 0.09655
SMUCE 9 0.000 0.001 0.023 0.973 0.003 0.000 0.000 0.06456 0.13206
Davies et al. (2012) 9 0.000 0.000 0.009 0.968 0.023 0.000 0.000 0.11669 0.18297
SMUCE 19 0.000 0.027 0.222 0.751 0.000 0.000 0.000 0.26076 0.27468
Davies et al. (2012) 19 0.000 0.008 0.074 0.912 0.006 0.000 0.000 0.47105 0.40606
  • †Difference between the estimated and the true number of change points for k=0,1,4,19 change points as well as MISE and MIAE for both estimators.

5.3. Poisson regression

We consider the Poisson family of distributions with intensity μ>0. Then, θ= log (μ) and ψ(θ)= exp (θ). The multiscale statistic is computed as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0294
For urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0295 as in expression (18), SMUCE is given by
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0296

In applications (see the example from photoemission spectroscopy below), one is often faced with the problem of low count Poisson data, i.e. when the intensity μ is small. It will turn out that, in this case, data transformation towards Gaussian variables such as variance stabilizing transformations are not always sufficient and it pays off to take into account the Poisson likelihood in SMUCE.

In what follows we perform a simulation study where we use a signal with a low count and a spike part (Fig. 7(a)). To evaluate the performance of SMUCE we compare it with the BIC estimator and the PL oracle as described before. Moreover, we included a version of SMUCE which is based on variance stabilizing transformations of the data. For this, we applied the mean matching transformation (Brown et al., 2010) to preprocess the data. We then compute SMUCE under a Gaussian model and retransform the obtained estimator by the inverse mean matching transform. The resulting estimator is referred to as SMUCEmm. Moreover, as a benchmark, we compute the (parametric) ML estimator with K=7 change points, which is referred to as the ML oracle.

image
(a) Simulated data, (b) true signal, (c) SMUCE with confidence bands for the signal intensities (image) and confidence intervals for the change points (image), (d) SMUCEmm and (e) PL oracle

Table 3 summarizes the simulation results. As to be expected the standard BIC performs far from satisfactorily. We stress that SMUCE clearly outperforms SMUCEmm, which is based on Gaussian transformations. Note that SMUCEmm systematically underestimates the number of change points K=7, which highlights the difficulty of capturing those parts of the signal correctly, where the intensity is low. Again, SMUCE performs almost as well as the Poisson oracle PL oracle. To obtain a visual impression along with the results of Table 3, we illustrated these estimators in Fig. 7.

Table 3. Frequencies of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0297 and distance measures for SMUCE, the BIC (Schwarz, 1978) and SMUCE for variance‐stabilized signals as well as the PL oracle and ML oracle
Method Results for the following MISE MIAE Kullback–
numbers of change points: Leibler
≤5 6 7 8 9
SMUCE 0.000 0.067 0.929 0.004 0.004 0.274 0.217 0.0187
BIC 0.000 0.000 0.080 0.094 0.920 0.575 0.313 0.0417
SMUCEmm 0.013 0.420 0.561 0.005 0.006 0.434 0.364 0.0418
PL oracle 0.045 0.014 0.942 0.000 0.000 0.275 0.217 0.0185
ML oracle 0.000 0.000 1.000 0.000 0.000 0.258 0.208 0.0143

5.4. Quantile regression

Finally, we extend our methodology to quantile regression. Let the observations Y1,…,Yn be given by model (1), without any assumption on the underlying distribution. For some β ∈ (0,1), we now aim for estimating the corresponding (piecewise constant) β‐quantile function, which will be denoted by ϑβ. This problem can be turned into a Bernoulli regression as follows: given the β‐quantile function ϑβ define the random variables W(ϑ)=(W1,…,Wm) as
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0298
Then, W1,…,Wn are independent and identically distributed Bernoulli random variables with mean value β. Extending the idea in Section 1.1. we compute a solution of expression (6), where Tn{W(ϑβ)} denotes the multiscale statistic for Bernoulli observations which reads
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0299
with
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0300

In other words, we compute the estimate with fewest change points, such that the signs of the residuals fulfil the multiscale test for Bernoulli observations with mean β. The computation of this estimate hence results in the same type of optimization problem as treated in Section 3.1. and we can apply the methodology proposed.

In what follows we compare this approach with a generalized taut string algorithm (Davies and Kovac, 2001), which was proposed in Dümbgen and Kovac (2009), for estimating quantile functions. The estimate is constructed in such a way that it minimizes the number of local extreme values among a specified class of functions. Here, a local extreme value is either a local maximum or a local minimum.

In contrast with SMUCE the number of change points is not penalized. In a simulation study Dümbgen and Kovac (2009) showed that their method is particularly suitable for detecting local extremes of a signal. We follow this idea and repeated their simulations: see also Fig. 8. The results, which also include the estimated number of change points, are shown in Table 4. It can be seen that the generalized taut string estimates the number of local extremes slightly better than SMUCE, whereas the number of change points is overestimated for n=2048 and n=4096. This may be explained by the fact that the generalized taut string is not primarily designed to have few change points rather few local extremes.

image
(a) Block signal, (b) simulated data, (c) estimator for the median (image) and 0.1‐ and 0.9‐quantile (image) from SMUCE and (d) estimator for the median (image) and 0.1‐ and 0.9‐quantile (image) from the generalized taut string
Table 4. Comparison of SMUCE and the generalized taut string (Dümbgen and Kovac, 2009)†
Method n Results for local extreme Results for change points
values
β=0.5 β=0.1 β=0.9
β=0.5 β=0.1 β=0.9
SMUCE 512 3 (5.9) 1 (7.9) 2 (7.4) 5 (5.8) 2 (9.1) 3 (8.3)
Generalized taut string 512 3 (6.0) 3 (6.6) 3 (6.6) 12 (2.0) 6 (4.9) 7 (4.0)
SMUCE 2048 9 (0.4) 4 (5.4) 3 (5.8) 11 (0.1) 6 (5.2) 5 (5.9)
Generalized taut string 2048 9 (0.7) 5 (4.0) 3 (5.7) 26 (15.3) 18 (7.1) 16 (5.7)
SMUCE 4096 9 (0.1) 4 (4.3) 5 (4.5) 11 (0.2) 8 (3.1) 6 (4.8)
Generalized taut string 4096 9 (0.0) 6 (3.1) 3 (5.3) 35 (24.1) 25 (13.8) 21 (9.9)
  • †Medians of local extreme values and numbers of change points of the estimators and mean absolute difference (in parentheses) to the true numbers of local extremes and change points. The true number of local extremes equals 9 and the true number of change points equals 11.

5.5. On the coverage of confidence sets I(q)

In Section 2.6. we gave asymptotic results on the simultaneous coverage of the confidence sets I(q) as defined in expression (28). In our simulations we choose q=q1−α to be the (1−α)‐quantile of M as in expression (15). It then follows from corollary 3 that asymptotically the simultaneous coverage is larger than 1−α. We now investigate empirically the simultaneous coverage of I(q1−α). For this, we consider the test signals that are shown in Fig. 9 for Gaussian observations with varying mean, Gaussian observations with varying variance, Poisson observations and Bernoulli observations.

image
(a) Gaussian observations with varying mean, (b) Gaussian observations with varying variance, (c) Poisson observations and (d) (binned) Bernoulli observations and SMUCE (image) with confidence bands (image) and confidence intervals for change points (image)

Table 5 summarizes the empirical coverage for various values for α and n obtained by 500 simulation runs each and the relative frequencies of correctly estimated change points, which are given in parentheses. The results show that for n=2000 the empirical coverage exceeds 1−α in all scenarios. The same is not true for smaller n (indicated by italics), since here the number of change points is misspecified quite frequently (see the numbers in parentheses). Given that K has been estimated correctly, we find that the empirical coverage of bands and intervals is in fact larger than the nominal 1−α for all simulations.

Table 5. Empirical coverage obtained from 500 simulations for the signals shown in Fig. 9
n 1−α Results for Results for Results for Results for
Gaussian Gaussian Poisson Bernoulli
observations observations observations observations
(mean) (variance)
1000 0.8 0.59 0.64 0.92 0.66 0.68 0.97 0.87 0.89 0.98 0.85 0.90 0.94
0.9 0.48 0.49 0.98 0.39 0.39 1.00 0.85 0.86 0.99 0.86 0.86 0.99
0.95 0.28 0.28 1.00 0.16 0.18 0.93 0.71 0.74 0.96 0.66 0.70 0.94
1500 0.8 0.84 0.90 0.93 0.87 0.88 0.98 0.92 0.95 0.96 0.93 0.97 0.96
0.9 0.73 0.74 0.98 0.72 0.74 0.97 0.95 0.97 0.98 0.96 0.97 0.99
0.95 0.55 0.56 0.98 0.45 0.47 0.98 0.92 0.93 0.99 0.89 0.90 0.99
2000 0.8 0.94 0.99 0.95 0.98 1.00 0.98 0.95 0.99 0.95 0.96 0.99 0.97
0.9 0.98 1.00 0.98 0.99 1.00 0.99 0.96 0.99 0.96 0.97 0.99 0.98
0.95 0.99 1.00 0.99 0.97 0.99 0.98 1.00 1.00 1.00 0.99 1.00 0.99
  • †For each choice of α and n we computed the simultaneous coverage of I(q), as in expression (28) (first value), the percentage of correctly estimated number of change points (second value) and the simultaneous coverage of confidence bands and intervals for the change points given urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0301 (third value).

5.6. Real data results

In this section we analyse two real data examples. The examples show the variety of possible applications for SMUCE. Moreover, we revisit the issue of choosing q as proposed in Section 4. and illustrate its applicability to the present tasks.

5.6.1. Array comparative genomic hybridization data

Array CGH data show aberrations in genomic DNA. The observations consist of the log‐ratios of normalized intensities from disease and control samples. The statistical problem at hand is to identify regions on which the ratio differs significantly from 0 (which corresponds to a gain or a loss). These are often referred to as aberration regions.

A thorough overview of the topic and a comparison of several methods is given in Lai et al. (2005). We compute SMUCE for two data sets studied in Lai et al. (2005) and more recently in Du and Kou (2012) and Tibshirani and Wang (2008). The data sets show the array CGH profile of chromosome 7 in GBM29 and chromosome 13 in GBM31 (see also again Du and Kou (2012) and Lai et al. (2005)).

By means of these two data examples we illustrate how the theory developed in Section 2. can be used for applications. As was stressed in Lai et al. (2005) many algorithms in change point detection strongly depend on the proper choice of a tuning parameter, which is often a difficult task in practice. We point out that our proposed choice of the threshold parameter q has in fact a statistically meaningful interpretation as it determines the level of the confidence set C(q). Moreover, we shall emphasize the usefulness of confidence bands and intervals for array CGH data.

We first consider the GBM29 data. To choose q according to the procedure suggested in expression (37), assumptions on λ and Δ must be imposed.

As mentioned above, log‐ratios of copy numbers may take a finite number of values which are approximately { log (1), log (3/2), log (2), log (5/2),…}. It therefore seems reasonable to assume that the smallest jumps size is Δ=log(3/2).

Moreover, we choose λ⩾0.2. We stress that the final solution of SMUCE will not be restricted to these assumptions. They enter as prior assumptions for the choice of q. If the data indicate strongly that these assumptions do not hold SMUCE will adapt to this.

In Fig. 10(a) we depict the probability of overestimating the number of change points as a function of q (the decreasing broken curve) and the probability of overestimating the number of change points as a function of q (the increasing broken curve) under the above stated assumption on λ and Δ. We may interpret the plot in the following way. It provides a tool for finding jumps of minimal height Δ=log(3/2) on scales of at least λ=0.2. For the optimized q* we obtain that the number of jumps is misspecified with probability less than 0.35. For the corresponding estimate see Fig. 10.

image
(a) Probability for overestimating (image) or underestimating (image) the number of change points in the dependence of q (x‐axis) and their sum (image), (b) detected change points with confidence intervals for various values of α (left‐hand y‐axis) with the probabilty of underestimation (right‐hand y‐axis) and (c) SMUCE (image) computed for the optimal q*≈1.1 with confidence bands (image) and confidence intervals for change points (image)

Moreover, we display SMUCE for different choices of q. Fig. 10(b) shows the estimated change points with their confidence intervals. Bounds for the probability that K is overestimated can be found on the left‐hand axis, and bounds for underestimation on the right‐hand axis.

Note from Fig. 10(b) that SMUCE is quite robust with respect to q=q1−α. For α ∈ [0.2,0.7] SMUCE always detects exactly seven change points in the signal. The results show that a jump of size approximately Δ is found in the data on an interval, whose length is even slightly smaller than λ. However, SMUCE can also detect larger aberrations on smaller intervals, which makes it quite robust against wrong choices of Δ and λ.

Recall that one goal in array CGH data analysis is to determine segments on which the signals differ from 0. The confidence sets in Fig. 10(c) indicate three intervals with signal different from 0. Moreover, as indicated by the blue arrows, the change point locations are detected very precisely. Actually, the estimator suggests one more change point in the data. However, it can be seen from the confidence bands that there is only small evidence for the signal to be non‐zero. Further, the confidence bands may be used to decide which segments belong to the same copy number event. In this particular example the confidence bands suggest that these three segments belong to the same copy number event, i.e. have the same mean value.

Put differently, not only is an estimator for the true signal obtained, but also three regions of aberration were detected and simultaneous confidence intervals for the signal's value on this region at a level of 1−α=0.9 are given. This is in accordance with others' findings (Du and Kou, 2012; Lai et al., 2005).

The same procedure as above is repeated for the GBM31 data as shown in Fig. 11. For the bounds on underestimating the number of change points we assumed again that Δ⩾ log (3/2) and chose λ⩾0.025. Fig. 11 shows that Δ⩾ log (3/2) for the sample size of n=797 and the probability of misspecification can be bounded by approximately 0.12 for the minimal length λ=0.025, which corresponds to 19 observations. Using the same reasoning as above we identify one large region of aberration and obtain a confidence interval for the corresponding change point as well as for the signal's value. Here, the optimized q*≈1.7 in the sense of expression (38) gives α≈0.04 which yields SMUCE with one jump with high significance.

image
(a) Probability for overestimating (image) or underestimating (image) the number of change points in the dependence of q (x‐axis) and their sum (image), (b) detected change points with confidence intervals for various values of α (left‐hand y‐axis) with the probabilty of underestimation (right‐hand y‐axis) and (c) SMUCE (image) computed for the optimal q*≈1.7 with confidence bands (image) and confidence intervals for change‐points (image)

5.6.2. Photo‐emission spectroscopy

Electron emission from nanostructures triggered by ultrashort laser pulses has numerous applications in time‐resolved electron imaging and spectroscopy (Ropers et al., 2007). In addition, it holds promise for fundamental insight into electron correlations in microscopic volumes, including antibunching (Kiesel et al., 2002). Single‐shot measurements of the number of electrons emitted per laser pulse (Bormann et al., 2010; Herink et al., 2012) will allow for the disentanglement of various competing processes governing the electron statistics, such as classical fluctuations, Pauli blocking and space charge effects.

We investigate with the SMUCE approach photo‐emission spectroscopy data that are displayed in Fig. 12(c). It represents a time series of electron numbers recorded from a photo‐emission spectroscopy experiment that was performed in the Ropers laboratory (Department of Biophysics, University of Göttingen; see Bormann et al. (2010)). It is customary to model photo‐emission spectroscopy data by Poisson regression with unknown intensity. This intensity is known to show long‐term fluctuations which correspond to variation in laser power and laser beam pointing, which cannot be controlled in the experiment and typically leads to an overall overdispersion effect. However, on a short timescale, the interesting task is to investigate underdispersion in the distribution. Such underdispersion would indicate an electron interaction in which the emission of one (or a few) electrons decreases the likelihood of further emission events. Specifically, significant underdispersion in the single‐shot electron number histogram would evidence an anticorrelation caused by electrons being fermions that obey the Pauli exclusion principle. A piecewise constant mean that models sudden changes in the laser intensity to reflect the large‐scale fluctuations is used for segmentation of the data for further investigation of underdispersion or overdispersion in these segments.

image
(a) Detected change points and confidence intervals for different values of α (y‐axis), (b) SMUCE with confidence bands (image), confidence intervals for the change points (image) and binned photo‐emission spectroscopy data and (c) ML estimator with 10 change points

Fig. 12(a) shows the estimated change points of SMUCE (and the corresponding confidence intervals) for α=0.05,0.1,…,0.9. We also display SMUCE with confidence bands for α=0.9 (Fig. 12(b)) and for comparison the ML estimator with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0302 change points (Fig. 12(c)). Note that the ML estimator is computed without the additional constraint urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0303, in contrast with SMUCE. Remarkably, this results in a different estimator.

We estimate the dispersion of data Y1,…,Ym by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0304, where urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0305 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0306. In Table 6urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0307 is shown for the whole data set as well as for the segments that are identified by SMUCE. It can be seen that our segmentation allows us to explain the overall overdispersion to a large extent, by the long‐term fluctuations. However, the results in Table 6 do not indicate significant underdispersion on any of the segments identified. This may be explained by a masking effect due to fluctuations of the emission current. Future experiments using more stable emission currents are under way.

Table 6. Dispersion estimator urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0308 of the whole data set and on the segments identified by SMUCE
Segment urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0309
Overall 1.02
1 0.98
2 1.02
3 0.98
4 1.04
5 1.01
6 1.04
7 0.98
8 1.03
9 0.99
10 0.98
11 1.05

6. Discussion

6.1. Dependent data

So far the theoretical justification for SMUCE relies on the independence of the data in model (1) (see Section 2.), as for example the optimal power results in Section 2.5. We claim, however, that SMUCE as introduced in this paper can be extended to piecewise constant regression problems with serially dependent data. A comprehensive discussion is beyond the scope of this paper and will be addressed in future work. Here, we confine ourselves to the case of a Gaussian moving average process of order 1; a similar strategy has been applied in Hotz et al. (2012) for m‐dependent data.

6.1.1. Example1

For a piecewise constant function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0310 we consider the moving average MA(1) model
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0311
where β<1 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0312. We aim to adapt SMUCE to this situation. Following the local likelihood approach underlying the multiscale constraint in problem (2) one simply might replace the local statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0313 for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0314 in expression (3) by the (modified) local statistics
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0315(42)

This is motivated by the fact that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0316. Under the null hypothesis the local statistics urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0317 then marginally have a urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0318‐distribution, like urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0319 in expression (4) for independent Gaussian observations.

To control the overestimation error as in Section 2.3., we now must compute the null distribution of
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0320

For this, we used Monte Carlo simulations for a sample size of n=500. We reconsider the test signal from Section 5.1. with σ=0.2 and a=0. The empirical null distribution of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0321 and a probability–probability plot of the null distribution of Tn against urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0322 are shown in Fig. 13. For β=0.1 and β=0.3, which correspond to a correlation of ρ=0.1 and ρ=0.27, we ran 1000 simulations each. We computed the modified SMUCE, as in equation , and SMUCE for independent Gaussian observations. For both procedures we chose q to be the 0.75‐quantile of the null distribution. The results are shown in Table 7. For β=0.1 both procedures perform similarly, which indicates that SMUCE is robust to such weak dependences, whereas for β=0.3 the modified version performs much better with respect to the estimated number of change points.

image
(a) Emprical cumulative distribution function for dependent observations with β=0.3 and (b) probability–probability plot against the null distribution for independent observations
Table 7. Frequencies of estimated numbers of change points and MISE by model selection for the modified SMUCE and SMUCE
Method β Results for the following MISE MIAE
numbers of change points:
5 6 7 8 9
Modified SMUCE 0.1 0.02 0.98 0.00 0.00 0.00 0.00154 0.02104
SMUCE 0.1 0.00 0.95 0.04 0.00 0.00 0.00142 0.02117
Modified SMUCE 0.3 0.27 0.73 0.00 0.00 0.00 0.00435 0.03084
SMUCE 0.3 0.00 0.29 0.34 0.24 0.13 0.00277 0.03229

The example illustrates that SMUCE as in problem (2) can be successfully applied to the case of dependent data after an adjustment of the underlying multiscale statistic Tn to the dependence structure. The asymptotic null distribution of this modified multiscale statistic is certainly not obvious and is postponed to future work.

6.2. Scale calibration of Tn

The penalization of different scales as in expression (3) is borrowed from Dümbgen and Spokoiny (2001) and calibrates the number of intervals on a given scale. This prevents the small intervals from dominating the statistic. For this, we might also consider the statistic

urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0323
which is finite almost surely as n→∞ (see again Dümbgen and Spokoiny (2001), theorem 6.1, or Schmidt‐Hieber et al. (2013)). A multiscale statistic without scale calibration
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0324
was for example considered in Davies et al. (2012). We illustrate the calibration effect of the statistics Tn, as in expression (3), urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0325 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0326 in Fig. 14. The graphic shows the frequencies at which the corresponding 0.75‐quantiles of the statistics Tn, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0327 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0328 are exceeded at a certain scale (scales are displayed on the x‐axis). It can be seen that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0329 puts much emphasis on small scales, whereas the penalized statistics Tn and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0330 distribute the scales more uniformly. For our purposes this calibration is beneficial in two ways: first it is required to obtain the optimal detection rates in theorem 6 and theorem 6 as shown in Chan and Walther (2013). Second, the asymptotic behaviour is determined by a process of the type (16) and not by an extreme value limit as to be expected in the uncalibrated case, where the maximum is attained at scales of magnitude  log (n) with high probability (see Kabluchko and Munk (2009), theorem 3.1, and the proof of theorem 1.1) in accordance with Fig. 14.
image
Frequencies of violations of the multiscale constraint for the various multiscale statistics Tn (image), urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0331 (image) and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0332 (image) obtained from 10000 simulations on certain scales (the scales are on the x‐axis)

6.3. SMUCE from a linear models perspective

For normal mean regression we may rewrite the change point regression model (22) as a linear model
urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0334
where βi=ϑiϑi−1 denotes the jump heights. If we add a vector of 1s and a coefficient β0 to define the offset of the function, then X is an n×n upper triangular matrix with entries Xi,j equal to 1, ij, and 0 otherwise. Hence, in the terminology of high dimensional linear models, we have an n=p problem in contrast with the pn situation which has received enormous attention during the last two decades. If we rescale by 1/√n, then we find that XTX/n=min(i,j)/n tends to the covariance function of a standard Brownian motion. From this limiting covariance it becomes immediately clear that assumptions like the restricted isometry property and related conditions (see Bühlmann and van de Geer (2011), Candès and Tao (2007) and Meinshausen and Yu (2009)) fail without additional restrictions, e.g. an s‐sparseness (sp) assumption on the jump locations. For a thorough discussion see Boysen et al. (2009) or the appendix in Harchaoui and Lévy‐Leduc (2010). Roughly speaking, these assumptions guarantee that estimators which are based on minimizing l0(β), i.e. the number of jumps, can be obtained by the l1(β) surrogate with large probability. This is not so in our set‐up when the number of jumps can be arbitrarily large. This may be taken as a rough explanation for the empirical observations that total variation and the l1‐penalization method do not perform competitively in the multiscale framework that is discussed in this paper for estimating the location and number of change points, as they built in too many little jumps. SMUCE employs a weaker notion of sparsity, i.e. s=n=p.

6.4. Risk measures

SMUCE aims to maximize the probability of correctly specifying the number of jumps urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0335 uniformly over sequences of models such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0336 tends to 0 not as fast as  log (n)/n. This is conceptually very different from optimizing urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0337 with respect to convex risk measures such as the MSE and related concepts. The latter measures do not primarily target the jump locations and number of jumps. Therefore, we argue that in those applications where the primary focus is on the jump locations SMUCE may be advantageous. In fact, maximizing the probability of correctly estimating the number of jumps as SMUCE advocates has some analogy with risk measures for variable selection problems, which have been shown to perform adequately successfully in high dimensional models. This includes the false discovery rate (Benjamini and Hochberg, 1995) and related ideas (see for example Genovese and Wasserman (2004)). Whereas in our context the classical false discovery rates aim to minimize the expected relative number of wrongly selected change points, SMUCE can give at the same time a guarantee that the true change points will be detected with large probability and hence controls the false acceptance rate as well.

6.5. Computational costs

Killick et al. (2012) showed that their pruned exact linear time method leads to an algorithm which expected complexity that is linear in n in some cases. As stressed in Section 3., our algorithm includes similar pruning steps. Owing to the complicated structure of the cost functional, however, it seems impossible to prove such a result for the computation of SMUCE. The computation can, of course, be further reduced significantly if for example only intervals of dyadic lengths are incorporated in the multiscale statistic. Since the dynamic approach leads to a recursive computation, SMUCE can be updated in linear time, if applied to sequential data. Another interesting strategy to reduce the computational costs could be adapted from Rivera and Walther (2012) and Walther (2010) who suggested restricting the multiscale constraint to a specific system of intervals of size urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0338 which still guarantees optimal detection.

6.6. Choice of α

We have offered a strategy to select the threshold q=qα and hence the confidence level α in a sensible way to minimize urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0339, by balancing the probabilities of overestimation and underestimation of K simultaneously. This is based on the inequalities in Section 4. depending on λ, Δ and n. As indicated in Figs 1, 10, 11 and 12 this can be used to consider the evolution of SMUCE depending on α as a universal ‘objective’ smoothing parameter. The features (jumps) of each SMUCE given α then may be regarded as ‘present with certain confidence’ similarly in spirit to ideas underlying the siZer algorithm (see Chaudhuri and Marron (1999, 2000)). It is striking that in many simulations we found that features (jumps) remain persistent for a large range of levels α. Of course, other strategies to balance urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0340 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0341 are of interest, e.g. if one of these probabilities is considered as less important. For a first screening of jumps, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0342 is the less serious error and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0343 should be minimized primarily. This can be achieved by optimizing the convex combination urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0344 for a weight δ close to 1 along the lines described in Section 4..

Acknowledgements

Klaus Frick, Axel Munk and Hannes Sieling were supported by Deutsche Forschungs‐ gemeinschaft–Schweizerischer Nationalfonds grant FOR 916. Axel Munk was also supported by CRC 803, CRC 755 and the Volkswagen Foundation. This paper benefited from discussions with colleagues. We specifically acknowledge L. D. Brown, T. Cai, L. Davies, L. Dümbgen, E. George, C. Holmes, T. Hotz, S. Kou, O. Lepski, R. Samworth, D. Siegmund, A. Tsybakov and G. Walther. Various helpful comments and suggestions of the screeners and reviewers for the journal are gratefully acknowledged.

    Discussion on the paper by Frick, Munk and Sieling

    Idris A. Eckley (Lancaster University)

    We owe thanks to Frick, Munk and Sieling for a most interesting and stimulating paper which so thoroughly addresses the problem at hand. Not only have they explored the area comprehensively from a theoretical perspective, but they also address several important practical considerations and make open source code available implementing the proposed methodology so that others may explore its potential. I commend them for so doing.

    To give some context for those who are less familiar with change point analysis, this area has been undergoing a revival in recent years. This is not just limited to the statistics community, but areas as diverse as the environmental sciences, econometrics, biology, geosciences and linguistics. Work developing the field of change point analysis is therefore of potentially significant importance in many different disciplines. The contribution of this paper is therefore very timely.

    Within change point analysis one of the key topics is the development of accurate and efficient methods for multiple change point detection. Typically such search methods focus on the identification of the number and location of multiple change points. Historically such approaches have been either fast but approximate (e.g. the binary segmentation approach that was proposed by Scott and Knott (1974) and related variants) or exact but slow (e.g. the segment neighbourhood algorithm that was proposed by Auger and Lawrence (1989)). More recently various researchers have focused on developing more computationally efficient exact search methods; see for example Rigaill (2010) or Killick et al. (2012).

    In this paper the authors first and foremost consider the question of search accuracy, focusing on a particular class of problem: that of the one‐dimensional exponential family. In so doing, within the frequentist framework, they go a significant step further in their approach: not only does their approach provide information on the change point location and parameter estimate, but they also derive some powerful results which enable the quantification of uncertainty (in time) of the change point location and provide confidence bands for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0345. Moreover en route they establish several key results including
    • (a) the (asymptotic) probability of overestimating the number of change points and
    • (b) the probability of underestimating the number of change points for any sample size.

    Taken together, these results are particularly attractive since they mean that, under a suitable choice of q, the probability of misspecification urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0346 tends to 0.

    Naturally a few questions came to mind on reading the paper. In Section 1.3 you mention that potentially several strategies exist to select the threshold q and outline an approach in Section 4 which takes into account prior information about the true signal ϑ. Specifically the approach depends on λ, the minimal interval length, and η, a minimal feature size. In a great many contexts this seems a very pragmatic approach. However, what other approaches to threshold selection did you consider and what was the motivation for choosing the approach described over these others?

    As in several other change point search methods, the penalization approach assumes intersegment independence, i.e. that the parameter value in one segment is independent of those in other segments. How could SMUCE be adapted to deal with intersegment dependence?

    Finally, I have a question motivated by some recent applications which we have encountered. In many cases one might see changes in multiple parameters (e.g. change in mean and variance) occurring in a time series. Indeed, a priori, a practitioner might not know whether they are dealing with a single‐ or multiple‐parameter change point problem. Consider by way of example the simulated data displayed in Fig. 15(a). In particular my concern here is that a practitioner may inadvertently assume from casually eyeballing the data that only one parameter is changing when in fact multiple parameters are changing. In so doing, they might then (mis)apply the single‐parameter SMUCE method and consequently falsely infer the change point structure within the time series (the full horizontal line within Fig. 15(b)). The truth is actually quite different from this since the data contain changes in both mean and variance (see the vertical lines in Fig. 15(b)). Hence, what scope is there to extend SMUCE, at least methodologically, to addres such multi‐parameter scenarios?

    image
    (a) Simulated time series containing change points and (b) results of a SMUCE change in mean analysis (image) with true change point locations (image, change in mean; image, change in variance)

    In summary, the work which Frick, Munk and Sieling present is that most powerful of combinations: both mathematically appealing and of considerable applied value. In particular it addresses several key change point facets which practitioners have been seeking for some time. As with any good paper read to the Society, this contains much thought‐provoking material and highlights several interesting avenues for future research. I therefore have great pleasure in proposing the vote of thanks.

    Arne Kovac (University of Bristol)

    Maximizing simplicity under multiresolution constraints is an extremely powerful technique in non‐parametric settings such as regression or image analysis. As such it is a pleasure to welcome this paper to the Society. Previously, this concept was in particular exploited in the case where simplicity is measured by the number of local extreme values. There consistency of number and locations of extreme values and rates of convergence can be derived (Davies and Kovac, 1997) as well as confidence bounds (Davies et al., 2009). In this paper the authors look at the situation where simplicity is measured by the number of break points in estimating a piecewise function. This problem was studied before by Höhenrieder (2010) and his solution to the problem shows remarkable similarity to the SMUCE approach.

    Frick, Munk and Sieling consider the observations {Yi} from an exponential family
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0347
    At first sight this provides a general framework and makes the methodology applicable to a wide range of situations, e.g. Gaussian mean and variance regression and Poisson regression. However, I argue here that this model is unnecessarily restrictive. A more direct approach allows a more general framework which includes some relevant special cases that cannot be dealt with by exponential families while avoiding some technical complications created by the assumption of an exponential family.
    The authors consider an exponential family with density f(x)= exp {θxψ(θ)}. The segmentation is based on the local likelihood ratio statistic which results in urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0348. If my calculations are correct then for large ji+1
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0349
    This implies that the multiscale constraints will be of the form
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0350
    with the qn chosen so that
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0351
    Forget that the Yl come from an exponential family and suppose now only that they have a moment‐generating function  exp {ψ(λ)}, i.e.
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0352
    Then for all λ
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0353
    Minimizing over λ with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0354 gives
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0355
    for any urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0356. The bounds qn will satisfy to first order
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0357
    leading to expurn:x-wiley:13697412:media:rssb12047:rssb12047-math-0358 or yn≈√2{ log (n)}. For this to hold (ji+1)/ log (n)3→0.

    Putting this together the results obtained by the authors hold for any Yl with a moment‐generating function as long as the multiscale constraints apply to sums of the Yl.

    The authors point out that the bounds depend on θ ‘only through the number of change points K and the change points locations τk’. But why is there this restriction? No such restriction was made in Höhenrieder (2010). The three main examples that the authors give are families closed under convolution. For the Poisson case and following Höhenrieder (2010) we could use the constraints
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0359
    with the αn determined by simulations and asymptotically
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0360
    This would allow us to decide how large a jump would have to be to be detected on a given interval, irrespectively of the length of the interval. It is reasonably clear how this could be extended to the case where the random variables have a moment‐generating function. Put
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0361
    and define xn by
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0362
    Are there any theoretical reasons for supposing that this form of segmentation is worse than that proposed by the authors?

    Finally a word on scale calibration: whatever method is used to determine the upper bounds qn it cannot be that one set of bounds is uniformly lower than the other if both have been calibrated for the same α. The decision must be made by the data owner. Indeed it is possible to adjust the bounds to accommodate any requirements that the owner may have.

    I found this paper interesting to read and therefore it gives me great pleasure to second the vote of thanks for this paper.

    The vote of thanks was passed by acclamation.

    Frank Critchley (The Open University, Milton Keynes)

    In warmly welcoming this paper, I would like to ask two questions and to indicate two possible types of extension.

    Scale calibration works well as a form of first‐order correction. Given this, (when) might there be scope for some higher order analogue—e.g. adjusting for variance as well as bias?

    Intuitively, it seems that sensitivity to prior specification of (λ,Δ) will typically be high. Is this your experience and, if so, what guidelines might you suggest? In particular, can your methodology be adapted to be more robust to the choice of prior, or is such sensitivity intrinsic?

    Concerning possible extensions, context‐specific considerations may suggest that, if changes are present, they will follow a certain pattern. Two generic examples come to mind:
    • (a) monotonicity (in either direction)—reflecting, for example, in a reliability context, ‘things can only get worse’;
    • (b) changes with alternating signs—reflecting, for example, compensating departures from equilibrium.

    Can procedures be formulated to test for such patterns (say, within the general alternative of any pattern)? Relatedly, can diagnostics be developed to check (potentially, on line) for departures from them?

    Secondly, might there be scope to adapt your methodology for use in model checking—most obviously, by applying it with the {Yi} as (regression) residuals? This requires, of course, that (at least mild) dependence among the {Yi} be permitted—raising the related question of extensions towards, say, time series and/or spatial data. To this end, (analogues or extensions of) Theil's best linear unbiased scalar residuals may have a role to play, these being uncorrelated and homoscedastic under the model.

    All this is to thank you for a very stimulating and well‐written paper.

    Yining Chen, Rajen D. Shah and Richard J. Samworth (University of Cambridge)

    We congratulate the authors for their stimulating paper; our comments focus on possible avenues for further development.

    A Gaussian quasi‐SMUCE (GQSMUCE)

    In the paper it is assumed that Y comes from a known exponential family, which may not be realistic. But consider the setting
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0363
    where ɛi are independent and identically distributed with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0364 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0365 The number of change points can still be estimated by solving
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0366
    with
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0367
    Scrutiny of the proof of theorem 1 shows that the result continues to hold for the Gaussian quasi‐likelihood estimator GQSMUCE above provided that there exists s0>0 such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0368 for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0369. An analogue of theorem 2 can also be proved, establishing model selection consistency.
    To examine the performance of GQSMUCE in a non‐Gaussian setting, and similarly to Section 5.1, we let
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0370(43)
    where ɛ1,…,ɛ497 have a shifted and scaled beta(2,2) distribution with zero mean and unit variance. Results are summarized in Table 8. We see that GQSMUCE outperforms circular binary segmentation (CBS) (Olshen et al., 2004) at lower noise levels σ=0.1,0.2 but tends to underestimate the number of change points when σ=0.3. These findings are qualitatively similar to results in Section 5.1.
    Table 8. Relative frequencies of estimated numbers of change points by model selection for GQSMUCE and CBS (Olshen et al., 2004) in 500 Monte Carlo simulations†
    Method σ Results for the following numbers of
    change points:
    4 5 6 7 8
    GQSMUCE (1−α=0.55) 0.1 0.000 0.000 0.988 0.012 0.000
    CBS (Olshen et al., 2004) 0.1 0.000 0.000 0.924 0.036 0.040
    GQSMUCE (1−α=0.55) 0.2 0.000 0.000 0.994 0.006 0.000
    CBS (Olshen et al., 2004) 0.2 0.000 0.000 0.872 0.100 0.028
    GQSMUCE (1−α=0.55) 0.3 0.012 0.248 0.772 0.018 0.000
    CBS (Olshen et al., 2004) 0.3 0.000 0.010 0.806 0.148 0.036
    • †The true signals have six change points.

    More generally, we believe that multiscale methods for change point inference (or appropriately defined ‘regions of interest’ in multivariate settings) offer great potential even with more complex data‐generating mechanisms, and we await future methodological, theoretical and computational developments with interest.

    Coverage of confidence sets

    One attractive feature of SMUCE is the fact that confidence sets can be produced for ϑ. However, Table 5 shows in the Gaussian example with unknown mean that, even when the sample size is as large as 1500, a nominal 95% confidence set has only 55% coverage; even more strikingly, a nominal 80% coverage set has 84% coverage!

    This phenomenon, where larger nominal coverage may reduce actual coverage, is caused by the choice of 1−α determining not only the nominal coverage but also urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0371, the estimated number of change points.

    As an alternative, consider the confidence set
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0372
    where q* can be chosen as suggested in Section 4, for example. We compare this approach (SMUCE2) with that proposed in the paper for the simulation setting (43), where here we take ɛiN(0,0.05); results are presented in Table 9. As well as giving better coverage here, the new confidence sets have the reassuring property that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0373 for αα.
    Table 9. Empirical coverage and probability of correctly estimating the number of change points K obtained from 500 simulations
    α Results for SMUCE Results forSMUCE 2
    Coverage of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0375 P{K(q1α)=K} Coverage of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0376 urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0374
    0.90 0.874 0.978 0.882 0.986
    0.95 0.874 0.914 0.944 0.986
    0.99 0.728 0.738 0.974 0.986

    Federico Crudu (Pontificia Universidad Católica de Valparaı´so), Emilio Porcu (Universidad Federico Santa Maria, Valparaı´so) and Moreno Bevilacqua (Universidad de Valparaı´so)

    We congratulate the authors for this excellent paper, which studies the change point problem in the case of exponential family regression. Their proposal is a multiscale change point estimator that can estimate the number of change points, their locations and the value of the mean function ϑ. In addition to that, they provide confidence bands for the function ϑ and confidence intervals for the change points.

    The authors provide a thorough analysis of the problem. However, we find that some points need further consideration. First the estimator is designed to work in the case of exponential families. It would be interesting to understand how restrictive this assumption is and therefore what happens to the estimator in the case of misspecification. Second, as the estimator performs remarkably well, it would be worthwhile investigating the out‐of‐sample potential of this estimator and its performance in the case of missing data.

    A stimulating point is the extension of the proposed methodology to georeferenced data. Probably, it would make sense to speak about isophlet changes, so that interest would be in detecting
    • (a) the number of change isophlets of ϑ and
    • (b) the change isophlets locations and the function values (intensities) of ϑ.

    It seems that such an extension is not obvious for many reasons: for instance, we do not see any way to obtain the analogue of J(ϑ) (in space any order is arbitrary). This might be a good challenge for future researches.

    Alessio Farcomeni (Sapienza University of Rome)

    First, I congratulate the authors on a very accurate and clearly written work. Their results are impressive, and the approach is undoubtedly useful and computationally efficient. Their contribution goes beyond what is neatly written in this paper. For instance, three major generalizations may be obtained with reasonable effort.
    • (a) Panel data: in many cases, we have independent replicates of Y (for example, we may have repeated measures on m units at n occasions). In this case the local likelihood ratio statistics would be
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0377
      and similar adjustments can be made to the definitions of C(q), urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0378, etc. The use of independent replicates of Y may lead to shorter confidence intervals around change point location estimates.
    • (b) Multivariate data: suppose now that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0379, with p>1, arising for instance from a p‐dimensional continuous exponential family. Computation of local likelihood ratio statistics seems to be straight‐ forward in this case also. When p>1, it may also be interesting to derive dimension‐specific likelihood ratio statistics, in which case the choice of α may be driven by ideas taken from the multiple‐testing literature (reviewed for example in Farcomeni (1991)). Finally, issues with multivariate outliers may arise, suggesting the use of robust estimation methods (e.g. Cuesta‐Albertos et al. (2008)).
    • (c) General right continuous functions: the right continuous step function θ can be offset by any known continuous function with once again minor adjustments to SMUCE. It may be of interest to explore whether an unknown function within this class can be estimated with a performance as good as that of SMUCE. Some additional restrictions would perhaps be needed for consistency (e.g. that the oscillation within each interval is smaller than Δ). Similarly, if Xi is a vector of covariates, it would be interesting to estimate θ and β in the model
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0380
      where g(·) is a known link function and m⩾1.

    Piotr Fryzlewicz (London School of Economics and Political Science)

    I congratulate the authors for their thought‐provoking paper.

    Revisiting the results of Table 1, it is interesting to note that the authors exhibit the performance of SMUCE for two particular values of the tuning parameter: α=0.45 and α=0.6, the latter presumably chosen to improve the performance in the high variance setting of σ=0.3. In this note, we apply the wild binary segmentation (WBS) technique of Fryzlewicz (2007) to the same example, including in a version that requires no choice of any tuning parameters on the part of the user.

    More specifically, we apply the WBS algorithm to the trend‐free signal from Table 1, with σ=0.1 and σ=0.3. The WBS algorithm uses the default value of 5000 random draws (each execution took the average of 1.20 s on a standard personal computer). In the thresholding stopping rule, we use the threshold urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0381, where urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0382 is the median absolute deviation estimator of σ suitable for independent and identically distributed Gaussian noise, n is the sample size and the constant c is selected manually. In the fully automatic stopping rule based on an information criterion, we propose a new criterion termed the ‘strengthened Schwarz information criterion’ sSIC, which can be shown to be consistent for the number and locations of change points when coupled with WBS and works by replacing the SIC‐penalty of  log (n)× number of change points by  log γ(n)× number of change points, for γ>1, where we use the default value of γ=1.01 to remain close to SIC.

    Table 10 shows the results. It is clear that SMUCE, even with an optimally chosen tuning parameter, struggles in the higher variance setting, where it is comfortably outperformed by WBS. WBS sSIC performs very well in both settings, and we emphasize again that it is a completely automatic procedure in which no tuning parameters need to be chosen. For the case σ=0.3, WBS sSIC would have come on top also in Table 1, and, for the case σ=0.1, very close to the top.

    Table 10. Distributions of the estimated numbers of change points for SMUCE and WBS expressed as percentages†
    Method σ Results (%) of the following
    numbers of change points:
    4 5 6 7 8
    SMUCE, α=0.45 0.1 0 0 99 1 0
    WBS, c=1.3 0.1 0 0 97 2 1
    WBS, c=1.4 0.1 0 0 99 1 0
    WBS, c ∈ [1.5,2] 0.1 0 0 100 0 0
    WBS sSIC 0.1 0 0 97 2 1
    SMUCE, α=0.45 0.3 3 34 62 1 0
    SMUCE, α=0.6 0.3 0 10 80 9 0
    WBS, c=1.3 0.3 0 3 94 2 1
    WBS, c=1.4 0.3 2 7 90 1 0
    WBS sSIC 0.3 0 1 95 3 1
    • †Results for SMUCE were taken from the paper and averaged to the nearest integer. Results for WBS are based on 100 simulation runs, with the random seed in R set to the arbitrary value of 1 before the first simulation run, for reproducibility.

    Finally, we note that the unbalanced Haar method of Fryzlewicz (2004), which was included by the authors in the same simulation study, is not a consistent change point detector. Out of the three methods studied (SMUCE, unbalanced Haar wavelets and WBS), users will be best‐off applying the WBS method in this context.

    Oliver Linton (University of Cambridge) and Myung Hwan Seo (London School of Economics and Political Science)

    Time series analysis usually begins with some decomposition into trend, seasonal component and remainder terms. This paper focuses on a very special but important case where the trend is piecewise constant, there is no seasonal component and the remainder term is independent and identically distributed. In this case, it obtains powerful results whereby the location of change points can be estimated freely and confidence intervals for the trend can be obtained. More generally, we may need to allow for periodic components such as in Linton and Vogt (2005) and more general trend functions that are not necessarily constant between knot points; this may raise further issues. For instance, if the original sequence contains a continuous piecewise linear trend, then the first differenced series can be fitted into the proposed method in conjunction with the discussion on moving average MA(l) processes in Section 6.1. However, in the case of a discontinuous piecewise linear trend, more care should be exercised because of those jump points at trend breaks.

    It will be also interesting to explore the consequences when the true data‐generating process ϑ(x) is a continuous function, and we interpret the estimated step functions as the best approximation to the unknown true function in the mean‐squared‐error sense. If the number of changes in the estimation is fixed over the sample size n, cube‐root‐type asymptotics are expected as in for example Koo and Seo (2013). Theorem 6 seems to allude to some expected result such as consistency to a continuous function ϑ. In other words, what can be included as the limit of ϑn? Certainly, the limit will not be an element of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0383 as an infinite number of jumps cannot be allowed in a step function with bounded domain. In the meantime, it was not clear whether some results of the paper include the case of no change point.

    In economics and finance, classification of tima series into different regimes is a common activity. For example, Phillips and Yu (2010) proposed statistics for classifying a price time series into a ‘rational’ part and a ‘bubble’ part. Their method is based on recursive change point testing. In the notation of this paper the function ϑ(t) is either consistent with a first‐difference stationary process or with an explosive process according to which regime the process is in. The issue is to identify the bubble regimes urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0384 from the data as well as to track lead–lag relationships in different asset prices, i.e. where did the bubble first occur? This leads to the question of whether the authors' methods can be generalized to multivariate dependent settings where the parameter ϑ(t) controls the type of non‐stationarity in the series. Another issue, for applications in finance in particular, is the presence of heavy tails, and the sort of quantile methods discussed in Section 5.4 looks promising in this regard.

    Gift Nyamundandaand Kevin Hayes (University of Limerick)

    We extend our congratulations to the authors for their paper which presents a fresh approach to change point detection, specifically, change point discovery. The provision of the probability of overestimating or underestimating the true number of change points and the uncertainty associated with the estimation of the location of these change points is a key attraction of the SMUCE method. However, a concern throughout is in the actual advantages of SMUCE over rival approaches. Although the authors identify the modified Bayesian information criterion and circular binary segmentation as potential competitors, the literature offers many alternative change point methodologies that should be considered. For example, binary segmentation (Chen and Gupta, 1999) and an approach of minimizing a certain cost function by Killick et al. (2012) are designed to detect changes in both the mean and the variance structure. Likewise, the product partition model (PPM) due to Barry and Hartigan (2000), implemented in the R package bcp (Erdman and Emerson, 2012), detects changes in both the mean and the variance structure. Furthermore, the posterior distribution of the partition in the PPM provides the uncertainties associated with the estimated locations of change points. The prior distribution of the partition in the PPM satisfies a similar role to that of the parameter α in SMUCE; both controlling the expected number of change points by using input parameters. The PPM also provides a posterior distribution on k, the number of estimated change points, which is informative regarding the overspecification or underspecification of change points in the final fit. Also, the authors did not provide a convincing argument why SMUCE can be a better approach for detecting distributional changes due to changes in the variance as compared with current existing approaches.

    The choice of the threshold parameter q is extremely important since it balances data fit and parsimony of SMUCE. It is associated with the parameter α that controls the probability of overestimating the true number of change points. Choosing q depends on the interval length parameter λ and jump size parameter Δ. It would be useful if the authors could report on the effects of misspecification of these parameters (λ and Δ) on the number of change points and, in turn, on the power of detecting change points. In Section 5.6 there are results of real data analysis in which these parameters were set to λ⩾0.2 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0385, and λ⩾0.025 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0386. Perhaps a better way to choose q here would be to define a full prior distribution over λ and Δ to take into account the uncertainties associated with these two parameters.

    K. Szajowski (Wrocław University of Technology)

    Multiscale methods

    In recent years, we can see a steady increase in references about modelling phenomena at several scales simultaneously. In the Web of Knowledge the number of reported articles with the phrase multiscale modelling in the titles of the articles has risen from 50 in 2001 to more than 300 in 2009. An overview of such models can be found in Horstemeyer (2012). A special interdisciplinary journal, Multiscale Modeling and Simulation, focusing on the fundamental modelling and computational principles underlying various multiscale methods was founded by the Society for Industrial and Applied Mathematics in 2003. Multiscale modelling became key in garnering a more precise and accurate predictive tool. This paper is a significant contribution to the communication between mathematicians and representatives of science using this approach in mathematical modelling. The change point problem for such models is a new point of view in this area. However, the idea of sequential multiple disorders detection has been considered not only for simple sequences of observations.

    Multiple‐change‐point analysis

    In various applications the modelled signal can have an unobservable parameter. If the unobservable parameter is modelled as a hidden process with rare changes of state then the signal and the hidden parameter process form the multiscale with ‘disorders’ (see Bojdecki (2010)). The problem of multiple‐disorders detection for processes having Markovian structure, when the hidden parameter process is uncovered, is the subject of Szajowski (2011). It is a generalization of the results of Yoshida (2008) (see also Sarnowski and Szajowski (2012, 2005)). A short description is as follows. A random sequence having segments being the homogeneous Markov processes is registered. Each segment has its own transition probability law and the length of the segment is unknown and random. The transition probabilities of each process are known and a joint a priori distribution of the disorder moments is given. The detection of the disorder rarely is precise. The decision maker accepts some deviation in estimation of the disorder moment. In the models taken into account the aim is to indicate the change points with fixed, bounded error with maximal probability.

    It is an interesting problem how the precision changes the optimal decisions. This question is considered in Ochman‐Gozdek et al. (1988). The case with various precision for overestimation and underestimation of this point is analysed, including the situation when the disorder does not appear with positive probability. The observed sequence, when the change point is known, has Markov properties. The results explain the structure of the optimal detector in various circumstances. The problem is reformulated to optimal stopping of the observed sequences. A detailed analysis of the problem is presented to show the form of the optimal decision function.

    These results show that the optimum detection procedures are seldom easy to use. They convince me that the methods proposed by Frick and his colleagues are a very good option.

    The following contributions were received in writing after the meeting.

    John Aston (University of Cambridge) and Claudia Kirch (Karlsruhe Institute of Technology)

    We congratulate the authors on a very interesting paper which introduces a new way to examine change point settings. We would like to take the opportunity to point out several possible extensions for future work that have been successfully included in simpler at most one‐change situations.

    Frequently, in statistics, procedures based on Gaussian assumptions can be used more generally and yield asymptotically correct results. This includes the Whittle likelihood for non‐normal data to give an example outside change point analysis but also the simple cumulative sum chart and its variations obtained as a likelihood ratio statistic for the much simpler change point problem, where at most one mean change in independent and identically distributed normal random variables is expected. In the latter case, results remain true (Csörgő and Horváth (1997), theorem 1.4.2) if a strong invariance principle and some asymptotic independence assumption holds (for the actual likelihood ratio statistic) respectively if a functional central limit theorem holds (for simpler weight functions (Csörgő and Horváth (1997), theorem 2.1.1)). In the same spirit, extensions to misspecifications with respect to the independence between observations are possible using not only parametric models but also weak dependence concepts under which strong invariance principles or at least functional central limit theorems hold (Csörgő and Horváth (1997), section 4.1). Typically, the variance then needs to be replaced by the long‐run variance with all the problems arising from the difficulty of estimating this. It would be rather interesting to see whether similar results can be obtained for the proposed procedure on the basis of the normal likelihood.

    Similarly, it would be interesting to see how to extend the proposed procedure to detect more complicated changes, e.g. in regression or parametric time series models. Similarly to the role of estimating functions in parameter estimation, corresponding functionals of the time series and the estimated parameters can transform this problem into a possibly multivariate mean problem. For example, Hušková et al. (2007) used several variations of the score function (for normal errors) to detect changes in linear auto‐regressive time series, whereas Kirch and Tadjuidje Kamgaing (2012) and Franke et al. (2012) used least squares scores in non‐linear auto‐regressive time series. In those cases, it is of interest to allow for an additional time series structure of the errors even in the parametric model as this allows for misspecification with respect to the model (Kirch and Tadjuidje Kamgaing, 2012).

    In general, it is of interest to consider the theoretical robustness of the proposed test procedures to violations of the assumptions on the underlying error distribution, model assumptions or a combination of both. Kirch and Tadjuidje Kamgaing (2014) give general regularity conditions on the truly possibly misspecified underlying time series under which the standard asymptotics for change point tests based on scores or estimating functions remain true under the null hypothesis, in addition to regularity conditions under which such misspecified tests still have asymptotic power 1.

    Jérémie Bigot (Institut Supérieur de l’Aéronautique et de l’Espace, Toulouse)

    I congratulate the authors for a stimulating paper and their innovative contribution on change point inference.

    One of the key features of SMUCE, that explains its nice performances, is the use of a multiscale test statistic to estimate the number of change points of an unknown step function. Here, I briefly compare the numerical performances of SMUCE with another estimator proposed in Bigot (1993) that is also based on the combination of test statistics at various scales.

    To explain briefly the approach followed in Bigot (1993), consider the standard white noise model
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0387(44)
    where f is a piecewise smooth function, and W is standard Brownian motion. Let also ψ(x)=∂θ(x)/∂x be a wavelet with one vanishing moment where θ is a smooth function with fast decay, and define the continuous wavelet transform of Y at scale s>0 as
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0388
    A wavelet maximum is any point urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0389 in the time–scale space such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0390 has a local maximum at x=m(s). Following Bigot (1993), a local maximum urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0391 at scale s is said to be significant if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0392 where
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0393
    The choice of the threshold λn is derived from a testing procedure where the null hypothesis is that f has at least one jump point. To combine this testing procedure at various scales s, we consider the so‐called structural intensity Gm defined as
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0394(45)
    where urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0395 is the number of significant wavelet maxima urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0396 at scale s, [s0,s1] is a range of testing scales and C is a normalizing constant. From Fig. 16, it can be seen that, for small values of s, the significant wavelet maxima urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0397 are all near the jumps of f. Moreover, the locations of the significant modes of Gm correspond to satisfactory estimators of the jumps of f.

    We applied this method (called Structlnt) to 1000 runs from model (41). For the trend‐free signal, it is clear that SMUCE performs better than Structlnt, especially for σ=0.3. This is because the multiscale nature of the testing procedure in SMUCE fully exploits the assumption that the unknown function f is piecewise constant, whereas Structlnt is suited to detect jumps in a piecewise smooth signal that is not necessarily a step function. Therefore, StructInt is more robust than SMUCE to deviation of the model with a step function as shown by the results in Table 11 in the case of a signal with a trend. Finally, these numerical experiments confirm the benefits of combining multiscale test statistics for change point detection of a piecewise smooth regression function.

    image
    Example of multiscale jumps detection: (a) example of model (41) with no trend and σ=0.1; (b) locations of significant wavelet maxima at various scales from s0=2−3 to s1=2−7 (the vertical axis corresponds to the negative logarithmic scale − log 2(s)); (c) structural intensity Gm of the wavelet maxima
    Table 11. Frequencies of estimated numbers of change points for SMUCE and the method based on structural intensity from Bigot (1993)†
    Method Trend σ Results for the following numbers of
    change points:
    4 5 6 7 8
    SMUCE (1−α=0.55) No 0.1 0 0 0.988 0.012 0
    StructInt No 0.1 0 0 0.970 0.029 0.001
    SMUCE (1−α=0.55) No 0.2 0 0 0.986 0.014 0
    StructInt No 0.2 0 0.026 0.926 0.047 0.001
    SMUCE (1−α=0.4) No 0.3 0 0.099 0.798 0.089 0
    StructInt No 0.3 0.132 0.476 0.380 0.012 0
    SMUCE (1−α=0.55) Long 0.2 0 0 0.825 0.171 0.004
    StructInt Long 0.2 0 0.033 0.931 0.036 0
    SMUCE (1−α=0.55) Short 0.2 0 0.002 0.903 0.088 0.007
    StructInt Short 0.2 0 0.025 0.923 0.050 0.002
    • †The results for StructInt are based on 1000 simulation runs from model (41), and the results for SMUCE are taken from Table 1.

    D. S. Coad (Queen Mary University of London)

    I congratulate the authors on this fascinating paper, which provides a hybrid method based on a likelihood ratio test and a model selection step for the change point problem in exponential family regression. Asymptotic arguments are developed for maximizing the probability of correctly estimating the number of change points. The computations may be carried out efficiently by using dynamic programming. I have no doubt that the method will be generalized further.

    The underlying model (1) assumes that the data follow a one‐dimensional exponential family. Several examples of different distributions are used to compare the performance of the hybrid method with some existing methods that have been proposed in the literature. A natural extension of the method is to the multi‐dimensional case (see Lévy‐Leduc and Roueff (2005) and Xie et al. (2009)). Given that a detailed study of a suitable vector of multiscale statistics may be impractical, because, for example, of potential sparseness in the data, an alternative approach may be to consider low dimensional statistics instead. That way, it may be possible to study both the asymptotic properties of the method and the necessary computations are still feasible.

    Although most of the paper focuses on independent data, it is briefly indicated in example 1 how the method could be extended to serially correlated data. In particular, a Gaussian moving average process of order 1 is considered. It would be interesting to know whether the arguments could be extended further to m‐dependent data, and, if so, where would the main challenges lie, both theoretically and computationally. Further, since one of the conclusions from simulations for example 1 is that the modified multiscale statistic performs much better than the unmodified version for stronger dependences, perhaps the difference between the two could be even greater for m‐dependent data. Presumably, the study of heavy‐tailed data is an open problem.

    Laurie Davies (University of Duisburg‐Essen)

    In Davies et al. (2012) the problem of minimizing the number of intervals where the regression function is monotone increasing or monotone decreasing was considered. The subject of the present paper is the simpler problem of minimizing the number of intervals where the function is constant. In the Gaussian case the precision with which the jump points can be located improves from O[{log(n)/n}1/5] to the authors' urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0398 although 2(2+√2)2≈23.3 is a more accurate constant. In Höhenrieder (2010) the case of piecewise constant functions was treated in the context of volatility estimation. The authors' approach, derived independently, shows similarities and differences. One difference is that they use a local likelihood test to segment the data whereas the segmentation in Höhenrieder (2010) is based on partial sums and bounds of the form
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0399(46)
    for all intervals. I suspect that for large ji+1 the local likelihood test also leads to partial sums. It requires, however, restrictions on the sizes of the intervals. Does the authors' segmentation method have any advantage in the Gaussian variance case to compensate for this?

    The method of Höhenrieder (2010) can be adapted to t‐distributions (see Banasiak (2010) and also Davies et al. (2012)) while maintaining the algorithmic complexity O(n2). If Gaussian segmentation is applied to the Standard and Poor's data (21717 non‐zero observations) the results for the positive returns are 43 and 29 intervals for the Gauss and t7‐models respectively. The Kolmogorov distances of the residuals from the respective models are 0.048 and 0.016 respectively. For the negative returns the numbers are 50 and 26 for the Gauss and t5‐models with distances 0.069 and 0.023. Models based on t‐distributions are clearly superior for this data set. Can the local likelihood approach be extended to t‐ and other distributions or is it essentially restricted to exponential families?

    After segmentation the authors maximize the likelihood given the minimum number of intervals. Efficiency under the model is not the only consideration. In Höhenrieder (2010) and Davies et al. (1964) the sum Σi (Yiθi)2 was minimized instead of the likelihood being maximized. This explains the inferior mean integrated squared error in Table 2, the compensation being the applicability to t‐distributions which was regarded as being much more important in the context. A correct calibration of the method of Davies et al. (1964) leads to the following numbers in the results given for Davies et al. (1964) in Table 2: 0.910, 0.938, 0.973, 0.973 and 0.855.

    Farida Enikeeva and Zaid Harchaoui (Institut National de Recherche en Informatique et Automatique, Grenoble)

    It is a great pleasure for us to comment on such an excellent paper. We shall name only a few among many contributions of the paper. First, the consistency of the estimator of K is established without any assumption on the number of jumps. The paper also gives a non‐asymptotic exponential bound for the deviation of this estimator in the Gaussian case, and confidence bounds for θ(t) in a general case. In addition, the paper provides detailed explanations on implementation and calibration of the test. All in all, the paper is a substantial contribution not only to statistics but also potentially to a large range of applications.

    The paper opens up many interesting prospects; we discuss only a few.
    • (a) Can the SMUCE approach be generalized to the case of indirect observations, e.g. in the setting of Boysen et al. (2009)? It would be interesting to see corresponding convergence rates, and whether the approach would enjoy optimality properties.
    • (b) Theoretical results are presented in terms of the minimal length of the interval and minimal jump size. However, if the signal jump is sufficiently large, the change could potentially be detected on a short time interval. Conversely, if one observes a small change on a sufficiently long time interval, then it should be possible to detect the change. Is it possible to give a condition on the signal detection that would take into account the behaviour of contiguous changes?
    • (c) Another interesting extension is to multivariate or even high dimensional parameter θ. The method of Harchaoui and Lévy‐Leduc (2009) was extended to the multivariate setting in Vert and Bleakley (2010), allowing the detection of common change points in a large number of sequences for the purpose of segmentation of comparative genomic hybridization data (Picard et al., 1997). Convergence is then analysed in a double‐asymptotic framework. It would be interesting to have the authors' insights on the possibility of extending their tools to multivariate, high dimensional observations.

    On the application side, the approach could be used for genomic data obtained by next generation sequencing techniques, with several challenges to tackle. First, one then works in a high dimensional setting, because of the constantly growing number of sequences in databases. Second, genomic sequences could contain millions of observations with hundreds of change points. When applied to the raw next generation sequencing data (e.g. 454 pyrosequencing flow data), we think that SMUCE could help to solve important problems like correction of erroneous sequences or detection of single‐nucleotide polymorphisms.

    Paul Fearnhead (Lancaster University)

    I congratulate the authors on this very interesting paper. One aspect that I particularly liked about it was that it was comprehensive, in terms of developing a method for detecting change points, the associated theory for the method and efficient approaches to implementing the method in practice. Furthermore the paper did not just consider point estimates for the location of change points, but also how to construct confidence intervals for their location, and confidence bands for the underlying function. It is this latter aspect of the paper that I would like to comment on.

    One issue with constructing confidence intervals for change point problems is the difficulty in defining what such an interval should be when there is uncertainty over how many change points there are. As can be seen in Fig. 1(c), the likely location of change points can vary substantially as you change their number. The approach in the paper seems to circumvent this by working with asymptotic confidence intervals, and using the fact that the method gives a consistent estimate of the number of change points. As a result, asymptotically you know the number of change points in the data, and hence this issue vanishes. This asymptotic behaviour, however, is different from that in many real life applications where there is substantial uncertainty over the number of change points. So how meaningful are the confidence intervals in practice? How is a practitioner to interpret a figure like Fig. 1(c), particularly as the confidence level is inherently linked with the number of change points detected.

    These issues seem particularly prevalent in applications with both large numbers of observations and many change points. Would pointwise confidence bands for the underlying function be more appropriate than simultaneous confidence bands in these cases? In situations where there is substantial uncertainty about the number of change points, are there any real alternatives to Bayesian approaches (e.g. Yao (2005), Barry and Hartigan (2000), Fearnhead and Liu (2009) and Fearnhead and Vasileiou (2006)) if you want to quantify the full uncertainty of any estimates?

    Cheng‐Der Fuh and Huei‐Wen Teng (National Central University, Jhongli City)

    We congratulate Professor Frick, Professor Munk and Dr Sieling for providing a comprehensive and challenging paper for multiple‐change‐point detection. There remain interesting problems that have not been addressed or have not yet been completely solved, which deserve further consideration.
    • (a) Besides the moving average MA(l) model mentioned in the paper for dependent data, another interesting model is the hidden Markov model. We assume that Xn is a Markov chain on a finite state space D={1,…,d}, with initial distribution P(X0=x0)=πθ(x0), and transition probability urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0400. The observation Yn at time n is continuous with density function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0401 (all parameterized by θ). Let K denote the number of change points, τ1<τ2<  …<τk be the change points and θ1,…,θk be the associated parameters. The problem of interest is that the state transition probability and observation model conditioned on the state undergo a change from θi to θi+1 at the change point τi for i=1,…,K. In particular, the state transition from urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0402 to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0403 is described by θi−1, whereas the transition from urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0404 to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0405 is described by θi. Similarly, observation model urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0406 is described by θi−1, whereas urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0407 is described by θi. This model is suitable for a setting where the observation is causally related to the state and, hence, a change in the state transition leads to a change in the observation density. In the case of K=1, the joint density of Y={Y1,…,Yn} is given as
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0408
      where
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0409
      The joint density for general K can be defined in a similar way. Another setting of change point detection is that the change point is defined as the change of states for the underlying Markov chain Xn. These models are called regime switching models when the Markov chain is recurrent or are called change point models when the Markov chain is not recurrent.
    • (b) In finance applications, the feature is to incorporate both the stock and the options prices to identify the regime of volatility. Specifically, the option price is modelled via a variance stabilizing transform  log (yij)= log {Cij(Θ)}+ɛij, where Cij(Θ) is the model price of the option with i indicating the type of the option and j indicating the strike price. Here, the error ɛij comes from market friction or model discrepancy. It is a challenging task to form change point detection using implied volatility calibrated from option prices, the volatility of the option price calculated from the error ɛij and the historical volatility of the stock price.

    It is interesting to see whether the simultaneous multiscale change point estimator proposed in the paper can be applied to these two problems. Alternatively, a Bayesian approach focuses on the posterior distribution of the parameters, in which Markov chain Monte Carlo methods can be implemented to sample the parameter of interest having the desired distribution.

    Dario Gasbarra and Elja Arjas (University of Helsinki)

    The paper follows consistently the frequentist statistical paradigm. Thus, for example, the number K of change points is viewed as a parameter, and consideration of different values for K is interpreted as a problem of model selection. A natural alternative, particularly in ‘dynamic’ applications where the data consist of noisy measurements of a signal over time, is to think of the signal as a realization of a latent stochastic process. Then the mode of estimation changes from an optimization problem to a problem formulated in terms of probabilities. Adopting a hierarchical Bayesian approach to the problem allows us similarly to quantify probabilistically also parameter uncertainty. As an illustration, we computed the posterior probabilities P(K=k|data) for the ‘no‐trend’ signal (see Fig. 4(a)), using data corrupted by Gaussian noise as in the paper. For this, we applied a slightly modified version of the Gibbs sampler algorithm for marked point processes of Arjas and Gasbarra (1994).

    Quite a vague prior distribution was assumed, with non‐informative scale invariant priors π(σ2)∝σ−2 and π(λ)∝λ−1respectively for the variance of the Gaussian error terms and for the Poisson intensity governing the number and the positions of the change points. For the signal levels, we adopted a Gaussian urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0410 prior for the initial level, and a conditionally independent and identically distributed Gaussian prior urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0411 for the successive jumps, assuming a scale invariant non‐informative prior also for η2.

    With these model and prior specifications the posterior probability P(K=k| data) for the number of change points had its maximum value 0.41 at k=6, which was the true number used in the simulation. By constraining the jump sizes of the signal above the threshold 0.15, we obtained also the constrained conditional probability P(K=6| data, constraint) = 0.78.

    The 0.95 credible interval for σ was (0.19, 0.21), the correct value being σ=0.2. Additional results, and the computer code that was used, are available from the first author.

    It would be interesting to see a systematic comparison of the performance of the various methods for solving change point problems based on the two complementary statistical paradigms.

    M. Hušková (Charles University in Prague)

    I congratulate the authors for excellent work, which is important for applications. They consider a relatively simple model with multiple changes (independent observations; one‐dimensional exponential family for single observations) but the paper is quite complex—it covers motivations, theoretical results, deep discussion of computational aspects, simulations, application to real data sets, discussions on single steps and discussions on possible extensions.

    The disadvantage is that the paper is too long (but I have no suggestion how to shorten it) and it is difficult to extract the algorithms if one wants to apply the procedures without reading the whole paper in detail—kinds of algorithms or pseudocode would be useful.

    I would like to bring your attention to Antoch and Jarušková (2013), where a different model is considered, and different procedures. It contains both theoretical results and algorithms with useful pseudocode.

    Jiashun Jin (Carnegie Mellon University, Pittsburgh) and Zheng Tracy Ke (Princeton University)

    We congratulate the authors on a very interesting paper. The paper sheds lights on a problem of great interest, and the theory and methods developed are potentially useful in many applications.

    In Ke et al. (2009) we have investigated the problem from the variable selection perspective. Consider a linear model
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0412(47)
    where X(i,j)=1{ji},1⩽i,jn, and β is a sparse vector, containing a small fraction of non‐zeros (i.e. ‘jumps’). In effect, is a blockwise constant vector, so model (47) is a change point model. We are interested in identifying all jumps.
    Fixing ϑ ∈ (0,1) and r>0 and letting ɛn=nϑ, and τn=√{2r  log (n)}, we consider a ‘rare‐and‐weak’ setting where
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0413(48)
    (νa is a point mass at a). See Ke et al. (2009) for a more general case and see Donoho and Jin (2009), (Jin 2009) and Ke et al. (2009) for the subtlety of model (48). The minimax Hamming selection error is then
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0414
    We showed in Ke et al. (2009) that
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0415
    where Ln is a generic multi‐ log (n) term. This implies a watershed phenomenon (which was also found in Donoho and Jin (2009), Jin (2003), Donoho and Tsaig (1998) and Ingster et al. (2003), but in different settings), which can be captured by a so‐called phase diagram; Fig. 17.
    image
    Phase diagram for the change point model: in the two‐dimensional phase space {(ϑ,r):0<ϑ<1,r>0}, the curve r=ρ0(ϑ) separates the phase space into two subregions, where ρ0(ϑ)=max{4(1−(ϑ), (4−10ϑ)+2√[(2−5ϑ)2ϑ2]+}; for (ϑ,r) in the interior of the region above the curve it is possible to identify all jumps (say, by using CASE) with high probability; for (ϑ,r) in the interior of the region below the curve, 1≪Hamm*n(ϑ,r)≪n, and it is impossible to identify all jumps, but it is possible to identify most of them (say, by using CASE)
    In Ke et al. (2013), we developed a two‐stage ‘screen‐and‐clean’ procedure called CASE which achieves the minimax rate. Define an n×1 vector urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0416 by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0417 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0418 The covariance matrix of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0419 is denoted as H, which can be easily calculated by basic statistics. For tuning integers m,l⩾1 (m is small) and tuning parameters u,v,t>0, the stages are as follows.
    • (a) Screening stage: let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0420 be the set of currently retained nodes (initialize with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0421). For k=1,…,m and i=1,…,nk+1, let Aik={i,i+1,…,i+k−1}, and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0422. We update with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0423 if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0424 (for short, A=Aik and D=Dik), and keep urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0425 unchanged otherwise; urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0426 is the subvector of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0427 restricted to indices in A (HA,A is defined similarly; see Ke et al. (2013)). Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0428 be the set of all retained indices at the end of the screening stage.
    • (b) Cleaning stage: urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0429 partitions uniquely to B1∪…∪BN such that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0430 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0431. For each Bk, let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0432, and compute urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0433, subject to the constraint either μj=0 or urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0434, for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0435, or μj=0, for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0436.

    The CASE estimate is given by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0437, and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0438 for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0439. For simplicity, the cleaning stage is slightly different from that in Ke et al. (2013); see the details therein.

    Rebecca Killick (Lancaster University)

    The authors are to be congratulated on a valuable contribution to the change point literature and commended for making their algorithm available within the stepR package (Hotz and Sieling, 1970). The automatic penalty selection and construction of confidence intervals for change point locations are of particular interest.

    Firstly I have a question on the automatic penalty selection. In Section 3 the authors reframe SMUCE as a dynamic program for ease of computation. By lemma 1 in Section 3 the solution of the dynamic program is also the solution of the original problem if γ is chosen larger than urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0440 The choice of γ is not mentioned in the paper as the authors instead focus on the choice of q. Given urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0441 as indicated in the paper, what γ was used in the simulations and, more importantly, what γ is preprogrammed into the stepR package?

    Secondly, whereas the presentation in the paper is restricted to SMUCE, I believe that the concept of confidence intervals for the change point locations can be extended to other search algorithms, in particular PELT (Killick et al., 2012). The confidence intervals for the change point locations in SMUCE are constructed by considering all sets of solutions where the test statistic Tn(Y,ϑ) is less than the threshold q (equation (5)). In contrast, the PELT algorithm keeps all change point locations that are within the penalty value of the maximum to prune the search. The same idea of confidence could be applied to these change point locations as their test statistics are close to the maximum and are thus also likely candidates for a change point. Obviously the key question is what theory is there to support this criterion as a way of constructing a confidence interval?

    Initial simulations using this method show desired properties such as the following:
    • (a) as you increase the penalty (i.e. increase your expected confidence in a change point) you become more uncertain about the proposed locations;
    • (b) for a given penalty value, the larger the change, the smaller the confidence interval;
    • (c) the coverage does not depend on the size of the change;
    • (d) the longer the interval between changes, the smaller the confidence interval.

    Fig. 18 gives an example of this last point by using the changepoint package (Killick and Eckley, 2014). In the simulations the coverage of both SMUCE and PELT was larger than 99% using default values. The theory behind this conjecture needs to be thoroughly treated but at least empirically this seems promising.

    image
    Simulated data with two changes in mean at 100 and 200 with (a) PELT and (b) SMUCE change point estimates and confidence intervals

    Dehan Kong and Qiang Sun (University of North Carolina at Chapel Hill)

    We congratulate the authors for their thought‐provoking and fascinating work on a challenging topic in multiscale change point inference. They introduce a new estimator, the simultaneous multiscale change point estimator SMUCE, for the change point problem in exponential family regression, which can provide honest confidence sets for the unknown step function and its change points and achieves the optimal detection rate of vanishing signals as n→0. This work is a substantial contribution to multiscale change point problems by providing a solid inference tool.

    However, the SMUCE method depends highly on the choice of the tuning parameter α, or q equivalently. The authors propose to select q by maximizing g(q)=1−α(q)−β(q,η,λ), the lower bound of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0442. This only works properly if max g(q) is close to 1, which happens when n is fairly large. It needs to be noted that, in general, the maximizer of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0443 and g(q) can be quite different. Thus finite performance is not guaranteed to be optimal. An illustrative example is to take urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0444 whereas urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0445. This phenomenon is also illustrated by the simulation results of the paper. In Section 5.1, the authors obtain an optimal q=q1−α(M) with 1−α=0.55, where q1−α(M) is the (1−α)‐quantile of the distribution of M. This result indicates that maxg(q)⩽0.55. However, from the simulation setting where σ=0.3, we can see that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0446 when 1−α=0.55, and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0447 when 1−α=0.4. The maximizer of f(q) should be quite different from q=q0.55(M).

    Moreover, we note that the selection of q or α does not depend on the error standard deviation σ, which results in the same α as chosen in Section 5.1. However, it is expected that, when σ increases, we would tend to underestimate the number of change points by using the same α, which is illustrated by the simulation results in Section 5.1. A better tuning method is needed.

    In conclusion, the tuning method proposed is not adapted to σ and not optimal given limited sample size. As a remedy, we suggest the use of cross‐validation for tuning purposes.

    Han Liu (Princeton University)

    I congratulate the authors for making an important contribution to the problem of change point inference. I believe that their method will have many interesting extensions beyond what has been presented in the paper.

    In what follows, we consider the change point inference problem in a slightly different varying‐coefficient generalized linear model and construct a SMUCE‐type estimator based on local composite likelihood. Unlike SMUCE, this estimator does not depend on the explicitly parametric form of the exponential family distribution.

    A varying‐coefficient generalized linear model

    Let {Fθ}θ ∈ Θ be the natural exponential distribution family with densities fθ (if the sample Y is continuous, we use Lebesgue measure as reference measure; if Y is discrete, we use counting measure as reference measure). fθ can be represented as
    math image(49)
    where h(x) is a positive function and does not depend on θ. Setting h(x)=1/x!, h(x)= exp (−x) and h(x)= exp {−x2/(2σ2)}, we obtain Poisson, exponential and Gaussian distributions respectively. To make model (49) identifiable, we require h(·) to be a density function, i.e.
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0449
    For change point inference, we consider a varying‐coefficient generalized linear model
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0450
    where ϑ: [0,1)→Θ is a right continuous step function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0451 defined in the paper. Our goal is to infer the change points τ1,…,τK.

    A local composite likelihood ratio statistic

    Under model (49), the local likelihood ratio statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0452 exploited by SMUCE is
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0453
    which requires knowledge of h(·). Here we propose an alternative local composite likelihood ratio statistic which is invariant to h(·) within the whole natural exponential family.
    Our method is based on a pairwise conditioning argument used in Ning and Liu (2011). More specifically, let
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0454
    the conditional density of (Yi, Yj) given their order statistics (Y(1), Y(2)) is
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0455
    We see that the unknown function h(·) is eliminated by the conditioning procedure.
    On a discrete interval [i/n, j/n] such that ϑ is constant with value θ, we have
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0456
    where
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0457
    Let urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0458. Multiplying the density functions for all possible combinations of pairs that fall in the interval [i/n,j/n], we define the local composite likelihood ratio statistic as
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0459
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0460 is a function of U‐statistics and can be plugged into SMUCE for change point inference. It is interesting to understand its theoretical properties.

    Robert Maidstone and Benjamin Pickering (Lancaster University)

    We congratulate the authors on a stimulating contribution to the literature. Our discussion explores the performance of SMUCE in the case of large data sets with an increasing number of change points.

    Examples of ‘big data’ are becoming increasingly frequent in industrial applications. It is important that any novel change point detection method can function efficiently in these cases. To assess the effectiveness of SMUCE in such scenarios we considered detecting changes in mean for Gaussian data. We examined data sets of varying size, from n=2000 to n=100000, containing m equidistant change points, where m={10,⌊√n⌋,n/10}. Change point estimates were computed by using three competing methods: SMUCE, PELT (Killick et al., 2012) and binary segmentation (Scott and Knott, 1974). Computational times were also recorded. We implemented the methods by using the default values in the stepR and changepoint (Killick and Eckley, 2012) packages.

    The average computational times for each method are displayed in Fig. 19. Where the number of change points is kept constant, SMUCE is considerably slower than the other two methods and is distinctly non‐linear. For the other two cases, where the number of change points increases with n, the computational performance of SMUCE improves. The empirical simulations illustrate that SMUCE is at least near linear, though still significantly slower than PELT. This reinforces the authors' comments that SMUCE cannot be proved to be urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0461 in computation. To reduce computational time the stepR package approximates the SMUCE algorithm by considering only intervals of dyadic lengths.

    image
    Computational time of SMUCE (image), PELT (image) and binary segmentation (image), averaged over 50 replications: (a) 10 change points; (b) ⌊√n⌋ change points; (c) n/10 change points

    The quality of the change point estimates was evaluated by calculating the V‐measure (Rosenberg and Hirschberg, 1955) for each method in each scenario. This is a measure on [0, 1] of the similarity between two different segmentations of a data set. A larger V‐measure indicates a more accurate segmentation of the data, with a V‐measure of 1 indicating perfect segmentation. Fig. 20 compares the average V‐measures of the three methods as n increases. Although the accuracy of SMUCE surpasses that of binary segmentation, it is outperformed by PELT for scenarios where the change point densities are urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0462 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0463, significantly so for the latter. In addition, the accuracy of SMUCE appears to decrease as the density of change points increases. These features are due to the approximation made in the stepR package.

    Our simulations suggest that SMUCE begins to experience difficulties as the density of change points increases with n, as a consequence of the approximations made. We would be interested to know whether an exact implementation is possible without compromising computational cost.

    image
    V‐measures of the segmentations produced by SMUCE (image), PELT (image) and binary segmentation (image) as n increases, averaged over 50 replications: (a) 10 change points; (b) ⌊√n⌋ change points; (c) n/10 change points

    Jorge Mateu (University Jaume I, Castellón)

    The authors are to be congratulated on a valuable and thought‐provoking contribution on the change point problem. As they note, this problem can be encountered in a wide variety of practical examples and scientific fields, incorporating and sharing ideas from multiscale methods, dynamic programming and exponential families. I would like to comment on the problem of dependent data as appears in Section 6.

    Gibbs point processes are applied to model point patterns with interacting objects (Ripley, 1973; Stoyan and Stoyan, 1975; Stoyan et al., 1995). Often, available data contain information about both locations and characteristics of objects, and interactions are expected to depend on both of them. Therefore, a suitable model for the data is a marked Gibbs point process, where the marks are the quantities connected to the points.

    However, marked point processes are quite difficult to deal with, particularly in terms of simulation and estimation. The likelihood function usually depends on an unknown scaling factor which is intractable avoiding the direct use of the maximum likelihood estimation method. A more feasible possibility is to apply computer‐intensive Markov chain Monte Carlo maximum likelihood which simulates ergodic Markov chains having equilibrium distributions in the model. Simulation procedures of marked Gibbs processes are also based on running a sufficiently long Markov chain until it reaches equilibrium, i.e. until the Markov chain has reached the point process distribution. Markov chain Monte Carlo methods have been increasingly popular in statistical computation. However, iterative simulation presents one problem beyond those of traditional statistical methods. One must decide when to stop the Markov chain or, more precisely, one must judge how close the algorithm is to convergence after a finite number of iterations. Classical methods for monitoring convergence are not fully implemented.

    One strategy is to compute the energy marking (Stoyan et al., 1995) over each simulated point pattern obtained after each iteration of the Markov chain. This provides a collection of serially dependent data in the form of a kind of time series. Convergence to equilibrium of this time series is essential. Analysing multiscale change points for this serially dependent data is directly related to checking for equilibrium. Thus we claim that an adaptation of the SMUCE method to detect change points (how many and where) over the Markov chain of the energy marking is worth trying to set up statistical planning for investigating the convergence of such chains in marked point processes.

    Dao Nguyen (University of Michigan, Ann Arbor)

    I congratulate the authors for their contribution in an interesting and stimulating paper. One of the most challenging problems in change point inference consists of finding the number K of change points and its formal treatment in applications. In this regard, the authors combine likelihood ratio and penalized likelihood approaches by calibrating the scale at level α and considering it as a multiscale constraints maximization for choosing K. One of the strong points of this approach is that, by using penalized likelihood, it allows (at least in theory) a large class of functions to be treated simultaneously. However, it also introduces extra arbitrariness and the optimization generally appears slow and time consuming. From my perspective, it will be important to make sure that the method competes effectively with the other (non‐penalized) approaches. Using the same setting for the comparative genomic hybridization data set as in the paper, Table 12 shows some simple empirical comparisons with the SMUCE approach: CumSeg of Muggeo and Adelfio (2000), UnbaHaar of Fryzlewicz (2004) and circular binary segmentation, CBS of Olshen et al. (2004).

    Table 12. Empirical comparison on 1000 Monte Carlo simulations
    Method Trend σ Results for the following Mean‐ Mean Time
    numbers of change squared absolute (s)
    points: error error
    K5 6 7 8
    SMUCE No 0.1 0 951 49 0 0.000761 0.4492 493
    CumSeg No 0.1 0 960 36 4 0.1543 0.780 105
    UnbaHaar No 0.1 0 857 39 104 0.000506 0.455 338
    CBS No 0.1 0 825 111 64 0.000828 0.4293 94

    From Table 12, I have no reason to prefer SMUCE to the other approaches when computational power and time are critical. I suggest that further considerations about the computational performance of this approach should be taken into account.

    Another appealing feature of the method is the ability to derive error bounds for the change point locations through the selection of a threshold q. However, the selection strategy is rather ad hoc in the sense that it is based on the prior and pilot Monte Carlo simulations. Indeed, the motivation for the choice of the parameter q is not convincing to me as it assumes simple Gaussian observations and extrapolates to other models only on the basis of empirical evidence. It is not even clear why particular values of q which lead to the maximization of the minorization function of interest instead of the function itself would be useful. To become of more general interest, the choices for the threshold q, the prior of λ and η, and the class of functions f used in the maximization of the likelihood will need to be further justified.

    I recognize the beauty of this hybrid approach, and, although it seems to achieve good results in term of accuracy, generalization of this approach is not straightforward. It would be interesting to investigate its performance for missing data or longitudinal data. In my opinion, the approach pursued by the authors looks promising and worthy of further exploration.

    Richard Nickl (University of Cambridge)

    The theory of non‐parametric inference is intricate and not as straightforward as in the classical ‘parametric’, finite dimensional, situation. This is particularly true for non‐parametric confidence sets for ‘adaptive–automatic’ procedures such as the methods proposed here. In Section 1.2 the authors briefly discuss this matter and reference to Donoho (2001) is made, who showed that for certain functionals honest frequentist inference is impossible if the model is ‘strongly’ non‐parametric in the sense that no qualitative assumptions on the underlying probability measures are made. This has long been known and is even true for estimating the simplest mean functional Ep(X) (Bahadur and Savage, 1956). These negative results apply also to the non‐parametric change point detection context of this paper, but perhaps not in a specifically relevant way: the confidence bounds derived by the authors are valid only for step functions that have minimal
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0464(50)
    where n is sample size. In other words the method proposed only works if one excludes functions from consideration that have ‘small jumps on small intervals’, where ‘small’ is quantified by condition (50). Moreover, for ‘uniform coverage’ (honesty) of the confidence sets proposed we require a uniform constant C in this bound so that the ‘limit set’ of the step functions urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0465 covered for a fixed sample size n is, as n→∞, actually not strongly non‐parametric in the sense of Donoho (2001), whose results hence say nothing in particular about necessity of the condition (50) employed by the present authors.
    Even for constrained non‐parametric models like the one considered here it was shown by Low (2005) that in some situations honest inference comes at the price of ‘worst‐case performance’ of the procedure. More recently ‘detection boundaries’ have been derived which give necessary and sufficient conditions under which honest optimal confidence sets exist in function estimation or high dimensional regression; see Giné and Nickl (2007), Hoffmann and Nickl (2008), Bull (1994), Nickl and van de Geer (2007) and also Szabo et al. (1986) for the Bayesian setting. Translated into the present setting one can thus ask whether the condition
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0466(51)
    employed by the authors is the minimal condition for the existence of honest inference procedures on change points, or whether it can be essentially weakened. One could tackle the problem for instance by considering the hypothesis testing problem
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0467

    Comparing the likelihood ratio between a uniform zero signal and a small perturbation, and using lower bound arguments as in Hoffmann and Nickl (2008), one can derive a lower bound in the sense that for Δ, λ too small no test can distinguish between these two hypotheses. This would give a sound theoretical approach to clarify whether improvements on the authors' method are still to be expected.

    Rui Song (North Carolina State University, Raleigh) and Michael R. Kosorokand Jason P. Fine (University of North Carolina, Chapel Hill)

    The authors are to be congratulated for their thoughtful paper on multiscale change point inference. Hypothesis testing for change point models is notoriously challenging, since, under the null hypothesis, the location of the change point—change points—is no longer identifiable. The asymptotic properties of the classical likelihood ratio, Wald and score tests are thus non‐standard. Furthermore, likelihood ratio tests do not have the Bahadur efficiency property since the regularity conditions no longer hold. Constructing optimal tests for the existence of the change point is thus rather demanding. In the paper, the authors have derived a test having ‘optimal’ asymptotic power, as stated in theorem 5. The current definition of optimality appears to be somewhat different from the usual definition. It would be worthwhile to discuss whether the proposed optimal testing procedure has any optimality properties in the traditional sense, such as the optimality illustrated in Song et al. (2013).

    The change point problems discussed in this paper deal primarily with univariate random variables. In regression settings, such as in multiple linear regression, the regression coefficients may change with the value of another variable, which we call a change point based on thresholding a covariate. This set‐up was considered in detail in chapter 14 of Kosorok (2012). The extension of the proposed multiscale inference to such change point models remains unclear but would be quite useful in expanding the scope of application. We would appreciate comments from the authors on the difficulties in such extensions.

    In Section 1.6, the authors claimed that ‘our analysis also provides an interface for incorporating a priori information on the true signal into the estimator’. We would like to make a connection with Song et al. (2013), where an asymptotically optimal likelihood ratio test for the existence of the change point was derived employing a prior distribution for the change point. In particular, the weighted average power for the proposed test with respect to the specified prior was shown to be optimal. It would be interesting to explore the theoretical advantages in incorporating a priori information in multiscale change point inference and to compare the resulting inferences with those obtained by using classical approaches like those in Song et al. (2013).

    Alexandre B. Tsybakov (Centre de Recherche en Economie et Statistique–Ecole Nationale de la Statistique et de l’Administration Economique, Malakoff)

    The paper by Klaus Frick, Axel Munk and Hannes Sieling provides a very interesting and thought‐provoking contribution demonstrating the power of the multiscale approach in statistics. The method suggested (SMUCE) shows excellent performance in simulations and the authors provide an original way of constructing confidence intervals. The theory developed in the paper focuses on the correct estimation of the number of change points. However, it remains unclear whether SMUCE achieves correct model selection (recovery of the set of change points), i.e. do we have urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0468 At the same time, correct model selection is granted for simple procedures such as thresholding of YiYi−1 or related techniques such as the fused lasso. The conditions on the jump size Δ and on the distance between jumps λ are crucial. Assuming for simplicity Gaussian observations, for the correct model selection by these simple methods it is enough to have Δ⩾√{ log (n)/n} for some constant c>0 and no condition is needed on λ. This seems not to be so for SMUCE. For example, if Δ∼σ√{ log (n)/n} the underestimation bound of theorem 2 is meaningful only for large λ: not for λ comparable with  log (n)q/n for q>0. Also, in the simulations, the situation is very favourable; the vector of differences θiθi−1 is extremely sparse (λ is large) and Δ∼σ. In view of this, it is not clear what the remark in Section 6.3 that SMUCE employs a weaker notion of sparsity (than the l1‐based methods) i.e., s=n, means. Since the question here is about sparsity of the differences θiθi−1, condition s=n is equivalent to λ=1/n, which is prohibited by the theory in the paper. In contrast for the l1‐based methods, correct model selection is possible for any s, including s=n (consider, for example, soft thresholding of YiYi−1).

    However, the simulations in the paper, as well as those in Rigollet and Tsybakov (1985) lead to a conclusion that fused lasso techniques are not very efficient for piecewise constant signals. Moreover, in the study of Rigollet and Tsybakov (1985), the fused lasso turns out to be the least efficient among several fused sparsity methods and the leadership goes to the fused exponentially weighted (EW) aggregate. The regression setting in Rigollet and Tsybakov (1985) covers the Gaussian case of the paper being discussed. Recently, Arias‐Castro and Lounici (2012) proved that the EW aggregate achieves correct model selection under much weaker assumptions than the lasso. This confirms the striking improvement over the lasso that is observed for the EW techniques in simulations (Rigollet and Tsybakov, 2012). Therefore, it would be interesting to compare, in terms of model selection performance, SMUCE with the fused EW aggregate, and not only with the fused lasso.

    Guenther Walther (Stanford University)

    The multiscale test that is considered in the paper requires the computation of local test statistics on all intervals [i/n,j/n] for 1⩽ijn. A straightforward implementation will thus result in an O(n2) algorithm, making computation infeasible for problems of moderate size. The authors address this by introducing methodology based on dynamic programming, which results in an improved computation time in many cases. Alternatively, it may be promising to evaluate the statistic on only a sparse approximating set of intervals. Rivera and Walther (2004) introduced such an approximating set for the closely related problem of constructing a multiscale likelihood ratio statistic for densities and intensities. The idea is that, after considering an interval [i/n,j/n] with large ji, not much is gained by also looking at, say, [i/n,(j+1)/n]. It turns out that it is possible to construct an approximating set of intervals that is sufficiently sparse to allow computation in O{n  log (n)} time, but which is still sufficiently rich to allow optimal detection of jumps.

    Another advantage in employing such an approximating set is that it considerably simplifies the theoretical treatment of the multiscale statistic. It is notoriously difficult to establish theoretical results about the null distribution of the multiscale statistic such as theorem 1. Even if one is not interested in the limiting distribution because the critical value is obtained by simulation (an option which is made feasible by the O{n log (n)} algorithm described above!), it is still necessary to show that the null distribution is Op(1) to establish optimalilty results such as theorem 5. The standard method of proof introduced in Dümbgen and Spokoiny (1988) requires establishing two exponential inequalities: first, one needs to establish sub‐Gaussian tails for the local statistic, which is often straightforward. The second exponential inequality, however, concerns the change between local test statistics, and this inequality is often very difficult to derive. Rivera and Walther (2004) showed that, if one employs a sparse approximating set, then the Op(1) result follows directly from the sub‐Gaussian tail property of the local statistics, which is typically easy to obtain.

    It appears that sparse approximating sets can similarly be constructed for relevant multivariate problems; see Walther (2000). The advantages of these sets for computation as well as theoretical analysis suggest that these approximating sets may play an important role for these types of problem in general.

    Chun Yip Yau (Chinese University of Hong Kong)

    This paper presents important advances in change point analysis. The ingenious use of the minimization constraint Tn(Y,ϑ)⩽q provides a solution for the construction of simultaneous confidence intervals for change points. We have the following points to make about this interesting paper.
    • (a) In this paper the multiscale statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0469 is considered. By showing that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0470 is close to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0471, the asymptotic distribution of the test statistics and large deviation bounds are derived (see lemmas 7.2 and 7.5 of the on‐line supplement). Since the sample mean urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0472 is a sufficient statistic for θ under the assumed one‐dimensional exponential family model, we may alternatively define the test statistics as
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0473(52)
      Defining the test statistics by equation is likely to have the following advantages.
      1. It is computationally more efficient as it avoids maximizing the likelihood function required in computing urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0474.
      2. It has better approximation to the null distribution and has sharper large deviation probability bounds, and hence more accurate confidence bands.
      3. It can be readily extended to dependent data by a proper standardization of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0475. The results in Lai et al. (2004) about large deviation of self‐normalized processes may be useful in establishing theoretical results analogous to the current method.
    • (b) The widely used likelihood ratio statistic
      urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0476
      seems to be more directly addressing the change point problem. Would it give more efficient results if the test statistic is defined in terms of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0477 rather than urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0478? The computational burden using urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0479 and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0480 is the same since both reduce to the evaluation of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0481. Also, the probabilistic properties of urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0482 have been well studied (e.g. Csörgő and Horváth (2000)).
    • (c) The simultaneous confidence band is conservative as the large deviation probability bounds are not sharp. In view of theorem 4, the accuracy appears to be decreasing with the number of change points. Therefore, if we are interested only in constructing confidence intervals for the change points, can we achieve higher accuracy by doing it locally for each of the change points separately? That is, once the set of estimated change points urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0483 has been obtained, the procedure in Section 3.2 is applied on urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0484 to obtain a confidence interval for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0485. As the estimated change points are independent of each other, a simultaneous confidence set for all change points can be calibrated from the urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0486 intervals.
    • (d) In lemma 1, it requires the penalty γ to be of order O{n2  log (n)}, which is much higher than the order of some common penalties such as the Bayes information criterion's O{ log (n)} and the Akaike information criterion's O(1), and is even greater than the cost function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0487 of order O(n). Why is such a strong penalty required?

    Zhou Zhou (University of Toronto)

    I congratulate Professor Frick, Professor Munk and Dr Sieling for this very stimulating paper. They propose to estimate a piecewise constant trend function via minimizing the model complexity subject to constraints on a multiscale goodness‐of‐fit statistic Tn. It is interesting that the tuning parameter q in the estimation is directly related to an upper bound for the probability of overestimating the number of change points. For independent observations belonging to a one‐dimensional exponential family, the authors comprehensively studied the properties of the change point estimator proposed, many of which are optimal under various criteria.

    Here I shall discuss the possible effects of heteroscedasticity and changes in nuisance parameters or higher order structures of the data on SMUCE proposed by the authors. The rationale is that, when the data‐generating mechanism of a system changes, often multiple parameters of the system will change simultaneously. For instance, changes in the mean are often accompanied by changes in the variability. And in the photo‐emission spectroscopy example discussed in the paper, except for changes in the marginal intensities, the autocovariance structure may change, which is actually scientifically interesting. To facilitate discussion, consider the following model for the observed data Yi:
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0488(53)
    where ti=i/n. Here ϑ(·) is the primary parameter function whose changes are of interest, θ(·) are urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0489‐valued nuisance functions whose behaviours are not of interest in the current study and {Hi} is a stochastic process which depicts the temporal copula structure of the data and does not influence the marginal distributions. In particular, the Hi are assumed to be marginally uniform(0,l) distributed. A natural question to ask is will SMUCE be robust to changes in θ(·) or {Hi}? For instance, in the GBM31 data studied in the paper, it seems that the suspected change point at around observation 700 for α⩾0.4 (Fig. 11) in fact depicts a decrease in the marginal variance of the sequence. Hence, if the answer to the latter question is negative, it is then of interest to know whether there is a modification of SMUCE which makes it robust to changes in the nuisance parameters and/or higher order structures of the series.

    The authors replied later, in writing, as follows.

    First, we thank all the discussants for their numerous inspiring comments, suggestions and thought‐provoking questions. Many of these comments pave the way to extensions of SMUCE reaching far beyond initial expectations and will open up new research in this area.

    We shall not be able to address all the comments in detail in this rejoinder. Indeed, some are quite challenging and deserve a more thorough analysis. In what follows, we primarily focus on those issues which we identify as common themes of the contributions.

    In our presentation, we confined ourselves to independent observations from a one‐dimensional exponential family to present the basic ideas as clearly and concisely as possible. We agree that extensions to more general models are necessary and important and we are grateful for the many fruitful suggestions. Some of them will be addressed in what follows, some have already been elaborated on by discussants and some seem to be very challenging to us and are under way.

    Other distributions and statistics

    As pointed out by many discussants (Crudu, Porcu and Bevilacqua, Davies, Hušková, Kovac, and Linton and Seo) extensions of SMUCE to other distributions are possible, e.g. under Cramér‐type conditions on the moment‐generating function. To control the overestimation error an exponential deviation bound is required for SMUCE and many tools are available to achieve this.

    However, a word of caution must be stated. Not all results carry over in a straightforward manner. A substantial difficulty for a thorough theory for the scale‐calibrated multiscale statistic Tn is the uniformity of a distributional bound for its limiting null distribution. As also pointed out by Walther, this can become quite involved and he offers an interesting simplification, which will be addressed later in more detail. The multiscale statistic Tn that is used in SMUCE balances all scales simultaneously and hence achieves good overall performance (see Chan and Walther (2013)). This requires mathematically quite a different treatment from the uncalibrated statistic (which differs from Tn by a log‐calibration term; see equation (3) in the paper)
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0490(54)
    as considered by Davies and Kovac in their comments. For example, with Gaussian error urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0491 is completely determined by its small‐scale behaviour. More precisely, only terms of scales of the order ji∼ log (n) matter for its Gumbel extreme value limit (Kabluchko and Munk, 1994). Further, asymptotically τk+1τk does not enter as it does for the distributional limit of SMUCE. One may argue that large scales do not need to be encountered in statistic (54) as they do not contribute to the maximum with high probability.

    Chen, Shah and Samworth, and Yao sketch an interesting and quite general approach to extend SMUCE beyond exponential families by using the (pseudo‐)Gaussian likelihood in a generalized additive noise model. This offers a variety of possibilities and simplifies theory while allowing us to control computational cost. However, if it comes to statistical efficiency how estimation accuracy and coverage of the associated confidence set are affected must be checked carefully. The simulation in Section 5.3 shows the benefit of using the exact model‐based likelihood ratios (instead of Gaussian surrogates) for Poisson observations. In particular for low count regions of the signal (where the normal approximation fails) change points will not be identified well, which results in a systematic underestimation of the number of the change points (see Table 3 in the paper).

    So far, SMUCE has been extended to heavy‐tailed distributions by us in the context of quantile regression in Section 5.4. Addressing Aston and Kirch, and Coad's comments we mention that this provides a general and robust modification of SMUCE as it transforms the problem into a Bernoulli regression problem. However, we agree with Linton and Seo that it is worth developing more sophisticated theory and test statistics, e.g. specifically tailored to the tail behaviour of the error. In particular, for applications in finance, this might be useful as pointed out by Davies, and Linton and Seo.

    Dependent data

    A challenging and practically important task is to extend SMUCE to dependent data as discussed by Aston and Kirch, Coad, Linton and Seo, Mateu, and Yau among others. Indeed, there seem to be several possibilities to extend SMUCE to dependent data. Statistically most efficient appears to be an imitation of the likelihood approach as advocated in the present paper. However, this might be practically limited for several reasons, e.g. computational, or by the difficulty in simultaneously estimating the dependence structure and the piecewise constant signal.

    A simple modification for dependent data is based on standardizing each local likelihood ratio statistic urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0492 by its variance as illustrated for the case of a moving average MA(1) process in Section 6.1.1. Coad pointed towards m‐dependence. In fact, this has been elaborated parallel to this paper in the context of estimating the opening and closing states from ion channel recordings in Hotz et al. (1970). The method that was presented there requires knowledge of m and a reliable estimate of the dependence structure. A simple bound is given which shows how to recalibrate the parameter q to control the overestimation error α. A particular appeal of this ad hoc modification might be that the computation of the estimate is essentially the same as for SMUCE with independent observations. However, a thorough theory and implementation for more efficient methods for dependent data, e.g. likelihood‐based multiscale statistics, has not been explored yet and seems an interesting challenge for the future.

    Model extensions: multivariate and indirect measurements

    The general methodology underlying SMUCE may be used for local statistics, different from likelihood ratios. This offers extensions to other settings and we are grateful for the numerous contributions in this direction. Liu made an interesting proposal: the use of a local composite likelihood ratio statistic to apply the methodology underlying SMUCE to a varying‐coefficient generalized linear model. We thank Farcomeni who proposes a novel modification for panel data regression and multivariate data, as questioned by Coad, and Enikeeva and Harchaoui, who also pointed towards indirect observation. In this case the situation changes completely for two reasons. First, asymptotic theory becomes regular for piecewise continuous change point regression (including piecewise constant) and the maximum likelihood will be Cramér–Rao efficient; see Frick, Hohage and Munk (2008). This is not so for the direct case that is treated in our paper as mentioned by Song, Kosorok and Fine who pointed out the non‐regularity of testing for a change point; see also Ibragimov and Khas'minskij (1981), Korostelev and Korosteleva (2002) and Antoch and Hušková (2000) for a similar phenomenon in estimation. From this perspective, indirect regression becomes even simpler. In contrast, computationally the problem changes drastically and the resulting estimation problems for the change points become a non‐linear and often non‐convex optimization problem, which is difficult to solve, in general (Frick, Hohage and Munk, 2008). In response to Song, Kosorok and Fine we also mention that in the current change point setting the number of change points is unknown and arbitrary, which makes this problem ‘non‐parametric’ of intermediate dimension (neither parametric p<<n nor high dimensional p>>n; rather n=p). This is very different from classical settings as the model dimension grows with the number of observations and different (asymptotic) efficiency concepts must be used (see theorem 5, theorem 6 and the comments by Jin and Ke, and Nickl).

    Time dynamic models

    Currently, theory underlying SMUCE assumes a regression model with a deterministic (unknown) function ϑ ∈ S. In this model, per se, the segments are not ‘intersegment dependent’, using Eckley's terminology. As pointed out by Eckley, Fuh and Teng, and Szajowski it is often more reasonable to assume ‘jump dynamics’ and to model this as a random process itself—and we fully agree. Probably, the most prominent and simple model in the change point context is a hidden Markov model (HMM) as described by Fuh and Teng. In Hotz et al. (1970) we performed simulations for the pathwise reconstruction of SMUCE in an HMM with two hidden states (without making use of the HMM assumption) and it performs remarkably well. Roughly speaking, the reason is that conditioned on the observations the reconstruction still obeys all recovery properties that are inherited within the (conditional) regression model. Interestingly, there is a dynamic programing analogue of SMUCE with the celebrated Viterbi algorithm, although pathwise the reconstructions are different, of course. However, in contrast with any HMM‐based estimation method for the most probable path, SMUCE neither incorporates any assumption on the number (and values) of states nor assumes a Markovian structure for the changes between the states. Indeed, we agree with Mateu that it would be of great interest if this could be incorporated in a sensible way. In fact, first simulations and examples show very promising results.

    This issue is related to the Bayesian viewpoint as stressed by Fearnhead, Gasbarra and Arjas, and Nyamundanda and Hayes, which we shall address in more detail later. In both worlds we assume that the parameter ϑ is random. Finally, it depends on the application whether we may assume dynamics for the change points (as in an HMM or state space model). This gives us also the possibility for prediction—which SMUCE does not offer so far.

    Beyond piecewise constant signals

    In Section 5.1 we included a simulation scenario with a deterministic trend to assess the robustness of SMUCE against violations of constant segments. These results are complemented by the simulations that were provided by Bigot, who offers an interesting procedure which does not rely on the piecewise constant assumption of the signal as SMUCE does.

    The other way round, it is of interest to investigate how well SMUCE approximates a signal which is not piecewise constant as addressed in the remarks of Farcomeni, and Linton and Seo. We have not yet developed a theory for the approximating properties of SMUCE to smooth functions. Nevertheless, we conjecture that it can be shown that SMUCE will converge at the minimax rate O(nα/(2α+1)) to the true signal as long as it is not too smooth (i.e. if ϑ is in the approximation space Aα;0<α⩽1 in the terminology of Boysen et al. (2009)). The case α=1 gives O(n−1/3), supporting Linton and Seo's conjecture. For smoother signals (e.g. ϑ having more than one derivative) this rate will not improve further, according to the local constant reconstruction provided by SMUCE. For an illustration of such an approximation see Fig. 21.

    image
    (a) Doppler signal (Donoho and Johnstone, 2013), (b) Gaussian observations with σ=0.4 and (c) SMUCE

    Farcomeni, and Linton and Seo suggest that SMUCE can be employed for more general right continuous piecewise parametric functions, which we basically agree. We do not see any conceptual limitations. For a fast implementation, however, the optimal local costs must be computed efficiently (see Section 3.1), which may be laborious and must be done case by case. Fig. 22 shows an extension to piecewise linear functions and its associated confidence intervals for the change point locations and confidence bands.

    image
    (a) Poisson data with piecewise linear mean function ϑ (——–) and (b) confidence bands for ϑ (image) and confidence intervals for the change point location (image) for α=0.1

    Multiparametric regression

    A challenging topic for further research is to generalize SMUCE to multiparametric models. As pointed out by Eckley, Enikeeva and Harchaoui, Farcomeni, Nyamundanda and Hayes, Song, Kosorok and Fine, and Zhou this is highly relevant from a practical perspective. We shall discuss this briefly for Gaussian observations where we allow additionally changes in the variance. Assume that the main parameter is the mean, but the variance is expected to change from segment to segment as considered by Eckley and Zhou. Then it may be considered as a nuisance parameter and the local Gauss statistic in Tn (see Section 5.1) can be replaced by the local t‐test
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0493
    with a local variance estimate
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0494
    Under the null hypothesis urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0495 is t distributed with ji degrees of freedom. The proper scale calibration term for the multiscale statistic must be computed in this case, which is not straightforward at all and is an interesting topic for future investigation. The computation of the estimate, however, is the same as for the case with constant variance as treated in the paper.

    Multi‐dimensional models

    A challenging problem is to generalize SMUCE to data sampled in a d‐dimensional domain Ω (the unit cube [0,1]d for simplicity, say)
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0496
    Here ϑ denotes a piecewise constant function on Ω, i.e.
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0497

    This is a simplified version of the problem that is raised by Crudu, Porcu and Bevilacqua in connection with georeferenced data and has a long history in function estimation and statistical imaging, in general. Farcomeni's comment targets much the same issue.

    The challenge in extending the SMUCE methodology to d⩾2 is twofold:
    • (a) finding a statistically efficient and computationally feasible generalization of the multiscale statistic Tn(Y,ϑ) and
    • (b) a reasonable substitute for the objective function #J(ϑ).

    The latter reflects the fact that the piecewise constant (surface) segmentation problem is well known to be NP hard already for d=2.

    Addressing the first issue, multiscale statistics for multi‐dimensional regression have been proven useful for various detection problems so far. See for example Glaz et al. (2012) for several scan statistics in the multi‐dimensional case, Walther (2000) for the construction of a system of approximating sets which can be computed in O{n log d(n)} steps, Addario‐Berry et al. (2010) for a combinatorial characterization of detectable sets which does not rely on the underlying space or Arias‐Castro et al. (2006) for detecting a curve in a lattice (d=2). Addressing the second issue, in statistical function estimation theory and in mathematical imaging a vast body of the literature concerns the development of computationally feasible surrogate functionals and its associated theory. The minimization of such a functional under an appropriate multiscale statistic Tn in imaging is relatively new, however, and various methods in the literature can be viewed in a unifying way from this point of view (see Frick et al. (2006, Frick, Marnitz and Munk 2008) and Section 1.6 of the paper for a brief overview). Here, we briefly sketch another approach when urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0498 is considered as a subset of the space BV(Ω), the functions of bounded variation. The jump setDϑ of a function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0499 is the collection of points x ∈ Ω at which left and right limits of ϑ exist (in a particular sense) but do not coincide. The jump set is rectifiable and has finite Hausdorff measure urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0500 We thus suggest, for the case d>1, substituting the objective functional #J(ϑ) by the (d−1)‐dimensional Hausdorff measure of the jump set: urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0501. It is important to note that this definition is consistent with the case d=1 since then Dϑ is the set of change points of ϑ and urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0502 is the counting measure, thus: urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0503. In the case d=2, for example, urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0504 measures the perimeter of the jump set Dϑ. Summarizing, one way of generalizing SMUCE to a d‐dimensional domain is given by the optimization problem
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0505
    For d=2 this amounts to minimizing the perimeter of the jump set of ϑ under a multiscale constraint. The generalized optimization problem for the case d>1, however, cannot be tackled by a dynamic programming approach anymore. One possibility might be an Ambrosio–Tortorelli‐type approximation (Ambrosio and Tortorelli, 1990) of the optimization problem as it is used in connection with the Mumford–Shah energy functional.

    Nature of SMUCE, error control and choice of threshold

    A major characteristic of SMUCE is the possibility of controlling the uncertainty for the estimate obtained and the number of jumps obtained. SMUCE has been designed such that much emphasis is put on controlling the error of overestimating the number of change points. This is guaranteed as SMUCE has the minimal number of change points within the set {ϑ:Tn(Y,ϑ)⩽q}. In fact, the overestimation control urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0506 in inequality (7) can even be refined to
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0507
    see Sieling (2007). Hence, the error in overestimating K by at least s jumps converges exponentially fast to 0 as s increases. This shows that SMUCE protects much more strongly against overestimation as the initial estimate urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0508 suggests and is confirmed by all simulation studies performed by us and the discussants (see for example Chen, Shah and Samworth).

    In summary, first and foremost, the parameter q is to determine the level of significance α to control the probability of overestimating K. As shown in theorem 1, this is determined by the null distribution, which further can be uniformly bounded in distribution by M in expression (15), i.e. M does not depend on ϑ. This has been questioned by Kovac. However, in addition, the exact as well as the asymptotic distribution depends on ϑ in a specific way (see expression (16)) which allows a refined determination of α if prior knowledge on ϑ is available. These refinements in turn yield improved detection power. Notably, ϑ enters only through the differences in the jump locations τk+1τk, similarly to Zhang and Siegmund's (2007) refined Bayes information criterion type of penalty.

    In his simulation study Fryzlewicz (accompanying our results in Section 5.1) compared SMUCE with wild binary segmentation (WBS) (an R package is available from http://cran.r-project.org/web/packages/wbs). The results suggest that particularly for high noise level SMUCE (with its standard settings in stepR) has a tendency to underestimate the true number of jumps. This can be explained by the strict requirement that SMUCE gives a significance guarantee against overestimation which becomes notoriously difficult for large variances. However, we do not fully agree with Fryzlewicz's conclusion concerning the superiority of the completely automatic WBS (sSIC) for small signal‐to‐noise ratios. The situation seems to be more complex and will depend on the signal itself. From this point of view, we stress that the possibility of choosing the level of significance α (as for any testing procedure) is a valuable tool for the data analysis. Therefore, we argue against the fully automatic ‘blind’ use of SMUCE (and any other segmentation method).

    We illustrate by considering the simple signal with one change point:
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0509(55)
    In this scenario we compared SMUCE with WBS in 1000 simulations for independent and identically distributed Gaussian noise with σ=1, σ=2 and σ=3. The WBS (sSIC) was applied with default settings and SMUCE was computed for two different levels of significance: α=0.1 and α=0.5. The results in Table 13 suggest that for small and large variances SMUCE at level α=0.1 performs comparably with WBS. However, for large variances (σ=2 and σ=3) and the less conservative choice α=0.5 SMUCE outperforms WBS considerably.
    Table 13. Frequencies of estimated number of change points by SMUCE and WBS
    Results for σ=1 Results for σ=2 Results for σ=3
    and the following and the following and the following
    frequencies: frequencies: frequencies:
    0 1 ⩾2 0 1 ⩾2 0 1 ⩾2
    SMUCE (α=0.1) 3.1 96.6 0.3 66.9 32.9 0.2 85.3 14.7 0
    SMUCE (α=0.5) 0.3 89.8 9.9 24.9 70.2 4.5 47.2 48.8 4
    WBS (sSIC) 1.1 96 2.9 61.2 35.2 3.6 85.1 11.8 3.1

    This suggests relaxing the current use of the familywise error rate which is controlled by the multiscale statistic Tn. Farcomeni addresses this as he relates SMUCE to recent developments in multiple testing. In many applications (e.g. genetic screening) an approach based on the false discovery rate might be more appropriate (see also Siegmund et al. (2011)) and a false discovery rate variant is under current investigation.

    In any case, it might be attractive to use SMUCE as an accompanying tool for any segmentation method as this can be used as an input for the multiscale statistic Tn. This allows us to detect regions by its corresponding violators plot (see Fig. 4 for such a violator plot in the case of the maximum likelihood estimation solutions with varying number of jumps) which are rejected to be constant at level α.

    Finally, we are grateful to Kong and Sun for giving us the possibility of clarifying an issue: all exponential bounds do depend on σ in the normal case. Indeed, σ enters through the worst‐case signal‐to‐noise ratio Δ/σ; see the end of the first paragraph in Section 2.4. Accordingly the multiscale statistic Tn itself depends on the (estimated) variance as explained at the beginning of Section 5.1.

    A Bayesian view

    Addressing Gasbarra and Arjas's comment, based on theorem 1, refined bounds for SMUCE can be found, if prior information on the change points' locations τ1,…,τK is available. Theorem 1 shows that only a prior Γ on K and τKτK−1,…,τ1τ0 (and not on the entire space urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0510) is needed to determine the a posteriori threshold q such that
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0511
    With this choice of q the a posteriori probability of overestimation is controlled in the sense that
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0512
    In addition, the exponential bounds for the probability of underestimation can be used to control the probability of estimating the number of change points correctly by incorporating prior information on the true signal. For simplicity, the results in Section 2.3 have been given only in terms of the minimal interval length λ and smallest jump height Δ. Accordingly, only prior information on λ and Δ is included in the present strategy for choosing q in Section 4.
    The price for this simplicity is that this choice of q can become quite conservative as pointed out by Kong and Sun, and Nguyen. From the proof of theorem 7.12 in the on‐line supplementary material a similar inequality can be derived such that all jump heights θkθk−1 and all interval lengths τiτi−1 enter. More precisely, in theorem 2 the right‐hand side can be replaced by
    math image
    where Δk=θkθk−1 and λk=min{τkτk−1,τk+1τk}. These bounds are less conservative; however, more detailed prior information is required for its application. As asked by Critchley, and Nyamundanda and Hayes we investigate the sensitivity of the current (simple) exponential bounds to changes in λ and Δ. This is illustrated in Fig. 23 where we display α‐level contour plots for the exponential bounds in theorem 2. They reveal a qualitatively similar behaviour independently of α. These contour plots also have another interpretation: for fixed n the bounds may also be seen as a guarantee of detecting changes of a certain size at a given level α.
    image
    Contour plots of exponential bounds for the probability of underestimation (see theorem 2) of the dependence of λ (x‐axis) and Δ (y‐axis) for (a) α=0.1 and (b) α=0.9 and n=1000

    Nyamundanda and Hayes as well as Gasbarra and Arjas pointed out that full priors on the signal ϑ can be used to increase efficiency—and we fully agree. Misspecification is an issue then, however. A distinct difference between SMUCE and a fully Bayesian method seems to us that misspecification of the prior, if it is used as we suggest, will affect only the underestimation bound but not the overestimation bound. In contrast, misspecification of a prior on the segment length will degrade the overestimation bound and interpretation of α. This is why we argue that it may be an interesting strategy to employ prior information primarily for controlling the underestimation error.

    The multiscale statistic Tn for model checking

    Critchley raised the important issue of how SMUCE is related to model checking. In fact, the multiscale test Tn can be regarded from this point of view as a multiscale analogue of local‐likelihood‐based methods, i.e. residual‐based model checking in the normal case (Boysen et al., 2009). Expanding the example of Fig. 1 in the paper, any candidate function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0514 might be accompanied by a ‘violator plot’ of those intervals, where the reconstruction of the signal gives rise to local rejection by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0515 exceeding the threshold q. This is illustrated in Fig. 24 where we display those regions (blue) for several candidate functions (red). These are the (unconstrained) maximum likelihood estimates for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0516 number of change points. For clarity we display only violators on intervals of dyadic lengths. The colour coding (levels of blue) has been chosen according to the number of overlapping intervals at a pixel (more violators correspond to darker blue). In this example the SMUCE solution (which is not displayed) almost coincides with the unconstrained maximum likelihood estimate for urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0517.

    image
    Data from example 1 together with candidate functions urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0333 (image) and regions in which the multiscale constraint is violated at level α=0.1: image, true signal, K=8

    Model selection performance and optimal detection

    Jin and Ke highlight an interesting connection to what they call the ‘rare and weak’ setting in a more general context of a linear model Y=+ɛ (see also our Section 6.3 in the paper). Now βj=ϑjϑj−1 are considered as random. Furthermore, in our terminology, a scale corresponds to the length of a substring of the vector β. It is difficult to relate results in their model to ours rigorously but some rough comparison can be done at this stage. If we relate the expected number of change points in their model with K in ours, we obtain E[#{βj:βj≠0}]=n1−ϑ=K, which shows that 0<K<∞ if and only if ϑ=1. In this case, for any r Jin and Ke are in the exact recover setting (see their Fig. 17). As long as r⩽6+2√10∼12.325 the minimax detection rate for the Hamming selection error is urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0518 (up to log‐terms). Hence, for ϑ=1, r can be arbitrarily small to obtain full recovery. Then, their τn=√{2r  log (n)} might be related to our λ and the expected jump size E[βj]=√{2r  log (n)}/n would correspond to Δ (although this is the worst‐case jump size and not the average size). From the argument below theorem 7 we would obtain full model consistency in probability (i.e. we estimate all change points correctly and not only their number, which we may call weak model consistency) as long as we have that the right‐hand side in theorem 7 vanishes. If we fix cn=c>0 (as this bound is non‐asymptotic), we obtain full model consistency in probability. For this, we require only Δ>c√{ log (n)/n} for any c>0. Note that the assertion in theorem 7 is uniformly over the set urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0519 which includes SMUCE. As the optimal estimation rate for Δ (if fixed) is known to be of order 1/√n, the previous estimate yields that SMUCE loses a  log (n)‐factor.

    Hence, we find that the difference in the rare and weak setting of our model is of order n−1/2, which might be explained by parameter ‘averaging’ in the first model and by the additional sampling error in ours. It would be interesting to compare empirically in the change point setting the CASE estimator with SMUCE. We believe that a similar comment applies to Tsybakov; it seems that here also the sampling error matters. He points out that for full model selection consistency (in the above sense) in the framework of variable selection no condition on λ is needed.

    Indeed, in our model, choosing ϑ with change point locations between sampling points shows that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0520 is 0 for any n owing to the sampling error.

    However, we might restrict the possible change points to the locations of the coefficients β to circumvent this difficulty.

    If we express SMUCE analogously in the linear model representation, SMUCE can be regarded as a multiscale generalization to the Dantzig selector (see Frick et al. (2006)):
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0521
    Now, we may rewrite the linear model urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0522 where urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0523 is a centred Gaussian vector with covariance A, the inverse of XTX=min(i,j). A is an n×n tridiagonal band matrix with diagonal elements ai,i=2,1⩽in−1,an,n=1 and all subdiagonal and superdiagonal elements −1, i.e. the error vector is 2 dependent. From this representation a subtle—but important—distinction from the Gaussian sequence model becomes obvious: the error variance is still σ2 and not σ2/n. Hence, the threshold for full model consistency of hard thresholding selection becomes (see Tsybakov (2013))
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0524
    which is in our notation Δ>2σ√{2  log (n)}. For SMUCE we even obtain model consistency as long as Δ>c√{ log (n)/n}, for all c>0.

    We found it quite difficult to understand how the scales enter the linear model representation Y=+ɛ when we interpret this in terms of the sampling model as in Section 2.1.1. In our model the jump is embedded in the function ϑ, and this is independent of n. For example, to express in the linear model representation above that a change point is in the middle of the interval [0,1] requires a sequence of parameter vectors β(n), where the ⌈n/2⌉‐entry is non‐zero and all others are 0.

    Confidence sets

    Multiscale statistics have been used to obtain simultaneous confidence statements about qualitative features of the regression function: see Dümbgen and Spokoiny (1988), Dümbgen and Walther (1995) and Schmidt‐Hieber et al. (2013) and also the references given by Nickl. Similarly, the multiscale statistic Tn can be employed to construct intervals, such that with probability α each of the intervals contains at least one change point. The confidence intervals that we introduced in Section 3.2 are constructed under the additional restriction that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0525 equals the true number of change points. This yields urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0526 disjoint confidence intervals for the change point locations. If q=q1−α is the (1−α)‐quantile of M the coverage of these intervals is bounded by
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0527
    as shown in Section 2.6. Recall that here urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0528 necessarily depends on λ and Δ. In particular, this implies that confidence intervals cannot be provided at any level but the error level is bounded by urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0529, which serves as an a posteriori confidence level given the data. This can be much less than 1−α and is in accordance with Fearnhead's comment, that the interpretation of these intervals is critical, if urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0530

    Towards more accurate bands

    An appealing approach to increase the actual coverage of bands was provided by Chen, Shah and Samworth and similarly by Yau. They proposed building the confidence set based on urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0531, i.e.
    urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0532
    However, confidence bands based on this approach can be quite wide. To illustrate this, assume that urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0533 Let the multiscale constraint be fulfilled by a function urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0534 with urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0535 change points, i.e. urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0536 By adding one additional arbitrary change point to urn:x-wiley:13697412:media:rssb12047:rssb12047-math-0537 we obtain a class of functions that all lie in C(q*,q1−α). Therefore, intervals that do not contain a change point will not be detected by this approach with high probability. Such intervals, however, are essential for the confidence bands based on SMUCE.

    Killick pointed out that the construction of confidence intervals can be adapted to PELT. This appealing approach seems to work quite well empirically and it is certainly interesting to develop a theory for this approach.

    Nickl, and Enikeeva and Harchaoui raise the challenging question concerning (asymptotic) optimality of SMUCE. For weak signals on a long interval, i.e. λ>0 is fixed, from theorems 5, part (a), and 6, part (a), it follows that changes can be detected as long as Δ=o(n−1/2). If inference concerns detection of a jump on scales which asymptotically shrink to 0 the answer is provided for the Gaussian case by theorems 5 and 6 (and the discussion between), which shows that SMUCE indeed achieves the minimax detection bound in the testing sense for a simple change point alternative. For more complex signals with an unbounded number of change points we do not know whether the constant that is given in theorem 6 is optimal, and also not for non‐Gaussian error. Whether condition (9) is minimal for an (asymptotically) honest inference procedure on change points is a challenging issue and worth being investigated further.

    Computational issues

    We are very grateful to Maidstone and Pickering, and Nguyen for their run time comparison. They pointed out that in the R‐package stepR only intervals of dyadic lengths are considerered for n>1000 observations, to speed up computations. It is important to note that this is not merely an approximation to SMUCE, as also the simulated (and asymptotic) quantiles for this reduced multiscale statistic change. In fact, in spite of the big data challenge in many applications nowadays, it is of great practical interest to consider even smaller systems of intervals to speed up computations further. Such a system was suggested in Walther (2000) and Rivera and Walther (2004) and it was shown that it achieves optimal detection rates in their context. As mentioned in Walther's comment, this reduced system is of order O{n  log (n)}. By considering SMUCE for such a reduced system it seems possible to reduce the complexity of the dynamic program. This, however, is not straightforward and it is important to distinguish between the complexity of evaluating the multiscale statistic for testing, as addressed by Walther, and the complexity of the dynamic program, which solves the corresponding constrained optimization problem to obtain SMUCE.

    Responding to Hušková, pseudocode for the algorithm is given in Futschik et al. (2008). We are grateful to Killick who addresses the connection between SMUCE as a constrained optimization problem with its formulation as a penalized optimization problem. To shed some light on this connection we showed how SMUCE can be rewritten with a certain penalized cost functional. This had mainly expository reasons and we had no practical consequences in mind, besides that it illustrates that dynamic programing is applicable (see lemma 1). The actual implementation in the R package stepR does not rely on lemma 1 and therefore we do not need to choose γ in practice. The only parameter to choose is q, i.e. the level of significance α.

    Again, we express our deep thanks to all the discussants. We thank also E. Arias‐Castro, L. D. Brown, M. Diehn, L. Dümbgen, A. Futschik, M. Hušková, O. Lepski, F. Pein, R. Samworth, D. Siegmund, I. Tecuapetla, A. Tsybakov and G. Walther for additional comments and discussions during the process of writing the paper and the rejoinder. Special thanks go to T. Hotz, who contributed significantly to the implementation in stepR. We are very grateful for the hospitality of the Newton Institute, Cambridge, where parts of this rejoinder were written, the Royal Statistical Society and Series B for hosting the discussion and the opportunity to present our work. We acknowledge support of Deutsche Forschungsgemeinschaft grants FOR 916, CRC 755 and CRC 803.

      Number of times cited according to CrossRef: 71

      • HMM conditional-likelihood based change detection with strict delay tolerance, Mechanical Systems and Signal Processing, 10.1016/j.ymssp.2020.107109, 147, (107109), (2021).
      • A Geostatistical Evolution Strategy for Subsurface Characterization: Theory and Validation Through Hypothetical Two‐Dimensional Hydraulic Conductivity Fields, Water Resources Research, 10.1029/2019WR026922, 56, 3, (2020).
      • Consumption Change Detection for Urban Planning: Monitoring and Segmenting Water Customers During Drought, Water Resources Research, 10.1029/2019WR025812, 56, 3, (2020).
      • Symbolic Time Series Analysis for Anomaly Detection in Measure-Invariant Ergodic Systems, Journal of Dynamic Systems, Measurement, and Control, 10.1115/1.4046156, 142, 6, (2020).
      • When Would Immunologists Consider a Nanomaterial to be Safe? Recommendations for Planning Studies on Nanosafety, Small, 10.1002/smll.201907483, 16, 21, (2020).
      • Anti‐Atherogenic Effect of Stem Cell Nanovesicles Targeting Disturbed Flow Sites, Small, 10.1002/smll.202000012, 16, 16, (2020).
      • Reduced‐Dimensional Gaussian Process Machine Learning for Groundwater Allocation Planning Using Swarm Theory, Water Resources Research, 10.1029/2019WR026061, 56, 3, (2020).
      • Reply to Comment by Jie Qin and Teng Wu on “A Modified Particle Filter‐Based Data Assimilation Method for a High‐Precision 2‐D Hydrodynamic Model Considering Spatial‐Temporal Variability of Roughness: Simulation of Dam‐Break Flood Inundation”, Water Resources Research, 10.1029/2020WR027315, 56, 3, (2020).
      • Reply to “Comment on ‘Two Papers About the Generalized Complementary Evaporation Relationships by Crago et al.’”, Water Resources Research, 10.1029/2019WR026773, 56, 3, (2020).
      • A Network Approach for Delineating Homogeneous Regions in Regional Flood Frequency Analysis, Water Resources Research, 10.1029/2019WR025910, 56, 3, (2020).
      • A backward procedure for change‐point detection with applications to copy number variation detection, Canadian Journal of Statistics, 10.1002/cjs.11535, 48, 3, (366-385), (2020).
      • Relating and comparing methods for detecting changes in mean, Stat, 10.1002/sta4.291, 9, 1, (2020).
      • Statistical inference for multiple change‐point models, Scandinavian Journal of Statistics, 10.1111/sjos.12456, 0, 0, (2020).
      • Iterative Potts Minimization for the Recovery of Signals with Discontinuities from Indirect Measurements: The Multivariate Case, Foundations of Computational Mathematics, 10.1007/s10208-020-09466-9, (2020).
      • The essential histogram, Biometrika, 10.1093/biomet/asz081, (2020).
      • Testing for dependence on tree structures, Proceedings of the National Academy of Sciences, 10.1073/pnas.1912957117, (201912957), (2020).
      • A Super Scalable Algorithm for Short Segment Detection, Statistics in Biosciences, 10.1007/s12561-020-09278-z, (2020).
      • Multiscale change point detection for dependent data, Scandinavian Journal of Statistics, 10.1111/sjos.12465, 0, 0, (2020).
      • Multiple change point detection and validation in autoregressive time series data, Statistical Papers, 10.1007/s00362-020-01198-w, (2020).
      • Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection, Journal of the Korean Statistical Society, 10.1007/s42952-020-00060-x, (2020).
      • BayesProject: Fast computation of a projection direction for multivariate changepoint detection, Statistics and Computing, 10.1007/s11222-020-09966-2, (2020).
      • Changepoint Detection by the Quantile LASSO Method, Journal of Statistical Theory and Practice, 10.1007/s42519-019-0078-z, 14, 1, (2019).
      • Robust algorithms for multiphase regression models, Applied Mathematical Modelling, 10.1016/j.apm.2019.09.009, (2019).
      • Tasks and methods of Big Data analysis (a survey), PROBLEMS IN PROGRAMMING, 10.15407/pp2019.03.058, 3, (058-085), (2019).
      • Narrowest‐over‐threshold detection of multiple change points and change‐point‐like features, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12322, 81, 3, (649-672), (2019).
      • Selective review of offline change point detection methods, Signal Processing, 10.1016/j.sigpro.2019.107299, (107299), (2019).
      • Predicting the Vibroacoustic Quality of Steering Gears, Operations Research Proceedings 2018, 10.1007/978-3-030-18500-8_39, (309-315), (2019).
      • Bayesian Detection of Piecewise Linear Trends in Replicated Time-Series with Application to Growth Data Modelling, The International Journal of Biostatistics, 10.1515/ijb-2018-0052, 0, 0, (2019).
      • Adaptive Nonparametric Clustering, IEEE Transactions on Information Theory, 10.1109/TIT.2019.2903113, 65, 8, (4875-4892), (2019).
      • undefined, 2019 IEEE International Systems Conference (SysCon), 10.1109/SYSCON.2019.8836816, (1-7), (2019).
      • Smoothing for signals with discontinuities using higher order Mumford–Shah models, Numerische Mathematik, 10.1007/s00211-019-01052-8, (2019).
      • Rank-based multiple change-point detection, Communications in Statistics - Theory and Methods, 10.1080/03610926.2019.1589515, (1-17), (2019).
      • Detection of regime shifts in the environment: testing “STARS” using synthetic and observed time series, ICES Journal of Marine Science, 10.1093/icesjms/fsz148, (2019).
      • Breaks and the statistical process of inflation: the case of estimating the ‘modern’ long-run Phillips curve, Empirical Economics, 10.1007/s00181-017-1404-5, 56, 5, (1455-1475), (2018).
      • The multiple filter test for change point detection in time series, Metrika, 10.1007/s00184-018-0672-1, 81, 6, (589-607), (2018).
      • Fully Automatic Multiresolution Idealization for Filtered Ion Channel Recordings: Flickering Event Detection, IEEE Transactions on NanoBioscience, 10.1109/TNB.2018.2845126, 17, 3, (300-320), (2018).
      • Exploring the longevity risk using statistical tools derived from the Shiryaev–Roberts procedure, European Actuarial Journal, 10.1007/s13385-018-0168-4, 8, 1, (27-51), (2018).
      • A change-point problem and inference for segment signals, ESAIM: Probability and Statistics, 10.1051/ps/2018014, 22, (210-235), (2018).
      • Confidence distributions for change-points and regime shifts, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2017.09.009, 195, (14-34), (2018).
      • Oracle Estimation of a Change Point in High-Dimensional Quantile Regression, Journal of the American Statistical Association, 10.1080/01621459.2017.1319840, 113, 523, (1184-1194), (2018).
      • Bayesian sieve methods: approximation rates and adaptive posterior contraction rates, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1470241, (1-26), (2018).
      • A Bayesian multiple changepoint model for marked poisson processes with applications to deep earthquakes, Stochastic Environmental Research and Risk Assessment, 10.1007/s00477-018-1632-z, (2018).
      • Simultaneous Credible Regions for Multiple Changepoint Locations, Journal of Computational and Graphical Statistics, 10.1080/10618600.2018.1513366, (1-9), (2018).
      • High dimensional change point estimation via sparse projection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12243, 80, 1, (57-83), (2017).
      • A heuristic, iterative algorithm for change-point detection in abrupt change models, Computational Statistics, 10.1007/s00180-017-0740-4, 33, 2, (997-1015), (2017).
      • Bayesian Selection for the $\ell _2$ -Potts Model Regularization Parameter: 1-D Piecewise Constant Signal Denoising, IEEE Transactions on Signal Processing, 10.1109/TSP.2017.2715000, 65, 19, (5215-5224), (2017).
      • Nonlinear Concepts in Time Series Analysis, Advanced Data Analysis in Neuroscience, 10.1007/978-3-319-59976-2_8, (183-198), (2017).
      • Computationally Efficient Changepoint Detection for a Range of Penalties, Journal of Computational and Graphical Statistics, 10.1080/10618600.2015.1116445, 26, 1, (134-143), (2017).
      • Jump-penalized least absolute values estimation of scalar or circle-valued signals, Information and Inference, 10.1093/imaiai/iaw022, (iaw022), (2017).
      • Changepoint Detection in the Presence of Outliers, Journal of the American Statistical Association, 10.1080/01621459.2017.1385466, (1-15), (2017).
      • Heterogeneous change point inference, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12202, 79, 4, (1207-1227), (2016).
      • On optimal multiple changepoint algorithms for large data, Statistics and Computing, 10.1007/s11222-016-9636-3, 27, 2, (519-533), (2016).
      • Multi-scale detection of rate changes in spike trains with weak dependencies, Journal of Computational Neuroscience, 10.1007/s10827-016-0635-3, 42, 2, (187-201), (2016).
      • Estimating the turning point location in shifted exponential model of time series, Journal of Applied Statistics, 10.1080/02664763.2016.1201797, 44, 7, (1269-1281), (2016).
      • The shark fin function: asymptotic behavior of the filtered derivative for point processes in case of change points, Statistical Inference for Stochastic Processes, 10.1007/s11203-016-9138-0, 20, 2, (253-272), (2016).
      • Estimating whole‐brain dynamics by using spectral clustering, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12169, 66, 3, (607-627), (2016).
      • Многоуровневый подход к обнаружению разладокMultiscale approach for change point detection, Теория вероятностей и ее примененияTeoriya Veroyatnostei i ee Primeneniya, 10.4213/tvp5087, 61, 4, (774-804), (2016).
      • Asymptotics for -value based threshold estimation under repeated measurements , Journal of Statistical Planning and Inference, 10.1016/j.jspi.2016.01.009, 174, (85-103), (2016).
      • Stepwise Signal Extraction via Marginal Likelihood, Journal of the American Statistical Association, 10.1080/01621459.2015.1006365, 111, 513, (314-330), (2016).
      • Statistical Inference in Hidden Markov Models Using k -Segment Constraints , Journal of the American Statistical Association, 10.1080/01621459.2014.998762, 111, 513, (200-215), (2016).
      • Inference for multiple change points in time series via likelihood ratio scan statistics, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12139, 78, 4, (895-916), (2015).
      • Detecting relevant changes in time series models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12121, 78, 2, (371-394), (2015).
      • The lasso for high dimensional regression with a possible change point, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12108, 78, 1, (193-210), (2015).
      • Jump information criterion for statistical inference in estimating discontinuous curves, Biometrika, 10.1093/biomet/asv018, 102, 2, (397-408), (2015).
      • Corrective control for transient faults with application to configuration controllers, IET Control Theory & Applications, 10.1049/iet-cta.2014.0532, 9, 8, (1213-1220), (2015).
      • Panel data segmentation under finite time horizon, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2015.05.007, 167, (69-89), (2015).
      • Modern Statistical Challenges in High-Resolution Fluorescence Microscopy, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-010814-020343, 2, 1, (163-202), (2015).
      • Iterative Potts and Blake–Zisserman minimization for the recovery of functions with discontinuities from indirect measurements, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 10.1098/rspa.2014.0638, 471, 2176, (20140638), (2015).
      • Comments on: Extensions of some classical methods in change point analysis, TEST, 10.1007/s11749-014-0373-7, 23, 2, (265-269), (2014).
      • Multiscale DNA partitioning: statistical evidence for segments, Bioinformatics, 10.1093/bioinformatics/btu180, 30, 16, (2255-2262), (2014).
      • The developing and restructuring superior cervical ganglion of guinea pigs (Cavia porcellus var. albina), International Journal of Developmental Neuroscience, 10.1016/j.ijdevneu.2009.03.006, 27, 4, (329-336), (2009).