Multiscale change point inference
Summary
We introduce a new estimator, the simultaneous multiscale change point estimator SMUCE, for the change point problem in exponential family regression. An unknown step function is estimated by minimizing the number of change points over the acceptance region of a multiscale test at a level α. The probability of overestimating the true number of change points K is controlled by the asymptotic null distribution of the multiscale test statistic. Further, we derive exponential bounds for the probability of underestimating K. By balancing these quantities, α will be chosen such that the probability of correctly estimating K is maximized. All results are even non‐asymptotic for the normal case. On the basis of these bounds, we construct (asymptotically) honest confidence sets for the unknown step function and its change points. At the same time, we obtain exponential bounds for estimating the change point locations which for example yield the minimax rate
up to a log‐term. Finally, the simultaneous multiscale change point estimator achieves the optimal detection rate of vanishing signals as n→∞, even for an unbounded number of change points. We illustrate how dynamic programming techniques can be employed for efficient computation of estimators and confidence regions. The performance of the multiscale approach proposed is illustrated by simulations and in two cutting edge applications from genetic engineering and photoemission spectroscopy.
1. Introduction
(1)
a right continuous step function with an unknown number K of change points. Figs 1(a) and 1(b) depict such a step function with K=8 change points and corresponding data Y for the Gaussian family
with fixed variance σ2.

with confidence bands (
) and confidence intervals for the change point locations (
) at α=0.4
- the number of change points of ϑ and
- the change point locations and the function values (intensities) of ϑ.
- confidence bands for the function ϑ and simultaneous confidence intervals for its change point locations.
1.1. Multiscale statistics and estimation
The goals (a)–(c) will be achieved on the basis of a new estimation and inference method for the change point problem in exponential families: the simultaneous multiscale change point estimator SMUCE. Let
denote the space of all right continuous step functions with an arbitrary but finite number of jumps on the unit interval [0,1) with values in Θ. For
we denote by J(ϑ) the ordered vector of change points and by #J(ϑ) its length, i.e. the number of change points. In a first step, SMUCE needs to solve the (non‐convex) optimization problem
(2)
. Optimization problems of the type (2) have been recently considered in Höhenrieder (2008) for Gaussian change point regression (see also Boysen et al. (2009) for a related approach) and for volatility estimation in Davies et al. (2012). Tn in problem (2) evaluates the maximum over the local likelihood ratio statistics on all discrete intervals [i/n,j/n] such that ϑ is constant on these with value θ=θi,j, i.e.
(3)
for testing H0:θ=θ0 against H1:θ≠θ0 on the interval [i/n,j/n] is defined as
(4)
. To obtain the final estimator for ϑ first consider the set of all solutions of problem (2) given by
(5)
is defined to be the constrained maximum likelihood estimator within this confidence set
, i.e.
(6)
recovers them both equally well.
1.2. Deviation bounds and confidence sets
in problem (2) plays a crucial role because it governs the trade‐off between data fit (the right‐hand side in problem (2)) and parsimony (the left‐hand side in problem (2)). It has an immediate statistical interpretation. From expression (2) it follows that
(7)
(8)As a consequence of inequalities (7) and (8),
in expression (5) constitutes an asymptotic confidence set at level 1−α and we shall explain in Section 3.2. how confidence bands for the graph of ϑ and confidence intervals for its change points can be obtained from this. See Fig. 1(d) for illustration.
, as Δ and λ can become arbitrarily small. Nevertheless, we can show that simultaneously both confidence bands for ϑ and intervals for the change points are asymptotically honest with respect to a sequence of nested models
that satisfy
(9)
as n→∞ (Section 2.6.). Here λn and Δn denote the smallest interval length and smallest absolute jump size in
respectively.
1.3. Choice of q
Balancing the probabilities for overestimation and underestimation in inequalities (7) and (8) gives an upper bound on
, the probability that the number of change points is misspecified. This bound depends on n, q, λ and Δ in an explicit way and opens the door for several strategies to select q, e.g. such that
is maximized. One may additionally incorporate prior information on Δ and λ and we suggest a simple way to do this in Section 4..
A further consequence of inequalities (7) and (8) is that under a suitable choice of q=qn the probability of misspecification
tends to 0 and hence
converges to the true number of change points K (model selection consistency), such that the underestimation error in inequality (8) vanishes exponentially fast.
Finally, we obtain explicit bounds on the precision of estimating the change point locations which again depend on q, n, λ and Δ. For any fixed q>0 they are recovered for all estimators in
, including SMUCE, at the optimal rate 1/n (up to a log ‐factor). Moreover, these bounds can be used to derive slower rates uniformly over nested models as in condition (9) (Section 2.6.).
1.4. Detection power for vanishing signals
For the case of Gaussian observations we derive the detection power of the multiscale statistic Tn in equation 3, i.e. we determine the maximal rate at which a signal may vanish with increasing n but still can be detected with probability 1, asymptotically. For the task of detecting a single constant signal against a noisy background, we obtain the optimal rate and constant (see Chan and Walther (2013), Dümbgen and Spokoiny (2001), Dümbgen and Walther (2008) and Jeng et al. (2010)). We extend this result to the case of an arbitrary number of change points, retrieving the same optimal rate but different constants (Section 2.5.). Similar results have been derived recently in Jeng et al. (2010) for sparse signals, where the estimator takes into account the explicit knowledge of sparsity. We stress that SMUCE does not rely on any sparsity assumptions yet it adapts automatically to sparse signals owing to its multiscale nature.
1.5. Implementation, simulations and applications
The applicability of dynamic programming to the change point problem has been the subject of research recently (see for example Boysen et al. (2009), Fearnhead (2006), Friedrich et al. (2008) and Harchaoui and Lévy‐Leduc (2010)). SMUCE
can also be computed by a dynamic program owing to the restriction of the local likelihoods to the constant parts of candidate functions. This has already been observed by Höhenrieder (2008) for the multiscale constraint that was considered there. We prove that expression (6) can be rewritten into a minimization problem of a penalized cost function with a particular data‐driven penalty (see lemma 1 in Section 3.).
Much in the spirit of the dynamic program that was suggested in Killick et al. (2012), our implementation exploits the structure of the constraint set in expression (6) to include pruning steps. These reduce the worst case computation time
considerably in practice and make it applicable to large data sets. Simultaneously, the algorithm returns a confidence band for the graph of ϑ as well as confidence intervals for the location of the change points (Section 3.), the latter without any additional cost. An R package (stepR) including an implementation of SMUCE is available on line (http://www.stochastik.math.uni-goettingen.de/smuce).
Extensive simulations reveal that SMUCE is competitive with (and indeed often outperforms) state of the art methods for the change point problem which all have been tailor made to specific exponential families (Section 5.). Our simulation study includes the circular binary segmentation (CBS) method (Olshen et al., 2004), unbalanced Haar wavelets (Fryzlewicz, 2007), the fused lasso (Tibshirani et al., 2005) and the modified Bayes information criterion (MBIC) (Zhang and Siegmund, 2007) for Gaussian regression, the multiscale estimator in Davies et al. (2012) for piecewise constant volatility and the extended taut string method for quantile regression in Dümbgen and Kovac (2009). In our simulations we consider several risk measures, including the mean‐squared error MSE and the model selection error
. Moreover, we study the feasibility of our approach for different real world data sets, including two benchmark examples from genetic engineering (Lai et al., 2005) and a new example from photoemission spectroscopy (Hüfner, 2003) which amounts to Poisson change point regression. Finally, in Section 6., we briefly discuss possible extensions to serially dependent data, among others.
1.6. Literature survey and connections to existing work
The problem of detecting changes in the characteristics of a sequence of observations has a long history in statistics and related fields, dating back to the 1950s (see for example Page (1955)). In recent years, it has experienced a renaissance in the context of regression analysis due to novel applications that mainly came along with the rapid development in genetic engineering (Braun et al., 2000; Jeng et al., 2010; Lebarbier and Picard, 2011; Olshen et al., 2004; Zhang and Siegmund, 2007) and financial econometrics (see Davies et al. (2012), Inclán and Tiao (1994), Lavielle and Teyssière (2007) and Spokoiny (2009)). Owing to the widespread occurrence of change point problems in different communities and areas of applications, such as statistics (Carlstein et al., 1994), electrical engineering and signal processing (Blythe et al., 2012), mobile phone communication (Zhang et al., 2009), machine learning (Harchaoui and Lévy‐Leduc, 2008), biophysics (Hotz et al., 2012), quantum optics (Schmidt et al., 2012), econometrics and quality control (Bai and Perron, 1998) and biology (Siegmund, 2013), an exhaustive list of existing methods is beyond reach. For a selective survey, we refer the reader also to Basseville and Nikiforov (1993), Brodsky and Darkhovsky (1993), Chen and Gupta (2000), Csörgő and Horváth (1997) and Wu (2005) and the extensive list in Khodadadi and Asgharian (2008).
Our approach as outlined above can be considered as a hybrid method of two well‐established approaches to the change point problem.

as in this paper or functions of bounded variation (Mammen and van de Geer, 1997), etc. Here l(Y,ϑ) is the (log‐) likelihood function. The penalty term pen(ϑ) penalizes the complexity of ϑ and prevents overfitting. It increases with the dimension of the model and serves as a model selection criterion.
Linear l0‐penalization, i.e. pen(ϑ)=ω#J(ϑ), has already been considered in Yao (1988) and Yao and Au (1989) with a BIC‐type weight ω∼ log (n). More sophisticated methods based on weighted l0‐penalties have since been further developed in Boysen et al. (2009), Braun et al. (2000), Winkler and Liebscher (2002) and Wittich et al. (2008) and more recently in Demaret et al. (2013) for higher dimensions. Model‐selection‐based l0‐penalized functionals, which are non‐linear in #J(ϑ), have been investigated in Arlot et al. (2012), Birgé and Massart (2001), Lavielle (2005), Lavielle and Moulines (2000) and Lavielle and Teyssière (2007) for change point regression. Zhang and Siegmund (2007) introduced a penalty which depends on the number of change points and additionally on their locations.
Further prominent penalization approaches include the fused lasso procedure (see Friedman et al. (2007), Tibshirani et al. (2005) and Harchaoui and Lévy‐Leduc (2010)) that uses a linear combination of the total variation and the l1‐norm penalty as a convex surrogate for the number of change points which has been primarily designed for the situation when ϑ is sparse. Recently, aggregation methods (Rigollet and Tsybakov, 2012) have been advocated for the change point regression problem as well.
Most similar in spirit to our approach are estimators which minimize target functionals under a statistical multiscale constraint. For some early references see Donoho (1995), Nemirovski (1985) and more recently Candès and Tao (2007), Davies and Kovac (2001), Davies et al. (2009) and Frick et al. (2012). In our case this target functional equals the number of change points.
The multiscale calibration in expression (3) is based on the work of Chan and Walther (2013), Dümbgen and Spokoiny (2001) and Dümbgen and Walther (2008). Multiscale penalization methods have been suggested in Kolaczyk and Nowak (2004) and Zhang and Siegmund (2007), multiscale partitioning methods including binary segmentation in Fryzlewicz (2012), Olshen et al. (2004), Sen and Srivastava (1975) and Vostrikova (1981) and recursive partitioning in Kolaczyk and Nowak (2005).
Aside from the connection to frequentists' work cited above, we claim that our analysis also provides an interface for incorporating a priori information on the true signal into the estimator (see Section 4.). We stress that for minimizing the bounds in inequalities (7) and (8) on the model selection error
it is not necessary to include full priors on the space of step functions
. Instead it suffices simply to specify a prior on the smallest interval length λ and the smallest absolute jump size Δ. The parameter choice strategy that is discussed in Section 4. or the limiting distribution of Tn(Y,ϑ) in Section 2., for instance, can be refined within such a Bayesian framework. This, however, will not be discussed in this paper in detail and is postponed to future work. For recent work on a Bayesian approach to the change point problem we refer to Du and Kou (2012), Fearnhead (2006) Luong et al. (2012), Rigaill et al. (2012) and the references therein.
We finally stress that there is a conceptual analogue of SMUCE to the Dantzig selector as introduced in Candès and Tao (2007) for estimating sparse signals in Gaussian high dimensional linear regression models (see James and Radchenko (2009) for an extension to exponential families). Here the l1‐norm of the signal is to be minimized subject to the constraint that the residuals are pointwise within the noise level. SMUCE, in contrast, minimizes the l0‐norm of the discrete derivative of the signal subject to the constraint that the residuals are tested to contain no signal on all scales. We shall briefly address this and other relationships to recent concepts in high dimensional statistics in a discussion in Section 6.. In summary, the change point problem is an ‘n=p’ problem and hence substantially different from high dimensional regression where ‘p≫n’. As we shall show, multiscale detection of sparse signals then becomes possible without any sparsity assumption entering the estimator. Another major statistical consequence of this paper is that post model selection inference is doable over a large range of scales uniformly over nested models in the sense of expression (9).
2. Theory
This section summarizes our main theoretical findings. In Section 2.3. we discuss consistency of the estimated number of change points. This result follows from an exponential bound for the probability of underestimating the number of change points on the one hand. On the other hand we show how to control the probability of overestimating the number of change points by means of the limiting distribution of Tn(Y,ϑ) as n→∞ (Section 2.2.). We give improved results, including a non‐asymptotic bound for the probability of overestimating the number of change points, for Gaussian observations (Sections 2.4. and 2.5.5). In Section 2.6. we finally show that the change point locations can be recovered as fast as the sampling rate up to a log ‐factor and discuss how asymptotically honest confidence sets for ϑ can be constructed over a suitable sequence of nested models.
2.1. Notation and model
is a one‐dimensional, standard exponential family with ν‐densities
(10)
denotes the natural parameter space. We shall assume that
is regular and minimal which means that Θ is an open interval and that the cumulant transform ψ is strictly convex on Θ. We shall frequently make use of the functions
(11)2.1.1. Observation model and step functions
(12)
. For
as in equation 12 we denote by J(ϑ)=(τ1,…,τK) the increasingly ordered vector of change points and by
its length. We shall denote the set of step functions with K change points and change point locations restricted to the sample grid by
.
For any estimator
of
, the estimated number of change points will be denoted by
and the change point locations by
and we set
for
. For simplicity, for each
we restrict ourselves to estimators which have change points only at sampling points, i.e.
with
for some
. To keep the presentation simple, throughout what follows we restrict ourselves to an equidistant sampling scheme as in model (1). However, we mention that extensions to more general designs are possible.
2.1.2. Multiscale statistic
in expression (4) can be rewritten as


. The multiscale statistic Tn(Y,ϑ) in expression (3) was defined to be the (scale‐calibrated) maximum over all
such that
for some
. As mentioned in Section 1. we shall sometimes restrict the minimal interval length (scale) by a sequence of lower bounds
tending to 0. To ensure that the asymptotic null distribution is non‐degenerate, we assume for non‐Gaussian families (see also Schmidt‐Hieber et al. (2013))
(13)
(14)2.2. Asymptotic null distribution
(15)
Theorem 1.Assume that
satisfies condition (13). Then,
(16)
A proof is given in the on‐line supplement.
(17)
), n=500 (
) and n=5000 (——–) equidistant discretization points

2.3. Exponential inequality for the estimated number of change points
In this section we derive explicit bounds on the probability that
as defined in problem (2) underestimates the true number of change points K. In combination with the results in Section 2..2, these bounds will imply model selection consistency, i.e.
for a suitable sequence of thresholds
in problem (2).
(18)
respectively and assume that
for all t ∈ [0,1]. We give the aforementioned exponential upper bound on the probability that the number of change points is underestimated. The result follows from the general exponential inequality in the on‐line supplement, theorem 7.10.
Theorem 2.(underestimation bound). Let
and
be defined as in expression (18) with λ⩾2cn. Then, there is a constant
subject to
(19)
(20)
for the Poisson family, given
in the latter case.
with high probability. On the other hand, it follows from theorem 1 that Tn(Y,ϑ;cn) is bounded almost surely as n→∞ if cn is as in expression (13). This in turn implies that the probability for
tends to 1, since
(21)
Theorem 3.(model selection consistency). Let the assumptions of theorems 1 and 2 hold and additionally assume that qn→∞ and qn/√n→0 as n→∞. Then,

Giving a non‐asymptotic bound for the probability for overestimating the true number of change points (in the spirit of expression (21)) appears to be rather difficult in general. For the Gaussian case though this is possible, as we shall show in the next section.
2.4. Gaussian observations
is the Gaussian family of distributions with constant variance. In this case model (1) reads
(22)
random variables, σ>0 and
denotes the expectation of Y. To ease the notation we assume in what follows that σ=1. For the general case replace Δ by Δ/σ.
In the Gaussian case it is possible to remove the lower bound for the smallest scales cn as in expression (13) because the strong approximation by Gaussian observations in the proof of theorem 1 becomes superfluous. We obtain the following non‐asymptotic result on the null distribution.
Theorem 4.(null distribution of Tn). For any 

In contrast with theorem 1, this result is non‐asymptotic and the inequality holds for any sample size. For this reason, we obtain the following improved upper bound for the probability of overestimating the number of change points.
Corollary 1.(overestimation bound). Let
and
be defined as in expression (18). Then for any 

This corresponds to the ‘worst‐case scenario’ for overestimation when the true signal ϑ has no jump.
(23)2.5. Multiscale detection of vanishing signals for Gaussian observations
We shall now discuss the ability of SMUCE to detect vanishing changes in a signal. We begin with the problem of detecting a signal on a single interval against an unknown background.
Theorem 5.Let ϑn(t)=θ0+δnIn(t) for some θ0,θ0+δn ∈ Θ, and for some sequence of intervals In⊂[0,1] and Y be given by expression (22). Further let
be bounded away from zero and assume,
- for signals on a large scale (i.e.
), that
and,
- for signals on a small scale (i.e.
), that
with ɛn, subject to
and
.
Then,
(24)A proof is given in the on‐line supplement.
Theorem 1 gives sufficient conditions on the signals ϑn (through the interval length
and the jump height δn) as well as on the thresholds qn such that the multiscale statistic Tn detects the signals with probability 1, asymptotically; put differently, this means that
. We stress that the above result is optimal in the following sense: no test can detect signals satisfying
with asymptotic power 1 (see Chan and Walther (2013), Dümbgen and Spokoiny (2001) and Jeng et al. (2010)). For the special case, when qn≡qα is a fixed α‐quantile of the null distribution Tn(Y,ϑn) (or of the limiting distribution M in expression (15)), the result boils down to the findings in Chan and Walther (2013) and Dümbgen and Spokoiny (2001). In particular, aside from the optimal asymptotic power (24), the error of the first kind is bounded by α. The result in theorem 5 goes beyond that and allows us to shrink the error of the first kind to zero asymptotically, by choosing qn→∞.
We finally generalize the results in theorem 5 to the case when
has more than one change point. To be more precise, we formulate conditions on the smallest interval and the smallest jump in ϑ such that no change point is missed asymptotically.
Theorem 6.Let
be a sequence in
with Kn change points and denote by Δn and λn the smallest absolute jump size and smallest interval in ϑn respectively. Further, assume that qn is bounded away from zero and,
- for signals on large scales (i.e. lim & inf ;λn>0), that √(λnn)Δn/qn→∞,
- for signals on small scales (i.e. λn→0) with Kn bounded, that √(λnn)Δn⩾(4+ɛn)×√ log (1/λn) with ɛn√ log (1/λn)→∞ and
and
- the same as in assumption (b), with Kn unbounded and the constant 12 instead of 4.
Then,

A proof is given in the on‐line supplement.
Theorem 1 amounts to saying that the statistic Tn can detect multiple change points simultaneously at the same optimal rate (in terms of the smallest interval and jump) as a single change point. The only difference is the constants that bound the size of the signals that can be detected. These increase with the complexity of the problem: √2 for a single change against an unknown background, 4 for a bounded (but unknown) and 12 for an unbounded number of change points. In Jeng et al. (2010) it was shown that for step functions that exhibit certain sparsity patterns the optimal constant √2 can be achieved. It is important to note that we do not make any sparsity assumption on the true signal. Finally we mention an analogue to theorem 4.1. of Dümbgen and Walther (2008) in the context of detecting local increases and decreases of a density. As in theorem 6 only the constants and not the rates of detection change with the complexity of the alternatives.
2.6. Estimation of change point locations and simultaneous confidence sets
in expression (5) by replacing Tn(Y,ϑ) in expression (3) with Tn(Y,ϑ,cn) as in expression (14) and consider the set of solutions of the optimization problem
(25)
recovers the change point locations of the true regression function ϑ with the same rate of convergence. It is determined by the smallest scale
for the considered interval lengths in the multiscale statistic Tn in expression (14) and hence equals the sampling rate up to a log‐factor.
Theorem 7.Let
and
be the set of solutions of problem (25) and
a sequence in (0,1]. Further let
as in expression (20). Then, for all
,

A proof is given in the on‐line supplement.
, a sufficient condition for the right‐hand side in theorem 7 to vanish as n→∞ is

Here the constant C matters; for example in the Gaussian case C=1/32 (see Section 2.3.). This improves several results that have been obtained for other methods; for example in Harchaoui and Lévy‐Leduc (2010) for a total variation penalized estimator a log 2(n)/n rate has been shown.
in which the change point locations are reconstructed uniformly with rate cn. These subclasses are delimited by conditions on the smallest absolute jump height Δn and on the number of change points Kn (or the smallest interval lengths λn by using the relationship Kn⩽1/λn) of its members. For instance, the rate function cn=n−β with some β ∈ [0,1) implies the condition

and it follows from theorem 7 that for n sufficiently large

(26)
is an asymptotic confidence set at level 1−α.
Corollary 2.Let α ∈ (0,1) and q1−α be the (1−α)‐quantile of the statistic M in expression (15). Then,
(27)
as in theorem 3. Consequently we find that

.
We mention that for the Gaussian family (see Section 2.4.) inequality (27) even holds for any n, i.e. the o(1) term on the right‐hand side can be omitted. Thus the right‐hand side of inequality (27) gives an explicit and non‐asymptotic lower bound for the true confidence level of C(qα).
, the confidence set
is difficult to visualize in practice. Therefore, in Section 2. we compute a confidence band B(q)⊂[0,1]×Θ that contains the graphs of all functions in
as well as disjoint confidence intervals for the change point locations denoted by
for
. For simplicity, we denote the collection
by I(q) and agree on the notation
(28)
and I(q) are linked by the relationship
(29)
at level 1−α if

, since signals cannot be detected if they vanish too fast as n→∞. For Gaussian observations this was made precise in Section 2.4.
,
, be a sequence of subclasses of
. Then I(q) is sequentially honest with respect toS(n) at level 1−α if

By combining expressions (26), (27) and (29) we obtain the following result about the asymptotic honesty of I(q1−α).
Corollary 3.Let α ∈ (0,1) and q1−α be the (1−α)‐quantile of the statistic M in expression (15) and assume that
is a sequence of positive numbers. Define
. Then I(q1−α) is sequentially honest with respect to
at level 1−α, i.e.

By estimating 1/λ⩽n we find that the confidence level α is kept uniformly over nested models
, as long as
. Here λn and Δn are the smallest interval length and smallest absolute jump size in
respectively.
3. Implementation
We now explain how SMUCE, i.e. the estimator
with maximal likelihood in the confidence set
, can be computed efficiently within the dynamic programming framework. In general the algorithm proposed is of complexity
. We shall show, however, that in many situations the computation can be performed much faster.
as a solution of a minimization of a complexity penalized cost function with data‐dependent penalty. For this, we shall denote the log‐likelihood of
as

for all
.
of discrete intervals a partition if its union equals the set {1,…,n}. We denote by β(n) the collection of all partitions of {1,…,n}. For
let
denote the number of discrete intervals in
. Hence, any discrete step function
can be identified with a pair
, where


. With this we define the costs of θI on I as
(30)
where we agree that
is such that
. We stress that
if and only if no θI ∈ Θ exists such that the multiscale constraint is satisfied on I. Finally, for an estimator
the overall costs are given by

(31)It is shown that the computation time amounts to
given that the minimal costs
can be computed in
. We now show that each minimizer of expression (31) maximizes the likelihood over the set
, if γ>0 is chosen sufficiently large. This γ can be computed explicitly for any given data (Y1,…,Yn) according to the next result.
Lemma 1.Let
. Then, any solution of expression (31) is also a solution of expression (6).

and
be such that
. Clearly, B(n) is the minimal value of expression (31) and
is a minimizer of expression (31). A key ingredient is the following recursion formula (see Friedrich et al. (2008), lemma 1)):

are given for all r<p≤n. Then, compute the best previous change point position, i.e.
(32)
and
. With this we can iteratively compute the Bellman function B(p) and the corresponding minimizers
for p=1,…,n and eventually obtain
, i.e. a minimizer of expression (31). According to lemma 1, this
solves problem (6) if γ is chosen sufficiently large.
We note that, for a practical implementation of the dynamic program proposed, the efficient computation of the values
is essential. We postpone this to the upcoming subsection and will discuss the complexity of the algorithm first. Following Friedrich et al. (2008) the dynamic programming algorithm is of order O(n2), given that the minimal costs
are computed in O(1) steps. Note that this does not hold true for the costs in expression (30). However, as we shall show in the next subsection, the set of all optimal costs
can be computed in O(n2) steps and hence the complete algorithm is of order O(n2) again.
In our implementation the specific structure of the costs (see expression (30)) has been employed by including several pruning steps in the dynamic program, similarly to Killick et al. (2012). Since the details are rather technical, we give only a brief explanation why the computation time of the algorithm as described below can be reduced: the speed‐ups are based on the idea to consider only such r in equation 32 that may lead to a minimal value, i.e. those r that are strictly larger than
. The number of intervals, on which SMUCE is constant, is of order
, instead of n2 if all intervals were considered. The number of intervals [r,p] which are needed in expression (32) is essentially of the same order. This indicates that SMUCE is much faster for signals with many detected change points than for signals with few detected change points, which has been confirmed by simulations. The pruned algorithm is implemented for the statistical software R in the package stepR (available from http://www.stochastik.math.uni-goettingen.de/smuce; the SMUCE procedure for several exponential families is available via the function smuceR).
3.1. Computation of minimal costs
is strictly convex on Θ and has the unique global minimum at
if and only if
. In this case it follows from Nielsen (1973), theorem 6.2, that for all q>0

. In other words,
and
are the two finite solutions of the equation
(33)
, then Nielsen (1973), theorem 6.2, implies that either
or
. Let us assume without restriction that
which in turn shows that Θ=(−∞,θ2) and
. In this case, the infimum of
is not attained and equation 33 has only one finite solution
. The lower bound
then is trivial.
and
for all r⩽i⩽j⩽p, define
and
. Hence, if
we obtain

if and only if
.
To summarize, the computation of
(and hence the computation of the minimal costs
) reduces to finding the non‐trivial solutions of equation 33 for all r⩽i⩽j⩽p. This can either be done explicitly (as for the Gaussian family, for example) or approximately by Newton's method, say.
and
are computed in O(1), the bounds
and
are computed in O(n2). This follows from the observation that for 1⩽r⩽p⩽n

3.2. Computation of confidence sets
as well as a confidence band B(q)⊂[0,1]×Θ such that, for each estimator
,


and define
(34)
that satisfies
, it holds that
with
and
.
Now we construct a confidence band B(q) that contains the graphs of all functions in C(q). For this, let
be as above and note that for
there is exactly one change point in the interval
and no change point in
. First, assume that
. Then we obtain a lower and an upper bound for
by
and
respectively. Now let
. Then, the kth change point is either to the the left or to the right of t and hence any feasible estimator is constant either on
or on
. Thus, we obtain a lower bound by
and an upper bound by
.
4. On the choice of the threshold parameter
. For the Gaussian case we have shown in Section 4. that this result is even non‐asymptotic, i.e. from corollary 1 it follows that
(35)
(36)
(37)
- η*=12√{− log (λ*)} and
- √λ*=g(Δ,n),
(38)
picks that q which maximizes the probability bound in expression (36) of correctly estimating the number of change points. Note that
does not depend on the true signal ϑ but only on the number of observations n.
Even though the motivation for
is built on the assumption of Gaussian observations, simulations indicate that it performs well for other distributions also. That is why we choose
, unless stated differently throughout all simulations. There α(q) is estimated by Monte Carlo simulations with sample size n=3000. These simulations are rather expensive but need to be performed only once. For a given n, a solution of equation may then be approximated numerically by computing the right‐hand side for a range of values for q.
We stress again that the general concept given by equation can be employed further to incorporate prior knowledge of the signal as will be shown in Section 5.6.
5. Simulations
As mentioned in Section 1., the literature on the change point problem is vast and we shall now aim for comparing our approach within the plethora of established methods for exponential families. All SMUCE instances computed in this section are based on the optimization problem (2), i.e. we do not restrict the interval lengths, as required in Section 2. for technical reasons.
5.1. Gaussian mean regression
reads

according to expression (18), SMUCE becomes

(39)
for k=1,…,n, where the function pen(ϑ) penalizes the complexity of the model. This includes the BIC that was introduced in Schwarz (1978) which suggests the choice pen(ϑ)={2#J(ϑ)} log (n). As was for instance stressed in Zhang and Siegmund (2007), the formal requirements to apply the BIC are not satisfied for the change point problem. Instead Zhang and Siegmund (2007) proposed the following penalty function in expression (39), which is denoted as the MBIC:


(40)In summary, we compare SMUCE with the MBIC approach that was suggested in Zhang and Siegmund (2007), the CBS algorithm (R package available from http://cran.r-project.org/web/packages/PSCBS) proposed in Olshen et al. (2004), the fused lasso algorithm (R package available from http://cran.r-project.org/web/packages/flsa/) that was suggested in Tibshirani et al. (2005), unbalanced Haar wavelets (R package available from http://cran.r-project.org/web/packages/unbalhaar/) (Fryzlewicz, 2007 and the PL oracle as defined above. Since the CBS algorithm tends to overestimate the number of change points Olshen et al. (2004) included a pruning step which requires the choice of an additional parameter. The choice of the parameter is not explicitly described in Olshen et al. (2004) and here we consider only the unpruned algorithm.
(41)Following Zhang and Siegmund (2007) we simulate data for σ=0.2 and a=0 (no trend), a=0.01 (long trend) and a=0.025 (short trend) (Fig. 4). Moreover, we included a scenario with a smaller signal‐to‐noise ratio, i.e. σ=0.3 and a=0 and a scenario with a higher signal‐to‐noise ratio, i.e. σ=0.3 and a=0. For both scenarios we do not display results with a local trend, since we found the effect to be very similar to the results with σ=0.2. Table 1 shows the frequencies of the number of detected change points for all methods mentioned and the corresponding MISE and mean integrated absolute error MIAE. Moreover, in Fig. 5 we display a typical observation of model (41) with a=0.1 and b=0.1 and the aforementioned estimators. The results show that SMUCE outperforms the MBIC (Zhang and Siegmund, 2007) slightly for σ=0.2 and appears to be less vulnerable to trends, in particular. Notably, SMUCE often performs even better than the PL oracle. For σ=0.3 SMUCE has a tendency to underestimate the number of change points by 1, whereas CBS and in particular the MBIC estimate the true number K=6 with high probability correctly. As is illustrated in Fig. 5 this is because SMUCE cannot detect all change points at level 1−α ≈ 0.55 as we have chosen it following the simple rule (38) in Section 4.. For further investigation, we lowered the level to 1−α=0.4 (see the last row in Table 1). Even though this improves estimation, SMUCE performs comparably with CBS and the PL oracle now, it is still worse than the MBIC.

), simulated data (•) and confidence bands (
) and confidence intervals for the change points (
) for (a) a=0, (b) a=0.01 and (c) a=0.025 and σ2=0.2

| Method | Trend | σ | Results for the following numbers | MISE | MIAE | ||||
|---|---|---|---|---|---|---|---|---|---|
| of change points: | |||||||||
| ⩽4 | 5 | 6 | 7 | ⩾8 | |||||
| SMUCE (1−α=0.55) | No | 0.1 | 0.000 | 0.000 | 0.988 | 0.012 | 0.000 | 0.00019 | 0.00885 |
| PL oracle | No | 0.1 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.00019 | 0.00874 |
| MBIC (Zhang and Siegmund, 2007) | No | 0.1 | 0.000 | 0.000 | 0.964 | 0.031 | 0.005 | 0.00020 | 0.00888 |
| CBS (Olshen et al., 2004) | No | 0.1 | 0.000 | 0.000 | 0.922 | 0.044 | 0.034 | 0.00023 | 0.00903 |
| Unbalanced Haar wavelets | No | 0.1 | 0.000 | 0.000 | 0.751 | 0.137 | 0.112 | 0.00026 | 0.00926 |
| (Fryzlewicz, 2007) | |||||||||
| FLcp | No | 0.1 | 0.124 | 0.122 | 0.419 | 0.134 | 0.201 | 0.00928 | 0.15821 |
| FLMSE | No | 0.1 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.00042 | 0.00274 |
| SMUCE (1−α=0.55) | No | 0.2 | 0.000 | 0.000 | 0.986 | 0.014 | 0.000 | 0.00117 | 0.01887 |
| PL oracle | No | 0.2 | 0.024 | 0.001 | 0.975 | 0.000 | 0.000 | 0.00138 | 0.01915 |
| MBIC (Zhang and Siegmund, 2007) | No | 0.2 | 0.000 | 0.000 | 0.960 | 0.037 | 0.003 | 0.00120 | 0.01894 |
| CBS (Olshen et al., 2004) | No | 0.2 | 0.000 | 0.000 | 0.870 | 0.089 | 0.041 | 0.00146 | 0.01969 |
| Unbalanced Haar wavelets | No | 0.2 | 0.000 | 0.000 | 0.637 | 0.222 | 0.141 | 0.00174 | 0.02063 |
| (Fryzlewicz, 2007) | |||||||||
| FLcp | No | 0.2 | 0.184 | 0.162 | 0.219 | 0.174 | 0.261 | 0.08932 | 0.23644 |
| FLMSE | No | 0.2 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.00297 | 0.03692 |
| SMUCE (1−α=0.55) | Long | 0.2 | 0.000 | 0.000 | 0.825 | 0.171 | 0.004 | 0.00209 | 0.03314 |
| PL oracle | Long | 0.2 | 0.026 | 0.030 | 0.944 | 0.000 | 0.000 | 0.00245 | 0.03452 |
| MBIC (Zhang and Siegmund, 2007) | Long | 0.2 | 0.000 | 0.000 | 0.753 | 0.215 | 0.032 | 0.00214 | 0.03347 |
| CBS (Olshen et al., 2004) | Long | 0.2 | 0.000 | 0.000 | 0.708 | 0.130 | 0.162 | 0.00266 | 0.03501 |
| Unbalanced Haar wavelets | Long | 0.2 | 0.000 | 0.000 | 0.447 | 0.308 | 0.245 | 0.00279 | 0.03515 |
| (Fryzlewicz, 2007) | |||||||||
| FLcp | Long | 0.2 | 0.078 | 0.112 | 0.219 | 0.215 | 0.376 | 0.08389 | 0.22319 |
| FLMSE | Long | 0.2 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.00302 | 0.03782 |
| SMUCE (1−α=0.55) | Short | 0.2 | 0.000 | 0.002 | 0.903 | 0.088 | 0.007 | 0.00235 | 0.03683 |
| PL oracle | Short | 0.2 | 0.121 | 0.002 | 0.877 | 0.000 | 0.000 | 0.00325 | 0.03846 |
| MBIC (Zhang and Siegmund, 2007) | Short | 0.2 | 0.000 | 0.000 | 0.878 | 0.107 | 0.015 | 0.00238 | 0.03695 |
| CBS (Olshen et al., 2004) | Short | 0.2 | 0.000 | 0.000 | 0.675 | 0.182 | 0.143 | 0.00267 | 0.03806 |
| Unbalanced Haar wavelets | Short | 0.2 | 0.000 | 0.000 | 0.602 | 0.225 | 0.173 | 0.00288 | 0.03849 |
| (Fryzlewicz, 2007) | |||||||||
| FLcp | Short | 0.2 | 0.175 | 0.126 | 0.192 | 0.210 | 0.297 | 0.08765 | 0.23105 |
| FLMSE | Short | 0.2 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.00331 | 0.04111 |
| SMUCE (1−α=0.55) | No | 0.3 | 0.030 | 0.340 | 0.623 | 0.007 | 0.000 | 0.00660 | 0.03829 |
| PL oracle | No | 0.3 | 0.181 | 0.031 | 0.788 | 0.000 | 0.000 | 0.00505 | 0.03447 |
| MBIC (Zhang and Siegmund, 2007) | No | 0.3 | 0.015 | 0.006 | 0.927 | 0.050 | 0.002 | 0.00364 | 0.03123 |
| CBS (Olshen et al., 2004) | No | 0.3 | 0.006 | 0.019 | 0.764 | 0.157 | 0.054 | 0.00449 | 0.03404 |
| Unbalanced Haar wavelets | No | 0.3 | 0.008 | 0.004 | 0.602 | 0.244 | 0.142 | 0.00556 | 0.03792 |
| (Fryzlewicz, 2007) | |||||||||
| FLcp | No | 0.3 | 0.038 | 0.059 | 0.088 | 0.115 | 0.700 | 0.08792 | 0.23496 |
| FLMSE | No | 0.3 | 0.531 | 0.200 | 0.125 | 0.078 | 0.066 | 0.09670 | 0.24131 |
| SMUCE (1−α=0.4) | No | 0.3 | 0.000 | 0.099 | 0.798 | 0.089 | 0.000 | 0.00468 | 0.03499 |
- †The true signals, which are shown in Fig. 4, have six change points.
For an evaluation of FLMSE and FLcp one should account for the quite different nature of the fused lasso: the weight λ1 in expression (40) penalizes estimators with large absolute values, whereas λ2 penalizes the cumulated jump height. However, none of them encourages directly sparsity with respect to the number of change points. That is why these estimators often incorporate many small jumps (which is well known as the staircase effect). In comparison with SMUCE we find that SMUCE outperforms FLMSE with respect to MISE and it outperforms FLcp with respect to the frequency of correctly estimating the number of change points. The example in Fig. 6 suggests that the major features of the true signal are recovered by FLMSE. But, additionally, there are also some artificial features in the estimator which suggest that an additional filtering step must be included (see Tibshirani and Wang (2008)).

, true signal): (a) SMUCE; (b) MBIC; (c) unbalanced Haar wavelets; (d) CBS; (e) FLMSE; (f) FLcpThe unbalanced Haar estimator also has a tendency to include too many jumps, even though the effect is much smaller than for lasso‐type methods, i.e. it is much sparser with respect to the number of change points. As this estimator has been primarily designed for estimation of ϑ and not the jump locations it performs well with respect to MISE and MIAE.
Again, we note that Table 1 can be complemented by the simulation study in Zhang and Siegmund (2007) which accounts for the classical BIC (Schwarz, 1978) and the method suggested in Fridlyand et al. (2004).
5.2. Gaussian variance regression
. For simplicity we set μ=0. This constitutes a natural exponential family with natural parameter θ=−(2σ2)−1 and ψ(θ)=− log (−2θ)/2 for the sufficient statistic
, i=1,…,n. It is easily seen that the MR‐statistic in this case reads

according to expression (18), SMUCE is given by

We compare our method with those of Davies et al. (2012) and Höhenrieder (2008). Similarly to SMUCE they proposed to minimize the number of change points under a multiscale constraint. They additionally restricted their final estimator to coincide with the local ML estimator on constant segments. As pointed out by Davies et al. (2012) and Höhenrieder (2008) this may increase the number of change points detected. Following their simulation study we consider test signals σk with k=0,1,4,9,19 equidistant change points and constant values alternating from 1 to 2 (k=1), from 1 to 2 (k=4), from 1 to 2.5 (k=9) and from 1 to 3.5 (k=19). For this simulation the parameters of both procedures are chosen such that the number of changes should not be overestimated with probability 0.9. For any signal we computed both estimates in 1000 simulations. The difference in true and estimated number of change points as well as MISE and MIAE are shown in Table 2. Considering the number of correctly estimated change points, it shows that SMUCE performs better for few changes (k=1,4,9) and worse for many changes (k=19). This may be explained by the fact that the multiscale test in Davies et al. (2012) does not include a scale calibration and is hence more sensible on small than on larger scales; see also Section 6.1.1. With respect to MISE and MIAE SMUCE outperforms in every scenario, interestingly even for k=19, where the method of Davies et al. (2012) performs better with respect to the estimated number of change points.
| Method | k | Results for the following differences between estimated | MISE | MIAE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| and true numbers of change points: | ||||||||||
| −3 | −2 | −1 | 0 | 1 | 2 | 3 | ||||
| SMUCE | 0 | 0.000 | 0.000 | 0.000 | 0.945 | 0.053 | 0.002 | 0.000 | 0.00072 | 0.02040 |
| Davies et al. (2012) | 0 | 0.000 | 0.000 | 0.000 | 0.854 | 0.127 | 0.019 | 0.000 | 0.00093 | 0.02122 |
| SMUCE | 1 | 0.000 | 0.000 | 0.000 | 0.975 | 0.024 | 0.001 | 0.000 | 0.00653 | 0.04295 |
| Davies et al. (2012) | 1 | 0.000 | 0.000 | 0.000 | 0.901 | 0.089 | 0.009 | 0.001 | 0.00935 | 0.04648 |
| SMUCE | 4 | 0.000 | 0.000 | 0.000 | 0.997 | 0.003 | 0.000 | 0.000 | 0.02153 | 0.07967 |
| Davies et al. (2012) | 4 | 0.000 | 0.000 | 0.000 | 0.957 | 0.042 | 0.001 | 0.000 | 0.03378 | 0.09655 |
| SMUCE | 9 | 0.000 | 0.001 | 0.023 | 0.973 | 0.003 | 0.000 | 0.000 | 0.06456 | 0.13206 |
| Davies et al. (2012) | 9 | 0.000 | 0.000 | 0.009 | 0.968 | 0.023 | 0.000 | 0.000 | 0.11669 | 0.18297 |
| SMUCE | 19 | 0.000 | 0.027 | 0.222 | 0.751 | 0.000 | 0.000 | 0.000 | 0.26076 | 0.27468 |
| Davies et al. (2012) | 19 | 0.000 | 0.008 | 0.074 | 0.912 | 0.006 | 0.000 | 0.000 | 0.47105 | 0.40606 |
- †Difference between the estimated and the true number of change points for k=0,1,4,19 change points as well as MISE and MIAE for both estimators.
5.3. Poisson regression

as in expression (18), SMUCE is given by

In applications (see the example from photoemission spectroscopy below), one is often faced with the problem of low count Poisson data, i.e. when the intensity μ is small. It will turn out that, in this case, data transformation towards Gaussian variables such as variance stabilizing transformations are not always sufficient and it pays off to take into account the Poisson likelihood in SMUCE.
In what follows we perform a simulation study where we use a signal with a low count and a spike part (Fig. 7(a)). To evaluate the performance of SMUCE we compare it with the BIC estimator and the PL oracle as described before. Moreover, we included a version of SMUCE which is based on variance stabilizing transformations of the data. For this, we applied the mean matching transformation (Brown et al., 2010) to preprocess the data. We then compute SMUCE under a Gaussian model and retransform the obtained estimator by the inverse mean matching transform. The resulting estimator is referred to as SMUCEmm. Moreover, as a benchmark, we compute the (parametric) ML estimator with K=7 change points, which is referred to as the ML oracle.

) and confidence intervals for the change points (
), (d) SMUCEmm and (e) PL oracle
Table 3 summarizes the simulation results. As to be expected the standard BIC performs far from satisfactorily. We stress that SMUCE clearly outperforms SMUCEmm, which is based on Gaussian transformations. Note that SMUCEmm systematically underestimates the number of change points K=7, which highlights the difficulty of capturing those parts of the signal correctly, where the intensity is low. Again, SMUCE performs almost as well as the Poisson oracle PL oracle. To obtain a visual impression along with the results of Table 3, we illustrated these estimators in Fig. 7.
and distance measures for SMUCE, the BIC (Schwarz, 1978) and SMUCE for variance‐stabilized signals as well as the PL oracle and ML oracle
| Method | Results for the following | MISE | MIAE | Kullback– | ||||
|---|---|---|---|---|---|---|---|---|
| numbers of change points: | Leibler | |||||||
| ≤5 | 6 | 7 | 8 | ≥9 | ||||
| SMUCE | 0.000 | 0.067 | 0.929 | 0.004 | 0.004 | 0.274 | 0.217 | 0.0187 |
| BIC | 0.000 | 0.000 | 0.080 | 0.094 | 0.920 | 0.575 | 0.313 | 0.0417 |
| SMUCEmm | 0.013 | 0.420 | 0.561 | 0.005 | 0.006 | 0.434 | 0.364 | 0.0418 |
| PL oracle | 0.045 | 0.014 | 0.942 | 0.000 | 0.000 | 0.275 | 0.217 | 0.0185 |
| ML oracle | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.258 | 0.208 | 0.0143 |
5.4. Quantile regression



In other words, we compute the estimate with fewest change points, such that the signs of the residuals fulfil the multiscale test for Bernoulli observations with mean β. The computation of this estimate hence results in the same type of optimization problem as treated in Section 3.1. and we can apply the methodology proposed.
In what follows we compare this approach with a generalized taut string algorithm (Davies and Kovac, 2001), which was proposed in Dümbgen and Kovac (2009), for estimating quantile functions. The estimate is constructed in such a way that it minimizes the number of local extreme values among a specified class of functions. Here, a local extreme value is either a local maximum or a local minimum.
In contrast with SMUCE the number of change points is not penalized. In a simulation study Dümbgen and Kovac (2009) showed that their method is particularly suitable for detecting local extremes of a signal. We follow this idea and repeated their simulations: see also Fig. 8. The results, which also include the estimated number of change points, are shown in Table 4. It can be seen that the generalized taut string estimates the number of local extremes slightly better than SMUCE, whereas the number of change points is overestimated for n=2048 and n=4096. This may be explained by the fact that the generalized taut string is not primarily designed to have few change points rather few local extremes.

) and 0.1‐ and 0.9‐quantile (
) from SMUCE and (d) estimator for the median (
) and 0.1‐ and 0.9‐quantile (
) from the generalized taut string
| Method | n | Results for local extreme | Results for change points | ||||
|---|---|---|---|---|---|---|---|
| values | |||||||
| β=0.5 | β=0.1 | β=0.9 | |||||
| β=0.5 | β=0.1 | β=0.9 | |||||
| SMUCE | 512 | 3 (5.9) | 1 (7.9) | 2 (7.4) | 5 (5.8) | 2 (9.1) | 3 (8.3) |
| Generalized taut string | 512 | 3 (6.0) | 3 (6.6) | 3 (6.6) | 12 (2.0) | 6 (4.9) | 7 (4.0) |
| SMUCE | 2048 | 9 (0.4) | 4 (5.4) | 3 (5.8) | 11 (0.1) | 6 (5.2) | 5 (5.9) |
| Generalized taut string | 2048 | 9 (0.7) | 5 (4.0) | 3 (5.7) | 26 (15.3) | 18 (7.1) | 16 (5.7) |
| SMUCE | 4096 | 9 (0.1) | 4 (4.3) | 5 (4.5) | 11 (0.2) | 8 (3.1) | 6 (4.8) |
| Generalized taut string | 4096 | 9 (0.0) | 6 (3.1) | 3 (5.3) | 35 (24.1) | 25 (13.8) | 21 (9.9) |
- †Medians of local extreme values and numbers of change points of the estimators and mean absolute difference (in parentheses) to the true numbers of local extremes and change points. The true number of local extremes equals 9 and the true number of change points equals 11.
5.5. On the coverage of confidence sets I(q)
In Section 2.6. we gave asymptotic results on the simultaneous coverage of the confidence sets I(q) as defined in expression (28). In our simulations we choose q=q1−α to be the (1−α)‐quantile of M as in expression (15). It then follows from corollary 3 that asymptotically the simultaneous coverage is larger than 1−α. We now investigate empirically the simultaneous coverage of I(q1−α). For this, we consider the test signals that are shown in Fig. 9 for Gaussian observations with varying mean, Gaussian observations with varying variance, Poisson observations and Bernoulli observations.

) with confidence bands (
) and confidence intervals for change points (
)
Table 5 summarizes the empirical coverage for various values for α and n obtained by 500 simulation runs each and the relative frequencies of correctly estimated change points, which are given in parentheses. The results show that for n=2000 the empirical coverage exceeds 1−α in all scenarios. The same is not true for smaller n (indicated by italics), since here the number of change points is misspecified quite frequently (see the numbers in parentheses). Given that K has been estimated correctly, we find that the empirical coverage of bands and intervals is in fact larger than the nominal 1−α for all simulations.
| n | 1−α | Results for | Results for | Results for | Results for | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gaussian | Gaussian | Poisson | Bernoulli | ||||||||||
| observations | observations | observations | observations | ||||||||||
| (mean) | (variance) | ||||||||||||
| 1000 | 0.8 | 0.59 | 0.64 | 0.92 | 0.66 | 0.68 | 0.97 | 0.87 | 0.89 | 0.98 | 0.85 | 0.90 | 0.94 |
| 0.9 | 0.48 | 0.49 | 0.98 | 0.39 | 0.39 | 1.00 | 0.85 | 0.86 | 0.99 | 0.86 | 0.86 | 0.99 | |
| 0.95 | 0.28 | 0.28 | 1.00 | 0.16 | 0.18 | 0.93 | 0.71 | 0.74 | 0.96 | 0.66 | 0.70 | 0.94 | |
| 1500 | 0.8 | 0.84 | 0.90 | 0.93 | 0.87 | 0.88 | 0.98 | 0.92 | 0.95 | 0.96 | 0.93 | 0.97 | 0.96 |
| 0.9 | 0.73 | 0.74 | 0.98 | 0.72 | 0.74 | 0.97 | 0.95 | 0.97 | 0.98 | 0.96 | 0.97 | 0.99 | |
| 0.95 | 0.55 | 0.56 | 0.98 | 0.45 | 0.47 | 0.98 | 0.92 | 0.93 | 0.99 | 0.89 | 0.90 | 0.99 | |
| 2000 | 0.8 | 0.94 | 0.99 | 0.95 | 0.98 | 1.00 | 0.98 | 0.95 | 0.99 | 0.95 | 0.96 | 0.99 | 0.97 |
| 0.9 | 0.98 | 1.00 | 0.98 | 0.99 | 1.00 | 0.99 | 0.96 | 0.99 | 0.96 | 0.97 | 0.99 | 0.98 | |
| 0.95 | 0.99 | 1.00 | 0.99 | 0.97 | 0.99 | 0.98 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | |
-
†For each choice of α and n we computed the simultaneous coverage of I(q), as in expression (28) (first value), the percentage of correctly estimated number of change points (second value) and the simultaneous coverage of confidence bands and intervals for the change points given
(third value).
5.6. Real data results
In this section we analyse two real data examples. The examples show the variety of possible applications for SMUCE. Moreover, we revisit the issue of choosing q as proposed in Section 4. and illustrate its applicability to the present tasks.
5.6.1. Array comparative genomic hybridization data
Array CGH data show aberrations in genomic DNA. The observations consist of the log‐ratios of normalized intensities from disease and control samples. The statistical problem at hand is to identify regions on which the ratio differs significantly from 0 (which corresponds to a gain or a loss). These are often referred to as aberration regions.
A thorough overview of the topic and a comparison of several methods is given in Lai et al. (2005). We compute SMUCE for two data sets studied in Lai et al. (2005) and more recently in Du and Kou (2012) and Tibshirani and Wang (2008). The data sets show the array CGH profile of chromosome 7 in GBM29 and chromosome 13 in GBM31 (see also again Du and Kou (2012) and Lai et al. (2005)).
By means of these two data examples we illustrate how the theory developed in Section 2. can be used for applications. As was stressed in Lai et al. (2005) many algorithms in change point detection strongly depend on the proper choice of a tuning parameter, which is often a difficult task in practice. We point out that our proposed choice of the threshold parameter q has in fact a statistically meaningful interpretation as it determines the level of the confidence set C(q). Moreover, we shall emphasize the usefulness of confidence bands and intervals for array CGH data.
We first consider the GBM29 data. To choose q according to the procedure suggested in expression (37), assumptions on λ and Δ must be imposed.
As mentioned above, log‐ratios of copy numbers may take a finite number of values which are approximately { log (1), log (3/2), log (2), log (5/2),…}. It therefore seems reasonable to assume that the smallest jumps size is Δ=log(3/2).
Moreover, we choose λ⩾0.2. We stress that the final solution of SMUCE will not be restricted to these assumptions. They enter as prior assumptions for the choice of q. If the data indicate strongly that these assumptions do not hold SMUCE will adapt to this.
In Fig. 10(a) we depict the probability of overestimating the number of change points as a function of q (the decreasing broken curve) and the probability of overestimating the number of change points as a function of q (the increasing broken curve) under the above stated assumption on λ and Δ. We may interpret the plot in the following way. It provides a tool for finding jumps of minimal height Δ=log(3/2) on scales of at least λ=0.2. For the optimized q* we obtain that the number of jumps is misspecified with probability less than 0.35. For the corresponding estimate see Fig. 10.

) or underestimating (
) the number of change points in the dependence of q (x‐axis) and their sum (
), (b) detected change points with confidence intervals for various values of α (left‐hand y‐axis) with the probabilty of underestimation (right‐hand y‐axis) and (c) SMUCE (
) computed for the optimal q*≈1.1 with confidence bands (
) and confidence intervals for change points (
)
Moreover, we display SMUCE for different choices of q. Fig. 10(b) shows the estimated change points with their confidence intervals. Bounds for the probability that K is overestimated can be found on the left‐hand axis, and bounds for underestimation on the right‐hand axis.
Note from Fig. 10(b) that SMUCE is quite robust with respect to q=q1−α. For α ∈ [0.2,0.7] SMUCE always detects exactly seven change points in the signal. The results show that a jump of size approximately Δ is found in the data on an interval, whose length is even slightly smaller than λ. However, SMUCE can also detect larger aberrations on smaller intervals, which makes it quite robust against wrong choices of Δ and λ.
Recall that one goal in array CGH data analysis is to determine segments on which the signals differ from 0. The confidence sets in Fig. 10(c) indicate three intervals with signal different from 0. Moreover, as indicated by the blue arrows, the change point locations are detected very precisely. Actually, the estimator suggests one more change point in the data. However, it can be seen from the confidence bands that there is only small evidence for the signal to be non‐zero. Further, the confidence bands may be used to decide which segments belong to the same copy number event. In this particular example the confidence bands suggest that these three segments belong to the same copy number event, i.e. have the same mean value.
Put differently, not only is an estimator for the true signal obtained, but also three regions of aberration were detected and simultaneous confidence intervals for the signal's value on this region at a level of 1−α=0.9 are given. This is in accordance with others' findings (Du and Kou, 2012; Lai et al., 2005).
The same procedure as above is repeated for the GBM31 data as shown in Fig. 11. For the bounds on underestimating the number of change points we assumed again that Δ⩾ log (3/2) and chose λ⩾0.025. Fig. 11 shows that Δ⩾ log (3/2) for the sample size of n=797 and the probability of misspecification can be bounded by approximately 0.12 for the minimal length λ=0.025, which corresponds to 19 observations. Using the same reasoning as above we identify one large region of aberration and obtain a confidence interval for the corresponding change point as well as for the signal's value. Here, the optimized q*≈1.7 in the sense of expression (38) gives α≈0.04 which yields SMUCE with one jump with high significance.

) or underestimating (
) the number of change points in the dependence of q (x‐axis) and their sum (
), (b) detected change points with confidence intervals for various values of α (left‐hand y‐axis) with the probabilty of underestimation (right‐hand y‐axis) and (c) SMUCE (
) computed for the optimal q*≈1.7 with confidence bands (
) and confidence intervals for change‐points (
)
5.6.2. Photo‐emission spectroscopy
Electron emission from nanostructures triggered by ultrashort laser pulses has numerous applications in time‐resolved electron imaging and spectroscopy (Ropers et al., 2007). In addition, it holds promise for fundamental insight into electron correlations in microscopic volumes, including antibunching (Kiesel et al., 2002). Single‐shot measurements of the number of electrons emitted per laser pulse (Bormann et al., 2010; Herink et al., 2012) will allow for the disentanglement of various competing processes governing the electron statistics, such as classical fluctuations, Pauli blocking and space charge effects.
We investigate with the SMUCE approach photo‐emission spectroscopy data that are displayed in Fig. 12(c). It represents a time series of electron numbers recorded from a photo‐emission spectroscopy experiment that was performed in the Ropers laboratory (Department of Biophysics, University of Göttingen; see Bormann et al. (2010)). It is customary to model photo‐emission spectroscopy data by Poisson regression with unknown intensity. This intensity is known to show long‐term fluctuations which correspond to variation in laser power and laser beam pointing, which cannot be controlled in the experiment and typically leads to an overall overdispersion effect. However, on a short timescale, the interesting task is to investigate underdispersion in the distribution. Such underdispersion would indicate an electron interaction in which the emission of one (or a few) electrons decreases the likelihood of further emission events. Specifically, significant underdispersion in the single‐shot electron number histogram would evidence an anticorrelation caused by electrons being fermions that obey the Pauli exclusion principle. A piecewise constant mean that models sudden changes in the laser intensity to reflect the large‐scale fluctuations is used for segmentation of the data for further investigation of underdispersion or overdispersion in these segments.

), confidence intervals for the change points (
) and binned photo‐emission spectroscopy data and (c) ML estimator with 10 change points
Fig. 12(a) shows the estimated change points of SMUCE (and the corresponding confidence intervals) for α=0.05,0.1,…,0.9. We also display SMUCE with confidence bands for α=0.9 (Fig. 12(b)) and for comparison the ML estimator with
change points (Fig. 12(c)). Note that the ML estimator is computed without the additional constraint
, in contrast with SMUCE. Remarkably, this results in a different estimator.
We estimate the dispersion of data Y1,…,Ym by
, where
and
. In Table 6
is shown for the whole data set as well as for the segments that are identified by SMUCE. It can be seen that our segmentation allows us to explain the overall overdispersion to a large extent, by the long‐term fluctuations. However, the results in Table 6 do not indicate significant underdispersion on any of the segments identified. This may be explained by a masking effect due to fluctuations of the emission current. Future experiments using more stable emission currents are under way.
of the whole data set and on the segments identified by SMUCE
| Segment |
|
|---|---|
| Overall | 1.02 |
| 1 | 0.98 |
| 2 | 1.02 |
| 3 | 0.98 |
| 4 | 1.04 |
| 5 | 1.01 |
| 6 | 1.04 |
| 7 | 0.98 |
| 8 | 1.03 |
| 9 | 0.99 |
| 10 | 0.98 |
| 11 | 1.05 |
6. Discussion
6.1. Dependent data
So far the theoretical justification for SMUCE relies on the independence of the data in model (1) (see Section 2.), as for example the optimal power results in Section 2.5. We claim, however, that SMUCE as introduced in this paper can be extended to piecewise constant regression problems with serially dependent data. A comprehensive discussion is beyond the scope of this paper and will be addressed in future work. Here, we confine ourselves to the case of a Gaussian moving average process of order 1; a similar strategy has been applied in Hotz et al. (2012) for m‐dependent data.
6.1.1. Example1
we consider the moving average MA(1) model

. We aim to adapt SMUCE to this situation. Following the local likelihood approach underlying the multiscale constraint in problem (2) one simply might replace the local statistic
for
in expression (3) by the (modified) local statistics
(42)This is motivated by the fact that
. Under the null hypothesis the local statistics
then marginally have a
‐distribution, like
in expression (4) for independent Gaussian observations.

For this, we used Monte Carlo simulations for a sample size of n=500. We reconsider the test signal from Section 5.1. with σ=0.2 and a=0. The empirical null distribution of
and a probability–probability plot of the null distribution of Tn against
are shown in Fig. 13. For β=0.1 and β=0.3, which correspond to a correlation of ρ=0.1 and ρ=0.27, we ran 1000 simulations each. We computed the modified SMUCE, as in equation , and SMUCE for independent Gaussian observations. For both procedures we chose q to be the 0.75‐quantile of the null distribution. The results are shown in Table 7. For β=0.1 both procedures perform similarly, which indicates that SMUCE is robust to such weak dependences, whereas for β=0.3 the modified version performs much better with respect to the estimated number of change points.

| Method | β | Results for the following | MISE | MIAE | ||||
|---|---|---|---|---|---|---|---|---|
| numbers of change points: | ||||||||
| 5 | 6 | 7 | 8 | ≥9 | ||||
| Modified SMUCE | 0.1 | 0.02 | 0.98 | 0.00 | 0.00 | 0.00 | 0.00154 | 0.02104 |
| SMUCE | 0.1 | 0.00 | 0.95 | 0.04 | 0.00 | 0.00 | 0.00142 | 0.02117 |
| Modified SMUCE | 0.3 | 0.27 | 0.73 | 0.00 | 0.00 | 0.00 | 0.00435 | 0.03084 |
| SMUCE | 0.3 | 0.00 | 0.29 | 0.34 | 0.24 | 0.13 | 0.00277 | 0.03229 |
The example illustrates that SMUCE as in problem (2) can be successfully applied to the case of dependent data after an adjustment of the underlying multiscale statistic Tn to the dependence structure. The asymptotic null distribution of this modified multiscale statistic is certainly not obvious and is postponed to future work.
6.2. Scale calibration of Tn
The penalization of different scales as in expression (3) is borrowed from Dümbgen and Spokoiny (2001) and calibrates the number of intervals on a given scale. This prevents the small intervals from dominating the statistic. For this, we might also consider the statistic


and
in Fig. 14. The graphic shows the frequencies at which the corresponding 0.75‐quantiles of the statistics Tn,
and
are exceeded at a certain scale (scales are displayed on the x‐axis). It can be seen that
puts much emphasis on small scales, whereas the penalized statistics Tn and
distribute the scales more uniformly. For our purposes this calibration is beneficial in two ways: first it is required to obtain the optimal detection rates in theorem 6 and theorem 6 as shown in Chan and Walther (2013). Second, the asymptotic behaviour is determined by a process of the type (16) and not by an extreme value limit as to be expected in the uncalibrated case, where the maximum is attained at scales of magnitude log (n) with high probability (see Kabluchko and Munk (2009), theorem 3.1, and the proof of theorem 1.1) in accordance with Fig. 14.

),
(
) and
(
) obtained from 10000 simulations on certain scales (the scales are on the x‐axis)
6.3. SMUCE from a linear models perspective

6.4. Risk measures
SMUCE aims to maximize the probability of correctly specifying the number of jumps
uniformly over sequences of models such that
tends to 0 not as fast as log (n)/n. This is conceptually very different from optimizing
with respect to convex risk measures such as the MSE and related concepts. The latter measures do not primarily target the jump locations and number of jumps. Therefore, we argue that in those applications where the primary focus is on the jump locations SMUCE may be advantageous. In fact, maximizing the probability of correctly estimating the number of jumps as SMUCE advocates has some analogy with risk measures for variable selection problems, which have been shown to perform adequately successfully in high dimensional models. This includes the false discovery rate (Benjamini and Hochberg, 1995) and related ideas (see for example Genovese and Wasserman (2004)). Whereas in our context the classical false discovery rates aim to minimize the expected relative number of wrongly selected change points, SMUCE can give at the same time a guarantee that the true change points will be detected with large probability and hence controls the false acceptance rate as well.
6.5. Computational costs
Killick et al. (2012) showed that their pruned exact linear time method leads to an algorithm which expected complexity that is linear in n in some cases. As stressed in Section 3., our algorithm includes similar pruning steps. Owing to the complicated structure of the cost functional, however, it seems impossible to prove such a result for the computation of SMUCE. The computation can, of course, be further reduced significantly if for example only intervals of dyadic lengths are incorporated in the multiscale statistic. Since the dynamic approach leads to a recursive computation, SMUCE can be updated in linear time, if applied to sequential data. Another interesting strategy to reduce the computational costs could be adapted from Rivera and Walther (2012) and Walther (2010) who suggested restricting the multiscale constraint to a specific system of intervals of size
which still guarantees optimal detection.
6.6. Choice of α
We have offered a strategy to select the threshold q=qα and hence the confidence level α in a sensible way to minimize
, by balancing the probabilities of overestimation and underestimation of K simultaneously. This is based on the inequalities in Section 4. depending on λ, Δ and n. As indicated in Figs 1, 10, 11 and 12 this can be used to consider the evolution of SMUCE depending on α as a universal ‘objective’ smoothing parameter. The features (jumps) of each SMUCE given α then may be regarded as ‘present with certain confidence’ similarly in spirit to ideas underlying the siZer algorithm (see Chaudhuri and Marron (1999, 2000)). It is striking that in many simulations we found that features (jumps) remain persistent for a large range of levels α. Of course, other strategies to balance
and
are of interest, e.g. if one of these probabilities is considered as less important. For a first screening of jumps,
is the less serious error and
should be minimized primarily. This can be achieved by optimizing the convex combination
for a weight δ close to 1 along the lines described in Section 4..
Acknowledgements
Klaus Frick, Axel Munk and Hannes Sieling were supported by Deutsche Forschungs‐ gemeinschaft–Schweizerischer Nationalfonds grant FOR 916. Axel Munk was also supported by CRC 803, CRC 755 and the Volkswagen Foundation. This paper benefited from discussions with colleagues. We specifically acknowledge L. D. Brown, T. Cai, L. Davies, L. Dümbgen, E. George, C. Holmes, T. Hotz, S. Kou, O. Lepski, R. Samworth, D. Siegmund, A. Tsybakov and G. Walther. Various helpful comments and suggestions of the screeners and reviewers for the journal are gratefully acknowledged.
References
Discussion on the paper by Frick, Munk and Sieling
Idris A. Eckley (Lancaster University)
We owe thanks to Frick, Munk and Sieling for a most interesting and stimulating paper which so thoroughly addresses the problem at hand. Not only have they explored the area comprehensively from a theoretical perspective, but they also address several important practical considerations and make open source code available implementing the proposed methodology so that others may explore its potential. I commend them for so doing.
To give some context for those who are less familiar with change point analysis, this area has been undergoing a revival in recent years. This is not just limited to the statistics community, but areas as diverse as the environmental sciences, econometrics, biology, geosciences and linguistics. Work developing the field of change point analysis is therefore of potentially significant importance in many different disciplines. The contribution of this paper is therefore very timely.
Within change point analysis one of the key topics is the development of accurate and efficient methods for multiple change point detection. Typically such search methods focus on the identification of the number and location of multiple change points. Historically such approaches have been either fast but approximate (e.g. the binary segmentation approach that was proposed by Scott and Knott (1974) and related variants) or exact but slow (e.g. the segment neighbourhood algorithm that was proposed by Auger and Lawrence (1989)). More recently various researchers have focused on developing more computationally efficient exact search methods; see for example Rigaill (2010) or Killick et al. (2012).
. Moreover en route they establish several key results including
- (a) the (asymptotic) probability of overestimating the number of change points and
- (b) the probability of underestimating the number of change points for any sample size.
Taken together, these results are particularly attractive since they mean that, under a suitable choice of q, the probability of misspecification
tends to 0.
Naturally a few questions came to mind on reading the paper. In Section 1.3 you mention that potentially several strategies exist to select the threshold q and outline an approach in Section 4 which takes into account prior information about the true signal ϑ. Specifically the approach depends on λ, the minimal interval length, and η, a minimal feature size. In a great many contexts this seems a very pragmatic approach. However, what other approaches to threshold selection did you consider and what was the motivation for choosing the approach described over these others?
As in several other change point search methods, the penalization approach assumes intersegment independence, i.e. that the parameter value in one segment is independent of those in other segments. How could SMUCE be adapted to deal with intersegment dependence?
Finally, I have a question motivated by some recent applications which we have encountered. In many cases one might see changes in multiple parameters (e.g. change in mean and variance) occurring in a time series. Indeed, a priori, a practitioner might not know whether they are dealing with a single‐ or multiple‐parameter change point problem. Consider by way of example the simulated data displayed in Fig. 15(a). In particular my concern here is that a practitioner may inadvertently assume from casually eyeballing the data that only one parameter is changing when in fact multiple parameters are changing. In so doing, they might then (mis)apply the single‐parameter SMUCE method and consequently falsely infer the change point structure within the time series (the full horizontal line within Fig. 15(b)). The truth is actually quite different from this since the data contain changes in both mean and variance (see the vertical lines in Fig. 15(b)). Hence, what scope is there to extend SMUCE, at least methodologically, to addres such multi‐parameter scenarios?

) with true change point locations (
, change in mean;
, change in variance)
In summary, the work which Frick, Munk and Sieling present is that most powerful of combinations: both mathematically appealing and of considerable applied value. In particular it addresses several key change point facets which practitioners have been seeking for some time. As with any good paper read to the Society, this contains much thought‐provoking material and highlights several interesting avenues for future research. I therefore have great pleasure in proposing the vote of thanks.
Arne Kovac (University of Bristol)
Maximizing simplicity under multiresolution constraints is an extremely powerful technique in non‐parametric settings such as regression or image analysis. As such it is a pleasure to welcome this paper to the Society. Previously, this concept was in particular exploited in the case where simplicity is measured by the number of local extreme values. There consistency of number and locations of extreme values and rates of convergence can be derived (Davies and Kovac, 1997) as well as confidence bounds (Davies et al., 2009). In this paper the authors look at the situation where simplicity is measured by the number of break points in estimating a piecewise function. This problem was studied before by Höhenrieder (2010) and his solution to the problem shows remarkable similarity to the SMUCE approach.

. If my calculations are correct then for large j−i+1





gives

. The bounds qn will satisfy to first order

or yn≈√2{ log (n)}. For this to hold (j−i+1)/ log (n)3→0.
Putting this together the results obtained by the authors hold for any Yl with a moment‐generating function as long as the multiscale constraints apply to sums of the Yl.




Finally a word on scale calibration: whatever method is used to determine the upper bounds qn it cannot be that one set of bounds is uniformly lower than the other if both have been calibrated for the same α. The decision must be made by the data owner. Indeed it is possible to adjust the bounds to accommodate any requirements that the owner may have.
I found this paper interesting to read and therefore it gives me great pleasure to second the vote of thanks for this paper.
The vote of thanks was passed by acclamation.
Frank Critchley (The Open University, Milton Keynes)
In warmly welcoming this paper, I would like to ask two questions and to indicate two possible types of extension.
Scale calibration works well as a form of first‐order correction. Given this, (when) might there be scope for some higher order analogue—e.g. adjusting for variance as well as bias?
Intuitively, it seems that sensitivity to prior specification of (λ,Δ) will typically be high. Is this your experience and, if so, what guidelines might you suggest? In particular, can your methodology be adapted to be more robust to the choice of prior, or is such sensitivity intrinsic?
- (a) monotonicity (in either direction)—reflecting, for example, in a reliability context, ‘things can only get worse’;
- (b) changes with alternating signs—reflecting, for example, compensating departures from equilibrium.
Can procedures be formulated to test for such patterns (say, within the general alternative of any pattern)? Relatedly, can diagnostics be developed to check (potentially, on line) for departures from them?
Secondly, might there be scope to adapt your methodology for use in model checking—most obviously, by applying it with the {Yi} as (regression) residuals? This requires, of course, that (at least mild) dependence among the {Yi} be permitted—raising the related question of extensions towards, say, time series and/or spatial data. To this end, (analogues or extensions of) Theil's best linear unbiased scalar residuals may have a role to play, these being uncorrelated and homoscedastic under the model.
All this is to thank you for a very stimulating and well‐written paper.
Yining Chen, Rajen D. Shah and Richard J. Samworth (University of Cambridge)
We congratulate the authors for their stimulating paper; our comments focus on possible avenues for further development.
A Gaussian quasi‐SMUCE (GQSMUCE)

and
The number of change points can still be estimated by solving


for
. An analogue of theorem 2 can also be proved, establishing model selection consistency.
(43)| Method | σ | Results for the following numbers of | ||||
|---|---|---|---|---|---|---|
| change points: | ||||||
| ⩽4 | 5 | 6 | 7 | ⩾8 | ||
| GQSMUCE (1−α=0.55) | 0.1 | 0.000 | 0.000 | 0.988 | 0.012 | 0.000 |
| CBS (Olshen et al., 2004) | 0.1 | 0.000 | 0.000 | 0.924 | 0.036 | 0.040 |
| GQSMUCE (1−α=0.55) | 0.2 | 0.000 | 0.000 | 0.994 | 0.006 | 0.000 |
| CBS (Olshen et al., 2004) | 0.2 | 0.000 | 0.000 | 0.872 | 0.100 | 0.028 |
| GQSMUCE (1−α=0.55) | 0.3 | 0.012 | 0.248 | 0.772 | 0.018 | 0.000 |
| CBS (Olshen et al., 2004) | 0.3 | 0.000 | 0.010 | 0.806 | 0.148 | 0.036 |
- †The true signals have six change points.
More generally, we believe that multiscale methods for change point inference (or appropriately defined ‘regions of interest’ in multivariate settings) offer great potential even with more complex data‐generating mechanisms, and we await future methodological, theoretical and computational developments with interest.
Coverage of confidence sets
One attractive feature of SMUCE is the fact that confidence sets can be produced for ϑ. However, Table 5 shows in the Gaussian example with unknown mean that, even when the sample size is as large as 1500, a nominal 95% confidence set has only 55% coverage; even more strikingly, a nominal 80% coverage set has 84% coverage!
This phenomenon, where larger nominal coverage may reduce actual coverage, is caused by the choice of 1−α determining not only the nominal coverage but also
, the estimated number of change points.

for α′⩽α.
| α | Results for SMUCE | Results forSMUCE 2 | ||
|---|---|---|---|---|
Coverage of
|
P{K(q1−α)=K} |
Coverage of
|
|
|
| 0.90 | 0.874 | 0.978 | 0.882 | 0.986 |
| 0.95 | 0.874 | 0.914 | 0.944 | 0.986 |
| 0.99 | 0.728 | 0.738 | 0.974 | 0.986 |
Federico Crudu (Pontificia Universidad Católica de Valparaı´so), Emilio Porcu (Universidad Federico Santa Maria, Valparaı´so) and Moreno Bevilacqua (Universidad de Valparaı´so)
We congratulate the authors for this excellent paper, which studies the change point problem in the case of exponential family regression. Their proposal is a multiscale change point estimator that can estimate the number of change points, their locations and the value of the mean function ϑ. In addition to that, they provide confidence bands for the function ϑ and confidence intervals for the change points.
The authors provide a thorough analysis of the problem. However, we find that some points need further consideration. First the estimator is designed to work in the case of exponential families. It would be interesting to understand how restrictive this assumption is and therefore what happens to the estimator in the case of misspecification. Second, as the estimator performs remarkably well, it would be worthwhile investigating the out‐of‐sample potential of this estimator and its performance in the case of missing data.
- (a) the number of change isophlets of ϑ and
- (b) the change isophlets locations and the function values (intensities) of ϑ.
It seems that such an extension is not obvious for many reasons: for instance, we do not see any way to obtain the analogue of J(ϑ) (in space any order is arbitrary). This might be a good challenge for future researches.
Alessio Farcomeni (Sapienza University of Rome)
- (a) Panel data: in many cases, we have independent replicates of Y (for example, we may have repeated measures on m units at n occasions). In this case the local likelihood ratio statistics would be
and similar adjustments can be made to the definitions of C(q),

, etc. The use of independent replicates of Y may lead to shorter confidence intervals around change point location estimates.
- (b) Multivariate data: suppose now that
, with p>1, arising for instance from a p‐dimensional continuous exponential family. Computation of local likelihood ratio statistics seems to be straight‐ forward in this case also. When p>1, it may also be interesting to derive dimension‐specific likelihood ratio statistics, in which case the choice of α may be driven by ideas taken from the multiple‐testing literature (reviewed for example in Farcomeni (1991)). Finally, issues with multivariate outliers may arise, suggesting the use of robust estimation methods (e.g. Cuesta‐Albertos et al. (2008)).
- (c) General right continuous functions: the right continuous step function θ can be offset by any known continuous function with once again minor adjustments to SMUCE. It may be of interest to explore whether an unknown function within this class can be estimated with a performance as good as that of SMUCE. Some additional restrictions would perhaps be needed for consistency (e.g. that the oscillation within each interval is smaller than Δ). Similarly, if Xi is a vector of covariates, it would be interesting to estimate θ and β in the model
where g(·) is a known link function and m⩾1.

Piotr Fryzlewicz (London School of Economics and Political Science)
I congratulate the authors for their thought‐provoking paper.
Revisiting the results of Table 1, it is interesting to note that the authors exhibit the performance of SMUCE for two particular values of the tuning parameter: α=0.45 and α=0.6, the latter presumably chosen to improve the performance in the high variance setting of σ=0.3. In this note, we apply the wild binary segmentation (WBS) technique of Fryzlewicz (2007) to the same example, including in a version that requires no choice of any tuning parameters on the part of the user.
More specifically, we apply the WBS algorithm to the trend‐free signal from Table 1, with σ=0.1 and σ=0.3. The WBS algorithm uses the default value of 5000 random draws (each execution took the average of 1.20 s on a standard personal computer). In the thresholding stopping rule, we use the threshold
, where
is the median absolute deviation estimator of σ suitable for independent and identically distributed Gaussian noise, n is the sample size and the constant c is selected manually. In the fully automatic stopping rule based on an information criterion, we propose a new criterion termed the ‘strengthened Schwarz information criterion’ sSIC, which can be shown to be consistent for the number and locations of change points when coupled with WBS and works by replacing the SIC‐penalty of log (n)× number of change points by log γ(n)× number of change points, for γ>1, where we use the default value of γ=1.01 to remain close to SIC.
Table 10 shows the results. It is clear that SMUCE, even with an optimally chosen tuning parameter, struggles in the higher variance setting, where it is comfortably outperformed by WBS. WBS sSIC performs very well in both settings, and we emphasize again that it is a completely automatic procedure in which no tuning parameters need to be chosen. For the case σ=0.3, WBS sSIC would have come on top also in Table 1, and, for the case σ=0.1, very close to the top.
| Method | σ | Results (%) of the following | ||||
|---|---|---|---|---|---|---|
| numbers of change points: | ||||||
| ⩽4 | 5 | 6 | 7 | ⩾8 | ||
| SMUCE, α=0.45 | 0.1 | 0 | 0 | 99 | 1 | 0 |
| WBS, c=1.3 | 0.1 | 0 | 0 | 97 | 2 | 1 |
| WBS, c=1.4 | 0.1 | 0 | 0 | 99 | 1 | 0 |
| WBS, c ∈ [1.5,2] | 0.1 | 0 | 0 | 100 | 0 | 0 |
| WBS sSIC | 0.1 | 0 | 0 | 97 | 2 | 1 |
| SMUCE, α=0.45 | 0.3 | 3 | 34 | 62 | 1 | 0 |
| SMUCE, α=0.6 | 0.3 | 0 | 10 | 80 | 9 | 0 |
| WBS, c=1.3 | 0.3 | 0 | 3 | 94 | 2 | 1 |
| WBS, c=1.4 | 0.3 | 2 | 7 | 90 | 1 | 0 |
| WBS sSIC | 0.3 | 0 | 1 | 95 | 3 | 1 |
- †Results for SMUCE were taken from the paper and averaged to the nearest integer. Results for WBS are based on 100 simulation runs, with the random seed in R set to the arbitrary value of 1 before the first simulation run, for reproducibility.
Finally, we note that the unbalanced Haar method of Fryzlewicz (2004), which was included by the authors in the same simulation study, is not a consistent change point detector. Out of the three methods studied (SMUCE, unbalanced Haar wavelets and WBS), users will be best‐off applying the WBS method in this context.
Oliver Linton (University of Cambridge) and Myung Hwan Seo (London School of Economics and Political Science)
Time series analysis usually begins with some decomposition into trend, seasonal component and remainder terms. This paper focuses on a very special but important case where the trend is piecewise constant, there is no seasonal component and the remainder term is independent and identically distributed. In this case, it obtains powerful results whereby the location of change points can be estimated freely and confidence intervals for the trend can be obtained. More generally, we may need to allow for periodic components such as in Linton and Vogt (2005) and more general trend functions that are not necessarily constant between knot points; this may raise further issues. For instance, if the original sequence contains a continuous piecewise linear trend, then the first differenced series can be fitted into the proposed method in conjunction with the discussion on moving average MA(l) processes in Section 6.1. However, in the case of a discontinuous piecewise linear trend, more care should be exercised because of those jump points at trend breaks.
It will be also interesting to explore the consequences when the true data‐generating process ϑ(x) is a continuous function, and we interpret the estimated step functions as the best approximation to the unknown true function in the mean‐squared‐error sense. If the number of changes in the estimation is fixed over the sample size n, cube‐root‐type asymptotics are expected as in for example Koo and Seo (2013). Theorem 6 seems to allude to some expected result such as consistency to a continuous function ϑ. In other words, what can be included as the limit of ϑn? Certainly, the limit will not be an element of
as an infinite number of jumps cannot be allowed in a step function with bounded domain. In the meantime, it was not clear whether some results of the paper include the case of no change point.
In economics and finance, classification of tima series into different regimes is a common activity. For example, Phillips and Yu (2010) proposed statistics for classifying a price time series into a ‘rational’ part and a ‘bubble’ part. Their method is based on recursive change point testing. In the notation of this paper the function ϑ(t) is either consistent with a first‐difference stationary process or with an explosive process according to which regime the process is in. The issue is to identify the bubble regimes
from the data as well as to track lead–lag relationships in different asset prices, i.e. where did the bubble first occur? This leads to the question of whether the authors' methods can be generalized to multivariate dependent settings where the parameter ϑ(t) controls the type of non‐stationarity in the series. Another issue, for applications in finance in particular, is the presence of heavy tails, and the sort of quantile methods discussed in Section 5.4 looks promising in this regard.
Gift Nyamundandaand Kevin Hayes (University of Limerick)
We extend our congratulations to the authors for their paper which presents a fresh approach to change point detection, specifically, change point discovery. The provision of the probability of overestimating or underestimating the true number of change points and the uncertainty associated with the estimation of the location of these change points is a key attraction of the SMUCE method. However, a concern throughout is in the actual advantages of SMUCE over rival approaches. Although the authors identify the modified Bayesian information criterion and circular binary segmentation as potential competitors, the literature offers many alternative change point methodologies that should be considered. For example, binary segmentation (Chen and Gupta, 1999) and an approach of minimizing a certain cost function by Killick et al. (2012) are designed to detect changes in both the mean and the variance structure. Likewise, the product partition model (PPM) due to Barry and Hartigan (2000), implemented in the R package bcp (Erdman and Emerson, 2012), detects changes in both the mean and the variance structure. Furthermore, the posterior distribution of the partition in the PPM provides the uncertainties associated with the estimated locations of change points. The prior distribution of the partition in the PPM satisfies a similar role to that of the parameter α in SMUCE; both controlling the expected number of change points by using input parameters. The PPM also provides a posterior distribution on k, the number of estimated change points, which is informative regarding the overspecification or underspecification of change points in the final fit. Also, the authors did not provide a convincing argument why SMUCE can be a better approach for detecting distributional changes due to changes in the variance as compared with current existing approaches.
The choice of the threshold parameter q is extremely important since it balances data fit and parsimony of SMUCE. It is associated with the parameter α that controls the probability of overestimating the true number of change points. Choosing q depends on the interval length parameter λ and jump size parameter Δ. It would be useful if the authors could report on the effects of misspecification of these parameters (λ and Δ) on the number of change points and, in turn, on the power of detecting change points. In Section 5.6 there are results of real data analysis in which these parameters were set to λ⩾0.2 and
, and λ⩾0.025 and
. Perhaps a better way to choose q here would be to define a full prior distribution over λ and Δ to take into account the uncertainties associated with these two parameters.
K. Szajowski (Wrocław University of Technology)
Multiscale methods
In recent years, we can see a steady increase in references about modelling phenomena at several scales simultaneously. In the Web of Knowledge the number of reported articles with the phrase multiscale modelling in the titles of the articles has risen from 50 in 2001 to more than 300 in 2009. An overview of such models can be found in Horstemeyer (2012). A special interdisciplinary journal, Multiscale Modeling and Simulation, focusing on the fundamental modelling and computational principles underlying various multiscale methods was founded by the Society for Industrial and Applied Mathematics in 2003. Multiscale modelling became key in garnering a more precise and accurate predictive tool. This paper is a significant contribution to the communication between mathematicians and representatives of science using this approach in mathematical modelling. The change point problem for such models is a new point of view in this area. However, the idea of sequential multiple disorders detection has been considered not only for simple sequences of observations.
Multiple‐change‐point analysis
In various applications the modelled signal can have an unobservable parameter. If the unobservable parameter is modelled as a hidden process with rare changes of state then the signal and the hidden parameter process form the multiscale with ‘disorders’ (see Bojdecki (2010)). The problem of multiple‐disorders detection for processes having Markovian structure, when the hidden parameter process is uncovered, is the subject of Szajowski (2011). It is a generalization of the results of Yoshida (2008) (see also Sarnowski and Szajowski (2012, 2005)). A short description is as follows. A random sequence having segments being the homogeneous Markov processes is registered. Each segment has its own transition probability law and the length of the segment is unknown and random. The transition probabilities of each process are known and a joint a priori distribution of the disorder moments is given. The detection of the disorder rarely is precise. The decision maker accepts some deviation in estimation of the disorder moment. In the models taken into account the aim is to indicate the change points with fixed, bounded error with maximal probability.
It is an interesting problem how the precision changes the optimal decisions. This question is considered in Ochman‐Gozdek et al. (1988). The case with various precision for overestimation and underestimation of this point is analysed, including the situation when the disorder does not appear with positive probability. The observed sequence, when the change point is known, has Markov properties. The results explain the structure of the optimal detector in various circumstances. The problem is reformulated to optimal stopping of the observed sequences. A detailed analysis of the problem is presented to show the form of the optimal decision function.
These results show that the optimum detection procedures are seldom easy to use. They convince me that the methods proposed by Frick and his colleagues are a very good option.
The following contributions were received in writing after the meeting.
John Aston (University of Cambridge) and Claudia Kirch (Karlsruhe Institute of Technology)
We congratulate the authors on a very interesting paper which introduces a new way to examine change point settings. We would like to take the opportunity to point out several possible extensions for future work that have been successfully included in simpler at most one‐change situations.
Frequently, in statistics, procedures based on Gaussian assumptions can be used more generally and yield asymptotically correct results. This includes the Whittle likelihood for non‐normal data to give an example outside change point analysis but also the simple cumulative sum chart and its variations obtained as a likelihood ratio statistic for the much simpler change point problem, where at most one mean change in independent and identically distributed normal random variables is expected. In the latter case, results remain true (Csörgő and Horváth (1997), theorem 1.4.2) if a strong invariance principle and some asymptotic independence assumption holds (for the actual likelihood ratio statistic) respectively if a functional central limit theorem holds (for simpler weight functions (Csörgő and Horváth (1997), theorem 2.1.1)). In the same spirit, extensions to misspecifications with respect to the independence between observations are possible using not only parametric models but also weak dependence concepts under which strong invariance principles or at least functional central limit theorems hold (Csörgő and Horváth (1997), section 4.1). Typically, the variance then needs to be replaced by the long‐run variance with all the problems arising from the difficulty of estimating this. It would be rather interesting to see whether similar results can be obtained for the proposed procedure on the basis of the normal likelihood.
Similarly, it would be interesting to see how to extend the proposed procedure to detect more complicated changes, e.g. in regression or parametric time series models. Similarly to the role of estimating functions in parameter estimation, corresponding functionals of the time series and the estimated parameters can transform this problem into a possibly multivariate mean problem. For example, Hušková et al. (2007) used several variations of the score function (for normal errors) to detect changes in linear auto‐regressive time series, whereas Kirch and Tadjuidje Kamgaing (2012) and Franke et al. (2012) used least squares scores in non‐linear auto‐regressive time series. In those cases, it is of interest to allow for an additional time series structure of the errors even in the parametric model as this allows for misspecification with respect to the model (Kirch and Tadjuidje Kamgaing, 2012).
In general, it is of interest to consider the theoretical robustness of the proposed test procedures to violations of the assumptions on the underlying error distribution, model assumptions or a combination of both. Kirch and Tadjuidje Kamgaing (2014) give general regularity conditions on the truly possibly misspecified underlying time series under which the standard asymptotics for change point tests based on scores or estimating functions remain true under the null hypothesis, in addition to regularity conditions under which such misspecified tests still have asymptotic power 1.
Jérémie Bigot (Institut Supérieur de l’Aéronautique et de l’Espace, Toulouse)
I congratulate the authors for a stimulating paper and their innovative contribution on change point inference.
One of the key features of SMUCE, that explains its nice performances, is the use of a multiscale test statistic to estimate the number of change points of an unknown step function. Here, I briefly compare the numerical performances of SMUCE with another estimator proposed in Bigot (1993) that is also based on the combination of test statistics at various scales.
(44)
in the time–scale space such that
has a local maximum at x=m(s). Following Bigot (1993), a local maximum
at scale s is said to be significant if
where

(45)
is the number of significant wavelet maxima
at scale s, [s0,s1] is a range of testing scales and C is a normalizing constant. From Fig. 16, it can be seen that, for small values of s, the significant wavelet maxima
are all near the jumps of f. Moreover, the locations of the significant modes of Gm correspond to satisfactory estimators of the jumps of f.
We applied this method (called Structlnt) to 1000 runs from model (41). For the trend‐free signal, it is clear that SMUCE performs better than Structlnt, especially for σ=0.3. This is because the multiscale nature of the testing procedure in SMUCE fully exploits the assumption that the unknown function f is piecewise constant, whereas Structlnt is suited to detect jumps in a piecewise smooth signal that is not necessarily a step function. Therefore, StructInt is more robust than SMUCE to deviation of the model with a step function as shown by the results in Table 11 in the case of a signal with a trend. Finally, these numerical experiments confirm the benefits of combining multiscale test statistics for change point detection of a piecewise smooth regression function.

| Method | Trend | σ | Results for the following numbers of | ||||
|---|---|---|---|---|---|---|---|
| change points: | |||||||
| ⩽4 | 5 | 6 | 7 | ⩾8 | |||
| SMUCE (1−α=0.55) | No | 0.1 | 0 | 0 | 0.988 | 0.012 | 0 |
| StructInt | No | 0.1 | 0 | 0 | 0.970 | 0.029 | 0.001 |
| SMUCE (1−α=0.55) | No | 0.2 | 0 | 0 | 0.986 | 0.014 | 0 |
| StructInt | No | 0.2 | 0 | 0.026 | 0.926 | 0.047 | 0.001 |
| SMUCE (1−α=0.4) | No | 0.3 | 0 | 0.099 | 0.798 | 0.089 | 0 |
| StructInt | No | 0.3 | 0.132 | 0.476 | 0.380 | 0.012 | 0 |
| SMUCE (1−α=0.55) | Long | 0.2 | 0 | 0 | 0.825 | 0.171 | 0.004 |
| StructInt | Long | 0.2 | 0 | 0.033 | 0.931 | 0.036 | 0 |
| SMUCE (1−α=0.55) | Short | 0.2 | 0 | 0.002 | 0.903 | 0.088 | 0.007 |
| StructInt | Short | 0.2 | 0 | 0.025 | 0.923 | 0.050 | 0.002 |
- †The results for StructInt are based on 1000 simulation runs from model (41), and the results for SMUCE are taken from Table 1.
D. S. Coad (Queen Mary University of London)
I congratulate the authors on this fascinating paper, which provides a hybrid method based on a likelihood ratio test and a model selection step for the change point problem in exponential family regression. Asymptotic arguments are developed for maximizing the probability of correctly estimating the number of change points. The computations may be carried out efficiently by using dynamic programming. I have no doubt that the method will be generalized further.
The underlying model (1) assumes that the data follow a one‐dimensional exponential family. Several examples of different distributions are used to compare the performance of the hybrid method with some existing methods that have been proposed in the literature. A natural extension of the method is to the multi‐dimensional case (see Lévy‐Leduc and Roueff (2005) and Xie et al. (2009)). Given that a detailed study of a suitable vector of multiscale statistics may be impractical, because, for example, of potential sparseness in the data, an alternative approach may be to consider low dimensional statistics instead. That way, it may be possible to study both the asymptotic properties of the method and the necessary computations are still feasible.
Although most of the paper focuses on independent data, it is briefly indicated in example 1 how the method could be extended to serially correlated data. In particular, a Gaussian moving average process of order 1 is considered. It would be interesting to know whether the arguments could be extended further to m‐dependent data, and, if so, where would the main challenges lie, both theoretically and computationally. Further, since one of the conclusions from simulations for example 1 is that the modified multiscale statistic performs much better than the unmodified version for stronger dependences, perhaps the difference between the two could be even greater for m‐dependent data. Presumably, the study of heavy‐tailed data is an open problem.
Laurie Davies (University of Duisburg‐Essen)
although 2(2+√2)2≈23.3 is a more accurate constant. In Höhenrieder (2010) the case of piecewise constant functions was treated in the context of volatility estimation. The authors' approach, derived independently, shows similarities and differences. One difference is that they use a local likelihood test to segment the data whereas the segmentation in Höhenrieder (2010) is based on partial sums and bounds of the form
(46)The method of Höhenrieder (2010) can be adapted to t‐distributions (see Banasiak (2010) and also Davies et al. (2012)) while maintaining the algorithmic complexity O(n2). If Gaussian segmentation is applied to the Standard and Poor's data (21717 non‐zero observations) the results for the positive returns are 43 and 29 intervals for the Gauss and t7‐models respectively. The Kolmogorov distances of the residuals from the respective models are 0.048 and 0.016 respectively. For the negative returns the numbers are 50 and 26 for the Gauss and t5‐models with distances 0.069 and 0.023. Models based on t‐distributions are clearly superior for this data set. Can the local likelihood approach be extended to t‐ and other distributions or is it essentially restricted to exponential families?
After segmentation the authors maximize the likelihood given the minimum number of intervals. Efficiency under the model is not the only consideration. In Höhenrieder (2010) and Davies et al. (1964) the sum Σi (Yi−θi)2 was minimized instead of the likelihood being maximized. This explains the inferior mean integrated squared error in Table 2, the compensation being the applicability to t‐distributions which was regarded as being much more important in the context. A correct calibration of the method of Davies et al. (1964) leads to the following numbers in the results given for Davies et al. (1964) in Table 2: 0.910, 0.938, 0.973, 0.973 and 0.855.
Farida Enikeeva and Zaid Harchaoui (Institut National de Recherche en Informatique et Automatique, Grenoble)
It is a great pleasure for us to comment on such an excellent paper. We shall name only a few among many contributions of the paper. First, the consistency of the estimator of K is established without any assumption on the number of jumps. The paper also gives a non‐asymptotic exponential bound for the deviation of this estimator in the Gaussian case, and confidence bounds for θ(t) in a general case. In addition, the paper provides detailed explanations on implementation and calibration of the test. All in all, the paper is a substantial contribution not only to statistics but also potentially to a large range of applications.
- (a) Can the SMUCE approach be generalized to the case of indirect observations, e.g. in the setting of Boysen et al. (2009)? It would be interesting to see corresponding convergence rates, and whether the approach would enjoy optimality properties.
- (b) Theoretical results are presented in terms of the minimal length of the interval and minimal jump size. However, if the signal jump is sufficiently large, the change could potentially be detected on a short time interval. Conversely, if one observes a small change on a sufficiently long time interval, then it should be possible to detect the change. Is it possible to give a condition on the signal detection that would take into account the behaviour of contiguous changes?
- (c) Another interesting extension is to multivariate or even high dimensional parameter θ. The method of Harchaoui and Lévy‐Leduc (2009) was extended to the multivariate setting in Vert and Bleakley (2010), allowing the detection of common change points in a large number of sequences for the purpose of segmentation of comparative genomic hybridization data (Picard et al., 1997). Convergence is then analysed in a double‐asymptotic framework. It would be interesting to have the authors' insights on the possibility of extending their tools to multivariate, high dimensional observations.
On the application side, the approach could be used for genomic data obtained by next generation sequencing techniques, with several challenges to tackle. First, one then works in a high dimensional setting, because of the constantly growing number of sequences in databases. Second, genomic sequences could contain millions of observations with hundreds of change points. When applied to the raw next generation sequencing data (e.g. 454 pyrosequencing flow data), we think that SMUCE could help to solve important problems like correction of erroneous sequences or detection of single‐nucleotide polymorphisms.
Paul Fearnhead (Lancaster University)
I congratulate the authors on this very interesting paper. One aspect that I particularly liked about it was that it was comprehensive, in terms of developing a method for detecting change points, the associated theory for the method and efficient approaches to implementing the method in practice. Furthermore the paper did not just consider point estimates for the location of change points, but also how to construct confidence intervals for their location, and confidence bands for the underlying function. It is this latter aspect of the paper that I would like to comment on.
One issue with constructing confidence intervals for change point problems is the difficulty in defining what such an interval should be when there is uncertainty over how many change points there are. As can be seen in Fig. 1(c), the likely location of change points can vary substantially as you change their number. The approach in the paper seems to circumvent this by working with asymptotic confidence intervals, and using the fact that the method gives a consistent estimate of the number of change points. As a result, asymptotically you know the number of change points in the data, and hence this issue vanishes. This asymptotic behaviour, however, is different from that in many real life applications where there is substantial uncertainty over the number of change points. So how meaningful are the confidence intervals in practice? How is a practitioner to interpret a figure like Fig. 1(c), particularly as the confidence level is inherently linked with the number of change points detected.
These issues seem particularly prevalent in applications with both large numbers of observations and many change points. Would pointwise confidence bands for the underlying function be more appropriate than simultaneous confidence bands in these cases? In situations where there is substantial uncertainty about the number of change points, are there any real alternatives to Bayesian approaches (e.g. Yao (2005), Barry and Hartigan (2000), Fearnhead and Liu (2009) and Fearnhead and Vasileiou (2006)) if you want to quantify the full uncertainty of any estimates?
Cheng‐Der Fuh and Huei‐Wen Teng (National Central University, Jhongli City)
- (a) Besides the moving average MA(l) model mentioned in the paper for dependent data, another interesting model is the hidden Markov model. We assume that Xn is a Markov chain on a finite state space D={1,…,d}, with initial distribution P(X0=x0)=πθ(x0), and transition probability
. The observation Yn at time n is continuous with density function
(all parameterized by θ). Let K denote the number of change points, τ1<τ2< …<τk be the change points and θ1,…,θk be the associated parameters. The problem of interest is that the state transition probability and observation model conditioned on the state undergo a change from θi to θi+1 at the change point τi for i=1,…,K. In particular, the state transition from
to
is described by θi−1, whereas the transition from
to
is described by θi. Similarly, observation model
is described by θi−1, whereas
is described by θi. This model is suitable for a setting where the observation is causally related to the state and, hence, a change in the state transition leads to a change in the observation density. In the case of K=1, the joint density of Y={Y1,…,Yn} is given as
where
The joint density for general K can be defined in a similar way. Another setting of change point detection is that the change point is defined as the change of states for the underlying Markov chain Xn. These models are called regime switching models when the Markov chain is recurrent or are called change point models when the Markov chain is not recurrent.
- (b) In finance applications, the feature is to incorporate both the stock and the options prices to identify the regime of volatility. Specifically, the option price is modelled via a variance stabilizing transform log (yij)= log {Cij(Θ)}+ɛij, where Cij(Θ) is the model price of the option with i indicating the type of the option and j indicating the strike price. Here, the error ɛij comes from market friction or model discrepancy. It is a challenging task to form change point detection using implied volatility calibrated from option prices, the volatility of the option price calculated from the error ɛij and the historical volatility of the stock price.
It is interesting to see whether the simultaneous multiscale change point estimator proposed in the paper can be applied to these two problems. Alternatively, a Bayesian approach focuses on the posterior distribution of the parameters, in which Markov chain Monte Carlo methods can be implemented to sample the parameter of interest having the desired distribution.
Dario Gasbarra and Elja Arjas (University of Helsinki)
The paper follows consistently the frequentist statistical paradigm. Thus, for example, the number K of change points is viewed as a parameter, and consideration of different values for K is interpreted as a problem of model selection. A natural alternative, particularly in ‘dynamic’ applications where the data consist of noisy measurements of a signal over time, is to think of the signal as a realization of a latent stochastic process. Then the mode of estimation changes from an optimization problem to a problem formulated in terms of probabilities. Adopting a hierarchical Bayesian approach to the problem allows us similarly to quantify probabilistically also parameter uncertainty. As an illustration, we computed the posterior probabilities P(K=k|data) for the ‘no‐trend’ signal (see Fig. 4(a)), using data corrupted by Gaussian noise as in the paper. For this, we applied a slightly modified version of the Gibbs sampler algorithm for marked point processes of Arjas and Gasbarra (1994).
Quite a vague prior distribution was assumed, with non‐informative scale invariant priors π(σ2)∝σ−2 and π(λ)∝λ−1respectively for the variance of the Gaussian error terms and for the Poisson intensity governing the number and the positions of the change points. For the signal levels, we adopted a Gaussian
prior for the initial level, and a conditionally independent and identically distributed Gaussian prior
for the successive jumps, assuming a scale invariant non‐informative prior also for η2.
With these model and prior specifications the posterior probability P(K=k| data) for the number of change points had its maximum value 0.41 at k=6, which was the true number used in the simulation. By constraining the jump sizes of the signal above the threshold 0.15, we obtained also the constrained conditional probability P(K=6| data, constraint) = 0.78.
The 0.95 credible interval for σ was (0.19, 0.21), the correct value being σ=0.2. Additional results, and the computer code that was used, are available from the first author.
It would be interesting to see a systematic comparison of the performance of the various methods for solving change point problems based on the two complementary statistical paradigms.
M. Hušková (Charles University in Prague)
I congratulate the authors for excellent work, which is important for applications. They consider a relatively simple model with multiple changes (independent observations; one‐dimensional exponential family for single observations) but the paper is quite complex—it covers motivations, theoretical results, deep discussion of computational aspects, simulations, application to real data sets, discussions on single steps and discussions on possible extensions.
The disadvantage is that the paper is too long (but I have no suggestion how to shorten it) and it is difficult to extract the algorithms if one wants to apply the procedures without reading the whole paper in detail—kinds of algorithms or pseudocode would be useful.
I would like to bring your attention to Antoch and Jarušková (2013), where a different model is considered, and different procedures. It contains both theoretical results and algorithms with useful pseudocode.
Jiashun Jin (Carnegie Mellon University, Pittsburgh) and Zheng Tracy Ke (Princeton University)
We congratulate the authors on a very interesting paper. The paper sheds lights on a problem of great interest, and the theory and methods developed are potentially useful in many applications.
(47)
(48)


by
and
The covariance matrix of
is denoted as H, which can be easily calculated by basic statistics. For tuning integers m,l⩾1 (m is small) and tuning parameters u,v,t>0, the stages are as follows.
- (a) Screening stage: let
be the set of currently retained nodes (initialize with
). For k=1,…,m and i=1,…,n−k+1, let Aik={i,i+1,…,i+k−1}, and
. We update with
if
(for short, A=Aik and D=Dik), and keep
unchanged otherwise;
is the subvector of
restricted to indices in A (HA,A is defined similarly; see Ke et al. (2013)). Let
be the set of all retained indices at the end of the screening stage.
- (b) Cleaning stage:
partitions uniquely to B1∪…∪BN such that
and
. For each Bk, let
, and compute
, subject to the constraint either μj=0 or
, for
, or μj=0, for
.
The CASE estimate is given by
, and
for
. For simplicity, the cleaning stage is slightly different from that in Ke et al. (2013); see the details therein.
Rebecca Killick (Lancaster University)
The authors are to be congratulated on a valuable contribution to the change point literature and commended for making their algorithm available within the stepR package (Hotz and Sieling, 1970). The automatic penalty selection and construction of confidence intervals for change point locations are of particular interest.
Firstly I have a question on the automatic penalty selection. In Section 3 the authors reframe SMUCE as a dynamic program for ease of computation. By lemma 1 in Section 3 the solution of the dynamic program is also the solution of the original problem if γ is chosen larger than
The choice of γ is not mentioned in the paper as the authors instead focus on the choice of q. Given
as indicated in the paper, what γ was used in the simulations and, more importantly, what γ is preprogrammed into the stepR package?
Secondly, whereas the presentation in the paper is restricted to SMUCE, I believe that the concept of confidence intervals for the change point locations can be extended to other search algorithms, in particular PELT (Killick et al., 2012). The confidence intervals for the change point locations in SMUCE are constructed by considering all sets of solutions where the test statistic Tn(Y,ϑ) is less than the threshold q (equation (5)). In contrast, the PELT algorithm keeps all change point locations that are within the penalty value of the maximum to prune the search. The same idea of confidence could be applied to these change point locations as their test statistics are close to the maximum and are thus also likely candidates for a change point. Obviously the key question is what theory is there to support this criterion as a way of constructing a confidence interval?
- (a) as you increase the penalty (i.e. increase your expected confidence in a change point) you become more uncertain about the proposed locations;
- (b) for a given penalty value, the larger the change, the smaller the confidence interval;
- (c) the coverage does not depend on the size of the change;
- (d) the longer the interval between changes, the smaller the confidence interval.
Fig. 18 gives an example of this last point by using the changepoint package (Killick and Eckley, 2014). In the simulations the coverage of both SMUCE and PELT was larger than 99% using default values. The theory behind this conjecture needs to be thoroughly treated but at least empirically this seems promising.

Dehan Kong and Qiang Sun (University of North Carolina at Chapel Hill)
We congratulate the authors for their thought‐provoking and fascinating work on a challenging topic in multiscale change point inference. They introduce a new estimator, the simultaneous multiscale change point estimator SMUCE, for the change point problem in exponential family regression, which can provide honest confidence sets for the unknown step function and its change points and achieves the optimal detection rate of vanishing signals as n→0. This work is a substantial contribution to multiscale change point problems by providing a solid inference tool.
However, the SMUCE method depends highly on the choice of the tuning parameter α, or q equivalently. The authors propose to select q by maximizing g(q)=1−α(q)−β(q,η,λ), the lower bound of
. This only works properly if max g(q) is close to 1, which happens when n is fairly large. It needs to be noted that, in general, the maximizer of
and g(q) can be quite different. Thus finite performance is not guaranteed to be optimal. An illustrative example is to take
whereas
. This phenomenon is also illustrated by the simulation results of the paper. In Section 5.1, the authors obtain an optimal q=q1−α(M) with 1−α=0.55, where q1−α(M) is the (1−α)‐quantile of the distribution of M. This result indicates that maxg(q)⩽0.55. However, from the simulation setting where σ=0.3, we can see that
when 1−α=0.55, and
when 1−α=0.4. The maximizer of f(q) should be quite different from q=q0.55(M).
Moreover, we note that the selection of q or α does not depend on the error standard deviation σ, which results in the same α as chosen in Section 5.1. However, it is expected that, when σ increases, we would tend to underestimate the number of change points by using the same α, which is illustrated by the simulation results in Section 5.1. A better tuning method is needed.
In conclusion, the tuning method proposed is not adapted to σ and not optimal given limited sample size. As a remedy, we suggest the use of cross‐validation for tuning purposes.
Han Liu (Princeton University)
I congratulate the authors for making an important contribution to the problem of change point inference. I believe that their method will have many interesting extensions beyond what has been presented in the paper.
In what follows, we consider the change point inference problem in a slightly different varying‐coefficient generalized linear model and construct a SMUCE‐type estimator based on local composite likelihood. Unlike SMUCE, this estimator does not depend on the explicitly parametric form of the exponential family distribution.
A varying‐coefficient generalized linear model
(49)

defined in the paper. Our goal is to infer the change points τ1,…,τK.
A local composite likelihood ratio statistic
exploited by SMUCE is





. Multiplying the density functions for all possible combinations of pairs that fall in the interval [i/n,j/n], we define the local composite likelihood ratio statistic as

is a function of U‐statistics and can be plugged into SMUCE for change point inference. It is interesting to understand its theoretical properties.
Robert Maidstone and Benjamin Pickering (Lancaster University)
We congratulate the authors on a stimulating contribution to the literature. Our discussion explores the performance of SMUCE in the case of large data sets with an increasing number of change points.
Examples of ‘big data’ are becoming increasingly frequent in industrial applications. It is important that any novel change point detection method can function efficiently in these cases. To assess the effectiveness of SMUCE in such scenarios we considered detecting changes in mean for Gaussian data. We examined data sets of varying size, from n=2000 to n=100000, containing m equidistant change points, where m={10,⌊√n⌋,n/10}. Change point estimates were computed by using three competing methods: SMUCE, PELT (Killick et al., 2012) and binary segmentation (Scott and Knott, 1974). Computational times were also recorded. We implemented the methods by using the default values in the stepR and changepoint (Killick and Eckley, 2012) packages.
The average computational times for each method are displayed in Fig. 19. Where the number of change points is kept constant, SMUCE is considerably slower than the other two methods and is distinctly non‐linear. For the other two cases, where the number of change points increases with n, the computational performance of SMUCE improves. The empirical simulations illustrate that SMUCE is at least near linear, though still significantly slower than PELT. This reinforces the authors' comments that SMUCE cannot be proved to be
in computation. To reduce computational time the stepR package approximates the SMUCE algorithm by considering only intervals of dyadic lengths.

), PELT (
) and binary segmentation (
), averaged over 50 replications: (a) 10 change points; (b) ⌊√n⌋ change points; (c) n/10 change points
The quality of the change point estimates was evaluated by calculating the V‐measure (Rosenberg and Hirschberg, 1955) for each method in each scenario. This is a measure on [0, 1] of the similarity between two different segmentations of a data set. A larger V‐measure indicates a more accurate segmentation of the data, with a V‐measure of 1 indicating perfect segmentation. Fig. 20 compares the average V‐measures of the three methods as n increases. Although the accuracy of SMUCE surpasses that of binary segmentation, it is outperformed by PELT for scenarios where the change point densities are
and
, significantly so for the latter. In addition, the accuracy of SMUCE appears to decrease as the density of change points increases. These features are due to the approximation made in the stepR package.
Our simulations suggest that SMUCE begins to experience difficulties as the density of change points increases with n, as a consequence of the approximations made. We would be interested to know whether an exact implementation is possible without compromising computational cost.

), PELT (
) and binary segmentation (
) as n increases, averaged over 50 replications: (a) 10 change points; (b) ⌊√n⌋ change points; (c) n/10 change points
Jorge Mateu (University Jaume I, Castellón)
The authors are to be congratulated on a valuable and thought‐provoking contribution on the change point problem. As they note, this problem can be encountered in a wide variety of practical examples and scientific fields, incorporating and sharing ideas from multiscale methods, dynamic programming and exponential families. I would like to comment on the problem of dependent data as appears in Section 6.
Gibbs point processes are applied to model point patterns with interacting objects (Ripley, 1973; Stoyan and Stoyan, 1975; Stoyan et al., 1995). Often, available data contain information about both locations and characteristics of objects, and interactions are expected to depend on both of them. Therefore, a suitable model for the data is a marked Gibbs point process, where the marks are the quantities connected to the points.
However, marked point processes are quite difficult to deal with, particularly in terms of simulation and estimation. The likelihood function usually depends on an unknown scaling factor which is intractable avoiding the direct use of the maximum likelihood estimation method. A more feasible possibility is to apply computer‐intensive Markov chain Monte Carlo maximum likelihood which simulates ergodic Markov chains having equilibrium distributions in the model. Simulation procedures of marked Gibbs processes are also based on running a sufficiently long Markov chain until it reaches equilibrium, i.e. until the Markov chain has reached the point process distribution. Markov chain Monte Carlo methods have been increasingly popular in statistical computation. However, iterative simulation presents one problem beyond those of traditional statistical methods. One must decide when to stop the Markov chain or, more precisely, one must judge how close the algorithm is to convergence after a finite number of iterations. Classical methods for monitoring convergence are not fully implemented.
One strategy is to compute the energy marking (Stoyan et al., 1995) over each simulated point pattern obtained after each iteration of the Markov chain. This provides a collection of serially dependent data in the form of a kind of time series. Convergence to equilibrium of this time series is essential. Analysing multiscale change points for this serially dependent data is directly related to checking for equilibrium. Thus we claim that an adaptation of the SMUCE method to detect change points (how many and where) over the Markov chain of the energy marking is worth trying to set up statistical planning for investigating the convergence of such chains in marked point processes.
Dao Nguyen (University of Michigan, Ann Arbor)
I congratulate the authors for their contribution in an interesting and stimulating paper. One of the most challenging problems in change point inference consists of finding the number K of change points and its formal treatment in applications. In this regard, the authors combine likelihood ratio and penalized likelihood approaches by calibrating the scale at level α and considering it as a multiscale constraints maximization for choosing K. One of the strong points of this approach is that, by using penalized likelihood, it allows (at least in theory) a large class of functions to be treated simultaneously. However, it also introduces extra arbitrariness and the optimization generally appears slow and time consuming. From my perspective, it will be important to make sure that the method competes effectively with the other (non‐penalized) approaches. Using the same setting for the comparative genomic hybridization data set as in the paper, Table 12 shows some simple empirical comparisons with the SMUCE approach: CumSeg of Muggeo and Adelfio (2000), UnbaHaar of Fryzlewicz (2004) and circular binary segmentation, CBS of Olshen et al. (2004).
| Method | Trend | σ | Results for the following | Mean‐ | Mean | Time | |||
|---|---|---|---|---|---|---|---|---|---|
| numbers of change | squared | absolute | (s) | ||||||
| points: | error | error | |||||||
| K⩽5 | 6 | 7 | ⩾ 8 | ||||||
| SMUCE | No | 0.1 | 0 | 951 | 49 | 0 | 0.000761 | 0.4492 | 493 |
| CumSeg | No | 0.1 | 0 | 960 | 36 | 4 | 0.1543 | 0.780 | 105 |
| UnbaHaar | No | 0.1 | 0 | 857 | 39 | 104 | 0.000506 | 0.455 | 338 |
| CBS | No | 0.1 | 0 | 825 | 111 | 64 | 0.000828 | 0.4293 | 94 |
From Table 12, I have no reason to prefer SMUCE to the other approaches when computational power and time are critical. I suggest that further considerations about the computational performance of this approach should be taken into account.
Another appealing feature of the method is the ability to derive error bounds for the change point locations through the selection of a threshold q. However, the selection strategy is rather ad hoc in the sense that it is based on the prior and pilot Monte Carlo simulations. Indeed, the motivation for the choice of the parameter q is not convincing to me as it assumes simple Gaussian observations and extrapolates to other models only on the basis of empirical evidence. It is not even clear why particular values of q which lead to the maximization of the minorization function of interest instead of the function itself would be useful. To become of more general interest, the choices for the threshold q, the prior of λ and η, and the class of functions f used in the maximization of the likelihood will need to be further justified.
I recognize the beauty of this hybrid approach, and, although it seems to achieve good results in term of accuracy, generalization of this approach is not straightforward. It would be interesting to investigate its performance for missing data or longitudinal data. In my opinion, the approach pursued by the authors looks promising and worthy of further exploration.
Richard Nickl (University of Cambridge)
(50)
covered for a fixed sample size n is, as n→∞, actually not strongly non‐parametric in the sense of Donoho (2001), whose results hence say nothing in particular about necessity of the condition (50) employed by the present authors.
(51)
Comparing the likelihood ratio between a uniform zero signal and a small perturbation, and using lower bound arguments as in Hoffmann and Nickl (2008), one can derive a lower bound in the sense that for Δ, λ too small no test can distinguish between these two hypotheses. This would give a sound theoretical approach to clarify whether improvements on the authors' method are still to be expected.
Rui Song (North Carolina State University, Raleigh) and Michael R. Kosorokand Jason P. Fine (University of North Carolina, Chapel Hill)
The authors are to be congratulated for their thoughtful paper on multiscale change point inference. Hypothesis testing for change point models is notoriously challenging, since, under the null hypothesis, the location of the change point—change points—is no longer identifiable. The asymptotic properties of the classical likelihood ratio, Wald and score tests are thus non‐standard. Furthermore, likelihood ratio tests do not have the Bahadur efficiency property since the regularity conditions no longer hold. Constructing optimal tests for the existence of the change point is thus rather demanding. In the paper, the authors have derived a test having ‘optimal’ asymptotic power, as stated in theorem 5. The current definition of optimality appears to be somewhat different from the usual definition. It would be worthwhile to discuss whether the proposed optimal testing procedure has any optimality properties in the traditional sense, such as the optimality illustrated in Song et al. (2013).
The change point problems discussed in this paper deal primarily with univariate random variables. In regression settings, such as in multiple linear regression, the regression coefficients may change with the value of another variable, which we call a change point based on thresholding a covariate. This set‐up was considered in detail in chapter 14 of Kosorok (2012). The extension of the proposed multiscale inference to such change point models remains unclear but would be quite useful in expanding the scope of application. We would appreciate comments from the authors on the difficulties in such extensions.
In Section 1.6, the authors claimed that ‘our analysis also provides an interface for incorporating a priori information on the true signal into the estimator’. We would like to make a connection with Song et al. (2013), where an asymptotically optimal likelihood ratio test for the existence of the change point was derived employing a prior distribution for the change point. In particular, the weighted average power for the proposed test with respect to the specified prior was shown to be optimal. It would be interesting to explore the theoretical advantages in incorporating a priori information in multiscale change point inference and to compare the resulting inferences with those obtained by using classical approaches like those in Song et al. (2013).
Alexandre B. Tsybakov (Centre de Recherche en Economie et Statistique–Ecole Nationale de la Statistique et de l’Administration Economique, Malakoff)
The paper by Klaus Frick, Axel Munk and Hannes Sieling provides a very interesting and thought‐provoking contribution demonstrating the power of the multiscale approach in statistics. The method suggested (SMUCE) shows excellent performance in simulations and the authors provide an original way of constructing confidence intervals. The theory developed in the paper focuses on the correct estimation of the number of change points. However, it remains unclear whether SMUCE achieves correct model selection (recovery of the set of change points), i.e. do we have
At the same time, correct model selection is granted for simple procedures such as thresholding of Yi−Yi−1 or related techniques such as the fused lasso. The conditions on the jump size Δ and on the distance between jumps λ are crucial. Assuming for simplicity Gaussian observations, for the correct model selection by these simple methods it is enough to have Δ⩾cσ√{ log (n)/n} for some constant c>0 and no condition is needed on λ. This seems not to be so for SMUCE. For example, if Δ∼σ√{ log (n)/n} the underestimation bound of theorem 2 is meaningful only for large λ: not for λ comparable with log (n)q/n for q>0. Also, in the simulations, the situation is very favourable; the vector of differences θi−θi−1 is extremely sparse (λ is large) and Δ∼σ. In view of this, it is not clear what the remark in Section 6.3 that SMUCE employs a weaker notion of sparsity (than the l1‐based methods) i.e., s=n, means. Since the question here is about sparsity of the differences θi−θi−1, condition s=n is equivalent to λ=1/n, which is prohibited by the theory in the paper. In contrast for the l1‐based methods, correct model selection is possible for any s, including s=n (consider, for example, soft thresholding of Yi−Yi−1).
However, the simulations in the paper, as well as those in Rigollet and Tsybakov (1985) lead to a conclusion that fused lasso techniques are not very efficient for piecewise constant signals. Moreover, in the study of Rigollet and Tsybakov (1985), the fused lasso turns out to be the least efficient among several fused sparsity methods and the leadership goes to the fused exponentially weighted (EW) aggregate. The regression setting in Rigollet and Tsybakov (1985) covers the Gaussian case of the paper being discussed. Recently, Arias‐Castro and Lounici (2012) proved that the EW aggregate achieves correct model selection under much weaker assumptions than the lasso. This confirms the striking improvement over the lasso that is observed for the EW techniques in simulations (Rigollet and Tsybakov, 2012). Therefore, it would be interesting to compare, in terms of model selection performance, SMUCE with the fused EW aggregate, and not only with the fused lasso.
Guenther Walther (Stanford University)
The multiscale test that is considered in the paper requires the computation of local test statistics on all intervals [i/n,j/n] for 1⩽i⩽j⩽n. A straightforward implementation will thus result in an O(n2) algorithm, making computation infeasible for problems of moderate size. The authors address this by introducing methodology based on dynamic programming, which results in an improved computation time in many cases. Alternatively, it may be promising to evaluate the statistic on only a sparse approximating set of intervals. Rivera and Walther (2004) introduced such an approximating set for the closely related problem of constructing a multiscale likelihood ratio statistic for densities and intensities. The idea is that, after considering an interval [i/n,j/n] with large j−i, not much is gained by also looking at, say, [i/n,(j+1)/n]. It turns out that it is possible to construct an approximating set of intervals that is sufficiently sparse to allow computation in O{n log (n)} time, but which is still sufficiently rich to allow optimal detection of jumps.
Another advantage in employing such an approximating set is that it considerably simplifies the theoretical treatment of the multiscale statistic. It is notoriously difficult to establish theoretical results about the null distribution of the multiscale statistic such as theorem 1. Even if one is not interested in the limiting distribution because the critical value is obtained by simulation (an option which is made feasible by the O{n log (n)} algorithm described above!), it is still necessary to show that the null distribution is Op(1) to establish optimalilty results such as theorem 5. The standard method of proof introduced in Dümbgen and Spokoiny (1988) requires establishing two exponential inequalities: first, one needs to establish sub‐Gaussian tails for the local statistic, which is often straightforward. The second exponential inequality, however, concerns the change between local test statistics, and this inequality is often very difficult to derive. Rivera and Walther (2004) showed that, if one employs a sparse approximating set, then the Op(1) result follows directly from the sub‐Gaussian tail property of the local statistics, which is typically easy to obtain.
It appears that sparse approximating sets can similarly be constructed for relevant multivariate problems; see Walther (2000). The advantages of these sets for computation as well as theoretical analysis suggest that these approximating sets may play an important role for these types of problem in general.
Chun Yip Yau (Chinese University of Hong Kong)
- (a) In this paper the multiscale statistic
is considered. By showing that
is close to
, the asymptotic distribution of the test statistics and large deviation bounds are derived (see lemmas 7.2 and 7.5 of the on‐line supplement). Since the sample mean
is a sufficient statistic for θ under the assumed one‐dimensional exponential family model, we may alternatively define the test statistics as
Defining the test statistics by equation is likely to have the following advantages.
(52)- It is computationally more efficient as it avoids maximizing the likelihood function required in computing
.
- It has better approximation to the null distribution and has sharper large deviation probability bounds, and hence more accurate confidence bands.
- It can be readily extended to dependent data by a proper standardization of
. The results in Lai et al. (2004) about large deviation of self‐normalized processes may be useful in establishing theoretical results analogous to the current method.
- It is computationally more efficient as it avoids maximizing the likelihood function required in computing
- (b) The widely used likelihood ratio statistic
seems to be more directly addressing the change point problem. Would it give more efficient results if the test statistic is defined in terms of

rather than
? The computational burden using
and
is the same since both reduce to the evaluation of
. Also, the probabilistic properties of
have been well studied (e.g. Csörgő and Horváth (2000)).
- (c) The simultaneous confidence band is conservative as the large deviation probability bounds are not sharp. In view of theorem 4, the accuracy appears to be decreasing with the number of change points. Therefore, if we are interested only in constructing confidence intervals for the change points, can we achieve higher accuracy by doing it locally for each of the change points separately? That is, once the set of estimated change points
has been obtained, the procedure in Section 3.2 is applied on
to obtain a confidence interval for
. As the estimated change points are independent of each other, a simultaneous confidence set for all change points can be calibrated from the
intervals.
- (d) In lemma 1, it requires the penalty γ to be of order O{n2 log (n)}, which is much higher than the order of some common penalties such as the Bayes information criterion's O{ log (n)} and the Akaike information criterion's O(1), and is even greater than the cost function
of order O(n). Why is such a strong penalty required?
Zhou Zhou (University of Toronto)
I congratulate Professor Frick, Professor Munk and Dr Sieling for this very stimulating paper. They propose to estimate a piecewise constant trend function via minimizing the model complexity subject to constraints on a multiscale goodness‐of‐fit statistic Tn. It is interesting that the tuning parameter q in the estimation is directly related to an upper bound for the probability of overestimating the number of change points. For independent observations belonging to a one‐dimensional exponential family, the authors comprehensively studied the properties of the change point estimator proposed, many of which are optimal under various criteria.
(53)
‐valued nuisance functions whose behaviours are not of interest in the current study and {Hi} is a stochastic process which depicts the temporal copula structure of the data and does not influence the marginal distributions. In particular, the Hi are assumed to be marginally uniform(0,l) distributed. A natural question to ask is will SMUCE be robust to changes in θ(·) or {Hi}? For instance, in the GBM31 data studied in the paper, it seems that the suspected change point at around observation 700 for α⩾0.4 (Fig. 11) in fact depicts a decrease in the marginal variance of the sequence. Hence, if the answer to the latter question is negative, it is then of interest to know whether there is a modification of SMUCE which makes it robust to changes in the nuisance parameters and/or higher order structures of the series.
The authors replied later, in writing, as follows.
First, we thank all the discussants for their numerous inspiring comments, suggestions and thought‐provoking questions. Many of these comments pave the way to extensions of SMUCE reaching far beyond initial expectations and will open up new research in this area.
We shall not be able to address all the comments in detail in this rejoinder. Indeed, some are quite challenging and deserve a more thorough analysis. In what follows, we primarily focus on those issues which we identify as common themes of the contributions.
In our presentation, we confined ourselves to independent observations from a one‐dimensional exponential family to present the basic ideas as clearly and concisely as possible. We agree that extensions to more general models are necessary and important and we are grateful for the many fruitful suggestions. Some of them will be addressed in what follows, some have already been elaborated on by discussants and some seem to be very challenging to us and are under way.
Other distributions and statistics
As pointed out by many discussants (Crudu, Porcu and Bevilacqua, Davies, Hušková, Kovac, and Linton and Seo) extensions of SMUCE to other distributions are possible, e.g. under Cramér‐type conditions on the moment‐generating function. To control the overestimation error an exponential deviation bound is required for SMUCE and many tools are available to achieve this.
(54)
is completely determined by its small‐scale behaviour. More precisely, only terms of scales of the order j−i∼ log (n) matter for its Gumbel extreme value limit (Kabluchko and Munk, 1994). Further, asymptotically τk+1−τk does not enter as it does for the distributional limit of SMUCE. One may argue that large scales do not need to be encountered in statistic (54) as they do not contribute to the maximum with high probability.
Chen, Shah and Samworth, and Yao sketch an interesting and quite general approach to extend SMUCE beyond exponential families by using the (pseudo‐)Gaussian likelihood in a generalized additive noise model. This offers a variety of possibilities and simplifies theory while allowing us to control computational cost. However, if it comes to statistical efficiency how estimation accuracy and coverage of the associated confidence set are affected must be checked carefully. The simulation in Section 5.3 shows the benefit of using the exact model‐based likelihood ratios (instead of Gaussian surrogates) for Poisson observations. In particular for low count regions of the signal (where the normal approximation fails) change points will not be identified well, which results in a systematic underestimation of the number of the change points (see Table 3 in the paper).
So far, SMUCE has been extended to heavy‐tailed distributions by us in the context of quantile regression in Section 5.4. Addressing Aston and Kirch, and Coad's comments we mention that this provides a general and robust modification of SMUCE as it transforms the problem into a Bernoulli regression problem. However, we agree with Linton and Seo that it is worth developing more sophisticated theory and test statistics, e.g. specifically tailored to the tail behaviour of the error. In particular, for applications in finance, this might be useful as pointed out by Davies, and Linton and Seo.
Dependent data
A challenging and practically important task is to extend SMUCE to dependent data as discussed by Aston and Kirch, Coad, Linton and Seo, Mateu, and Yau among others. Indeed, there seem to be several possibilities to extend SMUCE to dependent data. Statistically most efficient appears to be an imitation of the likelihood approach as advocated in the present paper. However, this might be practically limited for several reasons, e.g. computational, or by the difficulty in simultaneously estimating the dependence structure and the piecewise constant signal.
A simple modification for dependent data is based on standardizing each local likelihood ratio statistic
by its variance as illustrated for the case of a moving average MA(1) process in Section 6.1.1. Coad pointed towards m‐dependence. In fact, this has been elaborated parallel to this paper in the context of estimating the opening and closing states from ion channel recordings in Hotz et al. (1970). The method that was presented there requires knowledge of m and a reliable estimate of the dependence structure. A simple bound is given which shows how to recalibrate the parameter q to control the overestimation error α. A particular appeal of this ad hoc modification might be that the computation of the estimate is essentially the same as for SMUCE with independent observations. However, a thorough theory and implementation for more efficient methods for dependent data, e.g. likelihood‐based multiscale statistics, has not been explored yet and seems an interesting challenge for the future.
Model extensions: multivariate and indirect measurements
The general methodology underlying SMUCE may be used for local statistics, different from likelihood ratios. This offers extensions to other settings and we are grateful for the numerous contributions in this direction. Liu made an interesting proposal: the use of a local composite likelihood ratio statistic to apply the methodology underlying SMUCE to a varying‐coefficient generalized linear model. We thank Farcomeni who proposes a novel modification for panel data regression and multivariate data, as questioned by Coad, and Enikeeva and Harchaoui, who also pointed towards indirect observation. In this case the situation changes completely for two reasons. First, asymptotic theory becomes regular for piecewise continuous change point regression (including piecewise constant) and the maximum likelihood will be Cramér–Rao efficient; see Frick, Hohage and Munk (2008). This is not so for the direct case that is treated in our paper as mentioned by Song, Kosorok and Fine who pointed out the non‐regularity of testing for a change point; see also Ibragimov and Khas'minskij (1981), Korostelev and Korosteleva (2002) and Antoch and Hušková (2000) for a similar phenomenon in estimation. From this perspective, indirect regression becomes even simpler. In contrast, computationally the problem changes drastically and the resulting estimation problems for the change points become a non‐linear and often non‐convex optimization problem, which is difficult to solve, in general (Frick, Hohage and Munk, 2008). In response to Song, Kosorok and Fine we also mention that in the current change point setting the number of change points is unknown and arbitrary, which makes this problem ‘non‐parametric’ of intermediate dimension (neither parametric p<<n nor high dimensional p>>n; rather n=p). This is very different from classical settings as the model dimension grows with the number of observations and different (asymptotic) efficiency concepts must be used (see theorem 5, theorem 6 and the comments by Jin and Ke, and Nickl).
Time dynamic models
Currently, theory underlying SMUCE assumes a regression model with a deterministic (unknown) function ϑ ∈ S. In this model, per se, the segments are not ‘intersegment dependent’, using Eckley's terminology. As pointed out by Eckley, Fuh and Teng, and Szajowski it is often more reasonable to assume ‘jump dynamics’ and to model this as a random process itself—and we fully agree. Probably, the most prominent and simple model in the change point context is a hidden Markov model (HMM) as described by Fuh and Teng. In Hotz et al. (1970) we performed simulations for the pathwise reconstruction of SMUCE in an HMM with two hidden states (without making use of the HMM assumption) and it performs remarkably well. Roughly speaking, the reason is that conditioned on the observations the reconstruction still obeys all recovery properties that are inherited within the (conditional) regression model. Interestingly, there is a dynamic programing analogue of SMUCE with the celebrated Viterbi algorithm, although pathwise the reconstructions are different, of course. However, in contrast with any HMM‐based estimation method for the most probable path, SMUCE neither incorporates any assumption on the number (and values) of states nor assumes a Markovian structure for the changes between the states. Indeed, we agree with Mateu that it would be of great interest if this could be incorporated in a sensible way. In fact, first simulations and examples show very promising results.
This issue is related to the Bayesian viewpoint as stressed by Fearnhead, Gasbarra and Arjas, and Nyamundanda and Hayes, which we shall address in more detail later. In both worlds we assume that the parameter ϑ is random. Finally, it depends on the application whether we may assume dynamics for the change points (as in an HMM or state space model). This gives us also the possibility for prediction—which SMUCE does not offer so far.
Beyond piecewise constant signals
In Section 5.1 we included a simulation scenario with a deterministic trend to assess the robustness of SMUCE against violations of constant segments. These results are complemented by the simulations that were provided by Bigot, who offers an interesting procedure which does not rely on the piecewise constant assumption of the signal as SMUCE does.
The other way round, it is of interest to investigate how well SMUCE approximates a signal which is not piecewise constant as addressed in the remarks of Farcomeni, and Linton and Seo. We have not yet developed a theory for the approximating properties of SMUCE to smooth functions. Nevertheless, we conjecture that it can be shown that SMUCE will converge at the minimax rate O(n−α/(2α+1)) to the true signal as long as it is not too smooth (i.e. if ϑ is in the approximation space Aα;0<α⩽1 in the terminology of Boysen et al. (2009)). The case α=1 gives O(n−1/3), supporting Linton and Seo's conjecture. For smoother signals (e.g. ϑ having more than one derivative) this rate will not improve further, according to the local constant reconstruction provided by SMUCE. For an illustration of such an approximation see Fig. 21.

Farcomeni, and Linton and Seo suggest that SMUCE can be employed for more general right continuous piecewise parametric functions, which we basically agree. We do not see any conceptual limitations. For a fast implementation, however, the optimal local costs must be computed efficiently (see Section 3.1), which may be laborious and must be done case by case. Fig. 22 shows an extension to piecewise linear functions and its associated confidence intervals for the change point locations and confidence bands.

) and confidence intervals for the change point location (
) for α=0.1
Multiparametric regression


is t distributed with j−i degrees of freedom. The proper scale calibration term for the multiscale statistic must be computed in this case, which is not straightforward at all and is an interesting topic for future investigation. The computation of the estimate, however, is the same as for the case with constant variance as treated in the paper.
Multi‐dimensional models


This is a simplified version of the problem that is raised by Crudu, Porcu and Bevilacqua in connection with georeferenced data and has a long history in function estimation and statistical imaging, in general. Farcomeni's comment targets much the same issue.
- (a) finding a statistically efficient and computationally feasible generalization of the multiscale statistic Tn(Y,ϑ) and
- (b) a reasonable substitute for the objective function #J(ϑ).
The latter reflects the fact that the piecewise constant (surface) segmentation problem is well known to be NP hard already for d=2.
is considered as a subset of the space BV(Ω), the functions of bounded variation. The jump setDϑ of a function
is the collection of points x ∈ Ω at which left and right limits of ϑ exist (in a particular sense) but do not coincide. The jump set is rectifiable and has finite Hausdorff measure
We thus suggest, for the case d>1, substituting the objective functional #J(ϑ) by the (d−1)‐dimensional Hausdorff measure of the jump set:
. It is important to note that this definition is consistent with the case d=1 since then Dϑ is the set of change points of ϑ and
is the counting measure, thus:
. In the case d=2, for example,
measures the perimeter of the jump set Dϑ. Summarizing, one way of generalizing SMUCE to a d‐dimensional domain is given by the optimization problem

Nature of SMUCE, error control and choice of threshold
in inequality (7) can even be refined to

suggests and is confirmed by all simulation studies performed by us and the discussants (see for example Chen, Shah and Samworth).
In summary, first and foremost, the parameter q is to determine the level of significance α to control the probability of overestimating K. As shown in theorem 1, this is determined by the null distribution, which further can be uniformly bounded in distribution by M in expression (15), i.e. M does not depend on ϑ. This has been questioned by Kovac. However, in addition, the exact as well as the asymptotic distribution depends on ϑ in a specific way (see expression (16)) which allows a refined determination of α if prior knowledge on ϑ is available. These refinements in turn yield improved detection power. Notably, ϑ enters only through the differences in the jump locations τk+1−τk, similarly to Zhang and Siegmund's (2007) refined Bayes information criterion type of penalty.
In his simulation study Fryzlewicz (accompanying our results in Section 5.1) compared SMUCE with wild binary segmentation (WBS) (an R package is available from http://cran.r-project.org/web/packages/wbs). The results suggest that particularly for high noise level SMUCE (with its standard settings in stepR) has a tendency to underestimate the true number of jumps. This can be explained by the strict requirement that SMUCE gives a significance guarantee against overestimation which becomes notoriously difficult for large variances. However, we do not fully agree with Fryzlewicz's conclusion concerning the superiority of the completely automatic WBS (sSIC) for small signal‐to‐noise ratios. The situation seems to be more complex and will depend on the signal itself. From this point of view, we stress that the possibility of choosing the level of significance α (as for any testing procedure) is a valuable tool for the data analysis. Therefore, we argue against the fully automatic ‘blind’ use of SMUCE (and any other segmentation method).
(55)| Results for σ=1 | Results for σ=2 | Results for σ=3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| and the following | and the following | and the following | |||||||
| frequencies: | frequencies: | frequencies: | |||||||
| 0 | 1 | ⩾2 | 0 | 1 | ⩾2 | 0 | 1 | ⩾2 | |
| SMUCE (α=0.1) | 3.1 | 96.6 | 0.3 | 66.9 | 32.9 | 0.2 | 85.3 | 14.7 | 0 |
| SMUCE (α=0.5) | 0.3 | 89.8 | 9.9 | 24.9 | 70.2 | 4.5 | 47.2 | 48.8 | 4 |
| WBS (sSIC) | 1.1 | 96 | 2.9 | 61.2 | 35.2 | 3.6 | 85.1 | 11.8 | 3.1 |
This suggests relaxing the current use of the familywise error rate which is controlled by the multiscale statistic Tn. Farcomeni addresses this as he relates SMUCE to recent developments in multiple testing. In many applications (e.g. genetic screening) an approach based on the false discovery rate might be more appropriate (see also Siegmund et al. (2011)) and a false discovery rate variant is under current investigation.
In any case, it might be attractive to use SMUCE as an accompanying tool for any segmentation method as this can be used as an input for the multiscale statistic Tn. This allows us to detect regions by its corresponding violators plot (see Fig. 4 for such a violator plot in the case of the maximum likelihood estimation solutions with varying number of jumps) which are rejected to be constant at level α.
Finally, we are grateful to Kong and Sun for giving us the possibility of clarifying an issue: all exponential bounds do depend on σ in the normal case. Indeed, σ enters through the worst‐case signal‐to‐noise ratio Δ/σ; see the end of the first paragraph in Section 2.4. Accordingly the multiscale statistic Tn itself depends on the (estimated) variance as explained at the beginning of Section 5.1.
A Bayesian view
) is needed to determine the a posteriori threshold q such that




Nyamundanda and Hayes as well as Gasbarra and Arjas pointed out that full priors on the signal ϑ can be used to increase efficiency—and we fully agree. Misspecification is an issue then, however. A distinct difference between SMUCE and a fully Bayesian method seems to us that misspecification of the prior, if it is used as we suggest, will affect only the underestimation bound but not the overestimation bound. In contrast, misspecification of a prior on the segment length will degrade the overestimation bound and interpretation of α. This is why we argue that it may be an interesting strategy to employ prior information primarily for controlling the underestimation error.
The multiscale statistic Tn for model checking
Critchley raised the important issue of how SMUCE is related to model checking. In fact, the multiscale test Tn can be regarded from this point of view as a multiscale analogue of local‐likelihood‐based methods, i.e. residual‐based model checking in the normal case (Boysen et al., 2009). Expanding the example of Fig. 1 in the paper, any candidate function
might be accompanied by a ‘violator plot’ of those intervals, where the reconstruction of the signal gives rise to local rejection by
exceeding the threshold q. This is illustrated in Fig. 24 where we display those regions (blue) for several candidate functions (red). These are the (unconstrained) maximum likelihood estimates for
number of change points. For clarity we display only violators on intervals of dyadic lengths. The colour coding (levels of blue) has been chosen according to the number of overlapping intervals at a pixel (more violators correspond to darker blue). In this example the SMUCE solution (which is not displayed) almost coincides with the unconstrained maximum likelihood estimate for
.

(
) and regions in which the multiscale constraint is violated at level α=0.1:
, true signal, K=8
Model selection performance and optimal detection
Jin and Ke highlight an interesting connection to what they call the ‘rare and weak’ setting in a more general context of a linear model Y=Xβ+ɛ (see also our Section 6.3 in the paper). Now βj=ϑj−ϑj−1 are considered as random. Furthermore, in our terminology, a scale corresponds to the length of a substring of the vector β. It is difficult to relate results in their model to ours rigorously but some rough comparison can be done at this stage. If we relate the expected number of change points in their model with K in ours, we obtain E[#{βj:βj≠0}]=n1−ϑ=K, which shows that 0<K<∞ if and only if ϑ=1. In this case, for any r Jin and Ke are in the exact recover setting (see their Fig. 17). As long as r⩽6+2√10∼12.325 the minimax detection rate for the Hamming selection error is
(up to log‐terms). Hence, for ϑ=1, r can be arbitrarily small to obtain full recovery. Then, their τn=√{2r log (n)} might be related to our λ and the expected jump size E[βj]=√{2r log (n)}/n would correspond to Δ (although this is the worst‐case jump size and not the average size). From the argument below theorem 7 we would obtain full model consistency in probability (i.e. we estimate all change points correctly and not only their number, which we may call weak model consistency) as long as we have that the right‐hand side in theorem 7 vanishes. If we fix cn=c>0 (as this bound is non‐asymptotic), we obtain full model consistency in probability. For this, we require only Δ>c√{ log (n)/n} for any c>0. Note that the assertion in theorem 7 is uniformly over the set
which includes SMUCE. As the optimal estimation rate for Δ (if fixed) is known to be of order 1/√n, the previous estimate yields that SMUCE loses a log (n)‐factor.
Hence, we find that the difference in the rare and weak setting of our model is of order n−1/2, which might be explained by parameter ‘averaging’ in the first model and by the additional sampling error in ours. It would be interesting to compare empirically in the change point setting the CASE estimator with SMUCE. We believe that a similar comment applies to Tsybakov; it seems that here also the sampling error matters. He points out that for full model selection consistency (in the above sense) in the framework of variable selection no condition on λ is needed.
Indeed, in our model, choosing ϑ with change point locations between sampling points shows that
is 0 for any n owing to the sampling error.
However, we might restrict the possible change points to the locations of the coefficients β to circumvent this difficulty.

where
is a centred Gaussian vector with covariance A, the inverse of XTX=min(i,j). A is an n×n tridiagonal band matrix with diagonal elements ai,i=2,1⩽i⩽n−1,an,n=1 and all subdiagonal and superdiagonal elements −1, i.e. the error vector is 2 dependent. From this representation a subtle—but important—distinction from the Gaussian sequence model becomes obvious: the error variance is still σ2 and not σ2/n. Hence, the threshold for full model consistency of hard thresholding selection becomes (see Tsybakov (2013))

We found it quite difficult to understand how the scales enter the linear model representation Y=Xβ+ɛ when we interpret this in terms of the sampling model as in Section 2.1.1. In our model the jump is embedded in the function ϑ, and this is independent of n. For example, to express in the linear model representation above that a change point is in the middle of the interval [0,1] requires a sequence of parameter vectors β(n), where the ⌈n/2⌉‐entry is non‐zero and all others are 0.
Confidence sets
equals the true number of change points. This yields
disjoint confidence intervals for the change point locations. If q=q1−α is the (1−α)‐quantile of M the coverage of these intervals is bounded by

necessarily depends on λ and Δ. In particular, this implies that confidence intervals cannot be provided at any level but the error level is bounded by
, which serves as an a posteriori confidence level given the data. This can be much less than 1−α and is in accordance with Fearnhead's comment, that the interpretation of these intervals is critical, if 
Towards more accurate bands
, i.e.

Let the multiscale constraint be fulfilled by a function
with
change points, i.e.
By adding one additional arbitrary change point to
we obtain a class of functions that all lie in C(q*,q1−α). Therefore, intervals that do not contain a change point will not be detected by this approach with high probability. Such intervals, however, are essential for the confidence bands based on SMUCE.
Killick pointed out that the construction of confidence intervals can be adapted to PELT. This appealing approach seems to work quite well empirically and it is certainly interesting to develop a theory for this approach.
Nickl, and Enikeeva and Harchaoui raise the challenging question concerning (asymptotic) optimality of SMUCE. For weak signals on a long interval, i.e. λ>0 is fixed, from theorems 5, part (a), and 6, part (a), it follows that changes can be detected as long as Δ=o(n−1/2). If inference concerns detection of a jump on scales which asymptotically shrink to 0 the answer is provided for the Gaussian case by theorems 5 and 6 (and the discussion between), which shows that SMUCE indeed achieves the minimax detection bound in the testing sense for a simple change point alternative. For more complex signals with an unbounded number of change points we do not know whether the constant that is given in theorem 6 is optimal, and also not for non‐Gaussian error. Whether condition (9) is minimal for an (asymptotically) honest inference procedure on change points is a challenging issue and worth being investigated further.
Computational issues
We are very grateful to Maidstone and Pickering, and Nguyen for their run time comparison. They pointed out that in the R‐package stepR only intervals of dyadic lengths are considerered for n>1000 observations, to speed up computations. It is important to note that this is not merely an approximation to SMUCE, as also the simulated (and asymptotic) quantiles for this reduced multiscale statistic change. In fact, in spite of the big data challenge in many applications nowadays, it is of great practical interest to consider even smaller systems of intervals to speed up computations further. Such a system was suggested in Walther (2000) and Rivera and Walther (2004) and it was shown that it achieves optimal detection rates in their context. As mentioned in Walther's comment, this reduced system is of order O{n log (n)}. By considering SMUCE for such a reduced system it seems possible to reduce the complexity of the dynamic program. This, however, is not straightforward and it is important to distinguish between the complexity of evaluating the multiscale statistic for testing, as addressed by Walther, and the complexity of the dynamic program, which solves the corresponding constrained optimization problem to obtain SMUCE.
Responding to Hušková, pseudocode for the algorithm is given in Futschik et al. (2008). We are grateful to Killick who addresses the connection between SMUCE as a constrained optimization problem with its formulation as a penalized optimization problem. To shed some light on this connection we showed how SMUCE can be rewritten with a certain penalized cost functional. This had mainly expository reasons and we had no practical consequences in mind, besides that it illustrates that dynamic programing is applicable (see lemma 1). The actual implementation in the R package stepR does not rely on lemma 1 and therefore we do not need to choose γ in practice. The only parameter to choose is q, i.e. the level of significance α.
Again, we express our deep thanks to all the discussants. We thank also E. Arias‐Castro, L. D. Brown, M. Diehn, L. Dümbgen, A. Futschik, M. Hušková, O. Lepski, F. Pein, R. Samworth, D. Siegmund, I. Tecuapetla, A. Tsybakov and G. Walther for additional comments and discussions during the process of writing the paper and the rejoinder. Special thanks go to T. Hotz, who contributed significantly to the implementation in stepR. We are very grateful for the hospitality of the Newton Institute, Cambridge, where parts of this rejoinder were written, the Royal Statistical Society and Series B for hosting the discussion and the opportunity to present our work. We acknowledge support of Deutsche Forschungsgemeinschaft grants FOR 916, CRC 755 and CRC 803.
References in the discussion
Citing Literature
Number of times cited according to CrossRef: 71
- David J. Miller, Najah F. Ghalyan, Sudeepta Mondal, Asok Ray, HMM conditional-likelihood based change detection with strict delay tolerance, Mechanical Systems and Signal Processing, 10.1016/j.ymssp.2020.107109, 147, (107109), (2021).
- Eungyu Park, A Geostatistical Evolution Strategy for Subsurface Characterization: Theory and Validation Through Hypothetical Two‐Dimensional Hydraulic Conductivity Fields, Water Resources Research, 10.1029/2019WR026922, 56, 3, (2020).
- Jose Bolorinos, Newsha K. Ajami, Ram Rajagopal, Consumption Change Detection for Urban Planning: Monitoring and Segmenting Water Customers During Drought, Water Resources Research, 10.1029/2019WR025812, 56, 3, (2020).
- Najah F. Ghalyan, Asok Ray, Symbolic Time Series Analysis for Anomaly Detection in Measure-Invariant Ergodic Systems, Journal of Dynamic Systems, Measurement, and Control, 10.1115/1.4046156, 142, 6, (2020).
- Martin Himly, Mark Geppert, Sabine Hofer, Norbert Hofstätter, Jutta Horejs‐Höck, Albert Duschl, When Would Immunologists Consider a Nanomaterial to be Safe? Recommendations for Planning Studies on Nanosafety, Small, 10.1002/smll.201907483, 16, 21, (2020).
- Jeong‐Kee Yoon, Dae‐Hyun Kim, Mi‐Lan Kang, Hyeon‐Ki Jang, Hyun‐Ji Park, Jung Bok Lee, Se Won Yi, Hye‐Seon Kim, Sewoom Baek, Dan Bi Park, Jin You, Seong‐Deok Lee, Yoshitaka Sei, Song Ih Ahn, Young Min Shin, Chang Soo Kim, Sangsu Bae, YongTae Kim, Hak‐Joon Sung, Anti‐Atherogenic Effect of Stem Cell Nanovesicles Targeting Disturbed Flow Sites, Small, 10.1002/smll.202000012, 16, 16, (2020).
- Adam J. Siade, Tao Cui, Robert N. Karelse, Clive Hampton, Reduced‐Dimensional Gaussian Process Machine Learning for Groundwater Allocation Planning Using Swarm Theory, Water Resources Research, 10.1029/2019WR026061, 56, 3, (2020).
- Yin Cao, Yuntao Ye, Lili Liang, Hongli Zhao, Yunzhong Jiang, Hao Wang, Zhenyan Yi, Yizi Shang, Dengming Yan, Reply to Comment by Jie Qin and Teng Wu on “A Modified Particle Filter‐Based Data Assimilation Method for a High‐Precision 2‐D Hydrodynamic Model Considering Spatial‐Temporal Variability of Roughness: Simulation of Dam‐Break Flood Inundation”, Water Resources Research, 10.1029/2020WR027315, 56, 3, (2020).
- R. Crago, J. Szilagyi, R. Qualls, Reply to “Comment on ‘Two Papers About the Generalized Complementary Evaporation Relationships by Crago et al.’”, Water Resources Research, 10.1029/2019WR026773, 56, 3, (2020).
- Xudong Han, Taha B. M. J. Ouarda, Ataur Rahman, Khaled Haddad, Rajeshwar Mehrotra, Ashish Sharma, A Network Approach for Delineating Homogeneous Regions in Regional Flood Frequency Analysis, Water Resources Research, 10.1029/2019WR025910, 56, 3, (2020).
- Seung Jun Shin, Yichao Wu, Ning Hao, A backward procedure for change‐point detection with applications to copy number variation detection, Canadian Journal of Statistics, 10.1002/cjs.11535, 48, 3, (366-385), (2020).
- Paul Fearnhead, Guillem Rigaill, Relating and comparing methods for detecting changes in mean, Stat, 10.1002/sta4.291, 9, 1, (2020).
- Wu Wang, Xuming He, Zhongyi Zhu, Statistical inference for multiple change‐point models, Scandinavian Journal of Statistics, 10.1111/sjos.12456, 0, 0, (2020).
- Lukas Kiefer, Martin Storath, Andreas Weinmann, Iterative Potts Minimization for the Recovery of Signals with Discontinuities from Indirect Measurements: The Multivariate Case, Foundations of Computational Mathematics, 10.1007/s10208-020-09466-9, (2020).
- Housen Li, Axel Munk, Hannes Sieling, Guenther Walther, The essential histogram, Biometrika, 10.1093/biomet/asz081, (2020).
- Merle Behr, M. Azim Ansari, Axel Munk, Chris Holmes, Testing for dependence on tree structures, Proceedings of the National Academy of Sciences, 10.1073/pnas.1912957117, (201912957), (2020).
- Ning Hao, Yue Selena Niu, Feifei Xiao, Heping Zhang, A Super Scalable Algorithm for Short Segment Detection, Statistics in Biosciences, 10.1007/s12561-020-09278-z, (2020).
- Holger Dette, Theresa Eckle, Mathias Vetter, Multiscale change point detection for dependent data, Scandinavian Journal of Statistics, 10.1111/sjos.12465, 0, 0, (2020).
- Lijing Ma, Andrew J. Grant, Georgy Sofronov, Multiple change point detection and validation in autoregressive time series data, Statistical Papers, 10.1007/s00362-020-01198-w, (2020).
- Piotr Fryzlewicz, Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection, Journal of the Korean Statistical Society, 10.1007/s42952-020-00060-x, (2020).
- Georg Hahn, Paul Fearnhead, Idris A. Eckley, BayesProject: Fast computation of a projection direction for multivariate changepoint detection, Statistics and Computing, 10.1007/s11222-020-09966-2, (2020).
- Gabriela Ciuperca, Matúš Maciak, Changepoint Detection by the Quantile LASSO Method, Journal of Statistical Theory and Practice, 10.1007/s42519-019-0078-z, 14, 1, (2019).
- Kang-Ping Lu, Shao-Tung Chang, Robust algorithms for multiphase regression models, Applied Mathematical Modelling, 10.1016/j.apm.2019.09.009, (2019).
- O.S. Balabanov, Tasks and methods of Big Data analysis (a survey), PROBLEMS IN PROGRAMMING, 10.15407/pp2019.03.058, 3, (058-085), (2019).
- Rafal Baranowski, Yining Chen, Piotr Fryzlewicz, Narrowest‐over‐threshold detection of multiple change points and change‐point‐like features, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12322, 81, 3, (649-672), (2019).
- Charles Truong, Laurent Oudre, Nicolas Vayatis, Selective review of offline change point detection methods, Signal Processing, 10.1016/j.sigpro.2019.107299, (107299), (2019).
- Paul Alexandru Bucur, Klaus Frick, Philipp Hungerländer, Predicting the Vibroacoustic Quality of Steering Gears, Operations Research Proceedings 2018, 10.1007/978-3-030-18500-8_39, (309-315), (2019).
- Panagiotis Papastamoulis, Takanori Furukawa, Norman van Rhijn, Michael Bromley, Elaine Bignell, Magnus Rattray, Bayesian Detection of Piecewise Linear Trends in Replicated Time-Series with Application to Growth Data Modelling, The International Journal of Biostatistics, 10.1515/ijb-2018-0052, 0, 0, (2019).
- Kirill Efimov, Larisa Adamyan, Vladimir Spokoiny, Adaptive Nonparametric Clustering, IEEE Transactions on Information Theory, 10.1109/TIT.2019.2903113, 65, 8, (4875-4892), (2019).
- Pooja Rani, G.S. Mahapatra, undefined, 2019 IEEE International Systems Conference (SysCon), 10.1109/SYSCON.2019.8836816, (1-7), (2019).
- Martin Storath, Lukas Kiefer, Andreas Weinmann, Smoothing for signals with discontinuities using higher order Mumford–Shah models, Numerische Mathematik, 10.1007/s00211-019-01052-8, (2019).
- Yunlong Wang, Zhaojun Wang, Xuemin Zi, Rank-based multiple change-point detection, Communications in Statistics - Theory and Methods, 10.1080/03610926.2019.1589515, (1-17), (2019).
- Luca Stirnimann, Alessandra Conversi, Simone Marini, Detection of regime shifts in the environment: testing “STARS” using synthetic and observed time series, ICES Journal of Marine Science, 10.1093/icesjms/fsz148, (2019).
- Bill Russell, Dooruj Rambaccussing, Breaks and the statistical process of inflation: the case of estimating the ‘modern’ long-run Phillips curve, Empirical Economics, 10.1007/s00181-017-1404-5, 56, 5, (1455-1475), (2018).
- Michael Messer, Stefan Albert, Gaby Schneider, The multiple filter test for change point detection in time series, Metrika, 10.1007/s00184-018-0672-1, 81, 6, (589-607), (2018).
- Florian Pein, Inder Tecuapetla-Gomez, Ole Mathis Schutte, Claudia Steinem, Axel Munk, Fully Automatic Multiresolution Idealization for Filtered Ion Channel Recordings: Flickering Event Detection, IEEE Transactions on NanoBioscience, 10.1109/TNB.2018.2845126, 17, 3, (300-320), (2018).
- Dominique Abgrall, Marine Habart, Catherine Rainer, Aliou Sow, Exploring the longevity risk using statistical tools derived from the Shiryaev–Roberts procedure, European Actuarial Journal, 10.1007/s13385-018-0168-4, 8, 1, (27-51), (2018).
- Victor-Emmanuel Brunel, A change-point problem and inference for segment signals, ESAIM: Probability and Statistics, 10.1051/ps/2018014, 22, (210-235), (2018).
- Céline Cunen, Gudmund Hermansen, Nils Lid Hjort, Confidence distributions for change-points and regime shifts, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2017.09.009, 195, (14-34), (2018).
- Sokbae Lee, Yuan Liao, Myung Hwan Seo, Youngki Shin, Oracle Estimation of a Change Point in High-Dimensional Quantile Regression, Journal of the American Statistical Association, 10.1080/01621459.2017.1319840, 113, 523, (1184-1194), (2018).
- Taihe Yi, Zhengming Wang, Dongyun Yi, Bayesian sieve methods: approximation rates and adaptive posterior contraction rates, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1470241, (1-26), (2018).
- Lu Shaochuan, A Bayesian multiple changepoint model for marked poisson processes with applications to deep earthquakes, Stochastic Environmental Research and Risk Assessment, 10.1007/s00477-018-1632-z, (2018).
- Tobias Siems, Marc Hellmuth, Volkmar Liebscher, Simultaneous Credible Regions for Multiple Changepoint Locations, Journal of Computational and Graphical Statistics, 10.1080/10618600.2018.1513366, (1-9), (2018).
- Tengyao Wang, Richard J. Samworth, High dimensional change point estimation via sparse projection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12243, 80, 1, (57-83), (2017).
- Salvatore Fasola, Vito M. R. Muggeo, Helmut Küchenhoff, A heuristic, iterative algorithm for change-point detection in abrupt change models, Computational Statistics, 10.1007/s00180-017-0740-4, 33, 2, (997-1015), (2017).
- Jordan Frecon, Nelly Pustelnik, Nicolas Dobigeon, Herwig Wendt, Patrice Abry, Bayesian Selection for the $\ell _2$ -Potts Model Regularization Parameter: 1-D Piecewise Constant Signal Denoising, IEEE Transactions on Signal Processing, 10.1109/TSP.2017.2715000, 65, 19, (5215-5224), (2017).
- Daniel Durstewitz, Daniel Durstewitz, Nonlinear Concepts in Time Series Analysis, Advanced Data Analysis in Neuroscience, 10.1007/978-3-319-59976-2_8, (183-198), (2017).
- Kaylea Haynes, Idris A. Eckley, Paul Fearnhead, Computationally Efficient Changepoint Detection for a Range of Penalties, Journal of Computational and Graphical Statistics, 10.1080/10618600.2015.1116445, 26, 1, (134-143), (2017).
- Martin Storath, Andreas Weinmann, Michael Unser, Jump-penalized least absolute values estimation of scalar or circle-valued signals, Information and Inference, 10.1093/imaiai/iaw022, (iaw022), (2017).
- Paul Fearnhead, Guillem Rigaill, Changepoint Detection in the Presence of Outliers, Journal of the American Statistical Association, 10.1080/01621459.2017.1385466, (1-15), (2017).
- Florian Pein, Hannes Sieling, Axel Munk, Heterogeneous change point inference, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12202, 79, 4, (1207-1227), (2016).
- Robert Maidstone, Toby Hocking, Guillem Rigaill, Paul Fearnhead, On optimal multiple changepoint algorithms for large data, Statistics and Computing, 10.1007/s11222-016-9636-3, 27, 2, (519-533), (2016).
- Michael Messer, Kauê M. Costa, Jochen Roeper, Gaby Schneider, Multi-scale detection of rate changes in spike trains with weak dependencies, Journal of Computational Neuroscience, 10.1007/s10827-016-0635-3, 42, 2, (187-201), (2016).
- Camillo Cammarota, Estimating the turning point location in shifted exponential model of time series, Journal of Applied Statistics, 10.1080/02664763.2016.1201797, 44, 7, (1269-1281), (2016).
- Michael Messer, Gaby Schneider, The shark fin function: asymptotic behavior of the filtered derivative for point processes in case of change points, Statistical Inference for Stochastic Processes, 10.1007/s11203-016-9138-0, 20, 2, (253-272), (2016).
- Ivor Cribben, Yi Yu, Estimating whole‐brain dynamics by using spectral clustering, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12169, 66, 3, (607-627), (2016).
- Александра Леонидовна Суворикова, Aleksandra Leonidovna Suvorikova, Владимир Григорьевич Спокойный, Vladimir Grigor'evich Spokoiny, Многоуровневый подход к обнаружению разладокMultiscale approach for change point detection, Теория вероятностей и ее примененияTeoriya Veroyatnostei i ee Primeneniya, 10.4213/tvp5087, 61, 4, (774-804), (2016).
- Atul Mallik, Bodhisattva Sen, Moulinath Banerjee, George Michailidis, Asymptotics for -value based threshold estimation under repeated measurements , Journal of Statistical Planning and Inference, 10.1016/j.jspi.2016.01.009, 174, (85-103), (2016).
- Chao Du, Chu-Lan Michael Kao, S. C. Kou, Stepwise Signal Extraction via Marginal Likelihood, Journal of the American Statistical Association, 10.1080/01621459.2015.1006365, 111, 513, (314-330), (2016).
- Michalis K. Titsias, Christopher C. Holmes, Christopher Yau, Statistical Inference in Hidden Markov Models Using k -Segment Constraints , Journal of the American Statistical Association, 10.1080/01621459.2014.998762, 111, 513, (200-215), (2016).
- Chun Yip Yau, Zifeng Zhao, Inference for multiple change points in time series via likelihood ratio scan statistics, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12139, 78, 4, (895-916), (2015).
- Holger Dette, Dominik Wied, Detecting relevant changes in time series models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12121, 78, 2, (371-394), (2015).
- Sokbae Lee, Myung Hwan Seo, Youngki Shin, The lasso for high dimensional regression with a possible change point, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12108, 78, 1, (193-210), (2015).
- Zhiming Xia, Peihua Qiu, Jump information criterion for statistical inference in estimating discontinuous curves, Biometrika, 10.1093/biomet/asv018, 102, 2, (397-408), (2015).
- Seong Woo Kwak, Jung-Min Yang, Corrective control for transient faults with application to configuration controllers, IET Control Theory & Applications, 10.1049/iet-cta.2014.0532, 9, 8, (1213-1220), (2015).
- L. Torgovitski, Panel data segmentation under finite time horizon, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2015.05.007, 167, (69-89), (2015).
- Timo Aspelmeier, Alexander Egner, Axel Munk, Modern Statistical Challenges in High-Resolution Fluorescence Microscopy, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-010814-020343, 2, 1, (163-202), (2015).
- Andreas Weinmann, Martin Storath, Iterative Potts and Blake–Zisserman minimization for the recovery of functions with discontinuities from indirect measurements, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 10.1098/rspa.2014.0638, 471, 2176, (20140638), (2015).
- Marie Hušková, Zuzana Prášková, Comments on: Extensions of some classical methods in change point analysis, TEST, 10.1007/s11749-014-0373-7, 23, 2, (265-269), (2014).
- Andreas Futschik, Thomas Hotz, Axel Munk, Hannes Sieling, Multiscale DNA partitioning: statistical evidence for segments, Bioinformatics, 10.1093/bioinformatics/btu180, 30, 16, (2255-2262), (2014).
- Cauê Pereira Toscano, Mariana Pereira Melo, Júlia Maria Matera, Andrzej Loesch, Antonio Augusto Coppi Maciel Ribeiro, The developing and restructuring superior cervical ganglion of guinea pigs (Cavia porcellus var. albina), International Journal of Developmental Neuroscience, 10.1016/j.ijdevneu.2009.03.006, 27, 4, (329-336), (2009).




