Multiscale inference and long-run variance estimation in nonparametric regression with time series errors

In this paper, we develop new multiscale methods to test qualitative hypotheses about the regression function m in a nonparametric regression model with fixed design points and time series errors. In time series applications, m represents a nonparametric time trend. Practitioners are often interested in whether the trend m has certain shape properties. For example, they would like to know whether m is constant or whether it is increasing/decreasing in certain time regions. Our multiscale methods allow to test for such shape properties of the trend m. In order to perform the methods, we require an estimator of the long-run variance of the error process. We propose a new difference-based estimator of the long-run error variance for the case that the error terms form an AR(p) process. In the technical part of the paper, we derive asymptotic theory for the proposed multiscale test and the estimator of the long-run error variance. The theory is complemented by a simulation study and an empirical application to climate data.


Introduction
The analysis of time trends is an important aspect of many time series applications. In a wide range of situations, practitioners are particularly interested in certain shape properties of the trend. They raise questions such as the following: does the observed time series have a trend at all? If so, is the trend increasing or decreasing in certain time intervals? Can one identify the intervals of increase and decrease? As an example, consider the time series plotted in Fig. 1 which shows the yearly mean temperature in central England from 1659 to 2017. Climatologists are very much interested in learning about the trending behaviour of temperature time series like this; see for example Benner (1999) and Rahmstorf et al. (2017). Among other things, they would like to know whether there is an upward trend in the central England mean temperature towards the end of the sample as visual inspection might suggest.
In this paper, we develop new methods to test for certain shape properties of a non-parametric time trend. We in particular construct a multiscale test which enables us to identify local increases and decreases of the trend function. We develop our test in the context of the following model setting: we observe a time series {Y t,T : 1 t T } of the form As usual in non-parametric regression, we let the function m depend on rescaled time t=T rather than on realtime t. A detailed description of model (1.1) is provided in Section 2. Our multiscale test is developed step by step in Section 3. Roughly speaking, the procedure can be outlined as follows: let H 0 .u, h/ be the hypothesis that m is constant in the time window [u − h, u + h] ⊆ [0, 1], where u is the midpoint and 2h the size of the window. In a first step, we set up a test statisticφ T .u, h/ for the hypothesis H 0 .u, h/. In a second step, we aggregate the statisticŝ ϕ T .u, h/ for a large number of time windows [u − h, u + h]. We thereby construct a multiscale statistic which enables us to test the hypothesis H 0 .u, h/ simultaneously for many time windows [u − h, u + h]. In the technical part of the paper, we derive the theoretical properties of the resulting multiscale test. To do so, we come up with a proof strategy which combines strong approximation results for dependent processes with anticoncentration bounds for Gaussian random vectors. This strategy is of interest in itself and may be applied to other multiscale test problems for dependent data. As shown by our theoretical analysis, our multiscale test is a rigorous level α test of the overall null hypothesis H 0 that H 0 .u, h/ is simultaneously fulfilled for all time windows [u − h, u + h] under consideration. Moreover, for a given level of significance α ∈ .0, 1/, the test enables us to make simultaneous confidence statements of the following form: we can claim, with statistical confidence 1 − α, that there is an increase or decrease in the trend m on all time windows [u − h, u + h] for which the hypothesis H 0 .u, h/ is rejected. Hence, the test enables us to identify, with a prespecified statistical confidence, time intervals where the trend m is increasing or decreasing.
For independent data, multiscale tests have been developed in a variety of contexts in recent years. In the regression context, Marron (1999, 2000) introduced the so-called SiZer method which has been extended in various directions; see for example Hannig and Marron (2006) where a refined distribution theory for SiZer is derived. Hall and Heckman (2000) constructed a multiscale test on monotonicity of a regression function. Dümbgen and Spokoiny (2001) developed a multiscale approach which works with additively corrected supremum statistics and derived theoretical results in the context of a continuous Gaussian white noise model. Rank-based multiscale tests for non-parametric regression were proposed in Dümbgen (2002) and Rohde (2008). More recently, Proksch et al. (2018) have constructed multiscale tests for inverse regression models. In the context of density estimation, multiscale tests have been investigated in Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017) among others.
Whereas a large number of multiscale tests for independent data have been developed in recent years, multiscale tests for dependent data are much rarer. Most notably, there are some extensions of the SiZer approach to a time series context. Park et al. (2004) and Rondonotti et al. (2007) introduced SiZer methods for dependent data which can be used to find local increases or decreases of a trend and which may thus be regarded as an alternative to our multiscale test. However, these SiZer methods are mainly designed for data exploration rather than for rigorous statistical inference. Our multiscale method, in contrast, is a rigorous level α test of the hypothesis H 0 which enables us to make simultaneous confidence statements about the time intervals where the trend m is increasing or decreasing. Some theoretical results for dependent SiZer methods were derived in Park et al. (2009), but only under quite a severe restriction: only time windows [u − h, u + h] with window sizes or scales h are taken into account that remain bounded away from zero as the sample size T grows. Scales h that converge to 0 as T increases are excluded. This effectively means that only large time windows [u − h, u + h] are taken into consideration. Our theory, in contrast, enables us to consider simultaneously scales h of fixed size and scales h that converge to 0 at various rates. We can thus take into account time windows of many sizes. In Section 3.4, we compare our approach with SiZer methods for dependent data in more detail.
Our multiscale approach is also related to wavelet-based methods: similar to the wavelet-based methods, it takes into account different locations u and resolution levels or scales h simultaneously. However, whereas our multiscale approach is designed to test for local increases and decreases of a non-parametric trend, wavelet methods are commonly used for other purposes. Among other things, they are employed for estimating or reconstructing non-parametric regression curves (see for example Donoho et al. (1995) or Von Sachs and MacGibbon (2000)) and for change point detection (see for example Cho and Fryzlewicz (2012)).
The test statistic of our multiscale method depends on the long-run error variance σ 2 = Σ ∞ l=−∞ cov." 0 , " l /, which is usually unknown in practice. To carry out our multiscale test, we thus require an estimator of σ 2 . Indeed, such an estimator is required for virtually all inferential procedures in the context of model (1.1). Hence, the problem of estimating σ 2 in model (1.1) is of broader interest and has received considerable attention in the literature; see Müller and Stadtmüller (1988), Herrmann et al. (1992) and Hall and Van Keilegom (2003) among many others. In Section 4, we introduce a new difference-based estimator of σ 2 for the case that {" t } belongs to the class of auto-regressive AR(∞) processes. This estimator improves on existing methods in several respects.
The methodological and theoretical analysis of the paper is complemented by a simulation study in Section 5 and two empirical applications in Section 6. In the simulation study, we examine the finite sample properties of our multiscale test and compare it with the dependent SiZer methods that were introduced in Park et al. (2004) and Rondonotti et al. (2007). Moreover, we investigate the small sample performance of our estimator of σ 2 and compare it with the estimator of Hall and Van Keilegom (2003). In Section 6, we use our methods to analyse the temperature data from Fig. 1 as well as a sample of global temperature data. The data that are analysed in the paper and the computer code that was used for the simulations and the analysis of the empirical data can be obtained from https://rss.onlinelibrary.wiley.com/hub/journal/14679868/seriesb-datasets.

The model
We now describe the model setting in detail which was briefly outlined in Section 1. We observe a time series {Y t,T : 1 t T } of length T which satisfies the non-parametric regression equation Here, m is an unknown non-parametric function defined on [0, 1] and {" t : 1 t T } is a zero-mean stationary error process. For simplicity, we restrict attention to equidistant design points x t = t=T . However, our methods and theory can also be carried over to non-equidistant designs. The stationary error process {" t } is assumed to have the following properties.
Conditions 1-3 are fulfilled by a wide range of stationary processes {" t }. As a first example, consider linear processes of the form " t = Σ ∞ i=0 c i η t−i with " t q < ∞, where c i are absolutely summable coefficients and η t are IID innovations with E[η t ] = 0 and η t q < ∞. Trivially, conditions 1 and 2 are fulfilled in this case. Moreover, if |c i | = O.ρ i / for some ρ ∈ .0, 1/, then condition 3 is easily seen to be satisfied as well. As a special case, consider an auto-regressive moving aver- where a 1 , : : : , a p and b 1 , : : : , b r are real-valued parameters. As before, we let η t be IID innovations with E[η t ] = 0 and η t q < ∞. Moreover, as usual, we suppose that the complex polynomials A.z/ = 1 − Σ p j=1 a j z j and B.z/ = 1 + Σ r j=1 b j z j do not have any roots in common. If A.z/ does not have any roots inside the unit disc, then the ARMA process {" t } is stationary and causal. Specifically, it has the representation " t = Σ ∞ i=0 c i η t−i with |c i | = O.ρ i / for some ρ ∈ .0, 1/, implying that conditions 1-3 are fulfilled. The results in Wu and Shao (2004) show that condition 3 (as well as the other two conditions) is not only fulfilled for linear time series processes but also for a variety of non-linear processes.

The multiscale test
In this section, we introduce our multiscale method to test for local increases and decreases of the trend function m and analyse its theoretical properties. We assume throughout that m is continuously differentiable on where G T is some large set of points .u, h/. The details on the set G T are discussed at the end of Section 3.1. Note that G T in general depends on the sample size T , implying that the null hypothesis H 0 = H 0,T depends on T as well. We thus consider a sequence of null hypotheses {H 0,T : T = 1, 2, : : :} as T increases. For simplicity of notation, however, we suppress the dependence of H 0 on T . In Sections 3.1 and 3.2, we step by step construct the multiscale test of the hypothesis H 0 . The theoretical properties of the test are analysed in Section 3.3.

Construction of the multiscale statistic
We first construct a test statistic for the hypothesis H 0 .u, h/, where [u − h, u + h] is a given interval. To do so, we consider the kernel averagê where w t,T .u, h/ is a kernel weight and h is the bandwidth. To avoid boundary issues, we work with a local linear weighting scheme. We in particular set t=T − u h l for l = 0, 1, 2 and K is a kernel function with the following properties.
Condition 4. The kernel K is non-negative, symmetric about zero and integrates to 1. Moreover, it has compact support [−1, 1] and is Lipschitz continuous, i.e. |K.v/ − K.w/| C|v − w| for any v, w ∈ R and some constant C > 0.
The kernel averageψ T .u, h/ is nothing other than a rescaled local linear estimator of the derivative m .u/ with bandwidth h. Alternatively to the local linear weights defined in equation (3.1), we could work with the weights w t, , where the kernel function K is assumed to be differentiable and K is its derivative. However, we prefer to use local linear weights as these have superior theoretical properties at the boundary.
A test statistic for the hypothesis H 0 .u, h/ is given by the normalized kernel averageψ T .u, h/=σ, whereσ 2 is an estimator of the long-run variance σ 2 = Σ ∞ l=−∞ cov." 0 , " l / of the error process {" t }. The problem of estimating σ 2 is discussed in detail in Section 4. For the time being, we suppose thatσ 2 is an estimator with reasonable theoretical properties. Specifically, we assume thatσ 2 = σ 2 + o p .ρ T / with ρ T = o{1= log.T/}. This is a fairly weak condition which is in particular satisfied by the estimator of σ 2 analysed in Section 4. The kernel weights w t,T .u, h/ are chosen such that, in the case of independent errors " t , var{ψ T .u, h/} = σ 2 for any location u and bandwidth h, where the long-run error variance σ 2 simplifies to σ 2 = var." t /. In the more general case that the error terms satisfy the weak dependence conditions from Section 2, var{ψ T .u, h/} = σ 2 + o.1/ for any u and h under consideration. Hence, for sufficiently large sample sizes T , the test statistiĉ ψ T .u, h/=σ has approximately unit variance.
We now combine the test statisticsψ T .u, h/=σ for a wide range of locations u and bandwidths or scales h. There are different ways to do so, leading to different types of multiscale statistics. Our multiscale statistic is defined aŝ .2h/}] and G T is the set of points .u, h/ that are taken into consideration. The details on the set G T are given below. As can be seen, the statisticΨ T does not simply aggregate the individual statisticsψ T .u, h/=σ by taking the supremum over all points .u, h/ ∈ G T as in more traditional multiscale approaches. We rather calibrate the statisticsψ T .u, h/=σ that correspond to the bandwidth h by subtracting the additive correction term λ.h/. This approach was pioneered by Dümbgen and Spokoiny (2001) and has been used in numerous other studies since then; see for example Dümbgen (2002), Rohde (2008), Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017).
To see the heuristic idea behind the additive correction λ.h/, consider for a moment the uncorrected statisticΨ and suppose that the hypothesis H 0 .u, h/ is true for all .u, h/ ∈ G T . For simplicity, assume that the errors " t are IID normally distributed and neglect the estimation error inσ, i.e. setσ = σ. Moreover, suppose that the set G T consists of only the points .u k , h l / = ..2k − 1/h l , h l / with k = 1, : : : , 1=.2h l / and l = 1, : : : , L. In this case, we can writê Under our simplifying assumptions, the statisticsψ T .u k , h l /=σ with k = 1, : : : , 1=.2h l / are independent and standard normal for any given bandwidth h l . Since the maximum over 1=.2h/ independent standard normal random variables is λ.h/ + o p .1/ as h → 0, we obtain that max kψT .u k , h l /=σ is approximately of size λ.h l / for small bandwidths h l . As λ.h/ → ∞ for h → 0, this implies that max kψT .u k , h l /=σ tends to be much larger for small than for large bandwidths h l . As a result, the stochastic behaviour of the uncorrected statisticΨ T ,uncorrected tends to be dominated by the statisticsψ T .u k , h l / corresponding to small bandwidths h l . The additively corrected statisticΨ T , in contrast, puts the statisticsψ T .u k , h l / corresponding to different bandwidths h l on a more equal footing, thus counteracting the dominance of small bandwidth values. The multiscale statisticΨ T simultaneously takes into account all locations u and bandwidths h with .u, h/ ∈ G T . Throughout the paper, we suppose that G T is some subset of G full T = {.u, h/ : u = t=T for some 1 t T and h ∈ [h min , h max ]}, where h min and h max denote some minimal and maximal bandwidth value respectively. For our theory to work, we require the following conditions to hold.
Condition 5. |G T | = O.T θ / for some arbitrarily large but fixed constant θ > 0, where |G T | denotes the cardinality of G T .
Condition 6. h min T −.1−2=q/ log.T/, i.e. h min ={T −.1−2=q/ log.T/} → ∞ with q > 4 defined in condition 2 and h max < 1 2 . According to condition 5, the number of points .u, h/ in G T should not grow faster than T θ for some arbitrarily large but fixed θ > 0. This is a fairly weak restriction as it allows the set G T to be extremely large compared with the sample size T . For example, we may work with the set G T = {.u, h/ : u = t=T for some 1 t T and h ∈ [h min , h max ] with h = t=T for some 1 t T }, which contains more than enough points .u, h/ for most practical applications. Condition 6 imposes some restrictions on the minimal and maximal bandwidths h min and h max . These conditions are fairly weak, allowing us to choose the bandwidth window [h min , h max ] extremely large. The lower bound on h min depends on the parameter q defined in condition 2 which specifies the number of existing moments for the error terms " t . As we can see, we can choose h min to be of the order T −1=2 for any q > 4. Hence, we can let h min converge to 0 very quickly even if only the first few moments of the error terms " t exist. If all moments exist (i.e. q = ∞), h min may converge to 0 almost as quickly as T −1 log.T/. Furthermore, the maximal bandwidth h max is not even required to converge to 0, which implies that we can pick it very large.
Remark 1. The above construction of the multiscale statistic can be easily adapted to hypotheses other than H 0 . To do so, we simply need to replace the kernel weights w t,T .u, h/ that are defined in equation (3.1) by appropriate versions which are suited to test the hypothesis of interest. For example, if we want to test for local convexity or concavity of m, we may define the kernel weights w t,T .u, h/ such that the kernel averageψ T .u, h/ is a (rescaled) estimator of the second derivative of m at the location u with bandwidth h.

The test procedure
To formulate a test for the null hypothesis H 0 , we still need to specify a critical value. To do so, we define the statistic where φ T .u, h/ = Σ T t=1 w t,T .u, h/σZ t and Z t are independent standard normal random variables. The statistic Φ T can be regarded as a Gaussian version of the test statisticΨ T under the null hypothesis H 0 . Let q T .α/ be the .1 − α/-quantile of Φ T . Importantly, the quantile q T .α/ can be computed by Monte Carlo simulations and can thus be regarded as known. Our multiscale test is now defined as follows: for a given level of significance α ∈ .0, 1/, we reject the overall null hypothesis H 0 ifΨ T > q T .α/. In particular, for any .u, h/ ∈ G T , we reject H 0 .u, h/ if the (corrected) test statistic |ψ T .u, h/=σ| − λ.h/ lies above the critical value q T .α/, i.e. if |ψ T .u, h/=σ| > q T .α/ + λ.h/.

Theoretical properties of the test
To examine the theoretical properties of our multiscale test, we introduce the auxiliary multiscale statisticΦ The following result is central to the theoretical analysis of our multiscale test. According to it, the (known) quantile q T .α/ of the Gaussian statistic Φ T that was defined in Section 3.2 can be used as a proxy for the .1 − α/quantile of the multiscale statisticΦ T .
Theorem 1. Let conditions 1-6 be fulfilled and assume thatσ 2 = σ 2 + o p .ρ T / with ρ T = o{1= log.T/}. Then A full proof of theorem 1 is given in the on-line supplementary material. Here we briefly outline the proof strategy, which splits up into two main steps. In the first, we replace the statisticΦ T for each T 1 by a statisticΦ T with the same distribution asΦ T and the property that .3:6/ where δ T = o.1/ and the Gaussian statistic Φ T is defined in Section 3.2. We thus replace the statisticΦ T by an identically distributed version which is close to a Gaussian statistic whose distribution is known. To do so, we make use of strong approximation theory for dependent processes as derived in Berkes et al. (2014). In the second step, we show that which immediately implies the statement of theorem 1. Importantly, the convergence result (3.6) is not sufficient for establishing the result in equation (3.7). Put differently, the fact thatΦ T can be approximated by Φ T in the sense thatΦ T − Φ T = o p .δ T / does not imply that the distribution ofΦ T is close to that of Φ T in the sense of equation (3.7). For equation (3.7) to hold, we additionally require that the distribution of Φ T has some sort of continuity property. Specifically, we prove that Anticoncentration bounds for Gaussian random vectors as derived in Chernozhukov et al. (2015) are the main tool for verifying the result in equation (3.8). The claim (3.7) can be proved by using equation (3.6) together with equation (3.8), which in turn yields theorem 1.
The main idea of our proof strategy is to combine strong approximation theory with anticoncentration bounds for Gaussian random vectors to show that the quantiles of the multiscale statisticΦ T can be proxied by those of a Gaussian analogue. This strategy is quite general in nature and may be applied to other multiscale problems for dependent data. Strong approximation theory has also been used to investigate multiscale tests for independent data; see for example Schmidt-Hieber et al. (2013). However, it has not been combined with anticoncentration results to approximate the quantiles of the multiscale statistic. As an alternative to strong approximation theory, Eckle et al. (2017) and Proksch et al. (2018) have recently used Gaussian approximation results that were derived in Chernozhukov et al. (2014Chernozhukov et al. ( , 2017 to analyse multiscale tests for independent data. Even though it might be possible to adapt these techniques to the case of dependent data, this is not trivial at all as part of the technical arguments and the Gaussian approximation tools strongly rely on the assumption of independence. We now investigate the theoretical properties of our multiscale test with the help of theorem 1. The first result is an immediate consequence of theorem 1. It says that the test has the correct (asymptotic) size.
Proposition 1. Let the conditions of theorem 1 be satisfied. Under the null hypothesis H 0 , it holds that The second result characterizes the power of the multiscale test against local alternatives. To formulate it, we consider any sequence of functions m = m T with the following property: there where {c T } is any sequence of positive numbers with c T → ∞. Alternatively to expression (3.9), we may also assume that Proposition 2. Let the conditions of theorem 1 be satisfied and consider any sequence of functions m T with the property (3.9). Then According to proposition 2, our test has asymptotic power 1 against local alternatives of the form (3.9). The proof can be found in the on-line supplementary material.
The next result formally shows that we can make simultaneous confidence statements about the time intervals where the trend m is increasing or decreasing. To formulate it, we define The object Π ± T can be interpreted as follows: our multiscale test rejects the null hypothesis T , then our test rejects H 0 .u, h/ and indicates an increase in the trend m on the interval I u,h , taking into account the positive sign of the statisticψ T .u, h/=σ. Hence, Π + T is the collection of time intervals I u,h for which our test indicates an increase in the trend m. Likewise, Π − T is the collection of intervals for which the test indicates a decrease. Note that Π ± T (as well as Π + T and Π − T ) is a random collection of intervals: whether our test rejects H 0 .u, h/ for some .u, h/ depends on the realization of the random vector .Y 1,T , : : : , Y T ,T /. Hence, whether an interval I u,h belongs to Π ± T depends on this realization as well. Having defined the objects Π ± T , Π + T and Π − T , we now consider the events h with m being a non-constant (increasing, decreasing) function on J u,h . We can make the following formal statement about the events E ± T , E + T and E − T , whose proof is given in the on-line supplement. Proposition 3. Let the conditions of theorem 1 be fulfilled. Then for l ∈ {±, + , −}, it holds that P.E l T / 1 − α + o.1/: According to proposition 3, we can make simultaneous confidence statements of the following form: with (asymptotic) probability 1 − α or greater, the trend function m is non-constant (increasing, decreasing) on each interval I u,h ∈ Π ± T .Π + T , Π − T /. Hence, our multiscale procedure enables us to identify, with prespecified confidence, time intervals where there is an increase or decrease in the trend m.
Our technical arguments enable us to say, with asymptotic confidence 1 − α or greater, that m .v/ = 0 for some v ∈ I u,h . However, we cannot say whether m .v/ > 0 or m .v/ < 0, i.e. we cannot make confidence statements about the sign. Crudely speaking, the problem is that the local linear weights w t,T .u, h/ behave quite differently at boundary points .u, h/ with I u,h [0, 1]. As a consequence, we can include boundary points Remark 3. The statement of proposition 3 suggests that we graphically present the results of our multiscale test by plotting the intervals I u,h ∈ Π l T for l ∈ {±, + , −}, i.e. by plotting the intervals where (with asymptotic confidence 1 − α or greater) our test detects a violation of the null hypothesis. The drawback of this graphical presentation is that the number of intervals in Π l T is often quite large. To obtain a better graphical summary of the results, we replace Π l T by a subset Π l, min T which is constructed as follows: as in Dümbgen (2002), we call an interval be the set of all minimal intervals in Π l T for l ∈ {±, + , −} and define the events It is easily seen that E l T = E l, min T for l ∈ {±, + , −}. Hence, by proposition 3, it holds that This suggests that we plot the minimal intervals in Π l, min T rather than the whole collection of intervals Π l T as a graphical summary of the test results. We in particular use this way of presenting the test results in our application in Section 6.
Proposition 3 enables us to make confidence statements for a fixed level of significance α ∈ .0, 1/. In some situations, we may be interested in letting α = α T ∈ .0, 1/ → 0 as T → ∞. This situation is considered in the following corollary to proposition 3, whose proof can be found in the on-line supplementary material.
Corollary 1. Let the conditions of theorem 1 be fulfilled and let α = α T ∈ .0, 1/ → 0 as T → ∞. Then P.E l T / → 1 for l ∈ {±, + , −}. Corollary 1 can be interpreted as a consistency result: if we let the level of significance α = α T go to 0, then the event E ± T (E + T , E − T ) occurs with probability tending to 1, i.e. the trend m is nonconstant (increasing, decreasing) on each interval I u,h ∈ Π ± T .Π + T , Π − T / with probability tending to 1.

Comparison with SiZer methods
As already mentioned in Section 1, some SiZer methods for dependent data have been introduced in Park et al. (2004) and Rondonotti et al. (2007), which we refer to as dependent SiZer for short. Informally speaking, both our approach and dependent SiZer are methods to test for local increases and decreases of a non-parametric trend function m. The formal problem is to test the hypothesis H 0 .u, h/ simultaneously for all .u, h/ ∈ G T , where, in this section, we let G T = U T × H T with U T being the set of locations and H T the set of bandwidths or scales. In what follows, we compare our approach with dependent SiZer and point out the most important differences.
Dependent SiZer is based on the statisticsŝ T .u, h/ =m .u, h/= sd{m .u, h/}, wherem .u, h/ is a local linear kernel estimator of m .u/ with bandwidth h and sd{m .u, h/} is an estimator of its standard deviation. The statisticŝ T .u, h/ parallels the statisticψ T .u, h/=σ in our approach. In particular, both can be regarded as test statistics of the hypothesis H 0 .u, h/. There are two versions of dependent SiZer, as follows.
(a) The global version aggregates the individual statisticsŝ T .u, h/ into the overall statistiĉ The rowwise version considers each scale h ∈ H T separately. In particular, for each bandwidth h ∈ H T , a test is carried out based on the statisticŜ T .h/. A rowwise analogue of our approach would be obtained by carrying out a test for each scale h ∈ H T separately based on the statisticΨ T .h/ = max u∈U T |ψ T .u, h/=σ|. Note that we can drop the correction term λ.h/ in this case as it is a fixed constant if only a single bandwidth h is taken into account.
In practice, SiZer is commonly implemented in its rowwise form. The main reason is that it has more power than the global version by construction. However, this gain of power comes at a cost: rowwise SiZer carries out a test separately for each scale h ∈ H T , thus ignoring the simultaneous test problem across scales h. Hence, it is not a rigorous level α test of the null H 0 . For this reason, we focus on global SiZer in the rest of this section. Even though related, our methods and theory are markedly different from those of the SiZer approach. The main differences are as follows.
(a) Theory for SiZer is derived under the assumption that H T ⊆ H for all T , where H is a compact subset of .0, ∞/. As already pointed out in Chaudhuri and Marron (2000) on page 420, this is quite a severe restriction: only bandwidths h are taken into account that remain bounded away from 0 as the sample size T increases. Bandwidths h that converge to 0 are excluded. Our theory, in contrast, enables us to consider simultaneously bandwidths h of fixed size and bandwidths h that converge to 0 at different rates. To achieve this, we come up with a proof strategy which is very different from that in the SiZer literature: as proven in Chaudhuri and Marron (2000) for the IID data case and in Park et al. (2009) for the dependent data case,Ŝ T weakly converges to some limit S under the overall null hypothesis H 0 . This is the central technical result on which the theoretical properties of SiZer are based. In contrast with this, our proof strategy (which combines strong approximation theory with anticoncentration bounds as outlined in Section 3.3) does not even require the statisticΨ T to have a weak limit and is thus not restricted by the limitations of classic weak convergence theory. (b) There are different ways to combine the test statisticsŜ T .h/ = max u∈U T |ŝ T .u, h/| for different scales h ∈ H T . One way is to take their maximum, which leads to the SiZer statisticŜ T = max h∈H TŜ T .h/. We could proceed analogously and consider the statisticΨ T ,uncorrected = max h∈H TΨ T .h/ = max .u,h/∈U T ×H T |ψ T .u, h/=σ|. However, as argued in Dümbgen and Spokoiny (2001) and as discussed in Section 3.1, this aggregation scheme is not optimal when the set H T contains scales h of many rates. Following the lead of Dümbgen and Spokoiny (2001), we consider the test statisticΨ Hence, even though related, our multiscale test statisticΨ T differs from the SiZer statisticŜ T in important ways. (c) The main complication in carrying out both our multiscale test and SiZer is to determine the critical values, i.e. the quantiles of the test statisticsΨ T andŜ T under H 0 . To approximate the quantiles, we proceed quite differently from the SiZer literature. The quantiles of the SiZer statisticŜ T can be approximated by those of the weak limit S. Usually, however, the quantiles of S cannot be determined analytically but must be approximated themselves (e.g. by the bootstrap procedures of Marron (1999, 2000)). Alternatively, the quantiles ofŜ T can be approximated by procedures based on extreme value theory (as proposed in Hannig and Marron (2006) and Park et al. (2009)). In our approach, the quantiles ofΨ T under H 0 are approximated by those of a suitably constructed Gaussian analogue ofΨ T . It is far from obvious that this Gaussian approximation is valid when the data are dependent. To see this, deep strong approximation theory for dependent data (as derived in Berkes et al. (2014)) is needed. It is important to note that our Gaussian approximation procedure is not the same as the bootstrap procedures that were proposed in Marron (1999, 2000). Both procedures can of course be regarded as resampling methods. However, the resampling is done in quite a different way in our case.

Estimation of the long-run error variance
In this section, we discuss how to estimate the long-run variance σ 2 = Σ ∞ l=−∞ cov." 0 , " l / of the error terms in model (2.1). There are two broad classes of estimators: residual-and difference-based estimators. In residual-based approaches, σ 2 is estimated from the residuals" t = Y t,T −m h .t=T/, wherem h is a non-parametric estimator of m with the bandwidth or smoothing parameter h. Difference-based methods proceed by estimating σ 2 from the lth differences Y t,T − Y t−l,T of the observed time series {Y t,T } for certain orders l. In what follows, we focus attention on differencebased methods as these do not involve a non-parametric estimator of the function m and thus do not require us to specify a bandwidth h for the estimation of m.
So far, we have assumed that {" t } is a general stationary error process which fulfils the weak dependence condition 3. Estimating the long-run error variance σ 2 in model (2.1) under general weak dependence conditions is a notoriously difficult problem. Estimators of σ 2 often tend to be quite imprecise. To circumvent this issue in practice, it may be beneficial to impose a time series model on the error process {" t }. Estimating σ 2 under the restrictions of such a model may of course create some misspecification bias. However, as long as the model gives a reasonable approximation to the true error process, the estimates of σ 2 that are produced can be expected to be fairly reliable even though they are a little biased.
Estimators of the long-run error variance σ 2 in model (2.1) have been developed for different kinds of error models. Various researchers have analysed the case of moving average MA(m) or, more generally, m-dependent error terms. Difference-based estimators of σ 2 for this case were proposed in Müller and Stadtmüller (1988), Herrmann et al. (1992) and Tecuapetla-Gómez and Munk (2017) among others. Presumably the most widely used error model in practice is an AR(p) process. Residual-based methods to estimate σ 2 in model (2.1) with AR(p) errors can be found for example in Truong (1991), Shao and Yang (2011) and Qiu et al. (2013). A difference-based method was proposed in Hall and Van Keilegom (2003).
We consider the class of AR(∞) processes as an error model, which is quite a large and important subclass of linear time series processes. Formally speaking, we let {" t } be a process of the form .4:1/ where a 1 , a 2 , a 3 , : : : are unknown coefficients and η t are IID with E[η t ] = 0 and E[η 2 t ] = ν 2 . We assume that A.z/ := 1 − Σ ∞ j=1 a j z j = 0 for all complex numbers |z| 1 + δ with some small δ > 0, which has the following implications: (a) {" t } is stationary and causal; (b) the coefficients a j decay to 0 exponentially fast, i.e. |a j | Cξ j with some C > 0 and ξ ∈ .0, 1/; (c) {" t } has an MA(∞) representation of the form " t = Σ ∞ k=0 c k η t−k . The coefficients c k can be computed iteratively from the equations for k = 0, 1, 2, : : :, where b 0 = 1 and b k = 0 for k > 0. Moreover, they decay to 0 exponentially fast, i.e. |c k | Cξ k with some C > 0 and ξ ∈ .0, 1/. Notably, the error model (4.1) nests AR(p Å ) processes of any finite order p Å as a special case: if a p Å = 0 and a j = 0 for all j > p Å , then {" t } is an AR process of order p Å . In what follows, we let p Å ∈ N ∪ {∞} denote the true AR order of {" t } which may be finite or infinite. We can thus rewrite the process (4.1) as where the AR order p Å is treated as unknown.
We now construct a difference-based estimator of σ 2 for the case that {" t } is an AR(p Å ) process of the form (4.3). To do so, we shall fit AR(p)-type models to {" t }, where we distinguish between the following two cases (which are referred to as case A and case B).
(a) We do not know the precise AR order p Å but we know an upper bound p on it. In this case, p is a fixed natural number with p p Å (case A). (b) We neither know p Å nor an upper bound on it. In this case, we let p = p T → ∞ as T → ∞, where formal conditions on the growth of p = p T are specified later (case B).
To simplify the notation, we let Δ l Z t = Z t − Z t−l denote the lth differences of a general time series {Z t }. Our estimation method relies on the following simple observation: if {" t } is an AR(p Å ) process of the form (4.3), then the time series {Δ q " t } of the differences Δ q " t = " t − " t−q is an ARMA(p Å , q) process of the form . 4:4/ As m is Lipschitz, the differences Δ q " t of the unobserved error process are close to the differences Δ q Y t,T of the observed time series in the sense that . 4:5/ Taken together, equations (4.4) and (4.5) imply that the differenced time series {Δ q Y t,T } is approximately an ARMA(p Å , q) process of the form (4.4). It is precisely this point which is exploited by our estimation method. We first describe our procedure to estimate the AR parameters a j . For any q 1, the ARMA(p Å , q) process {Δ q " t } satisfies the Yule-Walker equations where γ q .l/ = cov.Δ q " t , Δ q " t−l / and c k are the coefficients from the MA(∞) expansion of {" t }.
The estimatorã q depends on the tuning parameter q, i.e. on the order of the differences Δ q Y t,T . An appropriate choice of q needs to take care of the following two points.
(a) q should be chosen sufficiently large to ensure that the vector c q = .c q−1 , : : : , c q−p / T is close to zero. As we have already seen, the constants c k decay to 0 exponentially fast and can be computed from the recursive equations (4.2) for given parameters a 1 , a 2 , a 3 , : : : : In the special case of an AR(1) process, for example, one can readily calculate that c k 0:0035 for any k 20 and any |a 1 | 0:75. Hence, if we have an AR(1) model for the errors " t and the error process is not too persistent, choosing q 20 should make sure that c q is close to 0. Generally speaking, the recursive equations (4.2) can be used to obtain some idea for which values of q the vector c q can be expected to be approximately 0. (b) q should not be chosen too large to ensure that the trend m is appropriately eliminated by taking qth differences.
As long as the trend m is not very strong, the two requirements (a) and (b) can be fulfilled without much difficulty. For example, by choosing q = 20 in the AR(1) case just discussed, we not only take care of point (a) but also make sure that moderate trends m are differenced out appropriately.
When the trend m is very pronounced, in contrast, even moderate values of q may be too large to eliminate the trend appropriately. As a result, the estimatorã q will have a strong bias. To reduce this bias, we refine our estimation procedure as follows: by solving the recursive equations (4.2) with a replaced byã q , we can compute estimatorsc k of the coefficients c k and thus estimatorsc r of the vectors c r for any r 1. Moreover, the innovation variance ν 2 can be estimated byν 2 = .2T/ −1 Σ T t=p+2r 2 t,T , wherer t,T = Δ 1 Y t,T − Σ p j=1ã j Δ 1 Y t−j,T andã j is the jth entry of the vectorã q . Plugging the expressionsΓ r ,γ r ,c r andν 2 into equation (4.7), we can estimate a byâ r =Γ −1 r .γ r +ν 2c r /, .4:9/ where r is a much smaller differencing order than q. Specifically, in case A, we can choose r to be any fixed number r 1. Unlike q, the parameter r thus remains bounded as T increases. In case B, our theory enables us to choose any number r with r .1 + δ/p for some small δ > 0.
Since q=p → ∞, it holds that q=r → ∞ as well, which means that r is of smaller order than q.
Hence, in both case A and case B, the estimatorâ r is based on a differencing order r that is much smaller than q; only the pilot estimatorã q relies on differences of the larger order q. As a consequence,â r should eliminate the trend m more appropriately and should thus be less biased than the pilot estimatorã q . To make the method more robust against estimation errors inc r , we finally average the estimatorsâ r for a few values of r. In particular, we definê a = 1 r − r + 1r r=râ r , .4:10/ where r andr are chosen as follows: in case A, we let r andr be small natural numbers. In case B, we set r = .1 − δ/p for some small δ > 0 and chooser such thatr − r remains bounded. For ease of notation, we suppress the dependence ofâ on the parameters r andr. Onceâ = .â 1 , : : : ,â p / T has been computed, the long-run variance σ 2 can be estimated bŷ T is an estimator of the innovation variance ν 2 and we make use of the fact that σ 2 = ν 2 =.1 − Σ p Å j=1 a j / 2 for the AR(p Å ) process {" t }.
We briefly compare the estimatorâ with competing methods. Presumably closest to our approach is that of Hall and Van Keilegom (2003) which was designed for AR(p Å ) processes of known finite order p Å . For comparing the two methods, we thus assume that p Å is known and set p = p Å . The two main advantages of our method are as follows.
(a) Our estimator produces accurate estimation results even when the AR process {" t } is quite persistent, i.e. even when the AR polynomial A.z/ = 1 − Σ p Å j=1 a j z j has a root that is close to the unit circle. The estimator of Hall and Van Keilegom (2003), in contrast, may have very high variance and may thus produce unreliable results when the AR polynomial A.z/ is close to having a unit root. This difference in behaviour can be explained as follows: our pilot estimatorã q = .ã 1 , : : : ,ã p Å / T has the property that the estimated AR polynomial A.z/ = 1 − Σ p Å j=1ã j z j has no root inside the unit disc, i.e.Ã.z/ = 0 for all complex numbers z with |z| 1. (More precisely,Ã.z/ = 0 for all z with |z| 1, whenever the covariance matrix .γ q .i − j/ : 1 i, j p Å + 1/ is non-singular. Moreover, .γ q .i − j/ : 1 i, j p Å + 1/ is non-singular wheneverγ q .0/ > 0, which is the generic case.) Hence, the fitted AR model with the coefficientsã q is ensured to be stationary and causal. Even though this may seem to be a minor technical detail, it has a huge effect on the performance of the estimator a q : it keeps the estimator stable even when the AR process is very persistent and the AR polynomial A.z/ has almost a unit root. This in turn results in reliable behaviour of the estimatorâ in the case of high persistence. The estimator of Hall and Van Keilegom (2003), in contrast, may produce non-causal results when the AR polynomial A.z/ is close to having a unit root. As a consequence, it may have unnecessarily high variance in the case of high persistence. We illustrate this difference between the estimators by the simulation exercises in Section 5.2. A striking example is Fig. 6 there, which presents the simulation results for the case of an AR(1) process " t = a 1 " t−1 + η t with a 1 = −0:95 and clearly shows the much better performance of our method. (b) Both our pilot estimatorã q and the estimator of Hall and Van Keilegom (2003) tend to have a substantial bias when the trend m is pronounced. Our estimatorâ reduces this bias considerably as demonstrated in the simulations of Section 5.2. Unlike the estimator of Hall and Van Keilegom (2003), it thus produces accurate results even in the presence of a very strong trend.
We close this section by deriving some basic asymptotic properties of the estimatorsã q ,â andσ 2 . To formulate the following result, we use the shorthand v T w T which means that v T =w T → 0 as T → ∞.
Proposition 4. Let m be Lipschitz continuous and suppose that {" t } is an AR(p Å ) process of the form (4.3) with the following properties: A.z/ = 0 for all |z| 1 + δ with some small δ > 0 and the innovations η t have a finite fourth moment. Assume that p, q, r andr satisfy the following conditions. In case A, p, r andr are fixed natural numbers and log.T/ q √ T . In case B, C log.T/ p min{T 1=5 , q} for some sufficiently large C, q √ T , r = .1 + δ/p for some small δ > 0 andr − r remains bounded. Under these conditions, The proof is provided in the on-line supplementary material. As we can see, the convergence rate of the second-step estimatorâ is somewhat slower than that of the pilot estimatorã q . Hence, from an asymptotic perspective, there is no gain from using the second-step estimator. Nevertheless, in finite samples, the estimatorâ vastly outperformsã q since it considerably reduces the bias of the latter.

Small sample properties of the multiscale test
In what follows, we investigate the performance of our multiscale test and compare it with the dependent SiZer methods from Park et al. (2004Park et al. ( , 2009) and Rondonotti et al. (2007). We consider the following versions of our multiscale test and SiZer.   Park et al. (2004Park et al. ( , 2009 and Rondonotti et al. (2007). We do not consider a global version of dependent SiZer as such a version was not fully developed in Park et al. (2004Park et al. ( , 2009) and Rondonotti et al. (2007).
The simulation set-up is as follows: we generate data from the model Y t,T = m.t=T/ + " t for different trends m, error processes {" t } and sample sizes T . The error terms are supposed to have the AR(1) structure " t = a 1 " t−1 + η t , where a 1 ∈ {−0:9, −0:5, −0:25, 0:25, 0:5, 0:9}, η t are IID standard normal and the AR order p Å = 1 is treated as known. To simulate data under the null, we let m be a constant function. In particular, we set m = 0 without loss of generality. To generate data under the alternative, we consider different non-constant trend functions which are specified below. For each model specification, we simulate S = 1000 data samples and carry out the tests T MS , T UC , T RW and T SiZer for each simulated sample.
To implement our multiscale test T MS , we choose K to be an Epanechnikov kernel and let 4 ] : h = 5l=T for some l ∈ N}: We thus take into account all locations u on an equidistant grid U T with step length 5=T and all bandwidths h = 5=T , 10=T , 15=T , : : : with log.T/=T h 1 4 . Note that the lower bound log.T/=T is motivated by condition 6 which requires that log.T/=T h min (given that all moments of " t exist). As a robustness check, we have rerun the simulations for other grids. As the results are very similar, we do not, however, report them here. To estimate the long-run error variance σ 2 , we apply the procedure from Section 4 with r = 1 andr = 10 and the following choices of q: for a 1 ∈ {−0:5, −0:25, 0:25, 0:5}, we set q = 25. As already discussed in Section 4, this should be an appropriate choice for AR(1) errors that are not too strongly correlated, in particular, for a 1 ∈ {−0:5, −0:25, 0:25, 0:5}. When the errors are very strongly correlated, larger values of q are required to produce precise estimates of σ 2 . In the case of AR(1) errors with a 1 ∈ {−0:9, 0:9}, we thus set q = 50. The dependence of our long-run variance estimator on the tuning parameters q, r andr is explored more systematically in Section 5.2. To compute the critical values of the multiscale test T MS , we simulate 5000 values of the statistic Φ T defined in Section 3.2 and compute their empirical .1 − α/-quantile q T .α/. The uncorrected and rowwise versions T UC and T RW of our multiscale test are implemented analogously. The SiZer test is implemented as described in Park et al. (2009). The details are summarized in section S.3 of the on-line supplementary material.

Size simulations
The first part of our simulation study investigates the size properties of the four tests T MS , T UC , T RW and T SiZer under the null that the trend m is constant. To start with, we focus on the multiscale test T MS . Table 1 reports the actual size of T MS for the AR parameters a 1 ∈ {−0:5, −0:25, 0:25, 0:5}, which is computed as the number of simulations in which T MS rejects the null divided by the total number of simulations. As can be seen, the actual size of the multiscale test T MS is fairly close to the nominal target α for all the AR parameters and sample sizes considered. Hence, the test has approximately the correct size.
In Table 1, we have explored the size of T MS when the errors are moderately auto-correlated. The case of strongly auto-correlated errors is investigated in Table 2, where we consider AR(1) errors with a 1 ∈ {−0:9, 0:9}. We first discuss the results for the positive AR parameter a 1 = 0:9. As can be seen, the size numbers are substantially downwardly biased for small sample sizes, in particular, for T = 250 and T = 500. As the sample size increases, this downward bias diminishes and the size numbers stabilize around their target α. In particular, for T 1000, the size numbers give a decent approximation to α. An analogous picture arises for the negative AR parameter a 1 = −0:9. The size numbers, however, are upwardly rather than downwardly biased for small sample sizes T and the size distortions appear to vanish a little more slowly as T increases. To summarize, in the case of strongly auto-correlated errors, our multiscale test has good size properties only for sufficiently large sample sizes. This is not very surprising: statistical inference in the presence of strongly auto-correlated data is a very difficult problem in general and satisfying results can only be expected for fairly large sample sizes.
We next compare our multiscale test T MS with T UC , T RW and T SiZer in terms of size. There is an important difference between T MS and T UC on the one hand and T RW and T SiZer on the other. T MS and its uncorrected version T UC are global test procedures: they test H 0 .u, h/ simultaneously for all locations u ∈ U T and scales h ∈ H T . Hence, they control the size simultaneously over both locations u and scales h. The methods T RW and T SiZer , in contrast, are rowwise (or scalewise) in nature: they test the hypothesis H 0 .u, h/ simultaneously for all u ∈ U T but separately for each scale h ∈ H T . Hence, they control the size for each scale h ∈ H T separately.  We conduct some simulation exercises to illustrate this important distinction. To keep the simulation study to a reasonable length, we restrict attention to the level of significance α = 0:05 and the AR parameters a 1 ∈ {−0:5, 0:5}. To simplify the implementation of T SiZer , we assume that the autocovariance function of the error process and thus the long-run error variance σ 2 are known. To keep the comparison fair, we treat σ 2 as known also when implementing T MS , T UC and T RW . Moreover, we use exactly the same location-scale grid for all four methods. To achieve this, we start off with the grid G T = U T × H T with U T and H T defined above. We then follow Rondonotti et al. (2007) and restrict attention to those points .u, h/ ∈ G T for which the effective sample size ESS Å .u, h/ for correlated data is not smaller than 5. This yields the grid G Å T = {.u, h/ ∈ G T : ESS Å .u, h/ 5}. A definition of ESS Å .u, h/ is given in section S.3 of the on-line supplement.
For our simulation exercises, we distinguish between global and rowwise (or scalewise) size: global size is defined as the percentage of simulations in which the test under consideration rejects H 0 .u, h/ for some .u, h/ ∈ G Å T . Hence, it is identical to the size as computed in Tables 1 and 2. Rowwise size for scale h Å ∈ H T , in contrast, is the percentage of simulations in which the test rejects H 0 .u, h Å / for some .u, h Å / ∈ G Å T . Table 3 reports the global size of the four tests. As can be seen, the size numbers of our multiscale test T MS and its uncorrected version T UC are reasonably close to the target α = 0:05. The global size numbers of the rowwise methods T RW and T SiZer , in contrast, are much larger than the target α = 0:05. Since the number of scales h in the grid G Å T increases with T , they even move away from α as the sample size T increases. To summarize, as expected, the global tests T MS and T UC hold the size reasonably well, whereas the rowwise methods T RW and T SiZer are much too liberal. Fig. 2 reports the rowwise size of the four tests by so-called parallel co-ordinate plots (Inselberg, 1985) for the sample size T = 1000. Each curve in Fig. 2 specifies the rowwise size of one of the tests for the scales h under consideration. As can be seen, the rowwise version T RW of our multiscale test holds the size quite accurately across scales. The rowwise size of T SiZer also gives an acceptable approximation to the target α = 5%, even though the size numbers are upwardly biased quite considerably. The global tests T MS and T UC , in contrast, have a rowwise size that is much smaller than the target α = 5%, which reflects the fact that they control global rather than rowwise size.

Power comparisons
In the second part of our simulation study, we compare the tests T MS , T UC , T RW and T SiZer in terms of power. As above, we use the location-scale grid G Å T and treat the autocovariance function of the error terms as known when implementing the tests. Moreover, we restrict attention to the level of significance α = 0:05 and the AR parameters a 1 ∈ {−0:5, 0:5}. Our simulation exercises investigate the ability of the four tests to detect local increases in the trend m. (The same could of The tests indicate a local increase in m according to the following decision rules: for each .u, h/ ∈ G Å T , T MS indicates an increase on [u − h, u + h] ⇔ψ T .u, h/=σ > q T .α/ + λ.h/, where q UC T .α/, q RW T .α, h/ and q SiZer T .α, h/ are the critical values of T UC , T RW and T SiZer respectively. Note that the critical values of T RW and T SiZer depend on the scale h as these are rowwise procedures. To be able to make systematic power comparisons, we consider a very simple trend function m. More complicated signals m are analysed in section S.3 of the on-line supplementary material. The trend function that we are considering here is defined as m.u/ = c 1.u ∈ [0:45, 0:55]/[1 − {.u − 0:5/=0:05} 2 ] 2 , where c = 0:45 in the AR case with a 1 = −0:5 and c = 1:3 in the case with a 1 = 0:5. The function m is increasing on I + = .0:45, 0:5/, decreasing on I − = .0:5, 0:55/ and constant elsewhere . Figs 3(a) and 3(d) give a graphical illustration of m, where the grey line in the background is the time series path of a representative simulated data sample. As can be seen, m is a small bump around u = 0:5, where c determines the height of the bump. The constant c is chosen such that the bump is difficult but not impossible to detect for the four tests. We distinguish between the following types of power for the tests T j with j ∈ {MS, UC, RW, SiZer}, where we restrict attention to increases in m: where m is indeed increasing, i.e. on some I u,h AE with I u,h AE ∩ I + = ∅; (d) spurious rowwise power on scale h Å ; the percentage of simulation runs in which the test T j indicates an increase on some interval where m is not increasing, i.e. on some I u,h AE with I u,h AE ∩ I + = ∅. Table 4 reports the global power and global spurious power of the four tests. As can be seen, our multiscale test T MS has higher power than the uncorrected version T UC . This confirms the theoretical optimality theory in Dümbgen and Spokoiny (2001) (see also Dümbgen and Walther (2008) and Rufibach and Walther (2010)) according to which the aggregation scheme of T MS with its additive correction term should yield better power properties than the simpler scheme of T UC . As expected, the rowwise methods T RW and T SiZer have substantially more power than the global tests. Indeed, T SiZer is even a little more powerful than T RW , which is presumably because it is somewhat too liberal in terms of rowwise size as observed in Fig. 2. The higher power of the rowwise procedures comes at some cost: their spurious global power is much higher than that of the global tests. For the sample size T = 1000 and the AR parameter a 1 = −0:5, for example, T SiZer spuriously finds an increase in the trend m in more than 28% of the simulations, and T RW in more than 15%. The multiscale test T MS (as well as its uncorrected version T UC ), in contrast, controls the probability of finding a spurious increase. In particular, as implied by proposition 3, its spurious global power is below 100α% = 5%. Fig. 3 gives a more detailed picture of the power properties of the four tests for the sample size T = 1000. The parallel co-ordinate plots of Fig. 3 show how power and spurious power are distributed across scales h. Let us first have a look at the rowwise methods. As can be seen, T SiZer is more powerful than T RW on all scales under consideration. As already mentioned when discussing the global power results, this is presumably because T SiZer is a little too liberal in terms of rowwise size. Comparing the power curves of the two global methods gives an interesting insight: our multiscale test T MS has substantially more power than the uncorrected version T UC on medium and large scales. On small scales, in contrast, it is slightly less powerful than T UC . This again illustrates the theoretical optimality theory in Dümbgen and Spokoiny (2001) which suggests that, asymptotically, the multiscale test T MS should be as powerful as T UC on small scales but more powerful on large scales. This is essentially what we see in Figs 3(b) and 3(e). Of course, T MS does not have exactly as much power as T UC on fine scales. However, the loss of power on fine scales is very small compared with the gain of power on larger scales (which is also reflected by the fact that T MS has more global power than T UC ).
The main findings of our simulation exercises can be summarized as follows: if we are interested in an exploratory data tool for finding local increases or decreases of a trend, the rowwise methods T RW and T SiZer both do a good job. However, if we want to make rigorous statistical inference simultaneously across locations and scales, we need to opt for a global method. Our simulation exercises have demonstrated that our multiscale test T MS is a global method which enjoys good size and power properties. In particular, as predicted by the theory, it is a more effective test than the uncorrected version T UC .

Small sample properties of the long-run variance estimator
In the final part of our simulation study, we analyse the estimators of the AR parameters and of the long-run error variance from Section 4 and compare them with the estimators of Hall and Van Keilegom (2003). We simulate data from the model Y t,T = m.t=T/ + " t , where {" t } is an AR(1) process of the form " t = a 1 " t−1 + η t . We consider the AR parameters a 1 ∈ {−0:95, − 0:75, − 0:5, − 0:25, 0:25, 0:5, 0:75, 0:95} and let η t be IID standard normal innovation terms. Throughout the simulation study, the AR order p Å = 1 is treated as known. We report our findings for the sample size T = 500; the results for other sample sizes are very similar. For simplicity, m is chosen to be a linear function of the form m.u/ = βu with the slope parameter β. For each value of a 1 , we consider two slopes β: one corresponding to a moderate and one to a pronounced trend m. In particular, we let β = s β √ var." t / with s β ∈ {1, 10}. When s β = 1, the slope β is equal to the standard deviation √ var." t / of the error process, which yields a moderate trend m. When s β = 10, in contrast, the slope β is 10 times as large as √ var." t /, which results in quite a pronounced trend m.
For each model specification, we generate S = 1000 data samples and compute the following quantities for each simulated sample: (a) the pilot estimatorã q from equation (4.8) with the tuning parameter q, the estimatorâ from equation (4.10) with the tuning parameters .r,r/ and the long-run variance estimator σ 2 from equation (4.11); (b) the estimators of a 1 and σ 2 from Hall and Van Keilegom (2003), which are denoted bŷ a HvK andσ 2 HvK (the estimatorâ HvK is computed as described in Section 2.2 of Hall and Van Keilegom (2003) andσ 2 HvK as defined at the bottom of page 447 in section 2.3 there; the estimatorâ HvK (as well asσ 2 HvK ) depends on two tuning parameters which we denote by m 1 and m 2 as in Hall and Van Keilegom (2003)); (c) oracle estimatorsâ oracle andσ 2 oracle of a 1 and σ 2 , which are constructed under the assumption that the error process {" t } is observed (for each simulation run, we computeâ oracle as the maximum likelihood estimator of a 1 from the time series of simulated error terms " 1 , : : : , " T . We then calculate the residuals r t = " t −â oracle " t−1 and estimate the innovation variance ν 2 = E[η 2 t ] byν 2 oracle = .T − 1/ −1 Σ T t=2 r 2 t . Finally, we setσ 2 oracle = ν 2 oracle =.1 −â oracle / 2 . Throughout this section, we set q = 25, .r,r/ = .1, 10/ and .m 1 , m 2 / = .20, 30/. We in particular choose q to be in the middle of m 1 and m 2 to make the tuning parameters of the estimatorsã q andâ HvK comparable. To assess how sensitive our estimators are to the choice of q and .r,r/, we carry out robustness checks, considering a range of values for q and .r,r/. In addition, we vary the tuning parameters m 1 and m 2 of the estimators from Hall and Van Keilegom (2003) to make sure that the results of our comparison study are not driven by the particular choice of any of the tuning parameters involved. The results of our robustness checks are reported in section S.3 of the on-line supplement. They show that the results of our comparison study are robust to different choices of the parameters q, .r,r/ and .m 1 , m 2 /. For each estimatorâ,â HvK ,â oracle ,σ 2 ,σ 2 HvK andσ 2 oracle , and for each model specification, the simulation output consists of a vector of length S = 1000 which contains the 1000 simulated values of the respective estimator. different across simulation scenarios to obtain a reasonable graphical presentation. In addition to the MSE values that are presented in Figs 4 and 5, we depict histograms of the 1000 simulated values that are produced by the estimatorsâ,â HvK ,â oracle ,σ 2 ,σ 2 HvK andσ 2 oracle for two specific simulation scenarios in Figs 6 and 7. The main findings can be summarized as follows.
(a) In the simulation scenarios with a moderate trend (s β = 1), the estimatorsâ HvK andσ 2 HvK of Hall and Van Keilegom (2003) exhibit a similar performance to that of our estimatorsâ andσ 2 as long as the AR parameter a 1 is not too close to −1. For strongly negative values of a 1 (in particular for a 1 = −0:75 and a 1 = −0:95), the estimators perform much worse than ours. This can be clearly seen from the much larger MSE values of the estimatorŝ a HvK andσ 2 HvK for a 1 = −0:75 and a 1 = −0:95 in Fig. 4. Fig. 6 gives some further insights into what is happening here. It shows the histograms of the simulated values that are produced by the estimatorsâ,â HvK andâ oracle and the corresponding long-run variance estimators in the scenario with a 1 = −0:95 and s β = 1. As can be seen, the estimatorâ HvK does not obey the causality restriction |a 1 | < 1 but frequently takes values that are substantially smaller than −1. This results in a very large spread of the histogram and thus in a disastrous performance of the estimator. A similar point applies to the histogram of the long-run variance estimatorσ 2 HvK . Our estimatorsâ andσ 2 , in contrast, exhibit stable behaviour in this case.
Interestingly, the estimatorâ HvK (as well as the corresponding long-run variance estimatorσ 2 HvK ) performs much worse than ours for large negative values but not for large positive values of a 1 . This can be explained as follows: in the special case of an AR(1) process, the estimatorâ HvK may produce estimates that are smaller than −1 but it cannot become larger than 1. This can be easily seen on inspecting the definition of the estimator. Hence, for large positive values of a 1 , the estimatorâ HvK performs well as it satisfies the causality restriction that the estimated AR parameter should be smaller than 1. (b) In the simulation scenarios with a pronounced trend (s β = 10), the estimators of Hall and Van Keilegom (2003) are clearly outperformed by ours for most of the AR parameters a 1 under consideration. In particular, their MSE values reported in Fig. 5 are much larger than the values that are produced by our estimators for most parameter values a 1 . The reason is as follows: the Hall and Van Keilegom estimators have a strong bias since the pronounced trend with s β = 10 is not eliminated appropriately by the underlying differencing methods. This point is illustrated by Fig. 7 which shows histograms of the simulated values for the estimatorsâ,â HvK andâ oracle and the corresponding long-run variance estimators in the scenario with a 1 = 0:25 and s β = 10. As can be seen, the histogram that is produced by our estimatorâ is centred near the true value a 1 = 0:25, whereas that ofâ HvK is strongly biased upwards. A similar picture arises for the long-run variance estimatorŝ σ 2 andσ 2 HvK . Whereas the methods of Hall and Van Keilegom (2003) perform much worse than ours for negative and moderately positive values of a 1 , the performance (in terms of MSE) is fairly similar for large values of a 1 . This can be explained as follows: when the trend m is not eliminated appropriately by taking differences, this creates spurious persistence in the data. Hence, the estimatorâ HvK tends to overestimate the AR parameter a 1 , i.e.â HvK tends to be larger in absolute value than a 1 . Very loosely speaking, when the parameter a 1 is close to 1, say a 1 = 0:95, there is not much room for overestimation sinceâ HvK cannot become larger than 1. Consequently, the effect of not eliminating the trend appropriately has a much smaller effect onâ HvK for large positive values of a 1 .

Application
The analysis of time trends in long temperature records is an important task in climatology. Information on the shape of the trend is needed to understand long-term climate variability better. In what follows, we use our multiscale test T MS to analyse two long-term temperature records. Throughout the section, we set the level of significance to α = 0:05 and implement the multiscale test in exactly the same way as in the simulation study of Section 5.

Analysis of the central England temperature record
The central England temperature record is the longest instrumental temperature time series in the world. The data are publicly available on the web page of the UK Met Office. A detailed description of the data can be found in Parker et al. (1992). For our analysis, we use the data set of yearly mean temperatures which consists of T = 359 observations Y t,T covering the years from 1659 to 2017. A plot of the time series is given in Fig. 8(a). We assume that the temperature data Y t,T follow the non-parametric trend model Y t,T = m.t=T/ + " t , where m is the unknown time trend of interest. The error process {" t } is supposed to have the AR(p Å ) structure " t = Σ p Å j=1 a j " t−j + η t , where η t are IID innovations with mean 0 and variance ν 2 . As pointed out in Mudelsee (2010) among others, this is the most widely used error model for discrete climate time series. We select the AR order p Å by the Bayesian information criterion, which yields p Å = 2. More precisely, we proceed as follows: we estimate the AR parameters and the corresponding variance of the innovation terms for different AR orders by the methods from Section 4 and then choose p Å as the minimizer of the BIC. As a check of robustness, we have repeated this procedure for a wide range of the tuning parameters q and .r,r/, which produces the value p Å = 2 throughout. Moreover, we have considered other information criteria such as the final prediction error criterion, the Akaike information criterion and a bias-corrected version thereof, which give the AR order p Å = 2 for almost all values of q and .r,r/. Given the AR order p Å = 2, we estimate the AR(2) parameters a = .a 1 , a 2 / and the long-run error variance σ 2 by the procedures from Section 4 with q = 25 and .r,r/ = .1, 10/. This gives the estimatorsâ 1 = 0:164,â 2 = 0:175 and σ 2 = 0:737.
With the help of our multiscale method, we now test the null hypothesis H 0 that m is constant on all intervals [u − h, u + h] with .u, h/ ∈ G Å T , where the grid G Å T is defined in the same way as in Section 5. The results are presented in Fig. 8. Fig. 8(b) depicts the minimal intervals in the set Π + T which is produced by our multiscale test T MS . The set of intervals Π − T is empty in the present case. According to proposition 3, we can make the following simultaneous confidence statement about the collection of minimal intervals plotted in Fig. 8(b). We can claim, with confidence of about 95%, that the trend m has some increase on each minimal interval. More specifically, we can claim with this confidence that there has been some upward movement in the trend both in the period from around 1680 to 1740 and in the period from about 1870 onwards. Hence, our test in particular provides evidence that there has been some warming trend in the period over approximately the last 150 years. In contrast, as the set Π − T is empty, there is no evidence of any downward movement of the trend. Fig. 8(c) presents the SiZer map that was produced by our multiscale test T MS . For comparison, the SiZer map of the dependent SiZer test T SiZer is shown in Fig. 8(d). To produce Fig.  8(d), we have implemented SiZer as described in section S.3 of the on-line supplement, where the autocovariance function of the errors {" t } is estimated with the help of our procedures from Section 4 under the assumption that {" t } is an AR(2) process. The SiZer maps of Figs 8(c) and 8(d) are to be read as follows: each pixel of the map corresponds to a location-scale point .u, h/ or, put differently, to a time interval [u − h, u + h]. The pixel (u, h) is coloured dark grey if the test indicates an increase in the trend m on the interval [u − h, u + h], white if the test indicates a decrease and light grey if the test does not reject the null hypothesis that m is constant on [u − h, u + h]. As can be seen, the two SiZer maps in Figs 8(c) and 8(d) have a similar structure. Both our multiscale test and SiZer indicate increases in the trend m during a short time period around 1700 and towards the end of the sample. However, in contrast with SiZer, our method enables us to make formal confidence statements about the regions of dark pixels in the SiZer map. In particular, as the set of dark grey pixels in Fig. 8(c) exactly corresponds to the collection of intervals Π + T , we can claim, with confidence of about 95%, that the trend m has an increase on each time interval represented by a dark grey pixel in Fig. 8(c).

Analysis of global temperature data
We next analyse a data set which consists of annual global temperature anomalies from 1850 onwards. The data are plotted in Fig. 9(a) and are described in detail in Morice et al. (2012). They are publicly available on the web page https://cdiac.ess-dive.lbl.gov/trends/temp/ jonescru/jones.html. As before, we assume that the data come from the model Y t,T = m.t=T/ + " t , where m is the trend and {" t } the noise process. We apply our multiscale methods to test the null hypothesis H 0 that m is constant on all time intervals [u − h, u + h] with .u, h/ ∈ G T , where the grid G T is defined as in Section 5. We compare our results with those obtained by Wu et al. (2001) who developed a method for testing the hypothesis that m is constant on [0, 1] against the alternative that m is an arbitrary monotonic function. For comparability, we use exactly the same data as in Wu et al. (2001): in particular, the yearly temperature anomalies from 1856 to 1998. Moreover, we use their estimate of the long-run error variance σ 2 which amounts to 0:01558. As we do not have an estimate available from Wu et al. (2001) for the autocovariance function of the error process, we do not consider dependent SiZer in the application example at hand.
The results produced by our multiscale test are reported in Fig. 9. Fig. 9(b) shows the minimal intervals in Π + T and Fig. 9(c) the SiZer map of the test. As can be clearly seen from both  . 9. Summary of the results for the global temperature anomalies: (a) observed temperature time series (in degrees centigrade); (b) minimal intervals in the set Π C T produced by the multiscale test (these are [1905,1935], [1915,1945], [1920,1950] and [1965,1995]); (c) SiZer map of our test Fig. 9(b) and Fig. 9(c), the test indicates an increase in the trend m during the first half of the 20th century followed by another increase during the second half. These findings are in line with those in Wu et al. (2001) who rejected the null hypothesis that m is constant. In contrast with the test of Wu et al. (2001), however, our multiscale method not only enables us to test whether the null is violated. It also enables us to make formal confidence statements about where violations occur, i.e. about where the trend m is increasing. In particular, we can claim, with confidence of about 95%, that the trend has an increase on each interval plotted in Fig. 9(b).