Volume 82, Issue 3
Original Article
Open Access

Robust testing in generalized linear models by sign flipping score contributions

Jesse Hemerik

Corresponding Author

E-mail address: jesse.hemerik@medisin.uio.no

University of Oslo, Norway

Address for correspondence: Jesse Hemerik, Oslo Centre for Biostatistics and Epidemiology, PO Box 1122 Blindern, 0317 Oslo, Norway. E‐mail: jesse.hemerik@medisin.uio.noSearch for more papers by this author
Jelle J. Goeman

Leiden University Medical Center, The Netherlands

Search for more papers by this author
First published: 11 May 2020

Summary

Generalized linear models are often misspecified because of overdispersion, heteroscedasticity and ignored nuisance variables. Existing quasi‐likelihood methods for testing in misspecified models often do not provide satisfactory type I error rate control. We provide a novel semiparametric test, based on sign flipping individual score contributions. The parameter tested is allowed to be multi‐dimensional and even high dimensional. Our test is often robust against the mentioned forms of misspecification and provides better type I error control than its competitors. When nuisance parameters are estimated, our basic test becomes conservative. We show how to take nuisance estimation into account to obtain an asymptotically exact test. Our proposed test is asymptotically equivalent to its parametric counterpart.

1 Introduction

We consider the problem of testing hypotheses about parameters in potentially misspecified generalized linear models (GLMs). The types of misspecification that we consider include overdispersion and heteroscedasticity. When the model is misspecified, traditional parametric tests tend to lose their properties, e.g. because they estimate the Fisher information under incorrect assumptions. By a parametric test we mean a test which fully relies on an assumed parametric model (Pesarin, 2015) to compute the null distribution of the test statistic.

When a parametric model to be tested is potentially misspecified, the most obvious approach is to extend the model with more parameters, e.g. to add an overdispersion parameter. However, such approaches still require assumptions, e.g. that the overdispersion is constant. Hence a fully parametric approach is not always the best option.

Another well‐known approach to testing in possibly misspecified GLMs is to use a Wald‐type test, where a sandwich estimate of the variance of the coefficient estimate is used. The sandwich estimate corrects for the potentially misspecified variance. As long as the linear predictor and link are correct, such a test is asymptotically exact under mild assumptions. We call a test asymptotically exact if its rejection probability is asymptotically known under the null hypothesis. For small samples, however, sandwich estimates often perform poorly and the test can be very liberal (Boos, 1992; Freedman, 2006; Maas and Hox, 2004; Kauermann and Carroll, 2000).

Recent decades have seen an increase in the use of permutation approaches for various testing problems (Tusher et al., 2001; Pesarin, 2001; Chung and Romano, 2013; Pauly et al., 2015; Winkler et al., 2016; Hemerik and Goeman, 2018a; Ganong and Jäger, 2018). These methods are useful since they require few parametric assumptions. Especially when multiple hypotheses are tested, permutation methods are often powerful since they can take into account the dependence structure in the data (Westfall and Young, 1993; Hemerik and Goeman, 2018b; Hemerik et al., 2019). In the past, permutation methods have already been used to test in linear models (Winkler et al. (2014) and references therein). Rather than permutations, sometimes other transformations are used, such as rotations (Solari et al., 2014) and sign flipping of residuals (Winkler et al., 2014). The existing permutation tests for GLMs, however, are limited to models with identity link function.

Like some existing methods for testing in linear models, this paper presents a sign flipping approach. Our approach is new, however, since, rather than flipping residuals, we flip individual score contributions (note that the score, the derivative of the log‐likelihood, is a sum of n individual score contributions). Moreover, we allow testing in a wide range of models, not only regression models with identity link. Under mild assumptions, the only requirement for the test to be asymptotically exact is that the individual score contributions have mean 0. Consequently, if the link function is correct, our method is often robust against several types of model specification, such as arbitrary overdispersion, heteroscedasticity and, in some cases, ignored nuisance parameters.

The main reason for this robustness is that we do not need to estimate the variance of the score: the Fisher information. Rather, we perform a permutation‐type test based on the score contributions where, rather than permutation, we use sign flipping. Intuitively, the advantage of this approach over explicitly estimating the variance is as follows: if the score contributions are independent and perfectly symmetric around zero under the null, then our test is exact for small n, even if the score contributions have misspecified variances and shapes (Pesarin and Salmaso, 2010a). A parametric test, in contrast, is then usually not exact.

In case nuisance parameters are estimated, the individual score contributions become dependent and our basic sign flipping test is no longer asymptotically exact. To deal with this problem, we consider the effective score, which is less dependent on the nuisance estimate than is the basic score (Hall and Mathiason, 1990; Marohn, 2002). In this case we need slightly more assumptions: the variance misspecification is not always allowed to depend on the covariates. The resulting test is asymptotically exact.

The methods in this paper have been implemented in the R package flipscores (Hemerik et al., 2018), which is available in the Comprehensive R Archive Network.

In Section 2 we consider the scenario that no nuisance effects need to be estimated. In Section 3 we show how the estimation of nuisance effects can be taken into account. Section 4 provides tests of hypotheses about parameters of more than one dimension. Section 5 contains simulations and Section 5 an analysis of real data.

The programs that were used to analyse the data can be obtained from

https://rss.onlinelibrary.wiley.com/hub/journal/14679868/series-b-datasets.

2 Models with known nuisance parameters

Consider random variables ν1,…,νn, which satisfy assumption 1 below. These will often be individual score contributions (see Section 3, Rao (1948) or Hall and Mathiason (1990), page 86), but the results in Section 2.1 hold for any random variables satisfying this assumption.

Assumption 1.The random variables νi, i N , are independent of each other, have finite variances and satisfy the following condition. For every ε>0,

lim n 1 n i = 1 n E ( ν i 2 1 { ν i / n > ϵ } ) = 0 .
Further, as n→∞, s n 2 : = ( 1 / n ) Σ i = 1 n var ( ν i ) s 2 for some constant s2>0.

Throughout Section 2, we consider any null hypothesis H0 which implies that E ( ν i ) = 0 for all 1⩽in. If ν1,…,νn are score contributions and H0 is a point hypothesis, then, under mild assumptions, E ( ν i ) = 0 is satisfied under H0.

A key assumption throughout Section 2 is that the νi, i N , are independent. As soon as nuisance parameters need to be estimated, however, score contributions become dependent. Section 3 is devoted to dealing with estimated nuisance.

2.1 Basic sign flipping test

Let α ∈ [0,1). For any a R , let ⌈a⌉ be the smallest integer that is larger than or equal to a and let ⌊a⌋ be the largest integer that is at most a. Given values T 1 n , , T w n R , we let T ( 1 ) n T ( w ) n be the sorted values and write T [ 1 α ] n = T ( ( 1 α ) w ) n .

Throughout this paper, w ∈ {2,3,…} denotes the number of random sign flipping transformations to be used. Define g 1 = ( 1 , , 1 ) R n and for every 2⩽jw let gj=(gj1,…,gjn) be independent and uniformly distributed on {−1,1}n. Throughout the rest of Section 2, for every 1⩽jw, we let
T j n = n 1 / 2 i = 1 n g ji ν i .
We now state that the basic sign flipping test is asymptotically exact for the point null hypothesis H0 that implies that E ( ν i ) = 0 , 1⩽in. All proofs are in the appendices.

Theorem 1.Suppose that assumption 1 holds. Consider the test that rejects H0 if and only if T 1 n > T [ 1 α ] n . Then, as n→∞, the probability of rejection of this test converges to ⌊αw⌋/wα under H0. Moreover, the statistics T 1 n , , T w n are asymptotically normal and independent with mean 0 and common variance lim n s n 2 under H0.

We now provide an extension of theorem 1 to interval hypotheses. The proof is a straightforward adaptation of the proof of theorem 1.

Corollary 1. ((interval hypotheses))Suppose that assumption 1 holds. Consider a null hypothesis H which implies that E ( ν i ) 0 for all 1⩽in. Then, for every ε>0, there is an N N such that, under H, for every n>N, P ( T 1 n > T ( ( 1 α ) w ) n ) is at most ⌊αw⌋/w+ε. Similarly if H implies that E ( ν i ) 0 for all 1⩽in, then there is an N N such that, under H, for every n>N, P ( T 1 n < T ( α w + 1 ) n ) is at most ⌊αw⌋/w+ε.

The following corollary extends theorem 1 to two‐sided tests. The proof is analogous to that of theorem 1.

Corollary 2. ((two‐sided test))Suppose that assumption 1 holds. Consider α 1 , α 2 { 0 / w , 1 / w , , ( w 1 ) / w } . Under H0, as n→∞,

P { ( T 1 n < T ( α 1 w + 1 ) n ) ( T 1 n > T ( ( 1 α 2 ) w ) n ) } α 1 + α 2 .

Note that our test does not rely on an approximate symmetry assumption (as for example that of Canay et al. (2017) does). Indeed, even if the scores are very skewed, asymptotically the test of theorem 1 is exact. However, if the νi are symmetric, then even for small n the size is always at most α, as noted in the following proposition. A special case of this result has already been discussed in Fisher (1935), section 21, where every element of {1,−1}n is used once.

Proposition 1.Suppose that ν1,…,νn are independent and continuous and that under H0, for each 1⩽in, νi=dνi. Then the size of the test of theorem 1 is at most α for any n N . Moreover, if g2,…,gw are uniformly drawn from {1,−1}n∖{(1,…,1)} without replacement (so that only g1 takes the value (1,…,1)), then the probability of rejection under H0 is exactly ⌊αw⌋/w. (Note that w cannot exceed 2n then.)

If the gj are drawn with replacement or the νi are discrete, then under H0 the probability of rejection of the test of proposition 1 is (slightly) smaller than ⌊αw⌋/w for finite n, because of the possibility of ties among the test statistics T j n , 1⩽jw. Otherwise the rejection probability under H0 is ⌊αw⌋/w.

When the rejection probability under H0 is ⌊αw⌋/w, it can be advantageous to take w such that α is a multiple of 1/w, to exhaust the nominal level.

In theorem 1, we did not assume continuity of the observations νi. There, under the mild assumption 1, for n→∞, P ( T j n = T k n ) 0 for any 1⩽j<kw, regardless of the distribution of the νi. This allows the use of theorem 1 for discrete GLMs.

2.2 Robustness

As a main example we consider the exponential family, i.e. suppose that independent variables Y1,…,Yn have densities of the form
f ( y i ; η i ) = exp { y i η i b ( η i ) a i + c ( y i ) } ,
where η i = x i β + z i γ , x i , β R , z i , γ R m for some m N . Here β is the coefficient of interest and at present we assume that the other coefficients γ are known. The canonical link function g satisfies ηi=g(μi), where g 1 ( η i ) = μ i = E ( y i ) = b ( η i ) and ai=var(yi)/b′′(ηi) (Agresti, 2015). For H 0 : β = β 0 , the score Σ i = 1 n ν i = Σ i = 1 n log { f ( y i ; η i ) } / β | β = β 0 is
i = 1 n x i { y i b ( η i ) } a i | β = β 0 = i = 1 n x i { y i E ( y i ) } a i | β = β 0 .
For example, the Poisson model has g‐log‐link function, b(ηi)= exp (ηi), ai=1 and c(yi)=− log (yi!). Hence E ( y i ) = b ( η i ) = exp ( η i ) . Thus the score function is
i = 1 n x i ( y i μ i ) | β = β 0 = i = 1 n x i { y i exp ( x i β 0 + z i γ ) } .
For the normal distribution, ai=σ2, so the score is
i = 1 n x i ( y i η i ) σ 2 | β = β 0 .

Apart from some mild assumptions, the main assumption that is made in theorem 1 is that E ( ν i ) = 0 , i=1,…,n. This is satisfied as soon as μ i | β = β 0 is the true expected value of Yi. Then the test is asymptotically exact even if the ai are misspecified, i.e. if the variance or distributional shape of Yi is misspecified. The ai are even allowed to be misspecified by a factor which depends on the covariates, as long as assumption 1 holds.

As a concrete example, consider the normal model with identity link function, which assumes that var(Y1)=…=var(Yn). If the real distribution is heteroscedastic, then the test will still be exact for finite n, since the νi are symmetric. The parametric test, however, loses its properties, e.g. because the estimated variance does not have the assumed χ2‐distribution. In Section 5 it is illustrated that our approach can be much more robust against heteroscedasticity than a parametric test.

Another example is the situation where the model is Poisson, i.e. var(Yi)=μi is assumed, but in reality var(Yi)>μi, which is a form of overdispersion which occurs very often in practice. Then the parametric score test underestimates the Fisher information and is anticonservative. To take the overdispersion into account it could be explicitly estimated. However, if the overdispersion factor is not constant, but depends on the covariates, then again the parametric test loses its properties. Theorem 1, however, often still applies, so an asymptotically exact test is obtained.

Further, note that if E ( Y i ) depends on a nuisance variable Z i l which is latent and ignored, where Z i l is independent of Xi, then the test may still be valid. The reason is that, marginally over Z i l , E ( Y i ) may still be computed correctly (see, for example, Section 5.2). Such latent nuisance variables will increase the variance of Yi, however, which can pose a problem for the classical parametric score test, which needs to compute the Fisher information. When the latent variable is not independent of Xi, this usually does pose a problem for our test (even as n→∞), since E ( Y i μ i ) becomes dependent on Xi under H0.

3 Taking into account nuisance estimation

Consider independent and identically distributed (IID) pairs ( X i , Y i ) , i=1,…,n, where X i is some covariate vector and Y i R has distribution P β , γ 0 , X i , which depends on the parameter of interest β R and unknown nuisance parameter γ 0 , which lies in a set G R k 1 , where k is the total number of modelled parameters. We shall discuss the issues arising from estimating γ 0 and propose a solution, which enables us to obtain an asymptotically exact test based on score flipping. In this paper, the above model is the model that is considered by the user. It is the model that is used to compute the scores. We consider this model to be correct, unless explicitly stated otherwise, e.g. in Section 3.2. The parameter γ 0 is part of the model that is considered by the user, so it is always modelled and estimated. For example, γ 0 never represents ignored overdispersion.

We consider the null hypothesis H 0 : β = β 0 R . Generalizations to interval hypotheses and two‐sided tests can be obtained as in corollaries 1 and 2. The case that the parameter of interest is multi‐dimensional is considered in Section 4.

Suppose that, for all γ G , P β , γ , X i has a density f β , γ , X i around β 0 with respect to some dominating measure. For 1⩽in write
ν γ , i = β log { f β , γ , X i ( Y i ) | β = β 0 } ,
where we assume that the derivative exists. The value νi is the score for the ith observation. Under H0, E ( ν γ 0 , i ) = 0 , i=1,…,n. The score for all n observations simultaneously is n 1 / 2 S γ , where
S γ = n 1 / 2 i = 1 n ν γ , i .
Assume that γ ^ is a √n‐consistent estimate of γ 0 , taking values in G . For every 1⩽in, let
ν γ ^ , i ( k 1 ) = γ log { f β 0 , γ , X i ( Y i ) } | γ = γ ^ R k 1
denote the (k−1)‐vector of score contributions for the nuisance parameters, which is assumed to exist. Let
S γ ^ ( k 1 ) = n 1 / 2 i = 1 n ν γ ^ , i ( k 1 ) R k 1
be the vector of nuisance scores.
For 1⩽jw, let the superscript j denote that gj has been applied:
S γ ^ j = n 1 / 2 i = 1 n g ji ν γ ^ , i , S γ ^ ( k 1 ) , j = n 1 / 2 i = 1 n g ji ν γ ^ , i ( k 1 ) .

3.1 Asymptotically exact test

When the nuisance parameter γ 0 is unknown, it needs to be estimated, which is typically done by maximizing the likelihood of the data under the null hypothesis. The distribution of S γ ^ can be substantially different from that of S γ 0 : the score based on the true nuisance parameters. Indeed, under the null hypothesis, the asymptotic variance of S γ ^ is not the Fisher information, but the effective Fisher information (Rippon and Rayner (2010), Rayner (1997), Hall and Mathiason (1990), Marohn (2002) and Cox and Hinkley (1979), section 9.3), which is also the asymptotic variance of the effective score, which is defined below. The effective information is smaller than the information, given that the score for the parameter of interest and the nuisance score are correlated. Intuitively, the reason is that the nuisance variable will be used to explain part of the apparent effect of the variable of interest, also asymptotically.

The estimation of γ 0 makes the summands ν γ ^ , 1 , , ν γ ^ , n underlying S γ ^ correlated, in such a way that var ( S γ ^ ) < var ( S γ 0 ) (if the score is correlated with the nuisance score). Note, however, that, after random flipping, the summands are not correlated anymore. This means that the variance of S γ ^ is asymptotically smaller than the variance of S γ ^ j , 2⩽jw (see the proof of theorem 2 in Appendix B.3). Hence, using ν γ ^ , 1 , , ν γ ^ , n in the test of theorem 1 can lead to a conservative test, even as n→∞.

To make the test asymptotically exact again, we would like to adapt the individual scores such that they are less dependent on the random variation of γ ^ . We do this by considering the so‐called effective score, which ‘is “less dependent” on the nuisance parameter than the usual score statistic’ (Marohn (2002), page 344).

The effective score S γ ^ * and the underlying summands ν γ ^ , i * , i=1,…,n (which we assume have non‐zero variance for γ ^ = γ 0 ), are defined as
S γ ^ * = S γ ^ I ^ 12 I ^ 22 1 S γ ^ ( k 1 ) , ν γ ^ , i * = ν γ ^ , i I ^ 12 I ^ 22 1 ν γ ^ , i ( k 1 ) ,
so that
S γ ^ * = n 1 / 2 i = 1 n ν γ ^ , i * .
Here
I ^ = I ^ 11 I ^ 12 I ^ 12 I ^ 22 ,
with I ^ 11 R and the (k−1)×(k−1) matrix I ^ 22 , assumed invertible, is a consistent estimate of the population Fisher information I , which is assumed to exist and is the variance of ( ν γ 0 , i , ν γ 0 , i ( k 1 ) ) marginally over X i , under H0. The matrix I is assumed to be continuous in the parameters. In GLMs, typically I ^ = n 1 X W ^ X , where X is the design matrix and W ^ the estimated weight matrix (Agresti (2015), page 126). Further, for 1⩽jw we write
S γ ^ * j = S γ ^ j I ^ 12 I ^ 22 1 S γ ^ ( k 1 ) , j .

As discussed, S γ ^ is not generally asymptotically equivalent to S γ 0 . The effective score S γ 0 * (based on I ^ = I ), however, is the residual from the projection of the score S γ 0 on the space spanned by the nuisance scores. Hence S γ 0 * is uncorrelated with the nuisance scores S γ 0 ( k 1 ) (Marohn (2002), page 344). Correspondingly, as noted in the proof of theorem 2, under mild regularity assumptions S γ ^ * = S γ 0 * + o P β 0 , γ 0 ( 1 ) , i.e. asymptotically the effective score is not affected by the nuisance estimate.

If γ ^ is the maximum likelihood estimate under H0, then S γ ^ ( k 1 ) = 0 , so S γ ^ * = S γ ^ . The summands ν γ ^ , i * and ν γ ^ , i are different, however, and the key point is that S γ ^ * = S γ 0 * + o P β 0 , γ 0 ( 1 ) .

Like Marohn (2002), we assume that, if ξ R and β = β n = β 0 + n 1 / 2 ξ , then
S γ ^ = S γ 0 I 12 n ( γ ^ γ 0 ) + o P β n , γ 0 ( 1 ) , S γ ^ ( k 1 ) = S γ 0 ( k 1 ) I 22 n ( γ ^ γ 0 ) + o P β n , γ 0 ( 1 ) ,
which is satisfied under mild assumptions such as continuous second‐order derivatives.

Theorem 2.Consider the test of theorem 1 with T j n = S γ ^ * j , 1⩽jw. As n→∞, under H0 the probability of rejection converges to ⌊αw⌋/wα.

The test of theorem 2 has a parametric counterpart, which uses that, under H0, S γ ^ * is asymptotically normal with zero mean and known variance: the effective information (Marohn (2002), page 341). This test is asymptotically equivalent to the test of theorem 2, as the following proposition says.

Proposition 2.Let ξ R and suppose that the true parameter satisfies β = β n = β 0 + n 1 / 2 ξ . As in theorem 2, let T j n = S γ ^ * j , 1⩽jw. Define ϕ n , w = 1 { T 1 n > T [ 1 α ] n } to be the test of theorem 2. Let ϕ n be the parametric test 1 { T 1 n > σ 0 Φ ( 1 α ) } , where σ 0 2 R is the effective Fisher information and Φ the cumulative distribution function of the standard normal distribution. Then lim w lim inf n P ( ϕ n , w = ϕ n ) = 1 .

3.2 Robustness

In Section 2.2 it was explained that the test of theorem 1 is often robust against misspecification of the variance of the score. The test of theorem 2 is also robust against certain forms of variance misspecification. An example is the case that S γ ^ and S γ ^ ( k 1 ) are misspecified by the same factor; see proposition 3. This happens in particular if the variance is misspecified by a factor which is independent of the covariates.

Proposition 3.Suppose that I ^ = n 1 X W ^ X , where X is an n×k design matrix with IID rows and W ^ a weight matrix. Consider a misspecification factor c1>0 and misspecified scores

ν ~ γ ^ , i = c 1 ν γ ^ , i , ν ~ γ ^ , i ( k 1 ) = c 1 ν γ ^ , i ( k 1 ) , i = 1 , , n .
Further, for c2>0 consider the misspecified weight matrix W ~ = c 2 W ^ . Let I ~ = n 1 X W ~ X be the misspecified average Fisher information. Let ν ~ γ ^ , i * = ν ~ γ ^ , i I ~ 12 I ~ 22 1 ν ~ γ ^ , i ( k 1 ) be the misspecified effective scores, i=1,…,n. Consider the test of theorem 2, with S γ ^ * j , j=1,…,w, replaced by the misspecified effective score
S ~ γ ^ * j = n 1 / 2 i = 1 n g ji ν ~ γ ^ , i * .
Under H0, as n→∞, the probability of rejection of this test converges to ⌊αw⌋/wα.

Proposition 1 is useful, since it tells us that if in a GLM var(Yi) is misspecified by a constant, such that W ^ and the scores are misspecified by a constant, the resulting test is still asymptotically exact. In proposition 3 we assume that the misspecification factors of the weights and the scores are the same for all observations. This is satisfied for example when the model is binomial or Poisson, but the true distribution is respectively quasi‐binomial or quasi‐Poisson. Moreover, in practice the test can be very robust against heteroscedasticity (see Section 5). The variance misspecification is not generally allowed to depend on the covariates, since then S γ ^ and S γ ^ ( k 1 ) can be misspecified by different factors asymptotically. There are exceptions, however; see Sections 5 and 5.

When there are estimated nuisance parameters, we can sometimes nevertheless decide to use the test of theorem 1 with the basic scores ν γ ^ , i plugged in (rather than using effective scores). Indeed, this test has been shown to be very robust to misspecification, as long as E ( ν γ ^ , i ) = 0 , 1⩽in. It is asymptotically conservative if the score S γ 0 is correlated with the nuisance scores S γ 0 ( k 1 ) , i.e. when I 12 0 . Hence, when using this test, it can be useful to redefine the covariates such that I 12 = 0 (as in Cox and Reid (1987)). When W ^ = b I , b>0, this means ensuring that the nuisance covariates are orthogonal to the covariate of interest. If the model is potentially misspecified, then the weights and hence I 12 are not asymptotically known, but the user could substitute a best guess for the weights.

3.3 An example

As discussed, the test of theorem 2 is not generally asymptotically exact if the variance misspecification depends on the covariates. An important exception is the case where the model is
Y i N ( γ 0 + β X i , σ 2 ) i = 1 , , n , (1)
where γ0 is the unknown intercept and X i R . If the null hypothesis is H 0 : β = β 0 , then γ0 is a nuisance parameter that needs to be estimated. (We do not need to know σ and can simply substitute 1 for it.) Hence, we compute the effective score. For 1⩽in,
ν γ ^ , i = x i ( y i μ ^ i ) / σ 2 , ν γ ^ , i ( k 1 ) = ( y i μ ^ i ) / σ 2 .
We can consistently estimate I 12 I 22 1 by x ¯ = ( 1 / n ) Σ i = 1 n x i , so that the effective score contributions are
ν γ ^ , i * = ( x i x ¯ ) ( y i μ ^ i ) / σ 2 .

Thus, the effective score contributions are exactly the basic score contributions after centring x1,…,xn at 0. Similarly, if x1,…,xn are already centred, then ν γ ^ , i and ν γ ^ , i * coincide, since then I ^ 12 = 0 .

The test of theorem 2 is not always asymptotically exact if S γ ^ and S γ ^ ( k 1 ) are misspecified by different factors. However, if I ^ 12 = 0 , then this does not apply anymore. The test of theorem 2 then remains asymptotically exact and reduces to the test based on the basic score. For model (1), this means that, even if the misspecification of var(Yi) depends on Xi, we obtain an asymptotically exact test.

A particular case where this principle applies is the generalized Behrens–Fisher problem, where the aim is to test equality of the means μ1 and μ2 of two populations (or to test whether μ1μ2 or μ1μ2). In this problem, it is assumed only that two independent samples from these populations are available, without making other assumptions such as equal variances. It is well known that this problem has no exact solution with good power under normality (Pesarin and Salmaso, 2010a; Lehmann and Romano, 2005). Under mild assumptions, we obtain an asymptotically exact test for this problem. Pesarin and Salmaso (2010a) have already suggested sign flipping residuals to solve this problem. This is equivalent to flipping scores in our linear model (1) if | x 1 | = = | x n | .

4 Multi‐dimensional parameter of interest

Until now we have considered hypotheses about a one‐dimensional parameter β R . Here we extend our results to hypotheses about a multi‐dimensional parameter β R d , d N . Our tests are defined even if d>n, but in the theoretical results that follow we consider d fixed and let n increase to ∞. The extension to multi‐dimensional β shares important characteristics with the test for a one‐dimensional parameter, such as robustness and asymptotic equivalence with the parametric score test.

4.1 Asymptotically exact test

Our tests below are related to the existing non‐parametric combination methodology (Pesarin, 2001; Pesarin and Salmaso, 2010a,b). This is a very general permutation‐based methodology that allows combining test statistics for many hypotheses into a single test of the intersection hypothesis. Non‐parametric combination methods can be extended to the score flipping framework. Our tests below could be considered a special case of such an extension of the non‐parametric combination methodology. This special case has certain power optimality properties, which are discussed below.

The parametric score test has a classical extension to a hypothesis on a multi‐dimensional parameter, H 0 : β = β 0 R d (Rao, 1948). We shall extend our test in an analogous way. We first assume that the nuisance γ 0 R k d is known. Since β R d , the score is S γ 0 = n 1 / 2 Σ i = 1 n ν γ 0 , i R d , where
ν γ 0 , i = β log { f β , γ 0 , X i ( Y i ) } | β = β 0 R d ,
1⩽in, which are now d‐vectors. We assume that the derivatives exist. About the elements of ν γ , i (and the nuisance scores that are considered later) we make the assumptions which are analogous to the earlier assumptions about ν γ , i .
Let I ^ 11 be a consistent estimate of I 11 : the d×d Fisher information for β R d . Rao's classical statistic for testing H 0 : β = β 0 R d is
S γ 0 I ^ 11 1 S γ 0 = ( n 1 / 2 i = 1 n ν γ 0 , i ) I ^ 11 1 ( n 1 / 2 i = 1 n ν γ 0 , i ) ,
which asymptotically has a χ d 2 ‐distribution under H0.

Instead of requiring a matrix I ^ 1 which converges to the inverse of the Fisher information, in our test that follows we allow replacement of the Fisher information by any random matrix V ^ converging to some non‐zero matrix V, i.e. we do not require the Fisher information to be asymptotically known, just like in the one‐dimensional case. The matrix V can be any matrix of preference, including I 11 1 (if I 11 is invertible), or we can take V ^ = V = I . We shall discuss various choices of V shortly.

Theorem 3.The result of theorem 1 still applies if for 1⩽jw we define

T j n = ( n 1 / 2 i = 1 n g ji ν γ 0 , i ) V ^ ( n 1 / 2 i = 1 n g ji ν γ 0 , i ) .

In the case that the nuisance parameter γ 0 is unknown and we have a √n‐consistent estimate γ ^ , we can use the same test, but with effective scores instead of basic scores plugged in. See theorem 4. For multi‐dimensional β, the effective score contributions are
ν γ ^ , i * = ν γ ^ , i I ^ 12 I ^ 22 1 ν γ ^ , i ( k d ) R d ,
1⩽in, where
ν γ ^ , i ( k d ) = γ log { f β 0 , γ , X i ( Y i ) } | γ = γ ^ R k d .
Here I ^ 12 and I ^ 22 are (kdd and (kd)×(kd) matrices respectively.

Theorem 4. ((unknown nuisance))The result of theorem 1 still applies if for 1⩽jw we define

T j n = ( n 1 / 2 i = 1 n g ji ν γ ^ , i * ) V ^ ( n 1 / 2 i = 1 n g ji ν γ ^ , i * ) .

The test of theorem 4 is asymptotically equivalent to its parametric counterpart, as proposition 4 states. In particular, if we take V ^ = ( I ^ * ) 1 , where ( I ^ * ) 1 is a consistent estimate of the inverse of the effective Fisher information, then the test of theorem 4 is asymptotically equivalent to the parametric score test (Hall and Mathiason (1990), page 86).

Proposition 4. ((equivalence with parametric counterpart))Define T j n as in theorem 4, 1⩽jw. Let ξ R d and suppose that the true value of the parameter of interest is β = β n = β 0 + n 1 / 2 ξ . Let ϕ n , w = 1 { T 1 n > T [ 1 α ] n } . This is the test of theorem 4. Let ϕ n be the parametric test 1 { T 1 n > q α } , where qα is the (1−α)‐quantile of the distribution to which T 1 n converges as n→∞ under β = β 0 . (This is the χ d 2 ‐distribution if V is the inverse of the effective information matrix I * ). Then lim w lim inf n P ( ϕ n , w = ϕ n ) = 1 .

We have seen that the test of theorem 1 is often robust against overdispersion and heteroscedasticity: as long as the score contributions have mean 0, the test is asymptotically exact, under very mild assumptions. Moreover, it is not required to estimate the Fisher information. The same applies to the multi‐dimensional extension in theorem 3.

The test that takes into account nuisance estimation (theorem 4) uses effective scores, so it does need to estimate the information. However, as in the one‐dimensional case, it can be seen that the test remains valid if the information matrix is asymptotically misspecified by a constant (as in proposition 3). Additional robustness is illustrated with simulations in Section 5.5.

4.2 Connection with the global test

The test of theorem 3 is related to the global test, which was developed in Goeman et al. (2004, 2006, 2011). We can combine the global test with the score flipping approach. In certain cases, the resulting test coincides with the test of theorem 3.

The global test is a parametric test of H0. For the test to be defined, it is not required that dn. For GLMs with canonical link function, the test statistic of the global test is
S γ 0 Σ S γ 0 , (2)
with Σ a freely chosen positive (semi)definite d×d matrix (Goeman et al., 2006, 2011). The choice of Σ influences the power properties.

When V ^ = Σ , statistic (2) coincides with the statistic of theorem 3. Thus, it immediately follows from our results that the global test can be combined with our sign flipping approach, leading to a test which becomes asymptotically exact as n→∞ and asymptotically equivalent to its parametric counterpart: the original global test (by proposition 4). Combining the global test with sign flipping is useful in the light of our robustness results. Moreover, the sign flipping variant can be combined with existing permutation‐based multiple‐testing methodology (Westfall and Young, 1993; Hemerik and Goeman, 2018b; Hemerik et al., 2019).

Goeman et al. (2006) provided results on the power properties of the global test as depending on the choice of Σ. Since the global test is asymptotically equivalent to its sign flipping counterpart, these results can be used as recommendations on the choice of V ^ in theorem 3. In particular, according to Goeman et al. (2006), section 8, taking V ^ = I leads to good power if we expect that relatively much of the variance of Y is explained by the large variance principal components of the design matrix. If this is not so, taking V ^ to be an estimate of the inverse of the Fisher information (if invertible) can provide better power. In general, the global test has optimal power on average (over β) in a neighbourhood of β 0 that depends on Σ (Goeman et al., 2006). Hence the same holds asymptotically for the test of theorem 3, for GLMs with canonical link.

5 Simulations

To compare the tests in this paper with each other and existing tests, we applied them to simulated data. In particular we considered scenarios where the model was misspecified. Simulations with a multi‐dimensional parameter of interest are in Section 5.5.

5.1 Overdispersion, heteroscedasticity and estimated nuisance

In Sections 5.1 and 5.2 the model assumed was Poisson, but in fact Y1,…,Yn were drawn from a negative binomial distribution.

The covariates X , Z , Z l R were drawn from a multivariate normal distribution with zero mean and var(X)=var(Z)=var(Zl)=1. (For non‐zero means, similar simulation results were obtained as below.) The response satisfied
log { E ( Y i ) } = log ( μ i ) = η i = 0 + β X i + γ 0 Z i + γ 0 l Z i l .

The null hypothesis was H 0 : β = 0 . In Section 5.1 we took γ 0 l = 0 . The coefficient γ0 and the intercept 0 were nuisance parameters that were estimated by maximum likelihood under H0. We took γ0=1 and ρ(Xi,Zi)=0.5, ρ ( Z i l , Z i ) = 0 and ρ ( Z i l , X i ) = 0 . We took the dispersion parameter of the negative binomial distribution to be 1, so that var ( Y i ) = μ i + μ i 2 .

The model assumed, however, was Poisson, i.e. var(Yi)=μi was assumed. Thus the true variance was larger than the assumed variance and the variance misspecification factor depended on μi, i.e. on the covariate Zi. The assumed log‐link function was correct and in Section 5.1 the linear predictor was correct as well.

In Fig. 1 the estimated rejection probabilites of four tests under H 0 : β = 0 are compared, based on 5000 repeated simulations. In all simulations the tests were two sided.

image
Estimated rejection probabilities for four tests under misspecified variance and estimated nuisance (the null hypothesis was H 0 : β = 0 ) (image, parametric; image, sandwich; image, flip basic; image, flip effective): (a) β=0, n=50; (b) β=0, n=200

One of the tests that was considered was the parametric score test. Since the model assumed was Poisson, the computed Fisher information was too small and the test was anticonservative.

We also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance of β ^ , to correct for the misspecified variance function. We used the R package gee (Carey et al., 2019) for this (available on the Comprehensive R Archive Network), specifying blocks of size 1. As can be seen in Fig. 1, this test was quite anticonservative (especially for small α, e.g. α=0.01). This was in particular due to the estimation error of the sandwich (Boos, 1992; Freedman, 2006; Maas and Hox, 2004; Kauermann and Carroll, 2000).

Further, we applied the sign flipping test based on the basic scores ν γ ^ , i . Because of the estimation of γ0, the variance of the score was shrunk and the test was conservative, as explained in Section 3.1. In the simulations under H0 we took w=200. Taking w larger led to a very similar level (see also Marriott (1979)). In the power simulations we took w=1000.

Finally, we used the sign flipping test of theorem 2, which is based on the effective scores ν γ ^ , i * . In Section 3.2 it has already been shown that this test is asymptotically exact under constant variance misspecification. Here, however, the variance misspecification factor was 1+μi (i.e. it depended on Zi). Nevertheless the rejection probability under H0 was approximately α. This illustrates that the test has some additional robustness, which we have not theoretically shown.

5.2 Ignored nuisance

The same simulations were performed as in Section 1, but with γ 0 l = 1 . Since γ 0 l = 0 was assumed, Z i l represented an ignored latent variable. Fig. 2 shows similar results to those in Fig. 1. The parametric test was even more anticonservative than in Section 5.1. The reason is that the introduction of Z i l increased the variance Yi, so the variance of the score was even more misspecified than in Section 5.1.

image
Estimated rejection probabilities for four tests under misspecified variance, estimated nuisance and ignored nuisance (the null hypothesis was H 0 : β = 0 ) (image, parametric; image, sandwich; image, flip basic; image, flip effective): (a) β=0,n=50; (b) β=0,n=200

The test of theorem 2 was still nearly exact for n=200, even though μi was misspecified. (Even marginally over Z i l , μi was misspecified. Possibly the estimation of the intercept corrected for the misspecification.)

A conclusion from the simulations of Sections 5.1 and 5.2 is that the sandwich‐based approach should not always be seen as the most reliable way of testing models with misspecified variance functions. Indeed, in our simulations the test of theorem 2 was substantially less anticonservative (while having similar power; see Section 5.3).

5.3 Power

For a meaningful power comparison of the four tests, we considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and γ 0 l was 0: Fig. 3. The estimated probabilities are based on 2×104 simulation loops.

image
Power comparison of four two‐sided tests under the correct model, with estimated nuisance (the null hypothesis was H 0 : β = 0 ) (image, parametric; image, sandwich; image, flip basic; image, flip effective): (a) β=0.2, n=50; (b) β=0.2, n=200

Since the model was correct, asymptotically there was no better choice than the parametric test. The sign flipping test of theorem 2 had very similar power. The basic sign flipping test was again conservative because of the estimation of γ 0 . The sandwich‐based test had the most power but was anticonservative (the null behaviour is not shown).

5.4 Strong heteroscedasticity

When a Gaussian linear model is considered with Y i N ( β x i , σ 2 ) , x1=…=xn=1 and H 0 : β = 0 , the score contributions are νi=Xi(Yi−0)/σ2=Yi/σ2. Thus the test of theorem 1 simply flips the observations Yi, 1⩽in. The parametric counterpart of this test is the one‐sample t‐test. The t‐test needs to estimate the nuisance parameter σ2; the sign flipping test does not (simply substitute σ=1).

We simulated strongly heteroscedastic data: we took Y i N ( β x i , σ i 2 ) , with σi= exp (i), 1⩽in=10. Consequently the t‐statistic did not have the distribution assumed and under H0 the rejection probability of the t‐test was far from the nominal level for most α: Fig. 4(a). The sign flipping test did not need to estimate the variance. In this setting the test has rejection probability ⌊αw⌋/w exactly if the transformations g1,…,gw are drawn without replacement, since the observations are symmetric; see proposition 1. (We drew g1,…,gw with replacement for convenience, but this gives almost the same test as drawing without replacement, because of the small probability of ties.)

image
Comparison of the one‐sample t‐test (image) and the sign flipping test (image) (the null hypothesis was H0:μ=0): (a) μ=0, strong heteroscedasticity; (b) μ=0.5, correct model

For a meaningful power comparison, we considered the correct homoscedastic model with σ1=…=σ10=1. Fig. 4(b), based on 105 repeated simulations, shows that the tests had virtually the same power.

5.5 Multi‐dimensional parameter of interest

We considered the same setting as in Section 5.2, except that β and the estimated nuisance parameter γ 0 = ( 0.5 , 0.2 , 0 , 0 , 0 ) were five dimensional (so X i , Z i R 5 ). All corresponding covariates were correlated (ρ=0.5). There was an ignored nuisance covariate as before ( γ 0 l = 0.5 ), which was uncorrelated with the other covariates. Thus there were in total 11 covariates. We took the overdispersion such that var ( Y i ) = μ i + 0.5 μ i 2 , i.e. the overdispersion again depended on the covariates (heteroscedasticity).

Instead of the basic score test for a one‐dimensional parameter we now used the multi‐dimensional extension in theorem 3. Similarly, instead of the test of theorem 2 based on effective scores, we used the multi‐dimensional extension in theorem 4. We took V ^ = V to be the identity matrix.

In Sections 5.1 and 5.2 we compared our tests with a Wald test based on a sandwich estimate of var ( β ^ ) . Here we proceeded analogously, using a sandwich estimate of the 5×5 matrix var ( β ^ ) in the multi‐dimensional Wald test. This test uses that β ^ var ( β ^ ) 1 β ^ asymptotically has a χ d 2 ‐distribution under the null hypothesis H 0 : β = 0 .

The results under H0 are shown in Fig. 5, where each plot is based on 104 simulation loops. They are comparable with those in Section 5.2, except that the sandwich‐based method is now even more anticonservative. This is because var ( β ^ ) is now a 5×5 matrix, which is difficult to estimate accurately. For n=50 and α=0.01, the rejection probability of the sandwich‐based method was 0.27 instead of the required 0.01.

image
Estimated rejection probabilities under the null hypothesis (the model was misspecified because of overdispersion, heteroscedasticity and ignored nuisance) (image, parametric; image, sandwich; image, flip basic; image, flip effective): (a) β=0, n=50; (b) β=0, n=200

For a meaningful power comparison of the four tests, we again considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and γ 0 l was 0. See Fig. 6, where each plot is based on 104 simulation loops. As usual, the sign flipping test based on basic scores had low power due to nuisance estimation. The power of the sign flipping test based on effective scores was comparable with that of the parametric score test. As in Section 5.3, the test based on a sandwich estimate was the most powerful, but this has limited meaning, since it was also quite anticonservative under the correct model (the plot is not shown).

image
Power comparison under the correct model (the null hypothesis was H 0 : β = 0 ) (image, parametric; image, sandwich; image, flip basic; image, flip effective): (a) β = ( 0.2 , 0 , 0 , 0 , 0 ) , n=50; (b) β = ( 0.2 , 0 , 0 , 0 , 0 ) , n=200

To conclude, sign flipping provided much more reliable type I error control than the sandwich approach, while giving satisfactory power (comparable with that of the parametric test, under the correct model).

6 Data analysis

We analysed the data set warpbreaks. These data are used in the example code of the gee R package, available on the Comprehensive R Archive Network. The data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. There are 54 observations of three variables: the number of breaks, the type of wool (A or B) and the tension (low, medium or high). For each of the six possible combinations of wool and tension, there are nine observations. Using various methods, we tested whether the number of breaks depends on the type of wool.

We first considered a basic Poisson model with
log ( μ i ) = γ 1 + β 1 { wool = B } + γ 2 1 { tension = M } + γ 3 1 { tension = H } .
The γ i , 1⩽i⩽3, were nuisance parameters that were estimated by using maximum likelihood. We first tested H 0 : β = 0 using the parametric score test, obtaining a p‐value of 6.29×10−5. (All the tests that were performed were two sided.)

However, the data were clearly overdispersed: for each combination of wool and tension, the empirical variance of the nine observations was substantially larger than the empirical mean. Thus the p‐value based on the parametric test had limited meaning. Fitting a quasi‐Poisson model, which assumes constant overdispersion, gave a p‐value of 0.059.

As in Section 5, we also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance of β ^ , to correct for the misspecified variance function. This resulted in a p‐value of 0.048.

Further, we used the sign flipping test based on the basic scores ν γ ^ , i , i=1,…,54 (still using the basic Poisson model). We took w=106. This resulted in a p‐value of 0.113. This test is quite robust to model misspecification, but we know that it tends to be conservative when the score is correlated with the nuisance scores, as was the case here.

Finally, we performed the test of theorem 2 based on the effective score. This test is asymptotically exact under the correct model and has been shown to be robust against several forms of variance misspecification. It provided a p‐value of 0.065.

On the basis of this evidence, when maintaining a confidence level of 0.05, it seems that we cannot reject H0. Indeed, only the sandwich‐based test provided a p‐value below 0.05, but this test is often anticonservative, as discussed in Section 5.1.

7 Discussion

We have proposed a test which relies on the assumption that individual score distributions are independent and have mean 0 (in the case of a point hypothesis) under the null. If the score contributions are misspecified because of overdispersion, heteroscedasticity or ignored nuisance covariates, then the traditional parametric tests lose their properties. The sign flipping test is often robust to these types of misspecification and can still be asymptotically exact.

When nuisance parameters are estimated, the basic score contributions become dependent. If a nuisance score is correlated with the score of the parameter of interest, the estimation reduces the variance of the score, so the sign flipping test becomes conservative. As a solution we propose to use the effective score, which is asymptotically the part of the score that is orthogonal to the nuisance score. The effective score is asymptotically unaffected by the nuisance estimation, so we again obtain an asymptotically exact test. We have proved that this is still the case when the scores and the Fisher information are misspecified by a constant, and simulations illustrate additional robustness.

When the parameter of interest is multi‐dimensional, our test statistic involves a freely chosen matrix, which influences the power properties. If this matrix is taken to be the inverse of the effective Fisher information and the model assumed is correct, then our test is asymptotically equivalent to the parametric score test. Under the correct model, in certain situations our test is asymptotically equivalent to the global test (Goeman et al., 2006), which is popular for testing hypotheses about high dimensional parameters.

Acknowledgement

We thank two referees for comments that helped to improve the paper.

Appendix A: A lemma

Lemma 1.Suppose that, for n→∞, a vector T n = ( T 1 n , , T w n ) converges in distribution to a vector T of IID continuous variables. Then P ( T 1 n > T [ 1 α ] n ) α w / w .

Proof.Note that P ( T 1 n > T [ 1 α ] n ) = P ( T n A ) , where

A = { ( t 1 , , t w ) R w : | { 2 j w : t j < t 1 } | ( 1 α ) w } .

Let ∂A be the boundary of A, i.e. the set of discontinuity points of 1 A . If t ∈ ∂A, then ti=tj for some 1⩽i<jw. It follows that P ( T A ) = 0 . Since 1 A is continuous on (∂A)c, it follows from the continuous mapping theorem (Van der Vaart (1998), theorem 2.3) that

1 A ( T n ) d 1 A ( T ) .

The elements of T are IID draws from the same distribution. Hence it follows from the Monte Carlo testing principle (Lehmann and Romano, 2005) that, under H0, P ( T A ) = α w / w . Thus P ( T n A ) α w / w .

Appendix B: Proofs of the results

B.1. Proof of theorem 1

Suppose that H0 holds. We shall show that T n = ( T 1 n , , T w n ) converges in distribution to a multivariate normal distribution with mean 0 and variance lim n s n 2 I , where I is the w×w identity matrix. It then follows from lemma 1 that P ( T 1 n > T [ 1 α ] n ) α w / w .

Under H0, for each 1⩽jw, E ( T j n ) = 0 . For every 1⩽jw, var ( T j n ) = n 1 Σ i = 1 n var ( ν i ) = s n 2 . Let Q n be the covariance matrix of T n . Q n has 0s off the diagonal. Indeed, for 1⩽j<kw,
cov ( T j n , T k n ) = cov ( n 1 / 2 i = 1 n g ji ν i , n 1 / 2 i = 1 n g ki ν i ) = 0 ,
since the gki, 2⩽kw, are independent with mean 0. Hence Q n converges to lim n s n I . Note that T n is a sum of n vectors. By the multivariate Lindeberg–Feller central limit theorem (Van der Vaart, 1998) T n converges in distribution to a multivariate normal distribution with mean vector 0 and covariance matrix lim n s n 2 I .

We have shown that T n converges in distribution to a vector T, say, of IID normal random variables. It now follows from lemma 1 that P ( T 1 n > T [ 1 α ] n ) α w / w .

B.2. Proof of proposition 2

Note that
( ν 1 , , ν n ) = d ( g j 1 ν 1 , , g jn ν n )
for every 1⩽jw. This means that the test becomes a basic random‐transformation test and the results follow from the proof of theorem 2 in Hemerik and Goeman (2018b).

B.3. Proof of theorem 2

Suppose that H0 holds. Note that
S γ ^ * = S γ ^ I ^ 12 I ^ 22 1 S γ ^ ( k 1 ) = S γ ^ I 12 I 22 1 S γ ^ ( k 1 ) + o P β 0 , γ 0 ( 1 ) = S γ 0 I 12 n ( γ ^ γ 0 ) I 12 I 22 1 { S γ 0 ( k 1 ) I 22 n ( γ ^ γ 0 ) } + o P β 0 , γ 0 ( 1 ) = S γ 0 * + o P β 0 , γ 0 ( 1 ) .
Let 2⩽jw and
S γ j + = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ , i , S γ j = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ , i .
Note that
S γ ^ j = S γ ^ j + S γ ^ j = { S γ 0 j + 1 2 n I 12 ( γ ^ γ 0 ) } { S γ 0 j 1 2 n I 12 ( γ ^ γ 0 ) } + o P β 0 , γ 0 ( 1 ) = S γ 0 j + S γ ^ j + o P β 0 , γ 0 ( 1 ) = S γ 0 j + o P β 0 , γ 0 ( 1 ) .
The intuitive reason why S γ ^ j = S γ 0 j + o P β 0 , γ 0 ( 1 ) is that the estimation of γ ^ does not cause the summands underlying S γ ^ j to be correlated. Similarly we find that S γ ^ ( k 1 ) , j = S γ 0 ( k 1 ) , j + o P β 0 , γ 0 ( 1 ) and conclude that S γ ^ * j = S γ 0 * j + o P β 0 , γ 0 ( 1 ) .

Let T n be as in the proof of theorem 1, with νi replaced by ν γ 0 , i * . Suppose that H0 holds and I ^ = I , so that the summands underlying T j n are independent. For every 1⩽in, E ( ν γ 0 , i * ) = 0 . The elements of T n are uncorrelated and have common variance var ( ν γ 0 , 1 * ) . By the multivariate central limit theorem (Van der Vaart, 1998; Greene, 2012), T n converges in distribution to N { 0 , var ( ν γ 0 , 1 * ) I } . We supposed that I ^ = I to use the central limit theorem, but the asymptotic distribution of T n is the same if I ^ is any consistent estimator of I .

Let T ^ n be as in the proof of theorem 1, with νi replaced by ν γ ^ , i * . For every 1⩽jw, S γ ^ * j = S γ 0 * j + o P β 0 , γ 0 ( 1 ) . Thus T ^ n and T n are asymptotically equivalent. The result now follows from lemma 1.

B.4. Proof of proposition 2

For 2⩽jw consider
S γ ^ * j + = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ ^ , i * , S γ ^ * j = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ ^ , i *
S γ ^ ( k 1 ) , j + = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ ^ , i ( k 1 ) , S γ ^ ( k 1 ) , j = n 1 / 2 i = 1 n 1 { g j i = 1 } ν γ ^ , i ( k 1 ) .
We have
S γ ^ * j + = S γ ^ j + I ^ 12 I ^ 22 1 S γ ^ ( k 1 ) , j + = S γ 0 j + 1 2 n I 12 ( γ ^ γ 0 ) I 12 I 22 1 { S γ 0 ( k 1 ) , j + 1 2 n I 22 ( γ ^ γ 0 ) } + o P β n , γ 0 ( 1 ) = S γ 0 * j + + o P β n , γ 0 ( 1 )
and analogously S γ ^ * j = S γ 0 * j + o P β n , γ 0 ( 1 ) . By Marohn (2002), page 341, for 2⩽jw, S γ 0 * j + and S γ 0 * j have an asymptotic N ( 1 2 ξ σ 0 2 , 1 2 σ 0 2 ) distribution. Since they are independent, it follows that T j n = S γ ^ * j + S γ ^ * j has an asymptotic N ( 0 , σ 0 2 ) distribution, 2⩽jw. With the multivariate central limit theorem we find that ( T 2 n , , T w n ) converges in distribution to a vector of w−1 IID N ( 0 , σ 0 2 ) variables as n→∞.
Let ε, ε>0. Let ( T 1 , , T w ) have the asymptotic distribution of ( T 1 n , , T w n ) . Let ( T 1 , , T w ) be a vector of w IID N ( 0 , σ 0 2 ) variables. Apart from the first element, these two vectors have the same distribution. For w ∈ {2,3,…}, define T [ 1 α ] [ w ] like T [ 1 α ] n , but based on the values T 1 , , T w instead of T 1 n , , T w n . Also define T [ 1 α ] [ [ w ] ] like T [ 1 α ] n , but based on the values T 1 , , T w . Note that, as w→∞, the empirical quantile T [ 1 α ] [ [ w ] ] converges in distribution to the constant σ0Φ(1−α). Further note that, for w→∞, T [ 1 α ] [ [ w ] ] T [ 1 α ] [ w ] converges in distribution to 0. Thus there is a W N such that, for all w>W,
P ( | T [ 1 α ] [ w ] σ 0 Φ ( 1 α ) | < ϵ ) > 1 ϵ . (3)
Since the distribution of ( T 1 n , , T w n ) converges to the distribution of ( T 1 , , T w ) as n→∞,
T [ 1 α ] n d T [ 1 α ] [ w ] (4)
as n→∞. Since in the present proof w is not fixed, we shall write T [ 1 α ] n = T [ 1 α ] n , w . By results (3) and (4), for w>W, lim inf n P { | T [ 1 α ] n , w σ 0 Φ ( 1 α ) | < ϵ } > 1 ϵ . Thus lim w lim inf n P { | T [ 1 α ] n , w σ 0 Φ ( 1 α ) | < ϵ } = 1 .

The distribution of T 1 n , which does not depend on w, converges to a continuous distribution as n→∞. It follows, that for every ε′′>0, there is a W such that there is an N such that, for all w>W and n>N, E ( | 1 { T 1 n > T [ 1 α ] n , w } 1 { T 1 n > σ 0 Φ ( 1 α ) } | ) < ϵ . This means that lim w lim inf n E ( | 1 { T 1 n > T [ 1 α ] n , w } 1 { T 1 n > σ 0 Φ ( 1 α ) } | ) = 0 , as was to be shown.

B.5. Proof of proposition 3

For every 1⩽jw we have
S ~ γ ^ * j = c 1 S γ ^ j c 2 I ^ 12 c 2 1 I ^ 22 1 c 1 S γ ^ ( k 1 ) , j = c 1 S γ ^ * j .
Hence the test is identical to that of theorem 2, since that test is unchanged if all T j n , 1⩽jw, are multiplied by the same constant.

B.6. Proof of theorem 3

Suppose that H0 holds. Consider the d×j matrix
n 1 / 2 i = 1 n g ji ν γ 0 , i 1 j w . (5)
It follows from the multivariate central limit theorem (Van der Vaart, 1998) that, as n→∞, this matrix converges in distribution to a matrix with identically distributed columns which are independent of each other. Note that, for every 1⩽jw, T j n is a function of the jth column of matrix (5). Thus, with the continuous mapping theorem (Van der Vaart (1998), theorem 2.3) it follows that ( T 1 n , , T j n ) also converges in distribution to a vector with continuous IID elements. The result now follows from lemma 1.

B.7. Proof of theorem 4

Consider the case γ ^ = γ 0 . As in the proof of theorem 3, under H0, ( T 1 n , , T w n ) converges in distribution to a vector of w IID variables. As in the proof of theorem 2, the same is true if we take γ ^ to be a different √n‐consistent estimator of γ 0 . (Again, the reason is that the effective score based on γ ^ is asymptotically equivalent to the effective score based on γ 0 .) The result now follows from lemma 1 again.

B.8. Proof of proposition 4

By Hall and Mathiason (1990), n 1 / 2 Σ i = 1 n ν γ ^ , i * has an asymptotic N ( 0 , I * ) distribution under β = β 0 . Analogously to the one‐dimensional case at proposition 2, for 2⩽jw, the vector n 1 / 2 Σ i = 1 n g ji ν γ ^ , i * is asymptotically the difference of two mutually independent N ( 1 2 I * ξ , 1 2 I * ) vectors (Hall and Mathiason, 1990), so it also has an asymptotic N ( 0 , I * ) distribution (under β = β n ). As in the proof of theorem 3, by the multivariate central limit theorem, the d×(w−1) matrix ( n 1 / 2 Σ i = 1 n g ji ν γ ^ , i * ) 2 j w converges to a matrix with w−1 independent N ( 0 , I * ) columns as n→∞. Hence, by the continuous mapping theorem, as n→∞, ( T 2 n , , T w n ) converges in distribution to a vector of w−1 IID variables (under β = β n ), which follow the asymptotic distribution which T 1 n has under β = β 0 .

The result now follows as at the end of the proof of proposition 2.