Robust testing in generalized linear models by sign flipping score contributions

Generalized linear models are often misspecified because of overdispersion, heteroscedasticity and ignored nuisance variables. Existing quasi‐likelihood methods for testing in misspecified models often do not provide satisfactory type I error rate control. We provide a novel semiparametric test, based on sign flipping individual score contributions. The parameter tested is allowed to be multi‐dimensional and even high dimensional. Our test is often robust against the mentioned forms of misspecification and provides better type I error control than its competitors. When nuisance parameters are estimated, our basic test becomes conservative. We show how to take nuisance estimation into account to obtain an asymptotically exact test. Our proposed test is asymptotically equivalent to its parametric counterpart.


Introduction
We consider the problem of testing hypotheses about parameters in potentially misspecified generalized linear models (GLMs). The types of misspecification that we consider include overdispersion and heteroscedasticity. When the model is misspecified, traditional parametric tests tend to lose their properties, e.g. because they estimate the Fisher information under incorrect assumptions. By a parametric test we mean a test which fully relies on an assumed parametric model (Pesarin, 2015) to compute the null distribution of the test statistic. When a parametric model to be tested is potentially misspecified, the most obvious approach is to extend the model with more parameters, e.g. to add an overdispersion parameter. However, such approaches still require assumptions, e.g. that the overdispersion is constant. Hence a fully parametric approach is not always the best option.
Another well-known approach to testing in possibly misspecified GLMs is to use a Wald-type test, where a sandwich estimate of the variance of the coefficient estimate is used. The sandwich estimate corrects for the potentially misspecified variance. As long as the linear predictor and link are correct, such a test is asymptotically exact under mild assumptions. We call a test asymptotically exact if its rejection probability is asymptotically known under the null hypothesis. For small samples, however, sandwich estimates often perform poorly and the test can be very liberal (Boos, 1992;Freedman, 2006;Maas and Hox, 2004;Kauermann and Carroll, 2000).
Recent decades have seen an increase in the use of permutation approaches for various testing problems (Tusher et al., 2001;Pesarin, 2001;Chung and Romano, 2013;Pauly et al., 2015;Winkler et al., 2016;Hemerik and Goeman, 2018a;Ganong and Jäger, 2018). These methods are useful since they require few parametric assumptions. Especially when multiple hypotheses are tested, permutation methods are often powerful since they can take into account the dependence structure in the data (Westfall and Young, 1993;Hemerik and Goeman, 2018b;Hemerik et al., 2019). In the past, permutation methods have already been used to test in linear models (Winkler et al. (2014) and references therein). Rather than permutations, sometimes other transformations are used, such as rotations (Solari et al., 2014) and sign flipping of residuals (Winkler et al., 2014). The existing permutation tests for GLMs, however, are limited to models with identity link function.
Like some existing methods for testing in linear models, this paper presents a sign flipping approach. Our approach is new, however, since, rather than flipping residuals, we flip individual score contributions (note that the score, the derivative of the log-likelihood, is a sum of n individual score contributions). Moreover, we allow testing in a wide range of models, not only regression models with identity link. Under mild assumptions, the only requirement for the test to be asymptotically exact is that the individual score contributions have mean 0. Consequently, if the link function is correct, our method is often robust against several types of model specification, such as arbitrary overdispersion, heteroscedasticity and, in some cases, ignored nuisance parameters.
The main reason for this robustness is that we do not need to estimate the variance of the score: the Fisher information. Rather, we perform a permutation-type test based on the score contributions where, rather than permutation, we use sign flipping. Intuitively, the advantage of this approach over explicitly estimating the variance is as follows: if the score contributions are independent and perfectly symmetric around zero under the null, then our test is exact for small n, even if the score contributions have misspecified variances and shapes (Pesarin and Salmaso, 2010a). A parametric test, in contrast, is then usually not exact.
In case nuisance parameters are estimated, the individual score contributions become dependent and our basic sign flipping test is no longer asymptotically exact. To deal with this problem, we consider the effective score, which is less dependent on the nuisance estimate than is the basic score (Hall and Mathiason, 1990;Marohn, 2002). In this case we need slightly more assumptions: the variance misspecification is not always allowed to depend on the covariates. The resulting test is asymptotically exact.
The methods in this paper have been implemented in the R package flipscores (Hemerik et al., 2018), which is available in the Comprehensive R Archive Network.
In Section 2 we consider the scenario that no nuisance effects need to be estimated. In Section 3 we show how the estimation of nuisance effects can be taken into account. Section 4 provides tests of hypotheses about parameters of more than one dimension. Section 5 contains simulations and Section 6 an analysis of real data.
The programs that were used to analyse the data can be obtained from https://rss.onlinelibrary.wiley.com/hub/journal/14679868/seriesb-datasets.

Models with known nuisance parameters
Consider random variables ν 1 , : : : , ν n , which satisfy assumption 1 below. These will often be individual score contributions (see Section 3, Rao (1948) or Hall and Mathiason (1990), page 86), but the results in Section 2.1 hold for any random variables satisfying this assumption.
Assumption 1. The random variables ν i , i ∈ N, are independent of each other, have finite variances and satisfy the following condition. For every > 0, Further, as n → ∞, s 2 n := .1=n/Σ n i=1 var.ν i / → s 2 for some constant s 2 > 0. Throughout Section 2, we consider any null hypothesis H 0 which implies that E.ν i / = 0 for all 1 i n. If ν 1 , : : : , ν n are score contributions and H 0 is a point hypothesis, then, under mild assumptions, E.ν i / = 0 is satisfied under H 0 .
A key assumption throughout Section 2 is that the ν i , i ∈ N, are independent. As soon as nuisance parameters need to be estimated, however, score contributions become dependent. Section 3 is devoted to dealing with estimated nuisance.

Basic sign flipping test
Let α ∈ [0, 1/. For any a ∈ R, let a be the smallest integer that is larger than or equal to a and let a be the largest integer that is at most a. Given values T n 1 , : : : , T n w ∈ R, we let T n .1/ : : : T n .w/ be the sorted values and write T n [1−α] = T n . .1−α/w / . Throughout this paper, w ∈ {2, 3, : : :} denotes the number of random sign flipping transformations to be used. Define g 1 = .1, : : : , 1/ ∈ R n and for every 2 j w let g j = .g j1 , : : : , g jn / be independent and uniformly distributed on {−1, 1} n . Throughout the rest of Section 2, for every 1 j w, we let We now state that the basic sign flipping test is asymptotically exact for the point null hypothesis H 0 that implies that E.ν i / = 0, 1 i n. All proofs are in the appendices.
Theorem 1. Suppose that assumption 1 holds. Consider the test that rejects H 0 if and only if T n 1 > T n [1−α] . Then, as n → ∞, the probability of rejection of this test converges to αw =w α under H 0 . Moreover, the statistics T n 1 , : : : , T n w are asymptotically normal and independent with mean 0 and common variance lim n→∞ s 2 n under H 0 . We now provide an extension of theorem 1 to interval hypotheses. The proof is a straightforward adaptation of the proof of theorem 1.
Corollary 1 (interval hypotheses). Suppose that assumption 1 holds. Consider a null hypothesis H which implies that E.ν i / 0 for all 1 i n. Then, for every > 0, there is an N ∈ N such that, under H , for every n > N, P.T n 1 > T n . .1−α/w / / is at most αw =w + . Similarly if H implies that E.ν i / 0 for all 1 i n, then there is an N ∈ N such that, under H , for every n > N, P.T n 1 < T n . αw+1 / / is at most αw =w + . The following corollary extends theorem 1 to two-sided tests. The proof is analogous to that of theorem 1.
Corollary 2 (two-sided test). Suppose that assumption 1 holds. Consider α 1 , α 2 ∈ {0=w, 1=w, : : : , .w − 1/=w}. Under H 0 , as n → ∞, P{.T n 1 < T n .α 1 w+1/ / ∪ .T n 1 > T n ..1−α 2 /w/ /} → α 1 + α 2 : Note that our test does not rely on an approximate symmetry assumption (as for example that of Canay et al. (2017) does). Indeed, even if the scores are very skewed, asymptotically the test of theorem 1 is exact. However, if the ν i are symmetric, then even for small n the size is always at most α, as noted in the following proposition. A special case of this result has already been discussed in Fisher (1935), section 21, where every element of {1, − 1} n is used once. Proposition 1. Suppose that ν 1 , : : : , ν n are independent and continuous and that under H 0 , for each 1 i n, ν i = d −ν i . Then the size of the test of theorem 1 is at most α for any n ∈ N. Moreover, if g 2 , : : : , g w are uniformly drawn from {1, − 1} n \ {.1, : : : , 1/} without replacement (so that only g 1 takes the value .1, : : : , 1/), then the probability of rejection under H 0 is exactly αw =w. (Note that w cannot exceed 2 n then.) If the g j are drawn with replacement or the ν i are discrete, then under H 0 the probability of rejection of the test of proposition 1 is (slightly) smaller than αw =w for finite n, because of the possibility of ties among the test statistics T n j , 1 j w. Otherwise the rejection probability under H 0 is αw =w.
When the rejection probability under H 0 is αw =w, it can be advantageous to take w such that α is a multiple of 1=w, to exhaust the nominal level.
In theorem 1, we did not assume continuity of the observations ν i . There, under the mild assumption 1, for n → ∞, P.T n j = T n k / → 0 for any 1 j < k w, regardless of the distribution of the ν i . This allows the use of theorem 1 for discrete GLMs.

Robustness
As a main example we consider the exponential family, i.e. suppose that independent variables Y 1 , : : : , Y n have densities of the form Here β is the coefficient of interest and at present we assume that the other coefficients γ are known. The canonical link function g satisfies η i = g.μ i /, where g −1 .η i / = μ i = E.y i / = b .η i / and a i = var.y i /=b .η i / (Agresti, 2015).
For example, the Poisson model has g-log-link function, b.η i / = exp.η i /, a i = 1 and c.y i / = − log.y i !/: Hence E.y i / = b .η i / = exp.η i /. Thus the score function is For the normal distribution, a i = σ 2 , so the score is Apart from some mild assumptions, the main assumption that is made in theorem 1 is that E.ν i / = 0, i = 1, : : : , n. This is satisfied as soon as μ i | β=β 0 is the true expected value of Y i . Then the test is asymptotically exact even if the a i are misspecified, i.e. if the variance or distributional shape of Y i is misspecified. The a i are even allowed to be misspecified by a factor which depends on the covariates, as long as assumption 1 holds.
As a concrete example, consider the normal model with identity link function, which assumes that var.Y 1 / = : : : = var.Y n /. If the real distribution is heteroscedastic, then the test will still be exact for finite n, since the ν i are symmetric. The parametric test, however, loses its properties, e.g. because the estimated variance does not have the assumed χ 2 -distribution. In Section 5 it is illustrated that our approach can be much more robust against heteroscedasticity than a parametric test.
Another example is the situation where the model is Poisson, i.e. var.Y i / = μ i is assumed, but in reality var.Y i / > μ i , which is a form of overdispersion which occurs very often in practice. Then the parametric score test underestimates the Fisher information and is anticonservative. To take the overdispersion into account it could be explicitly estimated. However, if the overdispersion factor is not constant, but depends on the covariates, then again the parametric test loses its properties. Theorem 1, however, often still applies, so an asymptotically exact test is obtained.
Further, note that if E.Y i / depends on a nuisance variable Z l i which is latent and ignored, where Z l i is independent of X i , then the test may still be valid. The reason is that, marginally over Z l i , E.Y i / may still be computed correctly (see, for example, Section 5.2). Such latent nuisance variables will increase the variance of Y i , however, which can pose a problem for the classical parametric score test, which needs to compute the Fisher information. When the latent variable is not independent of X i , this usually does pose a problem for our test (even as n → ∞), since E.Y i − μ i / becomes dependent on X i under H 0 .

Taking into account nuisance estimation
Consider independent and identically distributed (IID) pairs .X i , Y i /, i = 1, : : : , n, where X i is some covariate vector and Y i ∈ R has distribution P β,γ 0 ,X i , which depends on the parameter of interest β ∈ R and unknown nuisance parameter γ 0 , which lies in a set G ⊆ R k−1 , where k is the total number of modelled parameters. We shall discuss the issues arising from estimating γ 0 and propose a solution, which enables us to obtain an asymptotically exact test based on score flipping. In this paper, the above model is the model that is considered by the user. It is the model that is used to compute the scores. We consider this model to be correct, unless explicitly stated otherwise, e.g. in Section 3.2. The parameter γ 0 is part of the model that is considered by the user, so it is always modelled and estimated. For example, γ 0 never represents ignored overdispersion.
We consider the null hypothesis H 0 : β = β 0 ∈ R. Generalizations to interval hypotheses and two-sided tests can be obtained as in corollaries 1 and 2. The case that the parameter of interest is multi-dimensional is considered in Section 4.
Suppose that, for all γ ∈ G, P β,γ,X i has a density f β,γ,X i around β 0 with respect to some dominating measure. For 1 i n write where we assume that the derivative exists. The value ν i is the score for the ith observation. Under H 0 , E.ν γ 0 ,i / = 0, i = 1, : : : , n. The score for all n observations simultaneously is n 1=2 S γ , where Assume thatγ is a √ n-consistent estimate of γ 0 , taking values in G. For every 1 i n, let denote the .k − 1/-vector of score contributions for the nuisance parameters, which is assumed to exist. Let be the vector of nuisance scores. For 1 j w, let the superscript j denote that g j has been applied:

Asymptotically exact test
When the nuisance parameter γ 0 is unknown, it needs to be estimated, which is typically done by maximizing the likelihood of the data under the null hypothesis. The distribution of Sγ can be substantially different from that of S γ 0 : the score based on the true nuisance parameters. Indeed, under the null hypothesis, the asymptotic variance of Sγ is not the Fisher information, but the effective Fisher information (Rippon and Rayner (2010), Rayner (1997), Hall and Mathiason (1990), Marohn (2002) and Cox and Hinkley (1979), section 9.3), which is also the asymptotic variance of the effective score, which is defined below. The effective information is smaller than the information, given that the score for the parameter of interest and the nuisance score are correlated. Intuitively, the reason is that the nuisance variable will be used to explain part of the apparent effect of the variable of interest, also asymptotically. The estimation of γ 0 makes the summands νγ ,1 , : : : , νγ ,n underlying Sγ correlated, in such a way that var.Sγ/ < var.S γ 0 / (if the score is correlated with the nuisance score). Note, however, that, after random flipping, the summands are not correlated anymore. This means that the variance of Sγ is asymptotically smaller than the variance of S jγ , 2 j w (see the proof of theorem 2 in Appendix B.3). Hence, using νγ ,1 , : : : , νγ ,n in the test of theorem 1 can lead to a conservative test, even as n → ∞.
To make the test asymptotically exact again, we would like to adapt the individual scores such that they are less dependent on the random variation ofγ. We do this by considering the so-called effective score, which 'is "less dependent" on the nuisance parameter than the usual score statistic' (Marohn (2002), page 344).
The effective score S Å γ and the underlying summands ν Å γ,i , i = 1, : : : , n (which we assume have non-zero variance forγ = γ 0 ), are defined as withÎ 11 ∈ R and the .k − 1/ × .k − 1/ matrixÎ 22 , assumed invertible, is a consistent estimate of the population Fisher information I, which is assumed to exist and is the variance of .ν γ 0 ,i , ν .k−1/ γ 0 ,i / marginally over X i , under H 0 . The matrix I is assumed to be continuous in the parameters. In GLMs, typicallyÎ = n −1 X Ŵ X, where X is the design matrix andŴ the estimated weight matrix (Agresti (2015), page 126). Further, for 1 j w we write As discussed, Sγ is not generally asymptotically equivalent to S γ 0 . The effective score S Å γ 0 (based onÎ = I), however, is the residual from the projection of the score S γ 0 on the space spanned by the nuisance scores. Hence S Å γ 0 is uncorrelated with the nuisance scores S (2002), page 344). Correspondingly, as noted in the proof of theorem 2, under mild regularity assumptions S Å γ = S Å γ 0 + o P β 0 ,γ 0 .1/, i.e. asymptotically the effective score is not affected by the nuisance estimate.
Ifγ is the maximum likelihood estimate under H 0 , then S The summands ν Å γ,i and νγ ,i are different, however, and the key point is that which is satisfied under mild assumptions such as continuous second-order derivatives.
Theorem 2. Consider the test of theorem 1 with T n j = S Åĵ γ , 1 j w. As n → ∞, under H 0 the probability of rejection converges to αw =w α.
The test of theorem 2 has a parametric counterpart, which uses that, under H 0 , S Å γ is asymptotically normal with zero mean and known variance: the effective information (Marohn (2002), page 341). This test is asymptotically equivalent to the test of theorem 2, as the following proposition says.

Robustness
In Section 2.2 it was explained that the test of theorem 1 is often robust against misspecification of the variance of the score. The test of theorem 2 is also robust against certain forms of variance misspecification. An example is the case that Sγ and S .k−1/ γ are misspecified by the same factor; see proposition 3. This happens in particular if the variance is misspecified by a factor which is independent of the covariates.
Proposition 3. Suppose thatÎ = n −1 X Ŵ X, where X is an n × k design matrix with IID rows andŴ a weight matrix. Consider a misspecification factor c 1 > 0 and misspecified scores Further, for c 2 > 0 consider the misspecified weight matrixW = c 2Ŵ . LetĨ = n −1 X W X be the misspecified average Fisher information.
Proposition 3 is useful, since it tells us that if in a GLM var.Y i / is misspecified by a constant, such thatŴ and the scores are misspecified by a constant, the resulting test is still asymptotically exact. In proposition 3 we assume that the misspecification factors of the weights and the scores are the same for all observations. This is satisfied for example when the model is binomial or Poisson, but the true distribution is respectively quasi-binomial or quasi-Poisson. Moreover, in practice the test can be very robust against heteroscedasticity (see Section 5). The variance misspecification is not generally allowed to depend on the covariates, since then Sγ and S .k−1/ γ can be misspecified by different factors asymptotically. There are exceptions, however; see Sections 3.3 and 5.
When there are estimated nuisance parameters, we can sometimes nevertheless decide to use the test of theorem 1 with the basic scores νγ ,i plugged in (rather than using effective scores). Indeed, this test has been shown to be very robust to misspecification, as long as E.νγ ,i / = 0, 1 i n. It is asymptotically conservative if the score S γ 0 is correlated with the nuisance scores S .k−1/ γ 0 , i.e. when I 12 = 0. Hence, when using this test, it can be useful to redefine the covariates such that I 12 = 0 (as in Cox and Reid (1987)). WhenŴ = bI, b > 0, this means ensuring that the nuisance covariates are orthogonal to the covariate of interest. If the model is potentially misspecified, then the weights and hence I 12 are not asymptotically known, but the user could substitute a best guess for the weights.

An example
As discussed, the test of theorem 2 is not generally asymptotically exact if the variance misspecification depends on the covariates. An important exception is the case where the model is where γ 0 is the unknown intercept and X i ∈ R . If the null hypothesis is H 0 : β = β 0 , then γ 0 is a nuisance parameter that needs to be estimated. (We do not need to know σ and can simply substitute 1 for it.) Hence, we compute the effective score. For 1 i n, We can consistently estimate I 12 I −1 22 byx = .1=n/Σ n i=1 x i , so that the effective score contributions are ν Å γ,i = .x i −x/.y i −μ i /=σ 2 : Thus, the effective score contributions are exactly the basic score contributions after centring x 1 , : : : , x n at 0. Similarly, if x 1 , : : : , x n are already centred, then νγ ,i and ν Å γ,i coincide, since then I 12 = 0.
The test of theorem 2 is not always asymptotically exact if Sγ and S .k−1/ γ are misspecified by different factors. However, ifÎ 12 = 0, then this does not apply anymore. The test of theorem 2 then remains asymptotically exact and reduces to the test based on the basic score. For model (1), this means that, even if the misspecification of var.Y i / depends on X i , we obtain an asymptotically exact test.
A particular case where this principle applies is the generalized Behrens-Fisher problem, where the aim is to test equality of the means μ 1 and μ 2 of two populations (or to test whether μ 1 μ 2 or μ 1 μ 2 ). In this problem, it is assumed only that two independent samples from these populations are available, without making other assumptions such as equal variances. It is well known that this problem has no exact solution with good power under normality (Pesarin and Salmaso, 2010a;Lehmann and Romano, 2005). Under mild assumptions, we obtain an asymptotically exact test for this problem. Pesarin and Salmaso (2010a) have already suggested sign flipping residuals to solve this problem. This is equivalent to flipping scores in our linear model (1) if |x 1 | =: : : = |x n |.

Multi-dimensional parameter of interest
Until now we have considered hypotheses about a one-dimensional parameter β ∈ R. Here we extend our results to hypotheses about a multi-dimensional parameter β ∈ R d , d ∈ N. Our tests are defined even if d > n, but in the theoretical results that follow we consider d fixed and let n increase to ∞. The extension to multi-dimensional β shares important characteristics with the test for a one-dimensional parameter, such as robustness and asymptotic equivalence with the parametric score test.

Asymptotically exact test
Our tests below are related to the existing non-parametric combination methodology (Pesarin, 2001;Pesarin and Salmaso, 2010a,b). This is a very general permutation-based methodology that allows combining test statistics for many hypotheses into a single test of the intersection hypothesis. Non-parametric combination methods can be extended to the score flipping framework. Our tests below could be considered a special case of such an extension of the non-parametric combination methodology. This special case has certain power optimality properties, which are discussed below.
The parametric score test has a classical extension to a hypothesis on a multi-dimensional parameter, H 0 : β = β 0 ∈ R d (Rao, 1948). We shall extend our test in an analogous way. We first assume that the nuisance γ 0 ∈ R k−d is known. Since β ∈ R d , the score is 1 i n, which are now d-vectors. We assume that the derivatives exist. About the elements of ν γ,i (and the nuisance scores that are considered later) we make the assumptions which are analogous to the earlier assumptions about ν γ,i . LetÎ 11 be a consistent estimate of I 11 : the d × d Fisher information for β ∈ R d . Rao's classical statistic for testing H 0 : β = β 0 ∈ R d is which asymptotically has a χ 2 d -distribution under H 0 .
Instead of requiring a matrixÎ −1 which converges to the inverse of the Fisher information, in our test that follows we allow replacement of the Fisher information by any random matrix V converging to some non-zero matrix V, i.e. we do not require the Fisher information to be asymptotically known, just like in the one-dimensional case. The matrix V can be any matrix of preference, including I −1 11 (if I 11 is invertible), or we can takeV = V = I. We shall discuss various choices of V shortly.
Theorem 3. The result of theorem 1 still applies if for 1 j w we define In the case that the nuisance parameter γ 0 is unknown and we have a √ n-consistent estimatê γ, we can use the same test, but with effective scores instead of basic scores plugged in. See theorem 4. For multi-dimensional β, the effective score contributions are HereÎ 12 andÎ 22 are .k − d/ × d and .k − d/ × .k − d/ matrices respectively.
Theorem 4 (unknown nuisance). The result of theorem 1 still applies if for 1 j w we define The test of theorem 4 is asymptotically equivalent to its parametric counterpart, as proposition 4 states. In particular, if we takeV = .Î Å / −1 , where .Î Å / −1 is a consistent estimate of the inverse of the effective Fisher information, then the test of theorem 4 is asymptotically equivalent to the parametric score test (Hall and Mathiason (1990), page 86).
Proposition 4 (equivalence with parametric counterpart). Define T n j as in theorem 4, 1 j w. Let ξ ∈ R d and suppose that the true value of the parameter of interest is β = β n = β 0 + n −1=2 ξ. Let φ n,w = 1 {T n 1 >T n [1−α] } . This is the test of theorem 4. Let φ n be the parametric test 1 {T n 1 >q α } , where q α is the .1 − α/-quantile of the distribution to which T n 1 converges as n → ∞ under β = β 0 . (This is the χ 2 d -distribution if V is the inverse of the effective information matrix I Å ). Then lim w→∞ lim inf n→∞ P.φ n,w = φ n / = 1: We have seen that the test of theorem 1 is often robust against overdispersion and heteroscedasticity: as long as the score contributions have mean 0, the test is asymptotically exact, under very mild assumptions. Moreover, it is not required to estimate the Fisher information. The same applies to the multi-dimensional extension in theorem 3.
The test that takes into account nuisance estimation (theorem 4) uses effective scores, so it does need to estimate the information. However, as in the one-dimensional case, it can be seen that the test remains valid if the information matrix is asymptotically misspecified by a constant (as in proposition 3). Additional robustness is illustrated with simulations in Section 5.5.

Connection with the global test
The test of theorem 3 is related to the global test, which was developed in Goeman et al. (2004Goeman et al. ( , 2006Goeman et al. ( , 2011. We can combine the global test with the score flipping approach. In certain cases, the resulting test coincides with the test of theorem 3.
The global test is a parametric test of H 0 . For the test to be defined, it is not required that d n. For GLMs with canonical link function, the test statistic of the global test is with Σ a freely chosen positive (semi)definite d × d matrix (Goeman et al., 2006(Goeman et al., , 2011. The choice of Σ influences the power properties. WhenV = Σ, statistic (2) coincides with the statistic of theorem 3. Thus, it immediately follows from our results that the global test can be combined with our sign flipping approach, leading to a test which becomes asymptotically exact as n → ∞ and asymptotically equivalent to its parametric counterpart: the original global test (by proposition 4). Combining the global test with sign flipping is useful in the light of our robustness results. Moreover, the sign flipping variant can be combined with existing permutation-based multiple-testing methodology (Westfall and Young, 1993;Hemerik and Goeman, 2018b;Hemerik et al., 2019). Goeman et al. (2006) provided results on the power properties of the global test as depending on the choice of Σ. Since the global test is asymptotically equivalent to its sign flipping counterpart, these results can be used as recommendations on the choice ofV in theorem 3. In particular, according to Goeman et al. (2006), section 8, takingV = I leads to good power if we expect that relatively much of the variance of Y is explained by the large variance principal components of the design matrix. If this is not so, takingV to be an estimate of the inverse of the Fisher information (if invertible) can provide better power. In general, the global test has optimal power on average (over β) in a neighbourhood of β 0 that depends on Σ (Goeman et al., 2006). Hence the same holds asymptotically for the test of theorem 3, for GLMs with canonical link.

Simulations
To compare the tests in this paper with each other and existing tests, we applied them to simulated data. In particular we considered scenarios where the model was misspecified. Simulations with a multi-dimensional parameter of interest are in Section 5.5.

Overdispersion, heteroscedasticity and estimated nuisance
In Sections 5.1 and 5.2 the model assumed was Poisson, but in fact Y 1 , : : : , Y n were drawn from a negative binomial distribution.
The covariates X, Z, Z l ∈ R were drawn from a multivariate normal distribution with zero mean and var.X/ = var.Z/ = var.Z l / = 1: (For non-zero means, similar simulation results were obtained as below.) The response satisfied The null hypothesis was H 0 : β = 0. In Section 5.1 we took γ l 0 = 0. The coefficient γ 0 and the intercept 0 were nuisance parameters that were estimated by maximum likelihood under H 0 . We took γ 0 = 1 and ρ.X i , Z i / = 0:5, ρ.Z l i , Z i / = 0 and ρ.Z l i , X i / = 0. We took the dispersion parameter of the negative binomial distribution to be 1, so that var.Y i / = μ i + μ 2 i . The model assumed, however, was Poisson, i.e. var.Y i / = μ i was assumed. Thus the true variance was larger than the assumed variance and the variance misspecification factor depended on μ i , i.e. on the covariate Z i . The assumed log-link function was correct and in Section 5.1 the linear predictor was correct as well.
In Fig. 1 the estimated rejection probabilites of four tests under H 0 : β = 0 are compared, based on 5000 repeated simulations. In all simulations the tests were two sided.
One of the tests that was considered was the parametric score test. Since the model assumed was Poisson, the computed Fisher information was too small and the test was anticonservative.
We also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance ofβ, to correct for the misspecified variance function. We used the R package gee (Carey et al., 2019) for this (available on the Comprehensive R Archive Network), specifying blocks of size 1. As can be seen in Fig. 1, this test was quite anticonservative (especially for small α, e.g. α = 0:01). This was in particular due to the estimation error of the sandwich (Boos, 1992;Freedman, 2006;Maas and Hox, 2004;Kauermann and Carroll, 2000).
Further, we applied the sign flipping test based on the basic scores νγ ,i . Because of the estimation of γ 0 , the variance of the score was shrunk and the test was conservative, as explained in Section 3.1. In the simulations under H 0 we took w = 200. Taking w larger led to a very similar level (see also Marriott (1979)). In the power simulations we took w = 1000: Finally, we used the sign flipping test of theorem 2, which is based on the effective scores ν Å γ,i . In Section 3.2 it has already been shown that this test is asymptotically exact under constant variance misspecification. Here, however, the variance misspecification factor was 1 + μ i (i.e. it depended on Z i ). Nevertheless the rejection probability under H 0 was approximately α. This illustrates that the test has some additional robustness, which we have not theoretically shown.

Ignored nuisance
The same simulations were performed as in Section 1, but with γ l 0 = 1. Since γ l 0 = 0 was assumed, Z l i represented an ignored latent variable. Fig. 2 shows similar results to those in Fig. 1. The parametric test was even more anticonservative than in Section 5.1. The reason is that the introduction of Z l i increased the variance Y i , so the variance of the score was even more misspecified than in Section 5.1.
The test of theorem 2 was still nearly exact for n = 200, even though μ i was misspecified. (Even marginally over Z l i , μ i was misspecified. Possibly the estimation of the intercept corrected for the misspecification.) A conclusion from the simulations of Sections 5.1 and 5.2 is that the sandwich-based approach should not always be seen as the most reliable way of testing models with misspecified variance functions. Indeed, in our simulations the test of theorem 2 was substantially less anticonservative (while having similar power; see Section 5.3).

Power
For a meaningful power comparison of the four tests, we considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and γ l 0 was 0: Fig. 3. The estimated probabilities are based on 2 × 10 4 simulation loops.
Since the model was correct, asymptotically there was no better choice than the parametric test. The sign flipping test of theorem 2 had very similar power. The basic sign flipping test was again conservative because of the estimation of γ 0 . The sandwich-based test had the most power but was anticonservative (the null behaviour is not shown).

Strong heteroscedasticity
When a Gaussian linear model is considered with Y i ∼ N.βx i , σ 2 /, x 1 =: : : = x n = 1 and H 0 : β = 0, Thus the test of theorem 1 simply flips the observations Y i , 1 i n. The parametric counterpart of this test is the one-sample t-test.
The t-test needs to estimate the nuisance parameter σ 2 ; the sign flipping test does not (simply substitute σ = 1). We simulated strongly heteroscedastic data: we took Y i ∼ N.βx i , σ 2 i /, with σ i = exp.i/, 1 i n = 10: Consequently the t-statistic did not have the distribution assumed and under H 0 the rejection probability of the t-test was far from the nominal level for most α: Fig. 4(a). The sign flipping test did not need to estimate the variance. In this setting the test has rejection probability αw =w exactly if the transformations g 1 , : : : , g w are drawn without replacement, since the observations are symmetric; see proposition 1. (We drew g 1 , : : : , g w with replacement for convenience, but this gives almost the same test as drawing without replacement, because of the small probability of ties.) For a meaningful power comparison, we considered the correct homoscedastic model with σ 1 =: : : = σ 10 = 1. Fig. 4(b), based on 10 5 repeated simulations, shows that the tests had virtually the same power.
Instead of the basic score test for a one-dimensional parameter we now used the multidimensional extension in theorem 3. Similarly, instead of the test of theorem 2 based on effective scores, we used the multi-dimensional extension in theorem 4. We tookV = V to be the identity matrix.
In Sections 5.1 and 5.2 we compared our tests with a Wald test based on a sandwich estimate of var.β/. Here we proceeded analogously, using a sandwich estimate of the 5 × 5 matrix var.β/ in the multi-dimensional Wald test. This test uses thatβ var.β/ −1β asymptotically has a χ 2 ddistribution under the null hypothesis H 0 : β = 0.
The results under H 0 are shown in Fig. 5, where each plot is based on 10 4 simulation loops. They are comparable with those in Section 5.2, except that the sandwich-based method is now even more anticonservative. This is because var.β/ is now a 5 × 5 matrix, which is difficult to estimate accurately. For n = 50 and α = 0:01, the rejection probability of the sandwich-based method was 0:27 instead of the required 0.01.
For a meaningful power comparison of the four tests, we again considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and γ l 0 was 0. See Fig. 6, where each plot is based on 10 4 simulation loops. As usual, the sign flipping test based on basic scores had low power due to nuisance estimation. The power of the sign flipping test based on effective scores was comparable with that of the parametric score test. As in Section 5.3, the test based on a sandwich estimate was the most powerful, but this has limited meaning, since it was also quite anticonservative under the correct model (the plot is not shown).
To conclude, sign flipping provided much more reliable type I error control than the sandwich approach, while giving satisfactory power (comparable with that of the parametric test, under the correct model).

Data analysis
We analysed the data set warpbreaks. These data are used in the example code of the gee R package, available on the Comprehensive R Archive Network. The data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. There are 54 observations of three variables: the number of breaks, the type of wool (A or B) and the tension (low, medium or high). For each of the six possible combinations of wool and tension, there are nine observations. Using various methods, we tested whether the number of breaks depends on the type of wool. We first considered a basic Poisson model with The γ i , 1 i 3, were nuisance parameters that were estimated by using maximum likelihood. We first tested H 0 : β = 0 using the parametric score test, obtaining a p-value of 6:29 × 10 −5 . (All the tests that were performed were two sided.) However, the data were clearly overdispersed: for each combination of wool and tension, the empirical variance of the nine observations was substantially larger than the empirical mean. Thus the p-value based on the parametric test had limited meaning. Fitting a quasi-Poisson model, which assumes constant overdispersion, gave a p-value of 0.059.
As in Section 5, we also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance ofβ, to correct for the misspecified variance function. This resulted in a p-value of 0.048.
Further, we used the sign flipping test based on the basic scores νγ ,i , i = 1, : : : , 54 (still using the basic Poisson model). We took w = 10 6 . This resulted in a p-value of 0:113. This test is quite robust to model misspecification, but we know that it tends to be conservative when the score is correlated with the nuisance scores, as was the case here.
Finally, we performed the test of theorem 2 based on the effective score. This test is asymptotically exact under the correct model and has been shown to be robust against several forms of variance misspecification. It provided a p-value of 0:065.
On the basis of this evidence, when maintaining a confidence level of 0:05, it seems that we cannot reject H 0 . Indeed, only the sandwich-based test provided a p-value below 0:05, but this test is often anticonservative, as discussed in Section 5.1.

Discussion
We have proposed a test which relies on the assumption that individual score distributions are independent and have mean 0 (in the case of a point hypothesis) under the null. If the score contributions are misspecified because of overdispersion, heteroscedasticity or ignored nuisance covariates, then the traditional parametric tests lose their properties. The sign flipping test is often robust to these types of misspecification and can still be asymptotically exact.
When nuisance parameters are estimated, the basic score contributions become dependent. If a nuisance score is correlated with the score of the parameter of interest, the estimation reduces the variance of the score, so the sign flipping test becomes conservative. As a solution we propose to use the effective score, which is asymptotically the part of the score that is orthogonal to the nuisance score. The effective score is asymptotically unaffected by the nuisance estimation, so we again obtain an asymptotically exact test. We have proved that this is still the case when the scores and the Fisher information are misspecified by a constant, and simulations illustrate additional robustness.
When the parameter of interest is multi-dimensional, our test statistic involves a freely chosen matrix, which influences the power properties. If this matrix is taken to be the inverse of the effective Fisher information and the model assumed is correct, then our test is asymptotically equivalent to the parametric score test. Under the correct model, in certain situations our test is asymptotically equivalent to the global test (Goeman et al., 2006), which is popular for testing hypotheses about high dimensional parameters.

B.3. Proof of theorem 2
Suppose that H 0 holds. Note that Let 2 j w and The intuitive reason why S jγ = S j γ 0 + o P β 0 ,γ 0 .1/ is that the estimation ofγ does not cause the summands underlying S jγ to be correlated. Similarly we find that S .k−1/, ĵ γ = S .k−1/, j γ 0 + o P β 0 ,γ 0 .1/ and conclude that S Åĵ γ = S Åj γ 0 + o P β 0 ,γ 0 .1/: Let T n be as in the proof of theorem 1, with ν i replaced by ν Å γ 0 , i : Suppose that H 0 holds andÎ = I, so that the summands underlying T n j are independent. For every 1 i n, E.ν Å γ 0 , i / = 0: The elements of T n are uncorrelated and have common variance var.ν Å γ 0 ,1 /: By the multivariate central limit theorem (Van der Vaart, 1998;Greene, 2012), T n converges in distribution to N{0, var.ν Å γ 0 ,1 /I}. We supposed thatÎ = I to use the central limit theorem, but the asymptotic distribution of T n is the same ifÎ is any consistent estimator of I. LetT n be as in the proof of theorem 1, with ν i replaced by ν Å γ, i : For every 1 j w, S Åĵ γ = S Åj γ 0 + o P β 0 ,γ 0 .1/. ThusT n and T n are asymptotically equivalent. The result now follows from lemma 1.
Since the distribution of .T n 1 , : : : , T n w / converges to the distribution of .T 1 , : : : , T w / as n → ∞, T n [1−α] d → T [w] [1−α] .4/ as n → ∞. Since in the present proof w is not fixed, we shall write T n [1−α] = T n, w [1−α] . By results (3) and (4), for w > W, lim inf n→∞ P{|T n, w [1−α] − σ 0 Φ.1 − α/| < } > 1 − . Thus lim w→∞ lim inf n→∞ P{|T n, w The distribution of T n 1 , which does not depend on w, converges to a continuous distribution as n → ∞. It follows, that for every > 0, there is a W such that there is an N such that, for all w > W and n > N, .k−1/, ĵ γ = c 1 S Åĵ γ : Hence the test is identical to that of theorem 2, since that test is unchanged if all T n j , 1 j w, are multiplied by the same constant.

B.6. Proof of theorem 3
Suppose that H 0 holds. Consider the d × j matrix n 1=2 n i=1 g ji ν γ 0 , i 1 j w : . 5/ It follows from the multivariate central limit theorem (Van der Vaart, 1998) that, as n → ∞, this matrix converges in distribution to a matrix with identically distributed columns which are independent of each other. Note that, for every 1 j w, T n j is a function of the jth column of matrix (5). Thus, with the continuous mapping theorem (Van der Vaart (1998), theorem 2.3) it follows that .T n 1 , : : : , T n j / also converges in distribution to a vector with continuous IID elements. The result now follows from lemma 1.

B.7. Proof of theorem 4
Consider the caseγ = γ 0 . As in the proof of theorem 3, under H 0 , .T n 1 , : : : , T n w / converges in distribution to a vector of w IID variables. As in the proof of theorem 2, the same is true if we takeγ to be a different √ n-consistent estimator of γ 0 . (Again, the reason is that the effective score based onγ is asymptotically equivalent to the effective score based on γ 0 .) The result now follows from lemma 1 again.
Analogously to the one-dimensional case at proposition 2, for 2 j w, the vector n −1=2 Σ n i=1 g ji ν Å γ, i is asymptotically the difference of two mutually independent N. 1 2 I Å ξ, 1 2 I Å / vectors (Hall and Mathiason, 1990), so it also has an asymptotic N.0, I Å / distribution (under β = β n ). As in the proof of theorem 3, by the multivariate central limit theorem, the d × .w − 1/ matrix .n −1=2 Σ n i=1 g ji ν Å γ, i / 2 j w converges to a matrix with w − 1 independent N.0, I Å / columns as n → ∞. Hence, by the continuous mapping theorem, as n → ∞, .T n 2 , : : : , T n w / converges in distribution to a vector of w − 1 IID variables (under β = β n ), which follow the asymptotic distribution which T n 1 has under β = β 0 . The result now follows as at the end of the proof of proposition 2.