Robust testing in generalized linear models by sign flipping score contributions
Summary
Generalized linear models are often misspecified because of overdispersion, heteroscedasticity and ignored nuisance variables. Existing quasi‐likelihood methods for testing in misspecified models often do not provide satisfactory type I error rate control. We provide a novel semiparametric test, based on sign flipping individual score contributions. The parameter tested is allowed to be multi‐dimensional and even high dimensional. Our test is often robust against the mentioned forms of misspecification and provides better type I error control than its competitors. When nuisance parameters are estimated, our basic test becomes conservative. We show how to take nuisance estimation into account to obtain an asymptotically exact test. Our proposed test is asymptotically equivalent to its parametric counterpart.
1 Introduction
We consider the problem of testing hypotheses about parameters in potentially misspecified generalized linear models (GLMs). The types of misspecification that we consider include overdispersion and heteroscedasticity. When the model is misspecified, traditional parametric tests tend to lose their properties, e.g. because they estimate the Fisher information under incorrect assumptions. By a parametric test we mean a test which fully relies on an assumed parametric model (Pesarin, 2015) to compute the null distribution of the test statistic.
When a parametric model to be tested is potentially misspecified, the most obvious approach is to extend the model with more parameters, e.g. to add an overdispersion parameter. However, such approaches still require assumptions, e.g. that the overdispersion is constant. Hence a fully parametric approach is not always the best option.
Another well‐known approach to testing in possibly misspecified GLMs is to use a Wald‐type test, where a sandwich estimate of the variance of the coefficient estimate is used. The sandwich estimate corrects for the potentially misspecified variance. As long as the linear predictor and link are correct, such a test is asymptotically exact under mild assumptions. We call a test asymptotically exact if its rejection probability is asymptotically known under the null hypothesis. For small samples, however, sandwich estimates often perform poorly and the test can be very liberal (Boos, 1992; Freedman, 2006; Maas and Hox, 2004; Kauermann and Carroll, 2000).
Recent decades have seen an increase in the use of permutation approaches for various testing problems (Tusher et al., 2001; Pesarin, 2001; Chung and Romano, 2013; Pauly et al., 2015; Winkler et al., 2016; Hemerik and Goeman, 2018a; Ganong and Jäger, 2018). These methods are useful since they require few parametric assumptions. Especially when multiple hypotheses are tested, permutation methods are often powerful since they can take into account the dependence structure in the data (Westfall and Young, 1993; Hemerik and Goeman, 2018b; Hemerik et al., 2019). In the past, permutation methods have already been used to test in linear models (Winkler et al. (2014) and references therein). Rather than permutations, sometimes other transformations are used, such as rotations (Solari et al., 2014) and sign flipping of residuals (Winkler et al., 2014). The existing permutation tests for GLMs, however, are limited to models with identity link function.
Like some existing methods for testing in linear models, this paper presents a sign flipping approach. Our approach is new, however, since, rather than flipping residuals, we flip individual score contributions (note that the score, the derivative of the log‐likelihood, is a sum of n individual score contributions). Moreover, we allow testing in a wide range of models, not only regression models with identity link. Under mild assumptions, the only requirement for the test to be asymptotically exact is that the individual score contributions have mean 0. Consequently, if the link function is correct, our method is often robust against several types of model specification, such as arbitrary overdispersion, heteroscedasticity and, in some cases, ignored nuisance parameters.
The main reason for this robustness is that we do not need to estimate the variance of the score: the Fisher information. Rather, we perform a permutation‐type test based on the score contributions where, rather than permutation, we use sign flipping. Intuitively, the advantage of this approach over explicitly estimating the variance is as follows: if the score contributions are independent and perfectly symmetric around zero under the null, then our test is exact for small n, even if the score contributions have misspecified variances and shapes (Pesarin and Salmaso, 2010a). A parametric test, in contrast, is then usually not exact.
In case nuisance parameters are estimated, the individual score contributions become dependent and our basic sign flipping test is no longer asymptotically exact. To deal with this problem, we consider the effective score, which is less dependent on the nuisance estimate than is the basic score (Hall and Mathiason, 1990; Marohn, 2002). In this case we need slightly more assumptions: the variance misspecification is not always allowed to depend on the covariates. The resulting test is asymptotically exact.
The methods in this paper have been implemented in the R package flipscores (Hemerik et al., 2018), which is available in the Comprehensive R Archive Network.
In Section 2 we consider the scenario that no nuisance effects need to be estimated. In Section 3 we show how the estimation of nuisance effects can be taken into account. Section 4 provides tests of hypotheses about parameters of more than one dimension. Section 5 contains simulations and Section 5 an analysis of real data.
The programs that were used to analyse the data can be obtained from
https://rss.onlinelibrary.wiley.com/hub/journal/14679868/series-b-datasets.
2 Models with known nuisance parameters
Consider random variables ν1,…,νn, which satisfy assumption 1 below. These will often be individual score contributions (see Section 3, Rao (1948) or Hall and Mathiason (1990), page 86), but the results in Section 2.1 hold for any random variables satisfying this assumption.
Assumption 1.The random variables νi, , are independent of each other, have finite variances and satisfy the following condition. For every ε>0,
Throughout Section 2, we consider any null hypothesis H0 which implies that for all 1⩽i⩽n. If ν1,…,νn are score contributions and H0 is a point hypothesis, then, under mild assumptions, is satisfied under H0.
A key assumption throughout Section 2 is that the νi, , are independent. As soon as nuisance parameters need to be estimated, however, score contributions become dependent. Section 3 is devoted to dealing with estimated nuisance.
2.1 Basic sign flipping test
Let α ∈ [0,1). For any , let ⌈a⌉ be the smallest integer that is larger than or equal to a and let ⌊a⌋ be the largest integer that is at most a. Given values , we let be the sorted values and write .
Theorem 1.Suppose that assumption 1 holds. Consider the test that rejects H0 if and only if . Then, as n→∞, the probability of rejection of this test converges to ⌊αw⌋/w⩽α under H0. Moreover, the statistics are asymptotically normal and independent with mean 0 and common variance under H0.
We now provide an extension of theorem 1 to interval hypotheses. The proof is a straightforward adaptation of the proof of theorem 1.
Corollary 1. ((interval hypotheses))Suppose that assumption 1 holds. Consider a null hypothesis H′ which implies that for all 1⩽i⩽n. Then, for every ε>0, there is an such that, under H′, for every n>N, is at most ⌊αw⌋/w+ε. Similarly if H′ implies that for all 1⩽i⩽n, then there is an such that, under H′, for every n>N, is at most ⌊αw⌋/w+ε.
The following corollary extends theorem 1 to two‐sided tests. The proof is analogous to that of theorem 1.
Corollary 2. ((two‐sided test))Suppose that assumption 1 holds. Consider . Under H0, as n→∞,
Note that our test does not rely on an approximate symmetry assumption (as for example that of Canay et al. (2017) does). Indeed, even if the scores are very skewed, asymptotically the test of theorem 1 is exact. However, if the νi are symmetric, then even for small n the size is always at most α, as noted in the following proposition. A special case of this result has already been discussed in Fisher (1935), section 21, where every element of {1,−1}n is used once.
Proposition 1.Suppose that ν1,…,νn are independent and continuous and that under H0, for each 1⩽i⩽n, νi=d−νi. Then the size of the test of theorem 1 is at most α for any . Moreover, if g2,…,gw are uniformly drawn from {1,−1}n∖{(1,…,1)} without replacement (so that only g1 takes the value (1,…,1)), then the probability of rejection under H0 is exactly ⌊αw⌋/w. (Note that w cannot exceed 2n then.)
If the gj are drawn with replacement or the νi are discrete, then under H0 the probability of rejection of the test of proposition 1 is (slightly) smaller than ⌊αw⌋/w for finite n, because of the possibility of ties among the test statistics , 1⩽j⩽w. Otherwise the rejection probability under H0 is ⌊αw⌋/w.
When the rejection probability under H0 is ⌊αw⌋/w, it can be advantageous to take w such that α is a multiple of 1/w, to exhaust the nominal level.
In theorem 1, we did not assume continuity of the observations νi. There, under the mild assumption 1, for n→∞, for any 1⩽j<k⩽w, regardless of the distribution of the νi. This allows the use of theorem 1 for discrete GLMs.
2.2 Robustness
Apart from some mild assumptions, the main assumption that is made in theorem 1 is that , i=1,…,n. This is satisfied as soon as is the true expected value of Yi. Then the test is asymptotically exact even if the ai are misspecified, i.e. if the variance or distributional shape of Yi is misspecified. The ai are even allowed to be misspecified by a factor which depends on the covariates, as long as assumption 1 holds.
As a concrete example, consider the normal model with identity link function, which assumes that var(Y1)=…=var(Yn). If the real distribution is heteroscedastic, then the test will still be exact for finite n, since the νi are symmetric. The parametric test, however, loses its properties, e.g. because the estimated variance does not have the assumed χ2‐distribution. In Section 5 it is illustrated that our approach can be much more robust against heteroscedasticity than a parametric test.
Another example is the situation where the model is Poisson, i.e. var(Yi)=μi is assumed, but in reality var(Yi)>μi, which is a form of overdispersion which occurs very often in practice. Then the parametric score test underestimates the Fisher information and is anticonservative. To take the overdispersion into account it could be explicitly estimated. However, if the overdispersion factor is not constant, but depends on the covariates, then again the parametric test loses its properties. Theorem 1, however, often still applies, so an asymptotically exact test is obtained.
Further, note that if depends on a nuisance variable which is latent and ignored, where is independent of Xi, then the test may still be valid. The reason is that, marginally over , may still be computed correctly (see, for example, Section 5.2). Such latent nuisance variables will increase the variance of Yi, however, which can pose a problem for the classical parametric score test, which needs to compute the Fisher information. When the latent variable is not independent of Xi, this usually does pose a problem for our test (even as n→∞), since becomes dependent on Xi under H0.
3 Taking into account nuisance estimation
Consider independent and identically distributed (IID) pairs , i=1,…,n, where is some covariate vector and has distribution , which depends on the parameter of interest and unknown nuisance parameter , which lies in a set , where k is the total number of modelled parameters. We shall discuss the issues arising from estimating and propose a solution, which enables us to obtain an asymptotically exact test based on score flipping. In this paper, the above model is the model that is considered by the user. It is the model that is used to compute the scores. We consider this model to be correct, unless explicitly stated otherwise, e.g. in Section 3.2. The parameter is part of the model that is considered by the user, so it is always modelled and estimated. For example, never represents ignored overdispersion.
We consider the null hypothesis . Generalizations to interval hypotheses and two‐sided tests can be obtained as in corollaries 1 and 2. The case that the parameter of interest is multi‐dimensional is considered in Section 4.
3.1 Asymptotically exact test
When the nuisance parameter is unknown, it needs to be estimated, which is typically done by maximizing the likelihood of the data under the null hypothesis. The distribution of can be substantially different from that of : the score based on the true nuisance parameters. Indeed, under the null hypothesis, the asymptotic variance of is not the Fisher information, but the effective Fisher information (Rippon and Rayner (2010), Rayner (1997), Hall and Mathiason (1990), Marohn (2002) and Cox and Hinkley (1979), section 9.3), which is also the asymptotic variance of the effective score, which is defined below. The effective information is smaller than the information, given that the score for the parameter of interest and the nuisance score are correlated. Intuitively, the reason is that the nuisance variable will be used to explain part of the apparent effect of the variable of interest, also asymptotically.
The estimation of makes the summands underlying correlated, in such a way that (if the score is correlated with the nuisance score). Note, however, that, after random flipping, the summands are not correlated anymore. This means that the variance of is asymptotically smaller than the variance of , 2⩽j⩽w (see the proof of theorem 2 in Appendix B.3). Hence, using in the test of theorem 1 can lead to a conservative test, even as n→∞.
To make the test asymptotically exact again, we would like to adapt the individual scores such that they are less dependent on the random variation of . We do this by considering the so‐called effective score, which ‘is “less dependent” on the nuisance parameter than the usual score statistic’ (Marohn (2002), page 344).
As discussed, is not generally asymptotically equivalent to . The effective score (based on ), however, is the residual from the projection of the score on the space spanned by the nuisance scores. Hence is uncorrelated with the nuisance scores (Marohn (2002), page 344). Correspondingly, as noted in the proof of theorem 2, under mild regularity assumptions , i.e. asymptotically the effective score is not affected by the nuisance estimate.
If is the maximum likelihood estimate under H0, then , so . The summands and are different, however, and the key point is that .
Theorem 2.Consider the test of theorem 1 with , 1⩽j⩽w. As n→∞, under H0 the probability of rejection converges to ⌊αw⌋/w⩽α.
The test of theorem 2 has a parametric counterpart, which uses that, under H0, is asymptotically normal with zero mean and known variance: the effective information (Marohn (2002), page 341). This test is asymptotically equivalent to the test of theorem 2, as the following proposition says.
Proposition 2.Let and suppose that the true parameter satisfies . As in theorem 2, let , 1⩽j⩽w. Define to be the test of theorem 2. Let be the parametric test , where is the effective Fisher information and Φ the cumulative distribution function of the standard normal distribution. Then
3.2 Robustness
In Section 2.2 it was explained that the test of theorem 1 is often robust against misspecification of the variance of the score. The test of theorem 2 is also robust against certain forms of variance misspecification. An example is the case that and are misspecified by the same factor; see proposition 3. This happens in particular if the variance is misspecified by a factor which is independent of the covariates.
Proposition 3.Suppose that , where X is an n×k design matrix with IID rows and a weight matrix. Consider a misspecification factor c1>0 and misspecified scores
Proposition 1 is useful, since it tells us that if in a GLM var(Yi) is misspecified by a constant, such that and the scores are misspecified by a constant, the resulting test is still asymptotically exact. In proposition 3 we assume that the misspecification factors of the weights and the scores are the same for all observations. This is satisfied for example when the model is binomial or Poisson, but the true distribution is respectively quasi‐binomial or quasi‐Poisson. Moreover, in practice the test can be very robust against heteroscedasticity (see Section 5). The variance misspecification is not generally allowed to depend on the covariates, since then and can be misspecified by different factors asymptotically. There are exceptions, however; see Sections 5 and 5.
When there are estimated nuisance parameters, we can sometimes nevertheless decide to use the test of theorem 1 with the basic scores plugged in (rather than using effective scores). Indeed, this test has been shown to be very robust to misspecification, as long as , 1⩽i⩽n. It is asymptotically conservative if the score is correlated with the nuisance scores , i.e. when . Hence, when using this test, it can be useful to redefine the covariates such that (as in Cox and Reid (1987)). When , b>0, this means ensuring that the nuisance covariates are orthogonal to the covariate of interest. If the model is potentially misspecified, then the weights and hence are not asymptotically known, but the user could substitute a best guess for the weights.
3.3 An example
Thus, the effective score contributions are exactly the basic score contributions after centring x1,…,xn at 0. Similarly, if x1,…,xn are already centred, then and coincide, since then .
The test of theorem 2 is not always asymptotically exact if and are misspecified by different factors. However, if , then this does not apply anymore. The test of theorem 2 then remains asymptotically exact and reduces to the test based on the basic score. For model (1), this means that, even if the misspecification of var(Yi) depends on Xi, we obtain an asymptotically exact test.
A particular case where this principle applies is the generalized Behrens–Fisher problem, where the aim is to test equality of the means μ1 and μ2 of two populations (or to test whether μ1⩽μ2 or μ1⩾μ2). In this problem, it is assumed only that two independent samples from these populations are available, without making other assumptions such as equal variances. It is well known that this problem has no exact solution with good power under normality (Pesarin and Salmaso, 2010a; Lehmann and Romano, 2005). Under mild assumptions, we obtain an asymptotically exact test for this problem. Pesarin and Salmaso (2010a) have already suggested sign flipping residuals to solve this problem. This is equivalent to flipping scores in our linear model (1) if .
4 Multi‐dimensional parameter of interest
Until now we have considered hypotheses about a one‐dimensional parameter . Here we extend our results to hypotheses about a multi‐dimensional parameter , . Our tests are defined even if d>n, but in the theoretical results that follow we consider d fixed and let n increase to ∞. The extension to multi‐dimensional β shares important characteristics with the test for a one‐dimensional parameter, such as robustness and asymptotic equivalence with the parametric score test.
4.1 Asymptotically exact test
Our tests below are related to the existing non‐parametric combination methodology (Pesarin, 2001; Pesarin and Salmaso, 2010a,b). This is a very general permutation‐based methodology that allows combining test statistics for many hypotheses into a single test of the intersection hypothesis. Non‐parametric combination methods can be extended to the score flipping framework. Our tests below could be considered a special case of such an extension of the non‐parametric combination methodology. This special case has certain power optimality properties, which are discussed below.
Instead of requiring a matrix which converges to the inverse of the Fisher information, in our test that follows we allow replacement of the Fisher information by any random matrix converging to some non‐zero matrix V, i.e. we do not require the Fisher information to be asymptotically known, just like in the one‐dimensional case. The matrix V can be any matrix of preference, including (if is invertible), or we can take . We shall discuss various choices of V shortly.
Theorem 3.The result of theorem 1 still applies if for 1⩽j⩽w we define
Theorem 4. ((unknown nuisance))The result of theorem 1 still applies if for 1⩽j⩽w we define
The test of theorem 4 is asymptotically equivalent to its parametric counterpart, as proposition 4 states. In particular, if we take , where is a consistent estimate of the inverse of the effective Fisher information, then the test of theorem 4 is asymptotically equivalent to the parametric score test (Hall and Mathiason (1990), page 86).
Proposition 4. ((equivalence with parametric counterpart))Define as in theorem 4, 1⩽j⩽w. Let and suppose that the true value of the parameter of interest is . Let . This is the test of theorem 4. Let be the parametric test , where qα is the (1−α)‐quantile of the distribution to which converges as n→∞ under . (This is the ‐distribution if V is the inverse of the effective information matrix ). Then
We have seen that the test of theorem 1 is often robust against overdispersion and heteroscedasticity: as long as the score contributions have mean 0, the test is asymptotically exact, under very mild assumptions. Moreover, it is not required to estimate the Fisher information. The same applies to the multi‐dimensional extension in theorem 3.
The test that takes into account nuisance estimation (theorem 4) uses effective scores, so it does need to estimate the information. However, as in the one‐dimensional case, it can be seen that the test remains valid if the information matrix is asymptotically misspecified by a constant (as in proposition 3). Additional robustness is illustrated with simulations in Section 5.5.
4.2 Connection with the global test
The test of theorem 3 is related to the global test, which was developed in Goeman et al. (2004, 2006, 2011). We can combine the global test with the score flipping approach. In certain cases, the resulting test coincides with the test of theorem 3.
When , statistic (2) coincides with the statistic of theorem 3. Thus, it immediately follows from our results that the global test can be combined with our sign flipping approach, leading to a test which becomes asymptotically exact as n→∞ and asymptotically equivalent to its parametric counterpart: the original global test (by proposition 4). Combining the global test with sign flipping is useful in the light of our robustness results. Moreover, the sign flipping variant can be combined with existing permutation‐based multiple‐testing methodology (Westfall and Young, 1993; Hemerik and Goeman, 2018b; Hemerik et al., 2019).
Goeman et al. (2006) provided results on the power properties of the global test as depending on the choice of Σ. Since the global test is asymptotically equivalent to its sign flipping counterpart, these results can be used as recommendations on the choice of in theorem 3. In particular, according to Goeman et al. (2006), section 8, taking leads to good power if we expect that relatively much of the variance of Y is explained by the large variance principal components of the design matrix. If this is not so, taking to be an estimate of the inverse of the Fisher information (if invertible) can provide better power. In general, the global test has optimal power on average (over β) in a neighbourhood of that depends on Σ (Goeman et al., 2006). Hence the same holds asymptotically for the test of theorem 3, for GLMs with canonical link.
5 Simulations
To compare the tests in this paper with each other and existing tests, we applied them to simulated data. In particular we considered scenarios where the model was misspecified. Simulations with a multi‐dimensional parameter of interest are in Section 5.5.
5.1 Overdispersion, heteroscedasticity and estimated nuisance
In Sections 5.1 and 5.2 the model assumed was Poisson, but in fact Y1,…,Yn were drawn from a negative binomial distribution.
The null hypothesis was . In Section 5.1 we took . The coefficient γ0 and the intercept 0 were nuisance parameters that were estimated by maximum likelihood under H0. We took γ0=1 and ρ(Xi,Zi)=0.5, and . We took the dispersion parameter of the negative binomial distribution to be 1, so that .
The model assumed, however, was Poisson, i.e. var(Yi)=μi was assumed. Thus the true variance was larger than the assumed variance and the variance misspecification factor depended on μi, i.e. on the covariate Zi. The assumed log‐link function was correct and in Section 5.1 the linear predictor was correct as well.
In Fig. 1 the estimated rejection probabilites of four tests under are compared, based on 5000 repeated simulations. In all simulations the tests were two sided.

, parametric;
, sandwich;
, flip basic;
, flip effective): (a) β=0, n=50; (b) β=0, n=200
One of the tests that was considered was the parametric score test. Since the model assumed was Poisson, the computed Fisher information was too small and the test was anticonservative.
We also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance of , to correct for the misspecified variance function. We used the R package gee (Carey et al., 2019) for this (available on the Comprehensive R Archive Network), specifying blocks of size 1. As can be seen in Fig. 1, this test was quite anticonservative (especially for small α, e.g. α=0.01). This was in particular due to the estimation error of the sandwich (Boos, 1992; Freedman, 2006; Maas and Hox, 2004; Kauermann and Carroll, 2000).
Further, we applied the sign flipping test based on the basic scores . Because of the estimation of γ0, the variance of the score was shrunk and the test was conservative, as explained in Section 3.1. In the simulations under H0 we took w=200. Taking w larger led to a very similar level (see also Marriott (1979)). In the power simulations we took w=1000.
Finally, we used the sign flipping test of theorem 2, which is based on the effective scores . In Section 3.2 it has already been shown that this test is asymptotically exact under constant variance misspecification. Here, however, the variance misspecification factor was 1+μi (i.e. it depended on Zi). Nevertheless the rejection probability under H0 was approximately α. This illustrates that the test has some additional robustness, which we have not theoretically shown.
5.2 Ignored nuisance
The same simulations were performed as in Section 1, but with . Since was assumed, represented an ignored latent variable. Fig. 2 shows similar results to those in Fig. 1. The parametric test was even more anticonservative than in Section 5.1. The reason is that the introduction of increased the variance Yi, so the variance of the score was even more misspecified than in Section 5.1.

, parametric;
, sandwich;
, flip basic;
, flip effective): (a) β=0,n=50; (b) β=0,n=200
The test of theorem 2 was still nearly exact for n=200, even though μi was misspecified. (Even marginally over , μi was misspecified. Possibly the estimation of the intercept corrected for the misspecification.)
A conclusion from the simulations of Sections 5.1 and 5.2 is that the sandwich‐based approach should not always be seen as the most reliable way of testing models with misspecified variance functions. Indeed, in our simulations the test of theorem 2 was substantially less anticonservative (while having similar power; see Section 5.3).
5.3 Power
For a meaningful power comparison of the four tests, we considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and was 0: Fig. 3. The estimated probabilities are based on 2×104 simulation loops.

, parametric;
, sandwich;
, flip basic;
, flip effective): (a) β=0.2, n=50; (b) β=0.2, n=200
Since the model was correct, asymptotically there was no better choice than the parametric test. The sign flipping test of theorem 2 had very similar power. The basic sign flipping test was again conservative because of the estimation of . The sandwich‐based test had the most power but was anticonservative (the null behaviour is not shown).
5.4 Strong heteroscedasticity
When a Gaussian linear model is considered with , x1=…=xn=1 and , the score contributions are νi=Xi(Yi−0)/σ2=Yi/σ2. Thus the test of theorem 1 simply flips the observations Yi, 1⩽i⩽n. The parametric counterpart of this test is the one‐sample t‐test. The t‐test needs to estimate the nuisance parameter σ2; the sign flipping test does not (simply substitute σ=1).
We simulated strongly heteroscedastic data: we took , with σi= exp (i), 1⩽i⩽n=10. Consequently the t‐statistic did not have the distribution assumed and under H0 the rejection probability of the t‐test was far from the nominal level for most α: Fig. 4(a). The sign flipping test did not need to estimate the variance. In this setting the test has rejection probability ⌊αw⌋/w exactly if the transformations g1,…,gw are drawn without replacement, since the observations are symmetric; see proposition 1. (We drew g1,…,gw with replacement for convenience, but this gives almost the same test as drawing without replacement, because of the small probability of ties.)

) and the sign flipping test (
) (the null hypothesis was H0:μ=0): (a) μ=0, strong heteroscedasticity; (b) μ=0.5, correct model
For a meaningful power comparison, we considered the correct homoscedastic model with σ1=…=σ10=1. Fig. 4(b), based on 105 repeated simulations, shows that the tests had virtually the same power.
5.5 Multi‐dimensional parameter of interest
We considered the same setting as in Section 5.2, except that β and the estimated nuisance parameter were five dimensional (so , ). All corresponding covariates were correlated (ρ=0.5). There was an ignored nuisance covariate as before ( ), which was uncorrelated with the other covariates. Thus there were in total 11 covariates. We took the overdispersion such that , i.e. the overdispersion again depended on the covariates (heteroscedasticity).
Instead of the basic score test for a one‐dimensional parameter we now used the multi‐dimensional extension in theorem 3. Similarly, instead of the test of theorem 2 based on effective scores, we used the multi‐dimensional extension in theorem 4. We took to be the identity matrix.
In Sections 5.1 and 5.2 we compared our tests with a Wald test based on a sandwich estimate of . Here we proceeded analogously, using a sandwich estimate of the 5×5 matrix in the multi‐dimensional Wald test. This test uses that asymptotically has a ‐distribution under the null hypothesis .
The results under H0 are shown in Fig. 5, where each plot is based on 104 simulation loops. They are comparable with those in Section 5.2, except that the sandwich‐based method is now even more anticonservative. This is because is now a 5×5 matrix, which is difficult to estimate accurately. For n=50 and α=0.01, the rejection probability of the sandwich‐based method was 0.27 instead of the required 0.01.

, parametric;
, sandwich;
, flip basic;
, flip effective): (a) β=0, n=50; (b) β=0, n=200
For a meaningful power comparison of the four tests, we again considered the scenario where the model assumed was correct, i.e. the data distribution was Poisson and was 0. See Fig. 6, where each plot is based on 104 simulation loops. As usual, the sign flipping test based on basic scores had low power due to nuisance estimation. The power of the sign flipping test based on effective scores was comparable with that of the parametric score test. As in Section 5.3, the test based on a sandwich estimate was the most powerful, but this has limited meaning, since it was also quite anticonservative under the correct model (the plot is not shown).

, parametric;
, sandwich;
, flip basic;
, flip effective): (a)
, n=50; (b)
, n=200
To conclude, sign flipping provided much more reliable type I error control than the sandwich approach, while giving satisfactory power (comparable with that of the parametric test, under the correct model).
6 Data analysis
We analysed the data set warpbreaks. These data are used in the example code of the gee R package, available on the Comprehensive R Archive Network. The data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. There are 54 observations of three variables: the number of breaks, the type of wool (A or B) and the tension (low, medium or high). For each of the six possible combinations of wool and tension, there are nine observations. Using various methods, we tested whether the number of breaks depends on the type of wool.
However, the data were clearly overdispersed: for each combination of wool and tension, the empirical variance of the nine observations was substantially larger than the empirical mean. Thus the p‐value based on the parametric test had limited meaning. Fitting a quasi‐Poisson model, which assumes constant overdispersion, gave a p‐value of 0.059.
As in Section 5, we also applied a Wald test, where we used a sandwich estimate (Agresti (2015), page 280) of the variance of , to correct for the misspecified variance function. This resulted in a p‐value of 0.048.
Further, we used the sign flipping test based on the basic scores , i=1,…,54 (still using the basic Poisson model). We took w=106. This resulted in a p‐value of 0.113. This test is quite robust to model misspecification, but we know that it tends to be conservative when the score is correlated with the nuisance scores, as was the case here.
Finally, we performed the test of theorem 2 based on the effective score. This test is asymptotically exact under the correct model and has been shown to be robust against several forms of variance misspecification. It provided a p‐value of 0.065.
On the basis of this evidence, when maintaining a confidence level of 0.05, it seems that we cannot reject H0. Indeed, only the sandwich‐based test provided a p‐value below 0.05, but this test is often anticonservative, as discussed in Section 5.1.
7 Discussion
We have proposed a test which relies on the assumption that individual score distributions are independent and have mean 0 (in the case of a point hypothesis) under the null. If the score contributions are misspecified because of overdispersion, heteroscedasticity or ignored nuisance covariates, then the traditional parametric tests lose their properties. The sign flipping test is often robust to these types of misspecification and can still be asymptotically exact.
When nuisance parameters are estimated, the basic score contributions become dependent. If a nuisance score is correlated with the score of the parameter of interest, the estimation reduces the variance of the score, so the sign flipping test becomes conservative. As a solution we propose to use the effective score, which is asymptotically the part of the score that is orthogonal to the nuisance score. The effective score is asymptotically unaffected by the nuisance estimation, so we again obtain an asymptotically exact test. We have proved that this is still the case when the scores and the Fisher information are misspecified by a constant, and simulations illustrate additional robustness.
When the parameter of interest is multi‐dimensional, our test statistic involves a freely chosen matrix, which influences the power properties. If this matrix is taken to be the inverse of the effective Fisher information and the model assumed is correct, then our test is asymptotically equivalent to the parametric score test. Under the correct model, in certain situations our test is asymptotically equivalent to the global test (Goeman et al., 2006), which is popular for testing hypotheses about high dimensional parameters.
Acknowledgement
We thank two referees for comments that helped to improve the paper.
Appendix A: A lemma
Lemma 1.Suppose that, for n→∞, a vector converges in distribution to a vector T of IID continuous variables. Then
Proof.Note that , where
Let ∂A be the boundary of A, i.e. the set of discontinuity points of . If t ∈ ∂A, then ti=tj for some 1⩽i<j⩽w. It follows that . Since is continuous on (∂A)c, it follows from the continuous mapping theorem (Van der Vaart (1998), theorem 2.3) that
The elements of T are IID draws from the same distribution. Hence it follows from the Monte Carlo testing principle (Lehmann and Romano, 2005) that, under H0, . Thus
Appendix B: Proofs of the results
B.1. Proof of theorem 1
Suppose that H0 holds. We shall show that converges in distribution to a multivariate normal distribution with mean 0 and variance , where I is the w×w identity matrix. It then follows from lemma 1 that .
We have shown that converges in distribution to a vector T, say, of IID normal random variables. It now follows from lemma 1 that
B.2. Proof of proposition 2
B.3. Proof of theorem 2
Let be as in the proof of theorem 1, with νi replaced by Suppose that H0 holds and , so that the summands underlying are independent. For every 1⩽i⩽n, The elements of are uncorrelated and have common variance By the multivariate central limit theorem (Van der Vaart, 1998; Greene, 2012), converges in distribution to . We supposed that to use the central limit theorem, but the asymptotic distribution of is the same if is any consistent estimator of .
Let be as in the proof of theorem 1, with νi replaced by For every 1⩽j⩽w, . Thus and are asymptotically equivalent. The result now follows from lemma 1.
B.4. Proof of proposition 2
The distribution of , which does not depend on w, converges to a continuous distribution as n→∞. It follows, that for every ε′′>0, there is a W′ such that there is an N such that, for all w>W′ and n>N, . This means that , as was to be shown.
B.5. Proof of proposition 3
B.6. Proof of theorem 3
B.7. Proof of theorem 4
Consider the case . As in the proof of theorem 3, under H0, converges in distribution to a vector of w IID variables. As in the proof of theorem 2, the same is true if we take to be a different √n‐consistent estimator of . (Again, the reason is that the effective score based on is asymptotically equivalent to the effective score based on .) The result now follows from lemma 1 again.
B.8. Proof of proposition 4
By Hall and Mathiason (1990), has an asymptotic distribution under . Analogously to the one‐dimensional case at proposition 2, for 2⩽j⩽w, the vector is asymptotically the difference of two mutually independent vectors (Hall and Mathiason, 1990), so it also has an asymptotic distribution (under ). As in the proof of theorem 3, by the multivariate central limit theorem, the d×(w−1) matrix converges to a matrix with w−1 independent columns as n→∞. Hence, by the continuous mapping theorem, as n→∞, converges in distribution to a vector of w−1 IID variables (under ), which follow the asymptotic distribution which has under .
The result now follows as at the end of the proof of proposition 2.




