Goodness-of-fit testing in high-dimensional generalized linear models

We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalized linear model. This can be achieved by predicting this signal from the residuals using modern flexible regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness-of-fit in logistic regression models. Software implementing the methodology is available in the R package `GRPtests'.


Introduction
In recent years, there has been substantial progress in developing methodology for estimation in generalized linear models in high-dimensional settings, where the number of covariates in the model may be much larger than the number of observations. A standard technique for estimation is the Lasso for generalized linear models (Park and Hastie, 2007), which has a fast implementation in the R package glmnet (Friedman et al., 2010) and is widely used. The Lasso enjoys good empirical and theoretical properties for estimation and variable selection, provided that we are searching for a sparse approximation to the regression coefficients in the generalized linear model.
Once a generalized linear model has been fitted to the high-dimensional data, it is important to assess the quality of the fit. Literature on testing goodness-of-fit in low-dimensional settings is extensive: we refer to Section 1.2 below for an overview. However, the methods typically rely on properties that only hold in low-dimensional settings such as asymptotic linearity and normality of the maximum likelihood estimator, for example. These may fail to hold with an increasing number of covariates in the model; as a consequence it is typically not possible to extend these approaches in an obvious way to the high-dimensional setting. This motivates us to develop a new method that may be used for detecting misspecification in the fit of a (potentially high-dimensional) generalized linear model.
To fix ideas, suppose we have data (x i , Y i ) n i=1 formed of feature vectors x i ∈ R p and univariate responses Y i ∈ Y ⊆ R. Let us write X = [x 1 , . . . , x n ] T = [X 1 , . . . , X p ] = (X ij ) 1≤i≤n,1≤j≤p for the n × p design matrix and Y = (Y 1 , . . . , Y n ) T for the vector of responses. Consider a generalized linear model (McCullagh and Nelder, 1989) for the data. Specifically, consider the setting where the Y i are independent conditional on X, and the conditional distribution of Y i only depends on X through the linear combination x T i β 0 for some coefficient vector β 0 ∈ R p . In particular, for the conditional expectation, this implies a structure of the form E(Y i |x i = x) =: m 0 (x) = µ(x T β 0 ); the function µ(·) is a known inverse link function and β 0 is unknown. Moreover, we assume that var(Y i |x i = x) = µ (x T β 0 ). This structure of the variance arises in generalized linear models derived from exponential families with canonical link functions, such as logistic regression or Poisson loglinear models.
We will focus on the detection of misspecification in the conditional mean function. In a lowdimensional setting, we understand that the model is misspecified in the conditional mean when there does not exist a β 0 ∈ R p such that m 0 (x) = µ(x T β 0 ). In a high-dimensional setting where p ≥ n, this concept becomes more complicated at first sight; for example, with fixed design points x 1 , . . . , x n , there always exists β 0 ∈ R p such that m 0 (x i ) = µ(x T i β 0 ) for all i = 1, . . . , n, meaning that the model can never be misspecified. However, in a high-dimensional setting, it is impossible to estimate β 0 consistently without additional structural assumptions. An assumption that is often used, and which we adopt in this paper, is sparsity of the model. Therefore, we address the question of whether a sparse model fits well to the observations, or whether a (sparse) non-linear model is more appropriate. If we restrict ourselves to sparse models, then misspecification can happen in the same way as in low-dimensional settings, even for fixed design. Some of the most important types of misspecification that are of interest in applications are missing nonlinear terms such as quadratic effects or interaction terms. Examples of generalized linear models that may be covered by our framework include logistic regression, Poisson regression, robust regression (Huber loss, Cauchy loss) and linear regression.

Overview of our contributions
We now briefly outline our strategy for goodness-of-fit testing; a more detailed description is given in Section 2. Letβ be an estimate of β 0 derived from a Lasso-penalised generalized linear model (GLM Lasso) fit. Our starting point is the vector R of Pearson residuals, with i-th coordinate , i = 1, . . . , n.
Now consider taking as a test statistic the scalar product w T R, for some (fixed) unit vector w ∈ R n .
If the generalized linear model were correct, then w T R would be approximately an average of zeromean random variables, and under reasonable conditions, should converge to a centred Gaussian random variable. On the other hand, if the model were misspecified, the residuals would contain some signal, and were w to be positively correlated with this signal, the lack of fit should be exposed by the test statistic taking a large value.
In the alternative setting, the signal in the residuals may be picked up by more flexible regression methods, such as random forests (Breiman, 2001) or boosted trees (Chen and Guestrin, 2016). However using such flexible regressions to inform the choice of w directly would make w strongly dependent on R even under the null; as such calibration of the resulting test statistic would be problematic. Our approach therefore is to construct w based on an independent auxiliary dataset (X A , Y A ) (e.g. derived through sample splitting) in the following way. We first perform a GLM Lasso fit on the auxiliary dataset to obtain an additional set of residuals. Regressing these residuals back on to the explanatory variables X A using a flexible regression method, we obtain an estimated regression functionf : R p → R that aims to predict the signal in the residuals; we refer to the n-fold concatenation of such anf as a residual prediction functionf : R n×p → R n . We may then choose w proportional tof (X) to give a direction w independent of Y .
One important issue that arises in the high-dimensional setting is that although components of R are close to zero-mean under the null, their bias can drive a substantial shift in the mean of w T R. To prevent this, we replace w with the residuals from a particular weighted (square-root) Lasso regression of w on to X. This final step ensures that w is almost orthogonal to the bias in the residuals and as a consequence, the limiting distribution under the null is a centred Gaussian. A notable feature of our construction is that the asymptotic null distribution is essentially invariant to the residual prediction method used. This can therefore be as flexible as needed to detect the type of mean misspecification we would like to uncover.
We provide a software implementation of our methodology in the R package GRPtests (Janková et al., 2019).

Related literature
High dimensions. Our work is related to that of Shah and Bühlmann (2018) who study goodnessof-fit tests for the linear model. They consider test statistics based on a proxy for the prediction error of a flexible regression method applied to the scaled residuals following a square-root Lasso fit to the data. It is shown that when a Gaussian linear model holds, these residuals depend only weakly on the unknown regression coefficients, motivating calibration of the tests via a parametric bootstrap. As there is no analogue of this result for other generalized linear models, it does not seem possible to extend this approach to our more general setting. Our methodology shares the idea of 'predicting' the residuals but, even when we specialize our approach to the Gaussian linear model, differs substantially in the construction of test statistics and the form of calibration.
In recent years, there has been much work on inference and testing in high-dimensional generalized linear models, particularly for the linear model. The work on significance testing includes flexible approaches based on (multiple) sample-splitting (Wasserman and Roeder, 2009;Meinshausen et al., 2009;Meinshausen and Bühlmann, 2010;Shah and Samworth, 2013) which may be combined with other methods. Another line of work, initiated by Zhang and Zhang (2014), proposes a method of de-biasing the Lasso that can be used for testing significance of variables in the linear regression. van de Geer et al. (2014) extend the methodology to generalized linear models; further developments include Javanmard and Montanari (2014), Dezeure et al. (2017) and Yu et al. (2018); see also Belloni et al. (2014). General frameworks for testing low-dimensional hypotheses about the parameter can be based for example on Neyman orthogonality conditions (Chernozhukov et al., 2015(Chernozhukov et al., , 2018 or on a profile likelihood testing framework (Ning et al., 2017). In recent work, Zhu and Bradic (2017) propose a method for testing more general hypotheses about the parameter vector, such as the sparsity level of the model parameter and minimum signal strength. Javanmard and Lee (2017) suggest a procedure to test similar hypotheses about the parameter in linear or logistic regression.
Low dimensions. There are numerous methods for testing goodness-of-fit of a model in lowdimensional settings, especially for the case of logistic regression, which is one of the focuses of this work. The most standard tests are residual deviance and Pearson's chi-squared tests; however, they behave unsatisfactorily if the data contain only a small number of observations for each pattern of covariate values. There have been a number of strategies to circumvent this difficulty, mainly based on grouping strategies, residual smoothing or modifications of Pearson's chi-squared test. Hosmer and Lemeshow (1980) proposed two methods of grouping based on ranked estimated logistic probabilities that form groups of equal numbers of subjects. The disadvantage of these tests (as noted in Le Cessie and Van Houwelingen (1991)) is that as they are based on a grouping strategy in the space of responses, they lack power to detect departures from the model in regions of the covariate space that yield the same estimated probabilities. For example, a model with a quadratic term may have very different covariate values with the same estimated probability. Tsiatis (1980) circumvents the difficulties faced by Hosmer-Lemeshow tests using a grouping strategy in the covariate space. However, different partitions of the space of covariates may still lead to substantially different conclusions.
Le Cessie and Van Houwelingen (1991) introduced a test based on residual smoothing using nonparametric kernel methods. Smoothed residuals replace each residual with a weighted average of itself and other residuals that are close in the covariate space. If residuals close to each other are strongly correlated, smoothing does not affect the magnitude of the residuals strongly, while if they are not correlated smoothing will shrink the residuals towards zero. Su and Wei (1991) proposed a goodness-of-fit test for the generalized linear model based on a cumulative sum of residuals, which was later adapted by Lin et al. (2002) and a weighted version was proposed in Hosmer and Hjort (2002). Another approach based on modifications of Pearson's chi-squared test was studied in Osius and Rojek (1992) and Farrington (1996) who derived a large-sample normal approximation for Pearson's chi-squared test statistic.

Organization and notation
The rest of the paper is organised as follows. In Section 2 we motivate and present our goodness-offit testing methodology. In Section 3, we study its theoretical properties, providing guarantees on the type I error and power. In Section 4, we illustrate the empirical performance of the method on simulated and semi-real genomics data. A brief discussion is given in Section 5. Proofs are deferred to Section 6 and Appendix A.
For a vector x ∈ R d , we let x j denote its j-th entry and write x p := ( d j=1 |x j | p ) 1/p for p ∈ N, x ∞ := max j=1,...,d |x j | and x 0 for the number of non-zero entries of x. For a matrix A ∈ R n×p , we use the notation A ij or (A) ij for its (i, j)-th entry, A j to denote its j-th column and we let A ∞ := max i,j |A ij |. Letting G ⊆ {1, . . . , p}, we denote by A G the matrix containing only columns from A whose indices are in G, and by A −G the columns of A whose indices are in the complement of G. We use Λ min (A) and Λ max (A) to denote the minimum and maximum eigenvalue of a square matrix A.
For sequences of random variables X n , Y n , we write X n = O P (Y n ) if X n /Y n is bounded in probability and X n = o P (1) if X n converges to zero in probability. We write a b to mean that there exists C > 0, which may depend on other quantities designated as constants in our assumptions, such that a ≤ Cb. If a b and b a, we write a b. Finally, for a function f : R → R and a vector z = (z 1 , . . . , z n ) ∈ R n we will use f (z) to denote the coordinate-wise application of f to z, that is f (z) = (f (z 1 ), . . . , f (z n )).

Methodology: Generalized Residual Prediction tests
As discussed in Section 1.1, our Generalized Residual Prediction (GRP) testing methodology relies on an initial fit of the Lasso for generalized linear models, which is defined bŷ Here ρ : Y × R → R is a loss function, usually derived from the negative log-likelihood associated with the model. Our general framework for goodness-of-fit testing will also assume we have available an auxiliary dataset (X A , Y A ) ∈ R n A ×p × Y n A independent of (X, Y ), sharing the same conditional distribution structure as that of (X, Y ). In the rest of the paper, we take n A = n for simplicity, although this is not needed for our procedures. Consider the Pearson-type residuals , i = 1, . . . , n.
Hereβ ∈ R p is an additional estimate of β 0 that may be computed using the auxiliary dataset, or in certain circumstances may be taken asβ itself: we discuss these two cases in the following sections. Given the vector R of residuals, the basic form of our test statistic is w T R; here w ∈ R n is a direction typically derived using the auxiliary dataset. We describe in detail the construction of such a w in Section 2.1, where the goal is general goodness-of-fit testing. A further modification of the method can allow us to use multiple directions w to test simultaneously for different departures from the null or to aggregate over different directions derived using flexible regression methods with different tuning parameters. Given a set W ⊆ R n of direction vectors w, our proposed test statistic then takes the form sup w∈W w T R.
We illustrate the use of this more general form of our test statistic for testing the significance of a group of variables. Such a problem may not immediately seem like goodness-of-fit testing, but is equivalent to testing the adequacy of a model not involving the group of variables in question. We explain how this may be addressed by our framework in Section 2.2 and describe a wild bootstrap procedure (Chernozhukov et al., 2013) to approximate the distribution of the test statistic under the null.

Goodness-of-fit testing
To motivate our general procedure for goodness-of-fit testing, consider the vector R ora of Pearson residuals with an oracle variance scaling, whose i-th component is given by where D 2 β 0 ,ii := µ (x T i β 0 ). We may decompose the residuals into noise and estimation error terms by writing where If the generalized linear model is correct, then E(ε i |x i ) = 0 and Var(ε i |x i ) = 1. Turning to the remainder term r i , a first-order Taylor expansion of µ yields the approximation Writing D β 0 for the diagonal matrix with entries D β 0 ,ii for i = 1, . . . , n, and ε := (ε 1 , . . . , ε n ), we obtain the decomposition Consider a unit vector w ∈ R n ; as discussed in Section 1.1, this will typically be constructed from an application of a residual prediction method on the auxiliary data. Our oracle w T R ora then satisfies Under suitable conditions on w and on the moments of the errors, the Berry-Esseen theorem should ensure that the pivot term w T ε is well approximated by a standardised Gaussian random variable.
To keep the remainder term δ in (3) under control we can leverage the fact that under the null, we can expect β − β 0 1 to be small. If w satisfies a near-orthogonality condition for some C > 0, then Hölder's inequality will yield |δ| ≤ C √ log p β − β 0 1 , which asymptotically vanishes under suitable conditions on the sparsity of β 0 .
To guarantee the near-orthogonality condition (4), we may use the square-root Lasso (Belloni et al., 2011;Sun and Zhang, 2012): for λ sq > 0, let β ora-sq := argmin The Karush-Kuhn-Tucker (KKT) conditions for the convex programme imply that the resulting vector of scaled residuals, satisfies the near-orthogonality property X T D β 0 w ora ∞ ≤ C √ log p when λ sq = C log p/n. Note that in performing this square-root Lasso regression, we are not assuming thatf is wellapproximated by a sparse linear combination of variables: we are simply exploiting the stationarity properties of the solution to the square-root Lasso optimisation problem 1 .
From the reasoning above, we conclude that, under appropriate conditions, a simple test based on the asymptotic normality of w T ora R ora will keep the type I error under control. In order to create a version of the test statistic that does not require oracular knowledge of D β 0 , we may replace this quantity with variance estimates based either onβ or on an estimate derived from the auxiliary data; we use the latter approach as this simplifies the analysis. The overall procedure is summarised in Algorithm 1 below.
2: Residual prediction: Compute the residuals Y A − µ(X AβA ) and fit a flexible regression method of these residuals versus X A to obtain a prediction functionf : R n×p → R n .
In practice, the auxiliary dataset (X A , Y A ) would be obtained through sample splitting. The effect of the randomness induced by such a split can be mitigated using methods designed to aggregate over multiple sample splits, as studied for instance in Meinshausen et al. (2009).

Group testing
Our framework of residual prediction tests also encompasses significance testing of groups of regression coefficients in a generalized linear model. Suppose that we wish to test H 0 : β G = 0 for a given group G ⊆ {1, . . . , p}. We first form the vectorR G of residuals based on a GLM Lasso fit of Y on X −G . Then, rather than constructing a single direction w using an auxiliary dataset, we can use multiple directions given by the columns of X G . Specifically, we use the test statistic max j∈G |ŵ T jR G | whereŵ j is given by the scaled residuals of the weighted square-root Lasso regressions of X j on to X −G .
Note that under the null, X G will be independent of the noise ε (1), and so sample splitting is not necessary in this case to mitigate the potentially complicated dependence of the directions and residualsR G . The limiting distribution of the test however will not be Gaussian due to the maximisation over multiple directions. Instead, we argue that max j∈G |ŵ T jR G | ≈ max j∈G |ŵ T j ε| and then use a wild bootstrap procedure to approximate the distribution of this latter quantity. The overall procedure is summarised in Algorithm 2 below.

Output: p value
Our Algorithm 2 is similar to the de-biased Lasso for generalized linear models (van de Geer et al., 2014). The main difference however is that the de-biased Lasso aims to ensure the directionŝ w j are almost orthogonal to X −j , whereas we only impose near-orthogonality with respect to X −G . Thus for large groups G, more of the direction of X G is preserved in theŵ j , which typically leads to better power of the test.

Theoretical guarantees
In this section we provide theoretical guarantees for the tests proposed in Algorithms 1 and 2. We consider an asymptotic regime with the sample size n tending to infinity and the number of parameters p = p n growing as a function of n.

Size of the test
In the following sections, we show that under the null hypothesis, the size of the test is asymptotically correct. We explore goodness-of-fit testing in Section 3.1.1 and group testing in Section 3.1.2.

Goodness-of-fit testing
Here we show that our test statistic has a Gaussian limiting distribution and we establish a bound on the type I error of the test. In this section, we condition on the design and the auxiliary dataset. Our result makes use of the fact that when the model is well specified, the GLM Lasso performs well in terms of estimation. Specifically, under certain conditions, it holds with high probability thatβ ∈ Θ(λ, β 0 , X), where Θ(λ, β 0 , X) is a local neighbourhood of β 0 defined by with s := β 0 0 as the number of non-zero entries of β 0 ; see for example Bühlmann and van de Geer (2011, Corollary 6.3). Sufficient conditions for this to occur include λ log p/n, s = o(n/ log p) and further conditions on the tail behaviour of the errors Y i − µ(x T i β 0 ), the design matrix X and the link function, as detailed below.
Condition 1 is satisfied for generalized linear models with canonical links under mild additional conditions. For example, in the case of logistic regression, the condition on the weights is satisfied if the class probability π 0 (x) = P(Y i = 1|x i = x) is bounded away from zero and one. Boundedness of the design (along with other conditions, including 12d −2 min LK X sλ A ≤ 1) guarantees that the weights can be consistently estimated. For our result below, it is convenient to introduce the shorthand notation Z A := (X, X A , Y A ).
Theorem 1. Consider Algorithm 1 with tuning parameters λ, λ A , λ sq > 0. Assume that Condition 1 is satisfied, thatβ A ∈ Θ(λ, β 0 , X A ) and let Then there exists a constant 2 C > 0 such that whenever Z A satisfiesf (X) = Xβ sq , we have for any z ∈ R that where Φ denotes the standard normal distribution function.
For the asymptotically optimal choice of tuning parameters λ λ A K X log p/n and λ sq log p/n, the bound in Theorem 1 reduces to We now discuss the terms on the right-hand side of (7). The terms λ sq √ nsλ and K X sλ A arise from bounding the bias term (near-orthogonalization step) and from bounding the weights, respectively. The presence of the ŵ A ∞ term in the bound stems from the contribution of each individual componentŵ A,i to the variance of the pivot term and that of the higher-order terms omitted in (2), which create a bias in the distribution of the test statistic.
To provide some intuition on the size of ŵ A ∞ , recall thatŵ A is a vector in R n with ŵ A 2 = 1, so we may hope for ŵ A ∞ to be small; in fact, it can be shown in certain settings, and under additional technical conditions, that ŵ A ∞ = O P (log n/ √ n). We also remark that we observê w A , and if the size of its ∞ -norm is a concern, then we can modify the square-root Lasso objective to control it explicitly. Indeed, consider setting Then the KKT conditions of the optimisation problem imply in particular both that a nearorthogonality condition similar to (4) is satisfied for suitable λ sq , and also that w A ∞ ≤ √ nλ η . Our empirical results in Section 4 however suggest that in practice ŵ A ∞ typically satisfies the necessary constraint and therefore we propose to use the simpler standard square-root Lasso without the above modifications.

Group testing
In this section, we derive theoretical properties for the group testing procedure proposed in Algorithm 2. Since we do not use sample splitting, we cannot directly apply the arguments of Theorem 1, as the directionŵ j depends onβ −G via the weightsD = Dβ −G . In order to understand this dependence, here we consider the setting of random bounded design and assume the response-covariate pairs (Y i , x i ) are all independent and identically distributed.
We aim to use the multiplier bootstrap procedure (Chernozhukov et al., 2013) to estimate the distribution of the test statistic max j∈G |T j | as described in Algorithm 2, but we first summarize a preliminary result which shows that, under appropriate conditions, T j can be asymptotically approximated by the zero-mean average w T j ε. Here we define w j : and γ 0,j is the population version ofγ j from Algorithm 2; i.e. γ 0,j := argmin . In order to guarantee consistency ofγ j in Algorithm 2, we will introduce the additional requirement that γ 0,j is sparse. Denote by β 0,−G ∈ R p−|G| the subset of components of β 0 corresponding to indices in G c . We also define Condition 2.
and assume that there exist constants c 1 , c 2 , c 3 > 0 such that for all i = 1, . . . , n.
(v) We have max j∈G γ 0,j 0 ≤ s, β 0 0 ≤ s and there exists a sequence (a n ) with a n → 0 and K 3 s log p/ √ n ≤ a n .
Proposition 1. Assume that Conditions 1 and 2 are satisfied and assume thatβ −G satisfies Considerŵ j , j ∈ G (assumed to be non-degenerate) as defined in Algorithm 2 with tuning parameters λ log p/n and λ nw K log p/n. Assume that the null hypothesis H 0 : β 0,G = 0 holds. Then there exists a constant C > 0 such that with probability at least 1 − 3/p, we have Using Proposition 1 and the results of Chernozhukov et al. (2013), we can show that the quantiles of max j∈G |T j | can be approximated by the quantiles of 1 , . . . , e b n are independent N (0, 1) random variables. We only need to guarantee that T j is well-approximated by w T j ε and we pay a price of log |G| for testing |G| hypothesis simultaneously, where |G| denotes the cardinality of G. Define where P e is the probability measure induced by the multiplier variables (e fixed.
Theorem 2. Assume the conditions of Proposition 1 and that there exist constants C 2 , c 2 > 0 such that Then there exist constants c, C > 0 such that Theorem 2 shows that if the generalized linear model is correct and the null hypothesis β 0,G = 0 holds, then Algorithm 2 produces an asymptotically valid p-value for testing this hypothesis.

Power analysis for goodness-of-fit testing
The choice of w as postulated in Theorem 1 guarantees that the type I error for goodness-of-fit testing stays under control. We now provide guarantees on the local power of the test. To this end, let us suppose that the true model has conditional expectation function m 0 (x) = µ x T β 0 + g 0 (x) . Here g 0 represents a small nonlinear perturbation of the linear predictor x T β 0 . Our aim here is to understand how this propogates through to the distribution of our test statistic. We will suppose that the perturbation is small enough that GLM Lasso estimates lie with high probability within local neighbourhoods of β 0 . Let us first provide some intuition on the expected value of the test statistic under model misspecification. Writing f 0 (x) = x T β 0 + g 0 (x), the expectation of the theoretical residuals ε = D −1 β 0 Y − µ(Xβ 0 ) is given by As argued in Section 2, the oracular test statistic w T R ora can be approximated by the scalar product w T ε. Therefore, in order to obtain good power properties, we should seek to construct a direction w so as to maximize Ew T ε. The oracular choice w := w 0 / w 0 2 yields by a Taylor expansion the approximation We therefore see that the test statistic behaves in expectation as a weighted 2 -norm of the nonlinear term g 0 (X). We now provide a theoretical justification which can be used for local asymptotic guarantees on the power of our method. We introduce the following conditions which are modifications of Condition 1 to account for the case when the model is misspecified.
Theorem 3. Consider Algorithm 1 with tuning parameters λ, λ A , λ sq . Assume Condition 3, that Then there exists a constant C > 0 such that, whenever Z A is such thatf (X) = Xβ sq , we have for any z ∈ R that Under the null hypothesis, we have ∆ = 0 and D Y = D β 0 ; thus we recover the result of Theorem 1. The departure of the model from the null hypothesis is captured by ∆. Hence the theorem shows that to detect departures from the null, the direction w must be "correlated" with the signal that remains in the residuals under misspecification, namely µ(f 0 (X)) − µ(Xβ 0 ). Under a local alternative, e.g. f 0 (X) = Xβ 0 + g 0 (X)/ √ n, where g 0 (X) 2 = 1, we have ∆ 1. Theorem 3 relies on rates of convergence of the "projected" estimatorβ in (10) when the model is misspecified. Oracle inequalities for Lasso-regularized estimators in high-dimensional settings have been well explored; we refer to Bühlmann and van de Geer (2011) and the references therein. If there is misspecification, we hope that the projected estimator behaves as if it knows which variables are relevant for a linear approximation of the possibly nonlinear target f 0 . In Appendix A.4, we summarize how misspecification affects estimation of the best linear approximation, based on the approach of Bühlmann and van de Geer (2011). These results guarantee that under a local alternative, the Lasso for generalized linear models still satisfies the condition P β ∈ Θ(λ, β 0 , X)|X → 1.

Consequences for logistic regression
In this section we show how our general theory applies to the problem of goodness-of-fit testing for logistic regression models. We take Y = {0, 1} and assume (Y i |x i = x) ∼ Bernoulli(π 0 (x)). Define . The function f 0 may be potentially nonlinear in x. The 1 -regularized logistic regression estimator iŝ where d(ξ) := log(1 + e ξ ). Let β 0 ∈ R p be the best approximation obtained by a GLM (Bühlmann and van de Geer, 2011, Section 6.3, p. 115). We define S := {j : β 0,j = 0} and s := |S|. Corollary 1 below follows by combining Theorem 3 with existing results on 1 -penalized logistic regression. We assume conditions, stated formally in Lemma 6 in Appendix A, which guarantee that with high probability this penalized estimator is sufficiently close to β 0 .

Empirical results: Logistic regression
In this section we explore the empirical performance of the methods for goodness-of-fit testing and group testing in the setting of logistic regression. We begin by considering goodness-of-fit testing in low-dimensional settings in Section 4.1 and in high-dimensional settings in Section 4.2. Goodnessof-fit testing on semi-real data is investigated in Section 4.3, while in Section 4.4, we explore group testing in high-dimensional settings.

Low-dimensional settings
While for low-dimensional settings, there are numerous methods available for testing goodness-of-fit as discussed in Section 1.2, we show that our test from Algorithm 1 may be advantageous even here. We compare the performance of the our test (with residual prediction method being a random forest with default tuning parameter choices) against the Hosmer-LemeshowĈ test, the Hosmer-LemeshowĤ test (see Lemeshow and Hosmer Jr (1982)) and the le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test (see Hosmer et al. (1997)). These tests are implemented in the function HLgof.test() in the R package MKmisc (Kohl, 2018). We simulated data from a logistic regression model with sample size N = 300 and p = 10 covariates according to where π(u) := µ(u 1 + u 2 + u 3 + σg(u)).
We considered different forms for the misspecification g(·): • quadratic effect: (a) g(u) = 2u 2 1 , (b) g(u) = 2u 2 5 , • interactions: Here σ ≥ 0 measures the size of departure from the null hypothesis H 0 : σ = 0. Note that our GRP testing methodology requires an auxiliary sample of size n A . We therefore randomly split the sample taking n A = n = N/2, with n being the number of observations in the main sample. The observation vectors x i follow a N 10 (0, Σ 0 ) distribution where is the Toeplitz matrix with correlation ρ = 0.6. The results for the six settings above are shown in Figure 2. All methods maintain good control over type I error, but in most scenarios our GRP-test has significantly greater power compared with the other methods.

High-dimensional settings
Here we consider logistic regression models with two different types of misspecification from the presence of a pure quadratic effect and an interaction term. Specifically, we take the log-odds to be 1, 1, 1, 1, 0, . . . , 0) ∈ R p and either (a) g(u) = u 1 u 2 or (b) g(u) = u 2 1 /2. The parameter σ controls the degree of the misspecification and we look at σ ∈ [0, 3].
The observation vectors x i are independent with a Gaussian distribution N p (0, Σ 0 ), where Σ 0 is the Toeplitz matrix (13) for a range of correlations ρ ∈ {0.4, 0.6, 0.8}. We consider a setting with p = 500 and N = 800. The GRP-test requires sample splitting and we use split size 50%; thus the size of auxiliary sample is n A = 400. Again, we use a random forest as the residual prediction method.     In order to achieve better control of the type I error, we use a slight modification of the procedure proposed in Algorithm 1. DefineŜ := {j :β j = 0}. Rather than calculating the directionŵ A throughβ sq , we instead usẽ

Probability of rejection
in its place within (6). In this way, we enforce thatŵ A is exactly orthogonal toD A XŜ. This helps to keep the remainder term arising from the asymptotic expansion of the test statistic under control, as can be seen from the following argument. Assume that a "beta-min condition" is satisfied, that is for all j ∈ S := {k ∈ {1, . . . , p} : β 0,k = 0} it holds that β 0,j s log p/n, where s := |S|. Then asymptotically, it holds thatŜ ⊇ S with high probability (e.g. Bühlmann and van de Geer, 2011, Corollary 7.6). On the event that this occurs, we have that the remainder term in (1) satisfieŝ Even without such a beta-min condition, it is plausible that we will obtain a reduction in this bias term through this strategy of exact orthogonalization. In the high-dimensional setting, there is no obvious method that we can use for comparison with our proposed GRP-test. Therefore as a theoretical benchmark, we consider an oracle GRPtest applied to a reduced design matrix containing only variables in the active set S = {1, . . . , 5}, thereby reducing problem to a low-dimensional one. The results are reported in Tables 1 and 2, from which we see that the GRP-test does indeed control the type I error, and suffers only a relatively small loss in power compared with the oracle GRP-test.

Semi-real data example
We use a gene-expression dataset on lung cancer available from the NCBI database (Spira et al. (2007), https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2771) to illustrate the size and power performance of the goodness-of-fit test. We aim to detect if the model is a logistic regression, or if there are extra nonlinear effects. The full dataset contains airway epithelial gene expressions for 22215 genes from each of 192 smokers with (suspected) lung cancer, but this was reduced by taking the 500 genes with the largest variances. Having scaled the resulting variables, we fit a 1 -penalized logistic regression using cv.glmnet() from the package glmnet (Friedman et al., 2010) and obtained a parameter estimateβ with its corresponding support setŜ. We then fit a Gaussian copula model to the rows of the design matrix and generated a new, augmented design matrix X by simulating a further N = 800 observation vectors from this fitted model. Finally, we generated 800 new responses: , for the following three scenarios: where j ∈Ŝ, = 1, . . . , 4 are uniformly sampled entries fromŜ. We report rejection probabilities for all three scenarios from 100 repetitions in Table 3. In each case, the GRP-test is able to detect the misspecification relatively reliably, while keeping the type I error under control.

Testing goodness-of-fit of logistic regression on semi-real data on lung cancer
Prob. of rejection of H 0 g(u) = 0 0.05 g(u) = u 2 j 1 + u 2 j 2 0.77 g(u) = u j 1 u j 2 + u j 3 u j 4 0.81 Table 3: Estimated probabilities of rejection of H 0 : g(u) = 0 at significance level 0.05, averaged over 100 generated datasets.

Group testing
Finally, we consider the problem of testing for the significance of groups of predictors using the methodology set out in Section 3.1.2. We compare the GRP-test (Algorithm 2) with the globaltest (Goeman et al., 2004)  Ecdf of p−values q q q q q q qq q q q qq q q q q qq q q qq q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q qq qq q q q q q qq q qq q qq q q q q q q qq qq q q qq qq q q q q q q q q qq q q (a) No misspecification: g(u) = 0. Ecdf of p−values q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q (b) Quadratic effects: g(u) = u 2 j 1 + u 2 j 2 . Ecdf of p−values q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q (c) Interactions: g(u) = u j 1 u j 2 + u j 3 u j 4 .

Discussion
In this work, we have introduced a new method for detecting conditional mean misspecification in generalized linear models based on predicting remaining signal in the residuals. For this task of prediction, we have a number of powerful machine learning methods at our disposal. Whilst these estimation performance of these methods is largely theoretically intractable, by employing sample-splitting and a careful debiasing strategy involving the square-root Lasso, our generalized residual prediction framework provides formal statistical tests with type I error control when used in conjunction with (essentially) arbitrary machine learning methods. One requirement for these theoretical guarantees is that the sparsity of the true regression coefficient β 0 satisfies s = o( √ n/ log p), a condition that was also needed in related work on the de-biased Lasso (Zhang and Zhang, 2014;van de Geer et al., 2014;Javanmard and Montanari, 2014). It would be very interesting if this could be relaxed to s = o(n/ log p) for instance, which would encompass settings where the GLM Lasso estimate satisfies β − β 0 2 → 0 though β − β 0 1 may be diverging.
Another interesting question is whether sample splitting can be completely avoided if we were able to obtain guarantees for an estimatorŵ of a population direction w. Such alternatives to   Figure 3: Comparison of GRP-test from Algorithm 2 with the de-biased Lasso (Dezeure et al., 2015) and globaltest (Goeman et al., 2004)   sample splitting could be particularly helpful for settings where there is dependence across the observations, such as in the case of generalized linear mixed effect models.
Acknowledgements: The authors would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the 'Statistical Scalability' programme when work on this paper was undertaken, supported by EPSRC grant number EP/R014604/1. JJ is supported by a Swiss National Science Foundation fellowship. RDS is supported by an EPSRC First Grant and an EPSRC programme grant. PB is supported by the European Research Council under the grant agreement No. 786461 (CausalStats -ERC-2017-ADG). RJS is supported by an EPSRC fellowship and an EPSRC programme grant. 6 Proofs 6.1 Proofs for Section 3 6.1.1 Proofs for Section 3.1.1 Proof of Theorem 1. This follows from the more general Theorem 3, noting that under the null hypothesis, ∆ = 0. The proof of Theorem 3 can be found in Section 6.1.3.
6.1.2 Proofs for Section 3.1.2 Proof of Proposition 1. For j ∈ G, let u j := X j − X −G γ 0,j and define the sets where C will be specified below. We first derive a high-probability bound for the set By Lemma 3 in Section A, there exists a constant A > 0 (in the definition of T 3,j ) such that In the rest of the proof, we work on the event T . Define D β 0,−G := µ (X −G β 0,−G ) (note that under H 0 , it holds that D β 0,−G = D β 0 ). Now consider the decomposition where rem 1,j := (D −1 and where we write Dβ −G :=D. We first derive the rates of convergence for the estimatorγ j from Algorithm 2 and then proceed to bound the remainders. By Lemma 4 in Section A, there exist positive constants C 3 and C 4 such that on T , we have max j∈G γ j − γ 0,j 1 ≤ C 3 K 2 s log p/n, Remainder rem 1,j : Let k j denote the index of column X j in the matrix X j∪G c . For j ∈ G, we defineΓ j := (−γ j,1 , . . . , −γ j,k j −1 , 1, −γ j,k j +1 , . . . ,γ j,|G c | ) T and its population-level counterpart Γ 0,j based on γ 0,j . First note that and similarly, Therefore, we obtain using Hölder's inequality that On the set ∩ j∈G T 3,j ⊆ T , we have Next we bound Γ j /τ j − Γ 0,j /τ j 1 . Firstly, we can decompose and bound Now note that Combining (21) with the fact that where e denotes the -th standard basis vector in R p−|G|+1 . Define Σ β 0 ,j∪G c := EX T j∪G c D 2 β 0 X j∪G c /n and note from Condition 2(iv) that Σ β 0 ,j∪G c is invertible. Consequently, Γ 0,j /τ 2 j = Σ −1 β 0 ,j∪G c e k j , so 1/τ 2 j = (Σ −1 β 0 ,j∪G c ) k j k j . Further note that max j∈G τ 2 Therefore, by Condition 2(iv), we have Consequently, and by sparsity of γ 0,j assumed in Condition 2(v), it follows that By Condition 2(v), we can find n 0 ∈ N such that a n ≤ 1/(2C 4 C e ) for n ≥ n 0 . Then from (17) and (22), we obtain on T that for n ≥ n 0 , Using (20), we conclude that on T , Consequently, combining (19) and (23), there exists a constant such that it holds on T max j∈G |rem 1,j | = O K 3 s log p/ √ n = O(a n ).
Remainder rem 2,j : By the mean value theorem, for each i = 1, . . . , n, there exists α i ∈ [0, 1] such that Consequently, using Hölder's inequality and the KKT conditions from the optimization problem in Algorithm 2, we obtain where rem 3,j := n i=1ŵ j,i D −1 By Condition 2(ii), we have X −G γ 0,j ∞ ≤ K, so we obtain Since µ is Lipschitz and x i ∞ ≤ K, we obtain that on T , by Condition 2(iii) and Condition 2(v). Therefore, on T , it follows that Then using the result above, on T , we obtain The result follows from (15), together with the bounds (14), (24), (25) and (30).
Proof of Theorem 2. We want to show that the quantiles of our test statistic for group testing, T := max j∈G n i=1ŵ j,iRG,i can be approximated by quantiles of its bootstrapped version where (e b i ) n i=1 is a sequence of independent and identically distributed N (0, 1) random variables. We can apply Theorem 4 from Section A.3 together with Proposition 1. Adopting the notation of Theorem 4 we let Note that the maxima above can be rewritten without the absolute values using the fact that for any a ∈ R it holds that |a| = max{a, −a}. Thus for i = 1, . . . , n and j ∈ G we let x ij := √ nw j,i ε i andx ij := √ nŵ j,iRG,i . Moreover, for i = 1, . . . , n and j ∈ G we also define x i(j+|G|) := −x ij and x i(j+|G|) := −x ij . We will apply Theorem 4 with x ij andx ij where i = 1, . . . , n and j ∈ H := G ∪ {j + |G| : j ∈ G}.
We now check that conditions (51), (52), (53), (54) and (55) By Condition 2(iii), it follows that c 0 ≤ D 2 β 0 ,ii ≤ C 0 . Consequently, and using Condition 2(i), there exist constants c, C such that c ≤ E(ε 2 i |X) = E(η 2 i /D 2 β 0 ,ii |X) ≤ C. It follows that Now recalling that Therefore, combining (31) and (32), we obtain c ≤ 1 n n i=1 Ex 2 ij ≤ C for j ∈ H, as required. Using this bound, we will now check that for suitable B n K, First observe that Taking sufficiently large B n K, we can therefore guarantee that (33) holds, as required.

Proofs for Section 3.2
Proof of Theorem 3. In this proof, it is convenient to write P A (·) and E A (·) as shorthand for P(·|Z A ) and E(·|Z A ) respectively. Consider the decomposition There are three terms: I. The term φ is the pivot. By the Berry-Esseen theorem, we will show below that (after scaling) it is well approximated by a normal random variable.
II. The term ∆ captures the deviation from the null hypothesis. If the null hypothesis is true, then ∆ = 0.
III. The term rem 1 is a stochastic remainder term, for which we will develop a probabilistic bound below.
Proof of Corollary 1. We apply Theorem 3 to the case of logistic regression to obtain local guarantees on the power of the test. To this end, we need to bound δ in (10) and Condition 3 of Theorem 3.
To bound δ, we note that by Lemma 6 in Section A with t := log(2p), we have with probability at least 1 − 1/(2p) that In what follows we work on the event where this occurs. We next want to obtain a bound on X(β − β 0 ) 2 2 /n. Note that the second derivative of the loss function is For x ∞ ≤ K and any f with sup x: Note that for anyβ on the line segment between β 0 andβ, we have Thus we can conclude using a Taylor expansion of the loss function that there exist {β (i) : i = 1, . . . , n}, each on the line segment from β 0 toβ, such that We deduce that there exists a constantC > 0 such that with λ =C log(2p)/n, we have δ = P β / ∈ Θ(λ, β 0 , X)|X ≤ 1/(2p). It remains to check that Condition 3 of Theorem 3 is satisfied. Firstly, the inverse link function µ(u) = 1/(1 + e −u ) is differentiable and Lipschitz with constant 1. Moreover, by (42), D 2 Y,ii ≥ d 2 min := c 2 0 and also D 2 β 0 ,ii ≥ c 2 0 . Finally, observe that E |Y i − µ(f 0 (x i ))| 3 |X ≤ 1. Moreover, 12d −2 min LK X sλ = 12c −2 0 LK X sλ ≤ 1 by hypothesis. Therefore, Condition 3 is satisfied.

A Appendix
A.1 Auxiliary lemmas Lemma 1 (Hoeffding's inequality for a maximum of p averages). Suppose that for each j = 1, . . . , p, the random variables Z 1j , . . . , Z nj are independent with Then for all t > 0 Proof of Lemma 1. Apply Corollary 17.1 in van de Geer (2016) together with a union bound.
Lemma 2. LetÃ,B ∈ R n×n be diagonal matrices and suppose thatB is invertible. Let w ∈ R n satisfy w 2 = 1. Then Proof of Lemma 2.
as required.

A.2 Auxiliary lemmas for Group Testing
Lemma 3. Under the conditions of Proposition 1, we have Proof. To obtain a probability bound for T 1,j , we can apply Lemma 1, noting that with Z ijk := u j,i D β 0 ,ii X −G,i,k , where X −G,i,k is the (i, k)-th entry of the matrix X −G and u j,i is the i-th entry of u j . Note that by Condition 2 (ii), it follows that |u j,i | ≤ 2K and by Condition 2 (iii), we have |D β 0 ,ii | ≤ C 0 . Therefore, |Z ijk | ≤ c i for c i := 2C 0 K 2 and for all i, j, k. Thus Lemma 1 implies that for all t > 0, Therefore, For the set T 3,j , by the sub-Gaussianity of η i = Y i − µ(x T i β 0 ) from Condition 2 (i), there exists a constant C > 0 such that Therefore, P T c j ≤ P(T c 1,j ) + P(T c 2 ) + P(T 3,j ) ≤ 1/p 2 + P(T c 2 ) + 1/p 2 .
Lemma 4. Assume Conditions 1 and 2. For j ∈ G, we let u j := X j − X −j γ 0,j and define the sets T 1,j := u T j D β 0 X −G ∞ /n ≤ 6C 0 K 2 log(2p)/n , Then there exist λ nw log p/n, and positive constants C 3 and C 4 such that on the set ∩ j∈G (T 1,j ∩ T 2 ), it holds that max j∈G γ j − γ 0,j 1 ≤ C 3 K 2 s log p/n, max j∈G |τ 2 j − τ 2 j | ≤ C 4 K 2 s log p/n, wheneverτ 2 j := D (X j − X −Gγj ) 2 2 /n > 0. Proof of Lemma 4. To obtain rates of convergence forγ j from Algorithm 2, we follow the arguments in the proof of Theorem 3.2 in van de Geer et al. (2014), which considers nodewise regression with random bounded design. The difference is that they define a nodewise regression program with design matrix X −j and use the Lasso, whereas we want to apply the nodewise regression with a smaller design matrix, X −G , and we use the square-root Lasso. We also seek finite-sample, as opposed to asymptotic, bounds, but this requires only minor modifications. But since we assume thatτ j > 0, the square-root Lasso program with penalty λ nw corresponds to the Lasso program with penalty λ Lasso :=τ j λ nw . We now check that the appropriate finite-sample analogues of conditions (D1)-(D5) of Theorem 3.2 from van de Geer et al. (2014) are satisfied forX := X j∪G c . Firstly, the analogues of (D1), (D2), (D4) are satisfied directly by the assumptions in Conditions 1 and 2. For (D3), first note that the smallest eigenvalue of Σ β 0 ,j∪G c := EX T D 2 β 0X /n is lower bounded by the smallest eigenvalue of Σ 0 = EX T D 2 β 0 X/n, which is in turn lower bounded by 1/C e . Similarly, Σ β 0 ,j∪G c ∞ ≤ Σ β 0 ∞ ≤ C e ,. Finally, Condition (D5) is satisfied on T 2 .
As in the proof of Proposition 1, let k j denote the index of column X j in the matrix X j∪G c . We writeΓ j := (−γ j,1 , . . . , −γ j,k j −1 , 1, −γ j,k j +1 , . . . , −γ j,|G c | ) T , and Γ 0,j for its analogy defined in terms of γ 0,j . Then note that we can write X j −X −Gγj =XΓ j and recall thatτ 2 j = DXΓ j 2 2 /n and τ 2 j = E D β 0X Γ 0,j 2 2 /n. By inspecting the proof of Theorem 3.2 of van de Geer et al. (2014), we conclude that there exist positive constants C 3 and C 4 such that on ∩ j∈G (T 1,j ∩ T 2 ), it holds that max j∈G γ j − γ 0,j 1 ≤ C 3 K 2 s log p/n, max j∈G |τ 2 j − τ 2 j | ≤ C 4 K 2 s log p/n, Therefore, with probability at least 1 − 1/p − P(T c ), We now bound r 2 . Using Condition 2(iii), together with the fact thatβ −G ∈ Θ −G (λ, β 0 ) on T we have that on this event, Then using the fact thatβ −G ∈ Θ −G (λ, β 0 ) on T and using that (which follows similarly as in the proof of Theorem 3) and using (47), we obtain that on T , i sλ 2 n + K 2 s 2 λ 2 n K 2 s 2 λ 2 n.

A.3 Multiplier bootstrap
We summarize Corollary 3.1 from Chernozhukov et al. (2013). To this end, we need the following condition.
Condition 4. Let (x i ) n i=1 be n independent random vectors with values in R g satisfying Ex 2 ij ≤ C 1 (51) and max k=1,2 Define x ij .
Let (e i ) n i=1 be a sequence of i.i.d. N (0, 1) random variables independent of (x i ) n i=1 and define x ij e i .
Assume that there exist ζ 1 , ζ 2 ≥ 0 such that where P e is the probability measure induced by the multiplier variables (e i ) n i=1 holding (x i ) n i=1 fixed.

A.4 Oracle inequalities for logistic regression under misspecification
We require a condition on the design matrix known as the compatibility condition (Bühlmann and van de Geer, 2011).
Definition 1 (Compatibility constant). We say that the compatibility condition is met with constant φ > 0 if for all β ∈ R p that satisfy β S c 1 ≤ 3 β S 1 , it holds that β S 2 1 ≤ s Xβ 2 2 φ 2 .
For our final lemma, we use the notation of Sections 3.3 and 6.1.4.
Proof of Lemma 6. The proof follows from Lemma 6.8 and Section 6.7 in Bühlmann and van de Geer (2011).