Generalised additive and index models with shape constraints

We study generalised additive models, with shape restrictions (e.g. monotonicity, convexity, concavity) imposed on each component of the additive prediction function. We show that this framework facilitates a nonparametric estimator of each additive component, obtained by maximising the likelihood. The procedure is free of tuning parameters and under mild conditions is proved to be uniformly consistent on compact intervals. More generally, our methodology can be applied to generalised additive index models. Here again, the procedure can be justified on theoretical grounds and, like the original algorithm, possesses highly competitive finite-sample performance. Practical utility is illustrated through the use of these methods in the analysis of two real datasets. Our algorithms are publicly available in the \texttt{R} package \textbf{scar}, short for \textbf{s}hape-\textbf{c}onstrained \textbf{a}dditive \textbf{r}egression.


Introduction
Generalized additive models (GAMs) Tibshirani, 1986, 1990;Wood, 2006) have become an extremely popular tool for modelling multivariate data. They are designed to enjoy the flexibility of non-parametric modelling while avoiding the curse of dimensionality (Stone, 1986). Mathematically, suppose that we observe pairs .X 1 , Y 1 /, : : : , .X n , Y n /, where X i = .X i1 , : : : , X id / T is the predictor and Y i is the real-valued response, for i = 1, : : : , n. A GAM relates the predictor and the mean response μ i = E.Y i |X i / through g.μ i / = f.X i / = d j=1 f j .X ij / + c, where g is a specified link function, and where the response Y i conditional on X i follows an exponential family distribution. Here c ∈ R is the intercept term and, for every j = 1, : : : , d, the additive component function f j : R → R is assumed to satisfy the identifiability constraint f j .0/ = 0. Our aim is to estimate the additive components f 1 , : : : , f d together with the intercept c on the basis of the given observations. Standard estimators are based on penalized spline-based methods (e.g. Wood (2004Wood ( , 2008), and involve tuning parameters whose selection is not always straightforward, especially if different additive components have different levels of smoothness, or if individual components have non-homogeneous smoothness.
In this paper, we propose a new approach, motivated by the fact that the additive components of f often follow certain common shape constraints such as monotonicity, convexity or concavity. For instance, monotone regression techniques have been used in biology to search for gene-gene interactions (Luss et al., 2012), and in medicine to study the expression of a leukaemia antigen as a function of white blood cell count and DNA index (Schell and Singh, 1997). Economic theory dictates that utility functions are increasing and concave (Matzkin, 1991) and that the cost function of a standard perfectly competitive firm is increasing and convex (Aït-Sahalia and Duarte, 2003), whereas production functions are often assumed to be concave (Varian, 1984). In finance, theory restricts call option prices to be convex and decreasing functions of the strike price (Aït-Sahalia and Duarte, 2003); in stochastic control, value functions are often assumed to be convex (Keshavarz et al., 2011). The full list of constraints that we consider is given in Table 1, with each assigned a numerical label to aid our exposition. By assuming that each of f 1 , : : : , f d satisfies one of these nine shape restrictions, we show in Section 2 that it is possible to derive a non-parametric maximum likelihood estimator, which requires no choice of tuning parameters and which can be computed by using fast convex optimization techniques. In theorem 1, we prove that, under mild regularity conditions, it is uniformly consistent on compact intervals.
More generally, as we describe in Section 3, our approach can be applied to generalized additive index models (GAIMs), in which the predictor and the mean response μ i = E.Y i |X i / are related through g.μ i / = f I .X i / = f 1 .α T 1 X i / +: : : + f m .α T m X i / + c, .1/ where the value of m ∈ N is assumed known, where g is a known link function, and where the response Y i |X i again follows an exponential family distribution. Here, α 1 , : : : , α m ∈ R d are called the projection indices, f 1 , : : : , f m : R → R are called the ridge functions (or, sometimes, additive components) of f I , and c ∈ R is the intercept. Such index models have also been widely applied, especially in the area of econometrics (Li and Racine, 2007). When g is the identity function, the model is also known as projection pursuit regression (Friedman and Stuetzle, 1981); when m = 1, the model reduces to the single-index model (Ichimura, 1993). As for additive models, in some applications it is natural to impose shape constraints on the ridge functions; for instance, Foster et al. (2013) argued in favour of the use of monotone single-index models for analysing certain randomized clinical trial data. In other cases, as pointed out by Xu et al. (2014), shape restrictions are attractive as tractable non-parametric relaxations of linear models. Recent work by Kim and Samworth (2014) has shown that shape-restricted inference without further assumptions can lead to slow rates of convergence in higher dimensions. The additive or index structure therefore becomes particularly attractive in conjunction with shape constraints as an attempt to evade the curse of dimensionality. In Section 3, we extend our methodology and theory to this setting, allowing us to estimate simultaneously the projection indices, the ridge functions and the intercept. The challenge of computing our estimators is taken up in Section 4, where our algorithms are described in detail. In Section 5, we summarize the results of a thorough simulation study designed to compare the finite sample properties of scar with several alternative procedures. We conclude in Section 6 with two applications of our methodology to real data sets concerning doctoral publications in biochemistry and the decathlon. All proofs, as well as various auxiliary results, are given in the on-line supplementary material.
This paper contributes to the larger literature of regression in the presence of shape constraints. In the univariate case, and with the identity link function, the properties of shapeconstrained least squares procedures are well understood, especially for the problem of isotonic regression. See, for instance, Brunk (1958Brunk ( , 1970 and Barlow et al. (1972). For the problem of univariate convex regression, see Hanson and Pledger (1976), Groeneboom et al. (2001Groeneboom et al. ( , 2008 and Guntuboyina and Sen (2013). These references cover consistency, local and global rates of convergence, and computational aspects of the estimator. Hall and Huang (2001) proposed an alternative approach to univariate monotone regression, based on perturbing a kernel estimator. Banerjee (2007) and Banerjee (2009) studied monotone regression models in which the conditional distribution of a response given a covariate is assumed to come from a regular parametric model and an exponential family respectively. There have been several papers studying additive isotonic regression, including Bacchetti (1989), Morton-Jones et al. (2000), Tutz and Leitenstorfer (2007), Cai and Dunson (2007), Mammen and Yu (2007), Brezger and Steiner (2008), Cheng (2009), Cheng et al. (2012, Fang and Meinshausen (2012), Rueda (2013) and Yu (2014). The recent work of Meyer (2013a) develops similar methodology (but not theory) to ours in the Gaussian, non-index setting. The problem of GAMs with shape restrictions was also recently studied by Pya and Wood (2015), who proposed a penalized spline method that is compared with ours in Section 5; in particular, they consider the same set of constraints as in this paper. Meyer et al. (2011) investigated a Bayesian spline-based approach to the problem of GAMs, with a focus on the isotonic case.

Background
Recall that the density function of a natural exponential family (EF) distribution with respect to a reference measure (either Lebesgue measure on R or counting measure on Q) can be written in the form where μ ∈ M ⊆ R and φ ∈ Φ ⊆ .0, ∞/ are the mean and dispersion parameters respectively. To simplify our discussion, we restrict our attention to the most commonly used natural EF distributions, namely the Gaussian, gamma, Poisson and binomial families, and take g to be the canonical link function. Expressions for g and the (strictly convex) log-partition function B for the different exponential families can be found in Table 2. The corresponding distributions are denoted by EF g,B .μ, φ/, and we write dom.B/ = {η ∈ R : B.η/ < ∞} for the domain of B. As a convention, for the binomial family, the response is scaled to take values in {0, 1=T , 2=T , : : : , 1} for some known T ∈ N.
If .X 1 , Y 1 /, : : : , .X n , Y n / are independent and identically distributed pairs taking values in for some prediction function f : R d → dom.B/, then the (conditional) log-likelihood of f can be written as log{h.Y i , φ/}: Table 2. Exponential family distributions, their corresponding canonical link functions, log-partition functions and mean and dispersion parameter spaces Since we are only interested in estimating f , it suffices to consider the scaled partial log-likeli-hood¯n .f/ ≡¯n.f ; .X 1 , Y 1 /, : : : , .X n , Y n // :

Maximum likelihood estimation under shape constraints
To guarantee the existence of our estimator, defined in expression (3) below, it turns out to be convenient to extend the definition of each`i (and therefore¯n) to all f : R d →R, which we do as follows.
(a) For the gamma family, if f.X i / 0, then we take`i.f/ = −∞. For any shape vector L d = .l 1 , : : : , l d / T ∈ {1, 2, : : : x 1 , : : : , x d / T ∈ R d , where, for every j = 1, : : : , d, f j : R → R is a function obeying the shape restriction indicated by label l j and satisfying f j .0/ = 0, and where c ∈ R. Whenever f has such a representation, we write f ∼ F .f 1 , : : : , f d , c/, and call L d the shape vector. The pointwise closure of F is defined as For a specified shape vector L d , we define the shape-constrained maximum likelihood estimator (SCMLE) asf n ∈ arg max f ∈cl.F /¯n .f/: .3/ Our reason for maximizing over cl.F/ rather than F in the definition off n is a technical convenience: as we see from proposition 1 below, it ensures that a maximizer always exists. This would be false in certain special cases if instead we only maximized over F (see example 1 in Appendix A for such an instance), though, from the paragraph immediately following theorem 1 below, we see that the distinction is not too important. Like many other shape-restricted regression estimators,f n is not unique in general. However, as can be seen from the second part of proposition 1, the value off n is uniquely determined at X 1 , : : : , X n .
In fact, as can be seen from the proof of proposition 1, if the EF distribution is Gaussian or gamma, thenŜ n ∩ F = ∅. WheneverŜ n ∩ F = ∅, it contains an element for which each additive component is piecewise linear, and any solution obtained from our algorithm in Section 4.1 has this piecewise linear property.

Consistency of the shape-constrained maximum likelihood estimator
In this subsection, we show the consistency off n in a random-design setting. We shall impose the following assumptions.
Assumption 2. The random vector X has a Lebesgue density with support R d .
where f 0 ∈ F and φ 0 ∈ .0, ∞/ denote the true prediction function and dispersion parameter respectively.
Assumption 4. f 0 is continuous on R d .
We are now in the position to state our main consistency result. Under assumption 3, we may write f 0 ∼ F .f 0,1 , : : : , f 0,d , c 0 /. When the assumptions of theorem 1 hold, in particular assumption 2, we see from the proof of theorem 1 that, for any a 0 > 0, with probability 1, for sufficiently large n, anyf n ∈Ŝ n can be written in the formf n .x/ = Σ d j=1f n,j .x j / +ĉ n for x = .x 1 , : : : , x d / T ∈ [−a 0 , a 0 ] d , wheref n,j satisfies the shape constraint l j andf n,j .0/ = 0 for each j = 1, : : : , d.
We now turn to estimation of the additive components and the intercept. Recall that, whenever we write f ∼ F .f 1 , : : : , f d , c/, we insist that f j .0/ = 0 for all j, and refer to it as an identifiability condition. This is because, if we also had f ∼ F .f 1 , : : : ,f d ,c/, then we would have x d ∈ R}, we could then conclude thatc = c andf j = f j . Note, however, that, as observed by Meyer (2013a), if we only know that the equality (4) holds at a finite set of points .x 1,1 , : : : , x 1,d / T , : : : , .x n,1 , : : : , x n,d / T ∈ R d , then we do not even necessarily know thatf j .x i,j / = f j .x i,j / for all i, j. Nevertheless, the following corollary establishes the important fact that each additive component (as well as the intercept term) is estimated consistently by the SCMLE. |f n,j .x j / − f 0,j .x j /| + |ĉ n − c 0 | → 0 almost surely as n → ∞.

The generalized additive index model and its identifiability
Recall that, in the GAIM, the real-valued response Y i and the predictor X i = .X i1 , : : : , X id / T are related through equation (1), where g is a known link function, and where, conditionally on X i , the response Y i has a known EF distribution with mean parameter g −1 {f I .X i /} and dispersion parameter φ. Let A = .α 1 , : : : , α m / denote the d × m index matrix, where m d, and let f.z/ = Σ m j=1 f j .z j / + c for z = .z 1 , : : : , z m / T ∈ R m , so the prediction function can be written as f I .x/ = f.A T x/ for x = .x 1 , : : : , x d / T ∈ R d . As in Section 2, we impose shape constraints on the ridge functions by assuming that f j : R → R satisfies the shape constraint with label l j ∈ {1, 2, : : : , 9}, for j = 1, : : : , m, and consider the shape vector L m ∈ {1, : : : , 9} m to be fixed throughout, so that in this section, as well as in the corresponding proofs and algorithm, F = F L m .
The interesting question of the identifiability of GAIMs was settled recently by Yuan (2011). To discuss the issue of identifiability carefully, we first state the main result of Yuan (2011) and then relate it to our shape-constrained setting. It is convenient to say that .α 1 , : : : , α m , f 1 , : : : , f m , c/ satisfy the additive index model assumptions on .−a, a/ d with a ∈ .0, ∞] if the following conditions hold.
(g) Each f j is either continuous at 0 or is monotonic or bounded on a subinterval of .−a, a/.
Although theorem 2 requires several conditions, most of these are very natural to rule out trivial lack of identifiability problems. The most interesting conditions are 1(d) and 1(e). As explained in equation (2.4) of Yuan (2011), even if there is only one linear function, further restrictions are required because for all non-zero scalars b 1 and b 2 . This is why we require α T k α j = 0 for every k = j whenever f j is linear. As shown in proposition 1 of Yuan (2011), condition 1(e) is necessary, because, if there are two quadratic ridge functions, then their corresponding projection indices are not identifiable. This fact is closely related to the identifiability of independent component analysis models (Eriksson and Koivunen, 2004;Samworth and Yuan, 2012).

Generalized additive index model estimation
Let A 0 = .α 0,1 , : : : , α 0,m / denote the true index matrix. For x = .x 1 , : : : , x d / T ∈ R d , let x/ = f 0,1 .α T 0,1 x/ +: : : + f 0,m .α T 0,m x/ + c 0 be the true prediction function, and write f 0 .z/ = Σ m j=1 f 0,j .z j / + c 0 for z = .z 1 , : : : , z m / T ∈ R m . Again we restrict our attention to the common EF distributions listed in Table 2 and take g to be the corresponding canonical link function. In the light of the identifiability discussion in Section 3.1, it makes sense to define the class of index matrices associated with a given shape vector L m as A = {A = .α 1 , : : : , α m / ∈ R d×m |A satisfies conditions 1(b) and 1(c), and, if ∃ k ∈ {1, : : : , m} such that l k = 1, then α T j α k = 0 for every j = k}: We can now consider the set of shape-constrained additive index functions given by By analogy with the approach that was adopted in Section 2, a natural idea is to seek to maximize the scaled partial log-likelihood¯n over the pointwise closure of G. As part of this process, and writing¯n.f ; A/ =¯n{f ; .A T X 1 , Y 1 /, : : : , .A T X n , Y n /} for the scaled partial index log-likelihood, we would like to find a d × m matrix in A that maximizes where the dependence of Λ n .·/ on L m and .X 1 , Y 1 /, : : : , .X n , Y n / is suppressed for notational convenience. We argue, however, that this strategy has two drawbacks. First, if m 2 and L m ∈ L m := {1, 4, 5, 6} m ∪ {1, 7, 8, 9} m , then, in certain cases, maximizing Λ n .A/ over A can lead to a perfect fit to the data; second, the function Λ n .·/ need not be upper semicontinuous, so we are not guaranteed that a maximizer exists. These phenomena are illustrated in examples 2 and 3 in Appendix A. As a result, certain modifications are required for our shape-constrained approach to be successful in the context of GAIMs. To deal with the first issue when L m ∈ L m , we optimize Λ n .·/ over the subset of matrices for some predetermined δ > 0, where λ min .·/ denotes the smallest eigenvalue of a non-negative definite matrix. Other strategies are also possible. For example, when L m = .2, : : : , 2/ T , the 'perfect fit' phenomenon can be avoided by considering only matrices with non-negative entries (see Section 6.2 below).
To address the second issue, we shall show that, given f I 0 ∈ G satisfying the identifiability conditions, to obtain a consistent estimator, it is sufficient to findf I n from the set x/ =f n .Ã T n x/, whereÃ n = .α n,1 , : : : ,α n,m / ∈ A orA δ is the estimated index matrix andf n .z/ = Σ m j=1f n,j .z j / +c n is the estimated additive function withf n,j satisfying the shape constraint l j andf n,j .0/ = 0 for every j = 1, : : : , m. We callf I n the shape-constrained additive index estimator (SCAIE), and writeÃ n andf n,1 , : : : ,f n,m respectively for the corresponding estimators of the index matrix and ridge functions.
When there is a maximizer of the function Λ n .·/ over A or A δ , the setS n is certainly nonempty; in fact,S n is always non-empty, in view of the following proposition and the argument below it.
If a maximizer of Λ n .·/ does not exist, there must exist someÅ n such that Λ n .Å n / > Λ n .A 0 /. It then follows from proposition 2 that lim inf A→Å n Λ n .A/ Λ n .Å n / > Λ n .A 0 / ¯n .f 0 ; A 0 /, so any A sufficiently close toÅ n yields a prediction function inS n . A stochastic search algorithm can be employed to find such matrices; see Section 4.2 for details.

Consistency of shape-constrained additive index estimator
In this subsection, we show the consistency off I n andÃ n under a random-design setting. In addition to assumptions 1 and 2, we require the following conditions. Condition 2. Fix L m ∈ {1, 2, : : : , 9} m . The true prediction function f I 0 and the corresponding index matrix A 0 satisfy the shape-constrained additive index model conditions 1(a ) and 1(b)-1(f) on R d with shape vector L m . In particular, if the ridge function f 0,j is linear, then l j = 1.
Theorem 3. Assume assumptions 1 and 2 as well as conditions 2-4. Then, provided that δ λ min .A T 0 A 0 / when L m ∈ L m , we have for every a 0 0 that To obtain consistency, whenever L m ∈ L m , the practitioner requires an a priori assumption of a lower bound for λ min .A T 0 A 0 /, and this lower bound plays a role in the computation. However, it is quite natural to want projection indices not to be too highly correlated to aid interpretability. In practice, choosing δ too small can result in overfitting when n is small, but in our experience the method is relatively insensitive to quite a wide range of choices of δ.
Consistency of the estimated index matrix and the ridge functions is established in the next corollary. Some care, however, is required to define an appropriate notion of distance between the estimator and estimand. In particular, note that the ordering of the additive components is arbitrary. This means that we can only hope to estimate the set of projection indices, and not their ordering, so it is only by allowing a permutation of the components that we can guarantee that the estimated quantities are asymptotically close to their population counterparts. Nevertheless, each projection index has a corresponding ridge function, so in permuting the ordering we must ensure to apply the same permutation to both the projection indices and the ridge functions. Similarly, since we are also unable to estimate the zero entries of the index matrix exactly, we should also allow the sign of each column of the index matrix to be flipped. This discussion leads us to make the following definition: where P m denotes the set of permutations of {1, : : : , m}.

Computation of shape-constrained maximum likelihood estimator
Throughout this subsection, we fix the shape vector L d = .l 1 , : : : , l d / T , the EF distribution and the values of the observations, and present an algorithm for computing the SCMLE described in Section 2. The algorithm is quite different from backfitting algorithms that are commonly used for fitting (generalized) additive models (Breiman and Friedman, 1985;Buja et al., 1989;Mammen and Park, 2006;Yu et al., 2008), but we found it to be somewhat faster in practice for our purposes.
Our aim is to reformulate the problem as a convex program in terms of basis functions and to apply an active set algorithm (Nocedal and Wright, 2006). Such algorithms have recently become Table 3. Pseudocode of the active set algorithm for computing the SCMLE Step 1 Initialization-outer loop: sort {X i } n i=1 co-ordinate by co-ordinate; define the initial working set as S 1 = {.0, j/|j ∈ {1,: : : , d 1 }} ∪ {.1, j/|l j ∈ {4, 7}}; in addition, define the set of potential elements as S = {.i, j/ : i = 1,: : : , n, j = d 1 + 1,: : : , d}; set the iteration count k = 1 Step 2 Initialization-inner loop: if k > 1, set w Å = w .k−1/ Step 3 Unrestricted GLM: solve the following unrestricted GLM problem by using iteratively reweighted least squares (IRLS): where, for k > 1, w Å is used as a warm start; store its solution in w .k/ (with zero weights for the elements outside S k ) Step 4 Working set refinement: if k = 1 or if w ij > 0 for every .i, j/ ∈ S k \S 1 , go to step 5; otherwise, define respectively the moving ratio p and the set of elements to drop as 1 − p/w Å + pw .k/ and go to step 3 Step 5 Derivative evaluation: for every .i, j/ ∈ S, compute D .k/ i,j = @ψ n @w ij .w .k/ / Step 6 Working set enlargement: write S + = arg max .i,j/∈S D .k/ i,j for the enlargement set, with maximum D .k/ = max .i,j/∈S D .k/ i,j ; if D .k/ 0 (or some other criteria are met if the EF distribution is non-Gaussian, e.g. D .k/ < IRLS for some predetermined small IRLS > 0), stop the algorithm and go to step 7; otherwise, pick any single-element subset S Å + ⊆ S + , let S k+1 = S k ∪ S Å + , set k := k + 1 and go back to step 2 Step 7 Output: for every j = 1,: : : , d, setf n,j .x j / = Σ {i:.i,j/∈S k } w .k/ ij g ij .x j /; takeĉ n = w .k/ 00 ; finally, return the SCMLE asf n .x/ = Σ d j=1f n,j .x j / +ĉ n popular for computing various shape-constrained estimators. For instance, Groeneboom et al. (2008) used a version, which they called the 'support reduction algorithm' in the one-dimensional convex regression setting; Dümbgen and Rufibach (2011) applied another variant to compute the univariate log-concave maximum likelihood density estimator. Recently, Meyer (2013b) developed a 'hinge' algorithm for quadratic programming, which can also be viewed as a variant of the active set algorithm. Without loss of generality, we assume in what follows that only the first d 1 components (d 1 d) of f 0 are linear, i.e. l 1 =: : : = l d 1 = 1 and .l d 1 +1 , : : : , l d / T ∈ {2, : : : , 9} d−d 1 . Furthermore, we assume that the order statistics {X .i/,j } n i=1 of {X ij } n i=1 are distinct for every j = d − d 1 + 1, : : : , d. For x = .x 1 , : : : , x d / T ∈ R d , define the basis functions g 0j .x j / = x j for j = 1, : : : , d 1 and, for i = 1, : : : , n, Note that all the basis functions given above are zero at the origin. Let W denote the set of weight vectors w = .w 00 , w 01 , : : : , w 0d 1 , w 1.d 1 +1/ , : : : , w n.d 1 +1/ , : : : , w 1d , : : : , w nd / T ∈ R n.d−d 1 /+d 1 +1 satisfying w ij 0, for every i = 1, : : : , n and every j with l j ∈ {2, 3, 5, 6, 8, 9}, w ij 0, for every i = 2, : : : , n and every j with l j ∈ {4, 7}: To compute the SCMLE, it suffices to consider prediction functions of the form Our optimization problem can then be reformulated as maximizing ψ n .w/ =¯n{f w ; .X 1 , Y 1 /, : : : , .X n , Y n /} over w ∈ W. Note that ψ n is a concave (but not necessarily strictly concave) function. Since our goal here is to find a sequence .w .k/ / such that ψ n .w .k/ / → sup w∈W¯n .f w / as k → ∞. In Table 3, we give the pseudocode for our active set algorithm for finding the SCMLE, which is implemented in the R package scar (Chen and Samworth, 2014). We outline below some implementation details.
(a) Iteratively reweighted least squares (IRLS): step 3 solves an unrestricted GLM problem by applying IRLS. Since the canonical link function is used here, IRLS is simply the Newton-Raphson method. If the EF distribution is Gaussian, then IRLS gives the exact solution of the problem in just one iteration. Otherwise, there is no closed form expression for the solution, so a threshold IRLS must be picked to serve as part of the stopping criterion. Note that here IRLS can be replaced by other methods that solve GLM problems, though we found that IRLS offers competitive timing performance. r .k/ u g ij .X uj /: Step 1 Initialization: let N denote the total number of stochastic searches; set k = 1 Step 2 Draw random matrices: draw a d × m random matrix A k by initially choosing the entries to be IID N.0, 1/ random variables; for each column of A k , if there is a j ∈ {1,: : : , m} such that l j = 1, subtract its projection to the jth column of A k so that condition 1(d) is satisfied; then normalize each column so that conditions 1(b) and 1(c) are satisfied Step 3 Rejection sampling: if L m ∈ L m and λ min {.A k / T A k } < δ, then go back to step 2; otherwise, if k < N, set k := k + 1 and go to step 2 Step 4 Evaluation of Λ n : for every k = 1,: : : , N, compute Λ n .A k / using the active set algorithm described in Table 3 Step 5 Index matrix estimation-1: let A Å ∈ arg max 1 k N Λ n .A k /; setÃ n = A Å Step 6 Index matrix estimation-2 (optional): treat A Å as a warm start and apply another optimization strategy to find A ÅÅ in a neighbourhood of A Å such that Λ n .A ÅÅ / > Λ n .A Å /; if such A ÅÅ can be found, setÃ n = A ÅÅ Step 7 Output: use the active set algorithm described in Table 3  For simplicity, we suppress henceforth the superscript k. Now fix j and reorder the pairs .r i , X ij / as .r .1/ , X .1/,j /, : : : , .r .n/ , X .n/,j / such that X .1/,j : : : X .n/,j (note that this is performed in step 1). Furthermore, define for i = 1, : : : , n, where we suppress the explicit dependence of r .u/ on j in the notation. We have R n,j = 0 because of the presence of the intercept w 00 . The following recurrence relations can be derived by simple calculation.
(c) Convergence: if the EF distribution is Gaussian, then it follows from theorem 1 of Groeneboom et al. (2008) that our algorithm converges to the optimal solution after finitely many iterations. In general, the convergence of this active set strategy depends on the following two aspects.
(i) Convergence of IRLS: the convergence of the Newton-Raphson method in step 3 depends on the starting values. It is not guaranteed without step size optimization; see Jørgensen (1983). However, starting from the second iteration, each subsequent IRLS step is performed by starting from the previous well-approximated solution, which typically makes the method work well. (ii) Accuracy of IRLS: if IRLS gives the exact solution every time, then ψ n .w .k/ / increases at each iteration. In particular, one can show that, at the kth iteration, the new element S Å + added into the working set in step 6 will remain in the working set S k+1 after the .k + 1/th iteration. However, since IRLS returns only an approximate solution, there is no guarantee that the above-mentioned phenomenon continues to hold. One way to resolve this issue is to reduce the tolerance IRLS if ψ n .w .k/ / ψ n .w .k−1/ /, and to redo the computations for both the previous and the current iteration.
Here we terminate our algorithm in step 6 if either ψ n .w .k/ / is non-increasing or D .k/ < IRLS . In our numerical work, we did not encounter convergence problems, even outside the Gaussian setting.

Computation of shape-constrained additive index estimator
The computation of the SCAIE can be divided into two parts.
(a) For a given A, find f ∈ cl.F/ that maximizesl n .f ; A/ by using the algorithm in Table 3 but with A T X i replacing X i . Denote the corresponding maximum value by Λ n .A/. (b) For a given lower semicontinuous function Λ n on A or A δ as appropriate, find a maximizing sequence .A k / in this set.
The second part of this algorithm solves a finite dimensional optimization problem. Possible strategies include the differential evolution method (Price et al., 2005; or a stochastic search strategy (Dümbgen et al., 2013) described below. In Table 4, we give the pseudocode for computing the SCAIE. We note that step 4 of the stochastic search algorithm is parallelizable.

Simulation study
To analyse the empirical performance of the SCMLE and SCAIE, we ran a simulation study focusing on the running time and the predictive performance. Throughout this section, we took IRLS = 10 −8 .

Generalized additive models with shape restrictions
For each data set, we took X 1 , : : : , X n ∼ IID U[−1, 1] d . The following three problems were considered.
(i) Gaussian: for i = 1, : : : , n, conditionally on X i , draw Y i ∼ N{f 0 .X i /, 0:5 2 } independently. (ii) Poisson: for i = 1, : : : , n, conditionally on where g.μ/ = log.μ/. (iii) Binomial: for i = 1, : : : , n, draw T i (independently of X 1 , : : : , X n ) from a uniform distribution on {11, 12, : : : , 20}, and then, conditionally on X i and All the component functions are convex, so f 0 is convex. This allows us to compare our method with other shape-restricted methods in the Gaussian setting. In the binomial setting, we considered the EF with different dispersion parameters since T i here can take different values. The new partial log-likelihood can be viewed as a weighted version of the original partial loglikelihood, where this feature can be easily incorporated in the SCMLE. Regardless of the EF distributions, problem 3 represents a more challenging (higher dimensional) problem.
In the Gaussian setting, we compared the performance of the SCMLE with shape-constrained additive models, SCAM (Pya and Wood, 2015), GAMs with integrated smoothness estimation, GAMIS (Wood, 2004), multivariate adaptive regression splines with maximum interaction degree equal to 1, MARS (Friedman, 1991), regression trees, Tree (Breiman et al., 1984), convex adaptive partitioning, CAP (Hannah and Dunson, 2013) and multivariate convex regression, MCR (Lim and Glynn, 2012;Seijo and Sen, 2011). Some of these methods are not designed to deal with non-identity link functions, so in the Poisson and binomial settings we compared the SCMLE with only SCAM and GAMIS. SCAM can be viewed as a shape-restricted version of GAMIS. It is a spline-based method, and is implemented in the R package scam (Pya, 2012). GAMIS is implemented in the R package mgcv (Wood, 2012), whereas MARS can be founded in the R package mda (Hastie et al., 2011). The method of regression trees is implemented in the R package tree (Ripley, 2012), and CAP is implemented in MATLAB by Hannah and Dunson (2013). We implemented MCR in MATLAB using the interior-point-convex solver. Default settings were used for all the competitors mentioned above.
For different sample sizes n = 200, 500, 1000, 2000, 5000, we ran all the methods on 50 randomly generated data sets. Our numerical experiments were carried out on standard 32-bit desktops with 1.8 GHz central processor units. Each method was given at most 1 h per data set. Beyond this limit, the run was forced to stop and the corresponding results were omitted. Tables 1 and 2 in the on-line supplementary material provide the average running time of different methods per training data set. The SCMLE method takes roughly 1-2 s with a sample size of n = 1000 and 4-8 additive components, which is unsurprisingly slower than Tree or MARS, but it is typically faster than other shape-constrained methods such as SCAM and MCR. Note that MCR is particularly slow compared with the other methods and becomes computationally infeasible for n 1000.
To study the empirical performance of the SCMLE, we drew 10 5 covariates independently from U[−0:98, 0:98] d and estimated the mean integrated squared error MISE, namely E{ [−0:98,0:98] d .f n − f 0 / 2 }, using Monte Carlo integration. Estimated MISEs are given in Tables 5 and 6. For every setting that we considered, the SCMLE method performs better than Tree, CAP and MCR. This is largely because these three estimators do not take into account the additive structure. In particular, MCR suffers severely from its boundary behaviour. The performance of SCAM depends on the choice of a tuning parameter k that controls the number of B-spline basis functions for each component function. The default choice in the scam package is k = 10, though we also experimented with k = 20 and k = 30. The picture is somewhat mixed: in some settings, moving from k = 10 to k = 20 resulted in improvements for larger sample sizes, whereas in others the results were very similar, or even resulted in a deterioration in performance. Moving from k = 20 to k = 30 made much smaller differences. A drawback of increasing k is that the computation quickly became very burdensome, and in fact sometimes went beyond our '1 h per data set' cut-off for k = 30 when n = 5000. The k = 10 and k = 20 versions of SCAM are denoted as SCAM 10 and SCAM 20 respectively in Tables 5 and 6. We found that SCAM and GAMIS occasionally offer slightly better performance than the SCMLE method when n is small. This is also mainly caused by the boundary behaviour of the SCMLE and is alleviated as the number of observations n increases. In fact, in each of the problems considered, the SCMLE method enjoys better predictive performance than the other methods for n 500. The SCMLE appears to offer particular advantages when the true signal exhibits inhomogeneous smoothness, since it can regularize in a locally adaptive way, whereas both SCAM and GAMIS rely on a single level of regularization throughout the covariate space. Finally, we note that in certain other shape-constrained estimation problems where boundary effects are also observed, such as log-concave density estimation, it is possible to construct a fully automatic smoothed estimate to alleviate these issues (Dümbgen and Rufibach, 2009;Cule et al., 2010;Chen and Samworth, 2013). However, in this instance, it seems that the most obvious remedy for boundary effects for small sample sizes would be to impose an additional constraint on the Lipschitz constant of each convex or concave component, and an upper and lower bound on each monotone component. Although feasible to implement, it seems difficult to give practical advice for the choice of these tuning parameters, and we do not pursue this issue further here.

Generalized additive index models with shape restrictions
In our comparisons of different estimators in GAIMs, we focused on the Gaussian case to facilitate comparisons with other methods. We took X 1 , : : : , X n ∼ IID U[−1, 1] d for each data set and considered the following two problems. (a) In problem 4, d = 4 and m = 1. We set L 1 = 4 and f I 0 .
In both problems, conditionally on X i , we drew independently Y i ∼ N{f I 0 .X i /, 0:5 2 } for i = 1, : : : , n. We compared the performance of our SCAIE with projection pursuit regression, PPR (Friedman and Stuetzle, 1981), multivariate adaptive regression splines with maximum two interaction degrees, MARS, and Tree. In addition, in problem 4, we also considered the semiparametric single-index method SSI (Ichimura, 1993), CAP and MCR. SSI was implemented in the R package np (Hayfield and Racine, 2013). The SCAIE was computed by using the algorithm illustrated in Table 4. We picked the total number of stochastic searches to be N = 100. Because problem 4 is a single-index problem (i.e. m = 1), there is no need to supply δ. In problem 5, we chose δ = 0:1. We considered sample sizes n = 200, 500, 1000, 2000, 5000. Table 3 in the on-line supplementary material gives the average running time of different methods per training data set. Although SCAIE is slower than PPR, MARS and Tree, its computation can be accomplished within 10-20 s when n = 1000. As SSI adopts a leave-one-out cross-validation strategy, it is typically considerably slower than the SCAIE method.
Estimated MISEs of different estimators over [−0:98, 0:98] d are given in Table 7 based on 50 randomly generated data sets. In both problem 4 and problem 5, we see that SCAIE outperforms its competitors for all the sample sizes that we considered. It should, of course, be noted that SSI, PPR, MARS and Tree do not enforce the shape constraints, whereas MARS, Tree, CAP and MCR do not take into account the additive index structure.
In the index setting, it is also of interest to compare the performance of those methods that directly estimate the index matrix. We therefore estimated root-mean squared errors RMSE, given by √ E. α n,1 − α 0,1 2 2 / in problem 4, where α 0,1 = .0:25, 0:25, 0:25, 0:25/ T . For problem 5, we estimated mean errors in Amari distance ρ, defined by Amari et al. (1996) as where C ij = .Ã n A −1 0 / ij and A 0 = 0:5 0:5 0:5 −0:5 : This distance measure is invariant to permutation and takes values in [0, d − 1]. Results obtained for the SCAIE and, where applicable, SSI and PPR are displayed in Table 8. For both problems, the SCAIE method performs better in these senses than both SSI and PPR in terms of estimating the projection indices.

Real data examples
In this section, we apply our estimators in two real data examples. In the first, we study doctoral publications in biochemistry and fit a generalized (Poisson) additive model with concavity constraints, whereas in the second we use an additive index model with monotonicity constraints to study javelin performance in the decathlon.

Doctoral publications in biochemistry
The scientific productivity of a doctoral student may depend on many factors, including some or all of the number of young children that they have, the productivity of the supervisor, their gender and marital status. Long (1990) studied this topic, focusing on the gender difference; see also Long (1997). The data set is available in the R package AER (Kleiber and Zeileis, 2013) and contains n = 915 observations. Here we model the number of articles written by the ith doctoral student in the last 3 years of their research as a Poisson random variable with mean μ i , where log.μ i / = f 1 .kids i / + f 2 .mentor i / + a 3 gender i + a 4 married i + c, for i = 1, : : : , n, where kids i and mentor i are respectively the number of that student's children who are less than 6 years old, and the number of papers published by that student's supervisor during the same period of time. Both gender i and married i are factors taking values 0 and 1, where 1 indicates 'female' and 'married' respectively. In the original data set, there is an extra continuous variable that measures the prestige of the graduate programme. We chose to drop this variable in our example because (a) its values were determined quite subjectively and (b) including this variable does not seem to improve the predictive power in the above settings.
To apply the SCMLE, we assume that f 1 is a concave and monotone decreasing function, whereas f 2 is a concave function. The main estimates obtained from the SCMLE are summarized in Table 9 and Fig. 1. Outputs from SCAM and GAMIS are also reported for comparison. We see that, with the exception off n,2 , estimates obtained from these methods are relatively close. Note that, in Fig. 1, the GAMIS estimate of f 2 displays local fluctuations that might be more difficult to interpret than the estimates obtained by using the SCMLE and SCAM methods.
Finally, we examine the prediction power of the different methods via cross-validation. Here we randomly split the data set into training (70%) and validation (30%) subsets. For each split, we compute estimates by using only the training set and assess their predictive accuracy in terms of the root-mean-square prediction error RMSPE on the validation set. The RMSPEs reported in Table 10 are averages over 500 splits. Our findings suggest that, although comparable with SCAM, the SCMLE offers slight improvements over GAMIS and Tree for this data set.

Javelin throw
In this section, we consider the problem of predicting a decathlete's javelin performance from their performances in the other decathlon disciplines. Our data set consists of decathlon athletes who scored at least 6500 points in at least one athletic competition in 2012 and scored points in every event there. To avoid data dependence, we include only one performance from each athlete, namely their 2012 personal best performance (over the whole decathlon). The data set, which consists of n = 614 observations, is available in the R package scar (Chen and Samworth, 2014). For simplicity, we only select events (apart from the javelin) that directly reflect the athlete's ability in throwing and short distance running, namely the shot put, discus, 100 m race and 110 m hurdles race. We fit the following additive index model: for i = 1, : : : , 614, where i ∼ IID N.0, σ 2 /, and where javelin i , shot i , discus i , 100m i and 110m i represent the corresponding decathlon event scores for the ith athlete. For the SCAIE, we assume that both f 1 and f 2 are monotone increasing, and we also assume that A 11 , : : : , A 41 , A 12 , : : : , A 42 are non-negative. This slightly restricted version of the SCAIE aids interpretability of the indices, and prevents the 'perfect fit' phenomenon (see Section 3.2), so no choice of δ is required. Table 11 gives the estimated index loadings by SCAIE. We observe that the first projection index can be viewed as the general athleticism associated with the athlete, whereas the second can be interpreted as a measure of throwing ability. Note that, when using the SCAIE method,Â n,32 andÂ n,42 are relatively small. To simplify our model further, and to seek improvement in the prediction power, we therefore considered forcing these entries to be exactly 0 in the optimization steps of the SCAIE method. This sparse version is denoted as SCAIE s . Its estimated index loadings are also reported in Table 11.
To compare the performance of our methods with PPR, MARS with maximum two degrees of interaction and Tree, we again estimated the prediction power (in terms of RMSPE) via 500 repetitions of 70%-30% random splits into training-test sets. The corresponding RMSPEs are reported in Table 12. We see that both SCAIE and SCAIE s outperform their competitors in this particular data set. It is also interesting to note that SCAIE s has a slightly lower RMSPE than the SCAIE, suggesting that the simpler (sparser) model might be preferred for prediction here.

Extensions and outlook
In this paper, we have developed methodology and theory for fitting GAMs with shape constraints on the additive components. Despite the non-parametric nature of the problem, our approach has the attractive feature that there are no tuning parameters to choose, and moreover it can be extended to handle an index structure. The algorithms that we have developed are fast to compute and are publicly available in the R package scar. We now describe various possible extensions of our theoretical result on the consistency of the SCMLE (theorem 1), and we conclude with a more general discussion of remaining challenges and possible future directions. As a generalization of the Gaussian shape-constrained additive model, consider the setting where the IID pairs .X 1 , Y 1 /, : : : , .X n , Y n / satisfy Y i = f 0 .X i / + i , for i = 1, : : : , n, where E. i |X i / = 0 and var. i |X i / = φ 0 . In this case, we can define the shapeconstrained least squares estimator (SCLSE) bỹ f n ∈ arg min {Y i − f 0 .X i /} 2 , and note that the SCLSE coincides with the SCMLE when i |X i ∼ N.0, φ 0 /. Theorem 1 can be extended to show the consistency of the SCLSE; see the remarks following the proof of theorem 1 in the on-line supplementary material.
Several of the other assumptions of theorem 1 can be weakened at the expense of lengthening the proof still further. First, in assumption 2, we can instead assume that the support supp.X/ of the covariates is a convex subset of R d with positive Lebesgue measure. In that case, it can be concluded that the SCMLEf n converges uniformly to f 0 almost surely on any compact subset contained in the interior of supp.X/. In fact, with some minor modifications, our proof can also be generalized to situations where some components of X are discrete. Second, consistency under a weaker L 1 -norm on [−a 0 , a 0 ] d can be established without assumption 4. In addition, instead of assuming a single dispersion parameter φ 0 as done here, we can take φ ni = φ 0 =w ni for i = 1, : : : , n, where w ni are known, positive weights (this is frequently needed in practice in the binomial setting). In that case, the new partial log-likelihood can be viewed as a weighted version of the original partial likelihood. Consistency of the SCMLE can be established provided that lim inf n→∞ min i w ni =max i w ni > 0.
In another direction, our work could potentially be extended to non-canonical link functions, though this would require further technical conditions. For example, proposition 1 does not necessarily hold for every monotone link function g, i.e. the SCMLEf n may not be unique at X 1 , : : : , X n . See Table 1 of Wedderburn (1976) for a summary of the uniqueness of the maximum likelihood estimator for various link functions and EF distributions. Moreover, a different algorithm would be needed here because the objective function is not necessarily concave for a general link function.
There remain several outstanding theoretical and methodological challenges. On the theory side, it is of great interest to understand both the local and the global rates of convergence of the least squares and maximum likelihood estimators in shape-constrained additive models. The only prior works in this direction with which we are familiar are Mammen and Yu (2007), Cheng (2009) and Yu (2014) on local rates. Mammen and Yu (2007) studied the simplest setting of least squares estimatorsf 1 , : : : ,f d based on independent observations from the monotone regression model Y i = f 1 .X i1 / +: : : + f d .X id / + i , i = 1, : : : , n, where i is assumed to have a subexponential distribution. Under regularity conditions, including an assumption that the design points are supported on [0, 1] d and each f j is strictly increasing and has a bounded derivative on [0, 1], they showed that n 1=3 {f 1 .x 1 / − f 1 .x 1 /} has a non-degenerate limiting distribution at all points x 1 ∈ .0, 1/ with f 1 .x 1 / > 0. This shows that, under these conditions, the least squares estimator evades the curse of dimensionality that one typically observes in multivariate shape-constrained regression problems and is the optimal rate for monotone regression. It is natural to conjecture that, under appropriate conditions, the least squares estimator would also achieve the optimal rate of O p .n −2=5 / for convex or concave components. Cheng (2009) andYu (2014) extended the work of Mammen and Yu (2007) to a setting where the expected response is an additive function that can be decomposed as a sum of linear and monotone components. The behaviour of our estimators under model misspecification is another intriguing topic that warrants further research.
On the methodological side, in addition to studying other distributional families and shape constraints, it would be desirable to extend our methodology to handle settings with large numbers of covariates under an assumption that most of these covariates have a negligible effect on the response. In this direction, Fang and Meinshausen (2012) studied the special case of a high dimensional additive model with isotonic restrictions on the additive components. We suggest that the additive structure is an attractive way of pushing ideas of shape-constrained estimation into high dimensional settings.