An adaptive estimation of dimension reduction space
Abstract
Summary. Searching for an effective dimension reduction space is an important problem in regression, especially for high dimensional data. We propose an adaptive approach based on semiparametric models, which we call the (conditional) minimum average variance estimation (MAVE) method, within quite a general setting. The MAVE method has the following advantages. Most existing methods must undersmooth the nonparametric link function estimator to achieve a faster rate of consistency for the estimator of the parameters (than for that of the nonparametric function). In contrast, a faster consistency rate can be achieved by the MAVE method even without undersmoothing the nonparametric link function estimator. The MAVE method is applicable to a wide range of models, with fewer restrictions on the distribution of the covariates, to the extent that even time series can be included. Because of the faster rate of consistency for the parameter estimators, it is possible for us to estimate the dimension of the space consistently. The relationship of the MAVE method with other methods is also investigated. In particular, a simple outer product gradient estimator is proposed as an initial estimator. In addition to theoretical results, we demonstrate the efficacy of the MAVE method for high dimensional data sets through simulation. Two real data sets are analysed by using the MAVE approach.
1. Introduction
Let y and X be respectively ℝ‐valued and ℝp‐valued random variables. Without prior knowledge about the relationship between y and X, the regression function g(x)=E(y|X=x) is often modelled in a flexible nonparametric fashion. When the dimension of X is high, recent efforts have been expended in finding the relationship between yand X efficiently. The final goal is to approximate g(x) by a function having simplifying structure which makes estimation and interpretation possible even for moderate sample sizes. There are essentially two approaches: the first is largely concerned with function approximation and the second with dimension reduction. Examples of the former are the additive model approach of Hastie and Tibshirani (1986)and the projection pursuit regression proposed by Friedman and Stuetzle (1981); both assume that the regression function is a sum of univariate smooth functions. Examples of the latter are the dimension reduction of Li (1991) and the regression graphics of Cook (1998).
( (1.1))
A simple approach that is directly related to the estimation of EDR directions is the average derivative estimation (ADE) proposed by Härdle and Stoker (1989). For the single‐index model y=g1(β1TX)+ɛ, the expectation of the gradient ▽g1(X) is a scalar multiple of β1. A nonparametric estimator of ∇g1(X) leads to an estimator of β1. There are several limitations of ADE.
- (a)
To estimate β1, the condition E{g1 ′ (β1TX)}≠0 is needed. This condition is violated when g1(⋅) is an even function and X is symmetrically distributed.
- (b)
As far as we know, there is no successful extension to the case of more than one EDR direction.
( (1.2))As pointed out by Cook and Weisberg in their discussion of Li (1991), the most important family of distributions satisfying condition (1.2) is that of elliptically symmetric distributions. Now, in time series analysis we typically set X=(yt−1,…,yt−p)T, where {yt} is a time series. Then it is easy to prove that elliptical symmetry of X for all p with (second‐order) stationarity of {yt} implies that {yt} is time reversible, a feature which is the exception rather than the rule in time series analysis. (For a discussion of time reversibility, see, for example, Tong (1990).)
Another aspect of searching for the EDR space is the determination of the corresponding dimension. The method proposed by Li (1991) can be applied to determine the dimension of the EDR space in some cases but for reasons mentioned above it is typically not relevant for time series data.
In this paper, we shall propose a new method to estimate the EDR directions. We call it the (conditional) minimum average variance estimation (MAVE) method. Our approach is inspired by the SIR method, the ADE method and the idea of local linear smoothers (see, for example, Fan and Gijbels (1996)). It is easy to implement and needs no strong assumptions on the probabilistic structure of X. Specifically, our methods apply to model (1.1) including its generalization within the additive noise set‐up. The joint density function of covariate X is needed if we search for the EDR space globally. However, if we have some prior information about the EDR directions and we look for them locally, then existence of density of X in the directions around EDR directions will suffice. These cases include those in which some of the covariates are categorical or functionally related. The observations need not be independent, e.g. time series data. On the basis of the properties of the MAVE method, we shall propose a method to estimate the dimension of the EDR space, which again does not require strong assumptions on the design X and has wide applicability.
( (1.3))The rest of this paper is organized as follows. Section 2 describes the MAVE procedure and gives some results. Section 3 discusses some comparisons with existing methods and proposes a simple average outer product of gradients (OPG) estimation method and an inverse MAVE method. To check the feasibility of our approach, we have conducted many simulations, typical ones of which are reported in Section 4. In Section 5 we study the circulatory and respiratory data of Hong Kong and the hitters' salary data of the USA using the MAVE methodology. In practice, we standardize our observations. Appendix A establishes the efficiency of the algorithm proposed. Some of our theoretical proofs are very lengthy and not included here. However, they are available on request from the authors. Finally, the programs are available at http://www.blackwellpublishers.co.uk/rss/
2. Estimation of effective dimension reduction space
2.1. The estimation of effective dimension reduction directions
Let us denote the working dimension by d with 1 ≤ d ≤ p. Therefore, we need to estimate only a set of orthogonal vectors. There are many related methods for this and similar purposes. Most of the existing methods adopt two separate cost functions. The first is used to estimate the link function and the second the directions based on the estimated link function. See, for example, Hall (1989), Härdle and Stoker (1989) and Carroll et al. (1997). It is therefore not surprising that the performance of the direction estimator suffers from the bias problem in nonparametric estimation. Härdle et al.(1993) noticed this and overcame the problem for a single‐index model by minimizing a cross‐validation‐type sum of squares of the residuals simultaneously with respect to the bandwidth and the directions. However, the cross‐validation‐type sum of squares of residuals affects the performance of estimation. See Xia et al. (1999). Moreover, the minimization is not trivial. Härdle et al. (1993) used the grid search method in their simulations, which is quite inefficient when the dimension is high.
( (2.1))
( (2.2))
( (2.3))
( (2.4))

( (2.5))
( (2.6))
. On the basis of expressions (2.1), (2.3) and (2.6), we can estimate the EDR directions by solving the minimization problem
( (2.7))If the weights depend on B, the implementation of the minimization in problem (2.7) is non‐trivial. The weight wi0 in approximation (2.5) should be chosen such that the value ofwi0is a function of the distance betweenXiandX0. Next, we give two choices of wi0.
2.1.1. Multidimensional kernel weight

This kind of weight can be used as an initial step of estimation. Given d, we obtain a set of directions
via the minimization in problem (2.7). Let
denote the subspace spanned by the column vectors of
. The distance between the space 𝒮(B0), the space spanned by the column vectors of B0, and the space
can be measured by
if d<D and
if d≥D. Here and later, obvious augmentations by zero vectors are understood and the distance is denoted by
.


Provided that the dimension is chosen correctly, the rate of consistency for
is OP{hopt3 log (n)} if we use the optimal bandwidth hopt of the regression function estimation in the sense of minimizing the mean integrated squared errors. This is faster than the rate that is achieved by the other methods, which is OP(hopt2). Note that the consistency rate for the local linear estimator of the link function is also OP(hopt2). The faster rate is due to minimizing the average (conditional) variance with respect to both directions and the local linearization of the link function. Moreover, if we extend the idea to higher order local polynomial smoothers, root n consistency for the estimator of B0 can be achieved; see the discussion in Section 6.
2.1.2. Refined kernel weight
. Let
( (2.8))Re‐estimate B0 by the minimization in problem (2.7) with weights
replacing wij. By an abuse of notation, we denote the new estimator of B0 by
also. Replace
in equation (2.8) by the latest
and estimate B0. Repeat this procedure until
converges; we call the limit the refined MAVE (RMAVE) estimator. Results similar to those of theorem 1 can be obtained. We here use a lower dimensional kernel and the bandwidth now is smaller than that used in the multidimensional wij, leading to a faster rate of consistency.

2.2. Dimension of effective dimension reduction space
Methods have been proposed for the determination of the number of the EDR directions. See, for example, Li (1992), Schott (1994) and Cook (1998). Their approaches tend to be based on similar probabilistic assumptions on the covariates X imposed by SIR. We now propose an alternative approach within our set‐up. It is well known that a cross‐validation approach penalizes the complexity of the model. See, for example, Stone (1974). We now extend the cross‐validation method of Cheng and Tong (1992) and Yao and Tong (1994) to solve the above problem. A similar extension may be effected by using the approach of Auestad and Tjøstheim (1990), which is asymptotically equivalent to the cross‐validation method.
s. As we have proved that the rate of consistency of the
s is faster than that of the nonparametric link function estimators, the replacement is justified. Let

. Here, we use the suffix d to highlight the fact that the bandwidth depends on the working dimension d. Let






If X is not bounded, we may consider only a compact domain over which the density is positive. Then we have a small probability of overestimating the dimension (Cheng and Tong, 1992; Yao and Tong, 1994). Note that ad0,j is the Nadaraya–Watson estimator of a. We can use alternatively the local linear estimator for ad0,j, which also leads to a consistent
. However, the local linear estimator involves more complicated computation. Moreover, as far as cross‐validatory determination of the dimension is concerned, our experience shows that using the local linear estimator tends to lead to a poorer performance in comparison with using the Nadaraya–Watson estimator. Empirical evidence suggests that using the latter tends to incur a smaller bandwidth and to lead to a heavier penalty for overfitting.
2.3. Bandwidth and algorithm

Our search procedure is as follows.
-
Step 1 (directions):
for each d, 1≤d≤p, we search for the d directions as follows.
- (a)
Initial value: use the multidimensional kernel weight to obtain an initial estimate of possible EDR directions
by minimizing problem (2.7).
- (b)
Refined estimation: let
constitute the latest estimator of B. Therefore we obtain refined kernel weights by using equation (2.8). We refine the estimator via expression (2.7) using the refined kernel weights. Continue this procedure until convergence. The CV(d) values can be obtained by using the final estimators of the directions.
- (a)
-
Step 2
(dimension and output results): compare the CV(d), 0≤d≤p. The d with the smallest CV(d) value is the estimated dimension. The corresponding estimator of B in step 1(b) gives the estimated EDR directions.
and
be the estimators of B in two adjacent iterations in step 1(b). A suggested stopping rule for step 1(b) is when the distances
in several adjacent iterations are each less than a pre‐set tolerance. Next, we describe one method to implement the minimization in problem (2.7). For any d, let B=(β1,…,βd) be the initial value (set β1=β2=…=βd=0 in step 1(a)). Bl,k=(β1,…,βk−1) and Br,k=(βk+1,…,βd), k=1,2,…,d. Minimize







and A+ denotes the Moore–Penrose inverse of a matrix A. Here λ is the usual Lagrangian multiplier for the constraint minimization. Finally, we normalize β.
3. Links with other methods and generalization
3.1. Outer product of gradients estimation
with E(ɛ|X)=0 almost surely. Consider the minimization in problem (2.6). Under assumptions 1–6 (in Appendix A) and


does not depend on B. Thus, the minimization problem (2.7) depends mainly on

Therefore, the B which minimizes this equation is the first d eigenvectors corresponding to the d largest eigenvalues of

, which is the average OPG of
.
Lemma 1. Suppose that
is differentiable. If model (1.1) is true, then B0 is in the space spanned by the first D eigenvectors of
corresponding to the largest D eigenvalues.
( (3.1))
by

is the minimizer from expression (3.1). Finally, we estimate the EDR directions by the first d eigenvectors of
. We call this method the method of OPG estimation.
be the first d eigenvectors of
corresponding to the largest d eigenvalues, and
. Suppose that conditions 1–6 (in Appendix A) hold and model (1.1) is true. If nhp/ log (n)→ ∞ and h→0, then

Unlike the ADE method, the OPG method still works even if
. Moreover, the OPG method can handle multiple EDR directions simultaneously whereas the ADE method can only handle the first EDR direction (i.e. the single‐index model). We can further refine the OPG estimator using refined weights as in the RMAVE method. Compared with the MAVE method, the OPG method still suffers from the effect of the bias term in nonparametric function estimation. Therefore, the rate of consistency is slower than that of the MAVE method when the dimension is chosen correctly. However, the OPG method is easy to implement and can be used as an initial value of other estimation methods. Li (1992) proposed the principal Hessian directions (PHD) method by estimating the Hessian matrix of g(⋅). Similarly to the OPG method, the directions are the eigenvectors of the Hessian matrix. For a normally distributed design X, the Hessian matrix can be properly estimated simply by Stein's lemma. However, the PHD method assumes some probabilistic structure on design X which is frequently violated in time series analysis. More fundamentally, the PHD method involves estimators of second derivatives whereas the OPG method involves only the first derivatives, which are considerably simpler and easier to estimate.
3.2. Inverse regression minimum average (conditional) variance estimation
( (3.2))

( (3.3))
respectively. To obtain the (k+1)th direction, we need to perform

We call the estimation method based on minimizing expression (3.3) with
as defined in equation (3.2) the inverse MAVE (IMAVE) method. The IMAVE method is in line with the most predictable variate (Hotelling, 1935). The minimizations in expressions (3.3) and (3.4) can be seen as looking for linear combinations of X that are most predictable from y. Under a similar assumption on X as in SIR, we have the following result.
. If h→0 and nh/ log (n)→ ∞ , then

This result is similar to that of Zhu and Fang (1996). As noted previously, the assumption on the design X can be a handicap as far as applications of the IMAVE method are concerned. Interestingly, simulations show that the SIR method and the IMAVE method can sometimes produce useful results in the case of independent data even when this assumption is mildly violated. However, for time series data, we find that this is often not so.
3.3. Semiparametric multi‐index models


Model (1.3) includes many models with a fixed dimension of EDR space. Examples are the single‐index model of Ichimura and Lee (1991), the generalized partially linear single‐index model of Carroll et al.(1997) and Xia et al. (1999) and the single‐index coefficient regression model of Xia and Li (1999). Here the estimation of the unknown function is also important. An obvious question is whether we can estimate both the function and the directions (multi‐indices) with their optimal rates of consistency simultaneously. This problem has attracted much attention. See, for example, Härdle et al.(1993), Severini and Wong (1992) and Carroll et al. (1997).
For most methods, the estimator of the direction suffers from the effect of the bias in the estimator of the unknown link function. Therefore, undersmoothing the estimator of the link function is necessary for the estimator of the direction to achieve its optimal rate of consistency. We are not aware of any recommended method to select the undersmooth bandwidth. By minimizing a cross‐validation‐type sum of squares of residuals simultaneously with respect to both the bandwidth and the direction, Härdle et al. (1993) have given a positive answer to the question raised in the previous paragraph. However, we have discussed the problems with this approach in Section 2. In contrast, the MAVE‐type methods can handle all the models mentioned above effectively. Specifically, when D'=1, the root n rate of consistency for the direction estimator can be obtained and at the same time the optimal rate of consistency for the nonparametric function estimator can be achieved.
3.4. Discrete or functionally related covariates
Generally, dimension reduction methods cannot be applied to models with discrete or functionally related covariates because they are not estimable, in the sense that there can be more than one dimension reduction space up to orthogonal transformations.
We believe that, provided that the link function can be approximated locally by `tangent' planes, the MAVE method can still be practically useful for discrete or functionally related covariates. The limiting accuracy will, of course, depend on the accuracy of the tangent plane approximation. We must keep in mind two points:
- (a)
the bandwidth cannot be selected to be smaller than a critical value because we must use adjacent points to estimate the `tangent' plane and
- (b)
if none of the X design points has repeated measurements then bandwidth selection methods based on cross‐validation may be considered. If the latter methods are ruled out, a feasible alternative may be one based on the idea of the nearest neighbours as follows. For any point xk, we choose a nearest neighbour of xk which includes observations
, such that the plane y=a+bTX is estimable, i.e. there is a unique solution of (a,b) to
; cf. the nearest neighbour method due to Wong and Shen (unpublished) mentioned in Section 2.
If X includes continuous covariates as well as categorical or functionally related covariates, then the RMAVE method still applies with appropriate initial values. If we carry out a global search for the EDR directions, the procedure may be trapped by directions with positive probability due to the categorical data. If we have some prior information about the EDR directions such that we only need to search for the directions locally, then the density requirement can be relaxed, namely the density function of BTX exists for all B∈ℬ={B:BTB=ID and ║B−B0║<c} for some c>0. Suppose further that E(XXT|BTX=v) and E(X|BTX=v) exist and have continuous second‐order derivatives. Then the RMAVE method in our paper applies with appropriate initial values in ℬ and the search for the directions conducted within the same region.
4. Simulations
In this section, we carry out simulations to check the performance of the proposed OPG method and the MAVE‐type methods. We shall use the square‐distance function m2, where m was defined in Section 2, to measure the error of estimation when we compare our method with others.
4.1. Example 1
( (4.1))
( (4.2))The sample size is set at n=200 or n=400 and 100 replications are drawn in each case. Let β1=(1,0,…,0)T, β2=(0,1,…,0)T and B0=(β1,β2). Fig. 1 shows the means of the estimation errors
and
; they are labelled `1' and `2' for β1 and β2 respectively. In our simulations, the IMAVE method outperforms the SIR method but is outperformed by the MAVE method. The RMAVE method performs best of all the methods. Zhu and Fang (1996) proposed a kernel smooth version of the SIR method. However, their method does not show a significant improvement over that of the original SIR method.

Means of
(labelled 1) and
(labelled 2) (broken curves are based on the MAVE method; full curves are based on the IMAVE method; wavy curves are based on the SIR method; bold curves are based on the RMAVE method; the horizontal axes give the numbers of slices or the bandwidth (in square brackets) for the SIR method or IMAVE method respectively): (a) model (4.1), sample size 200, bandwidths 1–3 (MAVE method) and 0.1–1 (RMAVE method); (b) model (4.1), sample size 400, bandwidths 1–2 (MAVE method) and 0.1–1 (RMAVE method); (c) model (4.2), sample size 200, bandwidths 1–3 (MAVE method) and 0.1–1 (RMAVE method); (d) model (4.2), sample size 400, bandwidths 1–2 (MAVE method) and 0.1–1 (RMAVE method)
4.2. Example 2
( (4.3))With sample size n=100, 200, 400, 200 independent samples are drawn in each case. The average distance from the estimated EDR directions to 𝒮(B0) is calculated for the PHD method (Li, 1992), the OPG method, the MAVE method and the RMAVE method. The results are listed in Table 1. The results suggest that the MAVE method performs better than the OPG method, which performs better than the PHD method, whereas the RMAVE method shows a significant improvement over the MAVE method. Our method for the estimation of the number of EDR directions also gives satisfactory results.
for model (4.3) by using different methods
| n | Method |
|
Frequencies of estimated numbers of EDR directions | |||
|---|---|---|---|---|---|---|
| k =1 | k =2 | k =3 | k =4 | |||
| 100 | PHD | 0.2769 | 0.2992 | 0.4544 | 0.5818 | f 1 =0,f2=10,f3=23, |
| OPG | 0.1524 | 0.2438 | 0.3444 | 0.4886 | f 4 =78,f5=44,f6=32, | |
| MAVE | 0.1364 | 0.1870 | 0.2165 | 0.3395 | f 7 =11,f8=1,f9=1, | |
| RMAVE | 0.1137 | 0.1397 | 0.1848 | 0.3356 | f 10 =0 | |
| 200 | PHD | 0.1684 | 0.1892 | 0.3917 | 0.6006 | f 1 =0,f2=0,f3=5, |
| OPG | 0.0713 | 0.1013 | 0.1349 | 0.2604 | f 4 =121,f5=50,f6=16, | |
| MAVE | 0.0710 | 0.0810 | 0.0752 | 0.1093 | f 7 =8,f8=0,f9=0, | |
| RMAVE | 0.0469 | 0.0464 | 0.0437 | 0.0609 | f 10 =0 | |
| 400 | PHD | 0.0961 | 0.1151 | 0.3559 | 0.6020 | f 1 =0,f2=0,f3=0, |
| OPG | 0.0286 | 0.0388 | 0.0448 | 0.0565 | f 4 =188,f5=16,f6=6, | |
| MAVE | 0.0300 | 0.0344 | 0.0292 | 0.0303 | f 7 =0,f8=0,f9=0, | |
| RMAVE | 0.0170 | 0.0119 | 0.0116 | 0.0115 | f 10 =0 | |
4.3. Example 3
( (4.4))Now, the simulation results summarized in Table 2 show that both the OPG method and the MAVE method have quite small estimation errors. As expected, the RMAVE method works better than the MAVE method, which outperforms the OPG method. The PHD method does not fare very well. The number of the EDR directions is also estimated correctly most of the time.
for model (4.4) by using different methods
| n | Method |
|
Frequency of estimated number of EDR directions | ||
|---|---|---|---|---|---|
| k =1 | k =2 | k =3 | |||
| 100 | PHD | 0.1582 | 0.2742 | 0.3817 | f 1 =3,f2=73, |
| OPG | 0.0427 | 0.1202 | 0.2803 | f 3 =94,f4=25, | |
| MAVE | 0.0295 | 0.1201 | 0.2924 | f 5 =4,f6=1 | |
| RMAVE | 0.0096 | 0.1712 | 0.2003 | ||
| 200 | PHD | 0.1565 | 0.2656 | 0.3690 | f 1 =0,f2=34, |
| OPG | 0.0117 | 0.0613 | 0.1170 | f 3 =160,f4=5, | |
| MAVE | 0.0059 | 0.0399 | 0.1209 | f 5 =1,f6=0 | |
| RMAVE | 0.0030 | 0.0224 | 0.0632 | ||
| 300 | PHD | 0.1619 | 0.2681 | 0.3710 | f 1 =0,f2=11, |
| OPG | 0.0076 | 0.0364 | 0.0809 | f 3 =185,f4=4, | |
| MAVE | 0.0040 | 0.0274 | 0.0666 | f 5 =0,f6=0 | |
| RMAVE | 0.0017 | 0.0106 | 0.0262 | ||
5. Examples
5.1. Circulatory and respiratory problems in Hong Kong
Consider the effect of the levels of pollutants and weather on the total number yt of daily hospital admissions of patients suffering from circulatory and respiratory problems. The pollutant and weather data are the daily average levels of sulphur dioxide (x1t(μg m−3)), nitrogen dioxide (x2t (μg m−3)), respirable suspended particulates (x3t(μg m−3)), ozone (x4t(μg m−3)), temperature (x5t(○C)) and relative humidity (x6t (%)). The data were collected daily in Hong Kong from January 1st, 1994, to December 31st, 1995, and are shown in Fig. 2. The basic question is this: are the prevailing levels of the pollutants a cause for concern?

(a) Total number of daily hospital admissions of circulatory and respiratory patients (——, time trend) and average levels of (b) sulphur dioxide, (c) nitrogen dioxide, (d) respirable suspended particulates, (e) ozone, (f) temperature and (g) humidity
( (5.1))Note that the coefficients of x3t, x5t and x6t are not significantly different from 0 (at the 5% level of significance) by reference to their standard errors shown inside the parentheses and the negative and significant coefficients of x1t and x4t are difficult to interpret. Refinements of this model are, of course, possible within the linear framework but are unlikely to throw much light with respect to the opening question because, as we shall see, the situation is quite complex. Previous analyses, such as Fan and Zhang (1999) and Cai et al. (2000), have not included the weather effect. However, it turns out that the weather has an important role to play.
The daily admissions shown in Fig. 2(a) suggest non‐stationarity in the form of almost a level shift taking place in early 1995 although none of the covariates seems to show a similar level shift. Now, a trend was also observed by Smith et al. (1999) in their study of the effect of particulates on human health. They conjectured that the trend was due to the epidemic effect. In our case, we understand from our data provider that additional hospital beds were released to accommodate circulatory and respiratory patients in the course of his joint project. As a result, we estimate the time dependence by a simple kernel method and the result is shown in Fig. 2(a). Another factor is the day of the week effect, presumably due to the hospital booking system. The day of the week effect can be estimated by a simple regression method using dummy variables. To assess the effect of pollutants better, we remove these two factors first. By an abuse of notation, we shall continue to use yt to denote the `filtered' data, now shown in Fig. 3.

`Filtered' number of daily hospital admissions of circulatory and respiratory patients by removing the time trend and the day of the week effect

Now, using the RMAVE method and with a cross‐validation bandwidth, we have the results in Table 3. The cross‐validation choice of the dimension is 3. The corresponding direction estimates are listed in Table 4.
| Dimension | Bandwidth | CV(d) value |
|---|---|---|
| 1 | 0.10 | 0.33 |
| 2 | 0.13 | 0.28 |
| 3 | 0.16 | 0.27 |
| 4 | 0.20 | 0.29 |
| 5 | 0.21 | 0.29 |
| 6 | 0.24 | 0.31 |
| 7 | 0.24 | 0.34 |
| 8 | 0.29 | 0.31 |
| 9 | 0.31 | 0.34 |
| 10 | 0.31 | 0.37 |
and
†
| Parameter | Estimates for the following lags: | ||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| x 1 | 0.0586 | − 0.0854 | 0.0472 | − 0.0152 | 0.1083 | − 0.0942 | 0.0734 |
| x 2 | 0.0876 | 0.0313 | − 0.1964 | 0.0893 | − 0.0867 | 0.0951 | − 0.1068 |
| x 3 | − 0.2038 | 0.1103 | 0.0153 | 0.0740 | − 0.0756 | 0.1283 | − 0.0520 |
| x 4 | 0.0155 | 0.0692 | 0.1622 | − 0.2624 | 0.1312 | 0.1342 | 0.0976 |
| x 5 | 0.5065 | − 0.4079 | 0.0743 | 0.0859 | − 0.3024 | − 0.1734 | − 0.0302 |
| x 6 | − 0.0294 | − 0.0610 | 0.0129 | − 0.0392 | − 0.0075 | 0.2850 | 0.0513 |
| x 1 | − 0.1525 | 0.0962 | − 0.1112 | 0.1170 | − 0.0388 | − 0.0605 | − 0.0326 |
| x 2 | − 0.0029 | 0.1614 | − 0.0955 | − 0.1160 | − 0.2185 | 0.0826 | 0.1696 |
| x 3 | − 0.0096 | − 0.1874 | 0.2422 | − 0.0047 | 0.3272 | − 0.2646 | − 0.0041 |
| x 4 | − 0.0013 | − 0.1162 | 0.0673 | 0.2113 | − 0.2193 | 0.1235 | − 0.1282 |
| x 5 | 0.1410 | 0.1193 | − 0.1425 | 0.1819 | − 0.2793 | − 0.0880 | − 0.0325 |
| x 6 | − 0.0345 | − 0.1479 | − 0.0400 | 0.4033 | 0.0474 | 0.0899 | 0.1336 |
| x 1 | 0.0701 | 0.0065 | − 0.0535 | − 0.1570 | − 0.0553 | − 0.0091 | − 0.0363 |
| x 2 | −0.0529 | 0.1360 | 0.0723 | 0.1045 | − 0.0045 | − 0.0200 | 0.0221 |
| x 3 | − 0.0121 | − 0.1189 | 0.0715 | − 0.0814 | 0.0112 | 0.0155 | 0.1214 |
| x 4 | 0.2215 | 0.0103 | − 0.3304 | 0.1028 | 0.0160 | − 0.1805 | 0.1341 |
| x 5 | 0.2909 | − 0.2372 | 0.0621 | − 0.0211 | 0.0950 | − 0.0954 | 0.2507 |
| x 6 | 0.2797 | − 0.1094 | − 0.3038 | 0.0452 | 0.1754 | − 0.3937 | 0.2597 |
- †Entries in bold have relatively large absolute values.
Figs 4(a)–4(c) show yt plotted against the respective EDR directions. These plots and Table 4 suggest the following features.

y
t
plotted against (a)
, (b)
, (c)
, (d)
, (e)
and (f)
: ——, polynomial regression to make trends more visualizable
- (a)
Rapid temperature changes play an important role. (Note the dominant coefficients for temperature in the two recent past days in
.)
- (b)
Of the pollutants, the most influential seems to be the particulates (note the large coefficient for particulates at lag 5 in
) and the least influential seems to be sulphur dioxide.
- (c)
The weather covariates are influential. (Note the many large coefficients for the weather covariates in all the three
s.)
Comparing the levels of the individual pollutants in Hong Kong against the national ambient quality standard of the USA lends further support to feature (b).


The above proposed procedure yields the results in Table 5.
| Dimension | Bandwidth | CV(d) value |
|---|---|---|
| 1 | 0.325 | 0.3593 |
| 2 | 0.325 | 0.3516 |
| 3 | 0.325 | 0.3435 |
| 4 | 0.325 | 0.3523 |
| 5 | 0.475 | 0.3450 |

Figs 4(d)–4(f) show yt plotted against the three directions. The `price' of using the reduced set with five covariates instead of the original set with 42 covariates is, loosely speaking, an increase in the percentage of unexplained variation from about 27% to about 34% (As we use standardized observations, we may interpret the CV(d) value as a percentage of unexplained variation.) In return, we can gain further insight.
- (a)
The first EDR direction is −0.1317x3,t−2−0.0772x4,t−6+0.5256vt−0.8366x5,t−4−0.0235x6,t−2, with temperature and temperature variation being the two dominant components. Fig. 4(d)suggests that this direction sees practically only the mean level of the hospital admissions.
- (b)
The second EDR direction is 0.4809x3,t−2+0.3154x4,t−6−0.6414vt−0.5078x5,t−4+0.0018x6,t−2, which, together with Fig. 4(e), suggests that high levels of suspended particulates and/or high levels of ozone during cold spells tend to cause high admissions.
- (c)
The third EDR direction is 0.0101x3,t−2+0.3815x4,t−6+0.1345vt+0.0734x5,t−4−0.9115x6,t−2, which, together with Fig. 4(f), suggests that high ozone levels on extremely dry days tend to cause high admissions.
This analysis suggests that pollutants have reached such a level in Hong Kong that it only takes the weather to enter the right regime to exacerbate the circulatory and respiratory problems there.
5.2. Hitters' salary data
The hitters' salary data set has attracted much attention among statisticians. The data consist of times at bat (x1), hits (x2), home runs (x3), runs (x4), runs batted in (x5) and walks (x6) in 1986, years in major leagues (x7), times at bat (x8), hits (x9), home runs (x10), runs (x11), runs batted in (x12) and walks (x13) during their entire career up to 1986, annual salary (y) in 1987, put‐outs (x14), assistances (x15) and errors (x16). For ease of exposition, we abuse the notation and set y as the logarithm of annual salary in 1987, xj the standardized xj (j=1,…,16) and X the vector (x1,…,x16)T. The main interest is `why they make what they make', which was the main topic of a conference organized for the data by the American Statistical Association in 1988. More recent studies on this data include Chaudhuri et al. (1994) and Li et al. (2000). The latter suggested the existence of an `aging effect' on salary.
Now, applying the RMAVE method to the data set and using model (1.1), we estimate the dimension of the EDR space as 2. We plot y against the two EDR directions as shown in Fig. 5. It suggests that there are seven outliers, in general agreement with an observation made by Li et al. (2000). Next, applying the RMAVE method to the data with the outliers removed, we have the following results. Table 6 shows that the dimension estimate remains at 2 and Fig. 6 shows the plots of y against the estimated EDR directions. The similarity between the results before and after the removal of outliers suggests a high degree of robustness enjoyed by the RMAVE method.

y plotted against (a)
and (b)
for the hitters' salary data: *, outlier
| Dimension | Bandwidth | CV(d) value |
|---|---|---|
| 1 | 0.148 | 0.265 |
| 2 | 0.395 | 0.118 |
| 3 | 0.473 | 0.134 |
| 4 | 0.609 | 0.158 |
| 5 | 0.544 | 0.139 |
| 6 | 0.596 | 0.135 |
| 7 | 0.662 | 0.146 |
| 8 | 0.572 | 0.133 |
| 9 | 0.655 | 0.132 |
| 10 | 0.927 | 0.178 |

y plotted against (a)
and (b)
for the hitters' salary data with the outliers removed
The EDR directions are given in the first pair of columns of Table 7. Note that, in the second direction, the negative coefficient (−0.23) of x7 lends some support to the aging effect on salary suggested by Li et al. (2000).
|
|
|
|
|
|
|---|---|---|---|---|---|
| − 0.25 (x1) | 0.08 (x1) | − 0.05 (x1) | 0.14 (x1) | − 0.01 (x1) | − 0.27 (x1) |
| 0.24 (x2) | 0.04 (x2) | 0.04 (x2) | − 0.20 (x2) | 0.15 (x2) | 0.17 (x2) |
| 0.09 (x3) | − 0.01 (x3) | − 0.03 (x3) | − 0.09 (x3) | 0.02 (x3) | 0.09 (x3) |
| 0.00 (x4) | 0.07 (x4) | 0.03 (x4) | 0.40 (x4) | 0.01 (x4) | − 0.04 (x4) |
| − 0.01 (x5) | − 0.04 (x5) | −0.03 (x5) | − 0.03 (x5) | −0.06 (x5) | 0.01 (x5) |
| 0.05 (x6) | 0.04 (x6) | − 0.09 (x6) | − 0.29 (x6) | 0.06 (x6) | 0.04 (x6) |
| 0.52 (x7) | − 0.23 (x7) | 0.18 (x7) | 0.02 (x7) | 0.01 (x7) | 0.51 (x7) |
| 0.55 (x8) | − 0.49 (x8) | 0.51 (x8) | − 0.57 (x8) | 0.03 (x8) | 0.74 (x8) |
| 0.37 (x9) | 0.75 (x9) | − 0.81 (x9) | − 0.26 (x9) | 0.90 (x9) | − 0.11 (x9) |
| 0.10 (x10) | 0.15 (x10) | 0.00 (x10) | − 0.08 (x10) | 0.24 (x10) | 0.03 (x10) |
| 0.23 (x11) | 0.12 (x11) | 0.08 (x11) | 0.27 (x11) | 0.14 (x11) | 0.16 (x11) |
| 0.08 (x12) | 0.17 (x12) | − 0.10 (x12) | 0.08 (x12) | 0.04 (x12) | − 0.13 (x12) |
| 0.30 (x13) | 0.22 (x13) | 0.07 (x13) | 0.43 (x13) | 0.27 (x13) | 0.08 (x13) |
| − 0.01 (x14) | 0.09 (x14) | − 0.06 (x14) | 0.09 (x14) | 0.08 (x14) | − 0.04 (x14) |
| 0.04 (x15) | − 0.03 (x15) | 0.03 (x15) | 0.12 (x15) | − 0.03 (x15) | 0.06 (x15) |
| 0.00 (x16) | 0.04 (x16) | − 0.05 (x16) | − 0.08 (x16) | 0.05 (x16) | − 0.02 (x16) |
- †Entries in bold have relatively large absolute values.

, is 0.26 and R2=0.865. The threshold is set at −0.47. The right‐hand regime is much more volatile and we may return to the RMAVE method. The estimated dimension is still 2 and the estimated directions are given in the second pair of columns in Table 7. Let
and
. We may fit to the right‐hand regime a polynomial regression such as

For this model,
and R2=0.714. The overall
is 0.27. A simple calculation shows that the coefficient of x7 in the model for the right‐hand regime is again negative, with the implication mentioned previously. As a comparison with the regression tree results obtained by Li et al. (2000), we quote
for the classification and regression trees method with five bases, 0.33 for the multivariate adaptive regression splines method with 13 bases, 0.44 for the SUPPORT algorithm with two bases and 0.35 for the PHDRT algorithm. For our simple‐minded hybrid, the overall
with two bases.
( (5.2))
,
and the estimate of the function g as shown in Fig. 7(b). (Because the density of
is not so uniform, a variable bandwidth is used. See Fan and Gijbels (1996), page 152.) The dominant covariates in
are x2, x9, x10, x11 and x13, all with positive coefficients. Four out of these five covariates measure past performance and so we may interpret z1 as principally a measure of past performance. Fig. 7(a) shows that, along the z1‐axis, players with better past performance are paid better. Note also that the number of years in the major league (x7) only features in z2, i.e.
, and quite prominently so. The estimated g(z2) lends support to the existence of an aging effect, now with the salary peaking at around z2=−0.5.

(a) Estimated regression surface of model (5.2) (•, observations) and (b) estimated regression function of g (——) and estimate of the density function along the direction (⋯⋯) (•, residuals after removing the linear part in model (5.2))
6. Conclusions
Our theoretical analysis, simulations and real applications have led us to believe that the MAVE methodology has many attractive attributes. Different from most existing methods for the estimation of the directions, the MAVE estimators of the directions have a faster rate of consistency than the corresponding estimators of the link function. On the basis of the faster rate of consistency, a consistent method for the determination of the number of EDR directions has been proposed. The MAVE method can easily be extended to more complicated models. It does not require strong assumptions on the design of X and the regression functions and can be applied to both independent data and dependent data.
As a by‐product, we have extended the ADE method of Härdle and Stoker (1989) to the case of more than one EDR direction, resulting in the OPG method. This method has wider applicability with respect to designs for X and regression functions. Our basic idea has also led to the IMAVE method, which is closely related to the SIR method and the most predictable problem of Hotelling (1935), but in our simulations IMAVE seems to enjoy a better performance than SIR. The refined kernel based on the determination of the number of the directions can further improve the accuracy of estimation of the directions. Our simulations show that substantial improvements can be achieved.


Unlike the SIR method, the MAVE method is well adapted to time series; our experience suggests that the MAVE method is also robust against outliers. Furthermore, all our simulations show that the MAVE method has a much better performance than the SIR method (and OPG method). Although theorem 2 furnishes a partial explanation, we are still intrigued because SIR uses the one‐dimensional kernel (for the kernel version) whereas the MAVE method uses a multidimensional kernel. However, because the SIR method uses y to produce the kernel weight, its efficiency will suffer from fluctuations in the link function. The gain by using the y‐based one‐dimensional kernel does not seem to be sufficient to compensate for the loss in efficiency caused by these fluctuations, but further research is needed here.
Acknowledgements
We thank the Biotechnology and Biological Science Research Council and Engineering and Physical Sciences Research Council of the UK, the Research Grants Council of Hong Kong, the Committee on Research and Conference Grants of the University of Hong Kong, the Friends of London School of Economics (Hong Kong) and the Wellcome Trust for partial support. We are most grateful to two referees for constructive comments. We thank Professor Wing Hung Wong and Professor X. Shen for making available to us their unpublished work and Professor T. S. Lau for providing the Hong Kong data and some background information.
Appendix A
A.1. Assumptions and remarks


Condition 2:
- (a)
E | y | k < ∞ for all k>0;
- (b)
E║X║k< ∞ for all k>0.
Condition 3:
- (a)
the density function f of X has bounded fourth derivative and is bounded away from 0 in a neighbourhood 𝒟 around 0;
- (b)
the density function fy of y has bounded derivative and is bounded away from 0 on a compact support.
Condition 4: the generalized conditional densities pX|y(x|y) of X given y and p(X0,Xl)|(y0,yl) of (X0,Xl) given (y0,yl) are bounded for all l≥1.
Condition 5:
- (a)
g has bounded, continuous third derivatives;
- (b)
E(X|y) and E(XXT|y) have bounded, continuous third derivatives.
Condition 6: K(⋅) is a spherical symmetric density function with a bounded derivative. All the moments of K(⋅) exist.

A.2. The efficiency of the algorithm

( (A.1))




( (A.2))

This estimation procedure is very efficient in that, in theory, after two steps the estimate from our procedure can achieve the final consistency rate.
A similar result was discovered in a different context by Hannan (1969). Specifically, he developed an estimation procedure for the parameters of autoregressive moving average processes. Starting with arbitrary consistent estimators of the parameters, a modification by one step of the Newton–Raphson‐ type iteration can make the estimators asymptotically efficient. In the MAVE method, the first step is to find a consistent `initial' estimator. The second step is to modify the `initial' estimator, which can also make the estimate asymptotically efficient. In spite of the asymptotic efficiency, the iterative application of the procedure beyond the two steps was suggested by Hannan (1969) as a way of further improving the estimator. For the MAVE method, our simulation also suggests that further iterations are beneficial.
References
Discussion on the paper by Xia, Tong, Li and Zhu
J. T. Kent (University of Leeds)
The paper is an ambitious attempt to tackle high dimensional regression problems. There are connections to several areas of statistics, including multivariate analysis, nonparametric regression and linear regression. I would like to direct some comments to each area in turn.
Multivariate analysis
denote the average of these mean values. Canonical variate analysis is a tool for improving the interpretability in this setting via dimension reduction. It is assumed that these means lie on a lower dimensional plane of dimension D, say, where D<min(k−1,p), i.e. we assume that the
span a subspace of dimension D. Let B(p×D) be a matrix whose columns span this subspace and let C(p×(p−D)) be a complementary matrix so that (B,C) is non‐singular. Reversing the conditioning yields the logistic‐type regression model


Thus this model can be regarded as a discrete and parametric version of the authors' model (1.1). In passing, note that similar conditional independence statements form the building‐blocks of graphical models, except that in our setting B is unknown.
In the k‐groups model, the marginal distribution of X is a mixture of p‐variate normals. However, when attention is focused on the conditional distribution of y|X in the logistic‐type regression model, it is usual to allow more general possibilities for the marginal distribution of X. The k‐groups model can be viewed as a motivating example for the sliced inverse regression approach to nonparametric multiple regression, whereas the logistic‐type regression model better matches the tone of the current paper.
Nonparametric regression
A generalized additive model takes the form y=Σj=1Dgj(βjTX)+ɛ. The ridge terms gj(βjTX) can be viewed as `main effects' in the directions βj. In contrast, the more general model (1.1), y=g(B0TX)+ɛ, which forms the foundation of the paper, also allows `interaction terms'. However, I am concerned that there is a tendency in practice to interpret the columns of B0 as main effects and to ignore possible interactions. For example, consider the plots of yversus
and yversus
in Fig. 5. There are two related problems with these plots. First any possible interactions are ignored; it might be better to represent the whole response surface. The second problem is that these two directions
and
have no preferred status. It is possible to take any other basis of their column space without affecting the validity of the model.
Linear regression
Reduced rank models are also of interest in linear regression analysis. Of course the ordinary least squares regression model is a special case of model (1.1) with D=1 and g linear. However, when p is large, it is well known that the least squares estimator can be unstable, so attempts are often made to reduce the dimensionality of X. One class of methods involves variable selection. However, a class of methods that is more in keeping with the current paper involves the construction of new linear composite variables from X. One of the simplest such methods is principal components regression in which X is replaced by its first few dominant principal components. Unfortunately, this method is rather unsatisfactory since the dominant principal components depend just on the X‐variability and not on the relationship to y. A hybrid approach between ordinary least squares and principal components regression is partial least squares; see Stone and Brooks (1990) for a unified treatment. Of course these methods of dimension reduction (including variable selection methods as well) depend heavily on the covariance structure of X.
Are there any lessons from this methodology for this paper? In particular, what happens when there is very high correlation between the X‐variables or, more generally, when the X‐variables become nearly collinear? My concern is that the estimate of the column space of B will become unstable and that problem (2.7) might have multiple solutions.
I have found the paper tremendously stimulating, and it gives me great pleasure to propose the vote of thanks.
Adrian Bowman (University of Glasgow)
It is a great pleasure to add my thanks for this paper. I enjoyed both its reading and its presentation. Over the past few years there has been a considerable amount of work in the dimension reduction area. Regression used to be a topic which we thought we understood. Now we are not so sure. It is one of the merits of this paper that it brings together a variety of approaches in this area and synthesizes them into a simple but potentially powerful idea. Direct and simultaneous estimation of both the nonparametric and the directional components of the model brings some significant benefits. These include an avoidance of some of the usual difficulties with bias incurred by smoothing, a weakening of assumptions, the ability to handle the special but important case of time series, some impressively strong supporting asymptotics and evidence of good behaviour in numerical work. However, it is difficult to believe that these properties are not bought at some price and I would like to explore one or two aspects of where the costs may lie.
The first relevant feature is that, although the central idea is attractively simple, the implementation is necessarily more sophisticated. It involves a variety of steps. The first is smoothing in, possibly high dimensional, covariate space. Most people feel comfortable when applying smoothing in one, two or occasionally three dimensions. The authors have been courageous in going rather beyond that. In the hospital admissions data courage gives way to heroism by smoothing in 42 dimensions. Of course, the refinements introduced by the authors quickly reduce attention to the much smaller dimensional space defined by the current effective dimension reduction (EDR) directions where smoothing can be applied without difficulty. At the same time, there is a high dimensional minimization in operation to identify the EDR directions. Beyond this lies a cross‐validation step to compare the EDR dimensions. Finally, there is some mention in the paper of the possibility of using a data‐dependent bandwidth choice, although the authors wisely do not routinely incorporate this. The end result is a set of EDR directions which have been produced by a set of complex operations on the data. However, there is no difficulty in principle with that. Complex data may require complex methods of analysis and if the end result brings insight then it has been worthwhile.
On the question of insight, I would like to use the hospital admissions data as a means of raising some practical issues. The first concerns the robustness and sensitivity of the procedure. A scatterplot matrix reveals a variety of features in the covariates. One is the presence of substantial skewness. The sulphur dioxide variable is a good example of this and it includes in particular two very large observations. Since the sulphur dioxide, nitrogen dioxide and particulates covariates are all concentrations, it would be natural to take a log‐transformation of each. Ozone, although also skewed, contains observations at or close to zero and so it may be best left unaltered, along with temperature. Humidity is a percentage, with many observations at high values and so the logistic transformation would be natural here. The question is whether the broad qualitative conclusions of the analysis will remain unchanged when repeated using the variables on these, arguably more natural, scales. The assumptions of the model are weak but one can only feel that there will be greater stability if the variables exhibit approximately normal variation. A second issue arises from the scatterplot of log(nitrogen dioxide) and log(particulates) which shows a strong linear relationship between these two variables. This is exactly the situation assumed by the model. However, it then seems surprising that particulates feature strongly in the conclusions whereas nitrogen dioxide does not. This raises the question of whether the decisions being made by the procedure on the weights to assign to variables are ones which we shall always feel comfortable with.
An issue of the appropriateness of the model is raised by the scatterplot of nitrogen dioxide against temperature. This shows a clear non‐linear pattern which will be obscured by the linear combinations around which the model is built. Of course, a second dimension will, in this case, allow the full relationship between the covariates to be expressed. However, it would seem more appropriate to incorporate specific non‐linear relationships into the model in a more direct way, where these are appropriate.
Finally, some important issues arise under the heading of interpretation. The first derives from the fact that EDR delivers a subspace, not a co‐ordinate system. The same subspace can be represented by EDR directions which are rotated in different ways. This makes the interpretation of specific elements of the EDR direction vectors rather difficult. The nonparametric surface g has an unspecified shape, built from all EDR directions simultaneously. The marginal space may change radically as the EDR co‐ordinate system is rotated. An interpretation can therefore only be made from the entire collection of EDRs and this is not an easy task. In addition, if we simulate data where y is unrelated to x we are still likely to identify EDR directions of apparent meaning. This highlights the need for some statistical methods of model comparison, beyond CV(d), to ensure that the results of EDR can safely be attributed to meaningful structure rather than to noise.
When the authors have come so far, it may seem churlish to ask them to go yet further. However, I raise these issues in the hope that the authors will be able to devote their considerable powers to addressing them. To return to the original remarks, this is clearly a simple but potentially powerful idea which deserves to be considered carefully. I have great pleasure again in congratulating the authors on their paper and in warmly seconding the vote of thanks.
The vote of thanks was passed by acclamation.
Santiago Velilla (Universidad Carlos III de Madrid, Getafe)
In developing minimum average variance estimation (MAVE) the authors seem to have in mind a first‐order regression problem in which all the information that X carries on the response y is captured by the conditional expectation E(y|X). In this sense, the populational objective function (2.1) and its sample version (2.7) seem to be appropriate when the error ɛ in model (1.1) not only satisfies the condition E(ɛ|X)=0 but also var(ɛ|X)=σ2. If the conditional variance is not constant, expressions (2.1) and (2.7) should perhaps be modified accordingly.
In comparing the four new methods proposed in this interesting paper, I find that both the outer product of gradients method, in Section 3.1, and inverse MAVE, in Section 3.2, have a natural nested character. Once a decision has been taken on the value of the dimension of the effective dimension reduction space, directions are determined sequentially. In contrast, both MAVE and refined MAVE seem to require specific computation in each step d=1,2,…. Moreover, as indicated in the algorithm of Section 2.3, computation is required for all 1≤d≤p. In view of the pattern of Tables 3, 5 and 6 in the examples in Sections 5.1 and 5.2, where the change in the CV(d) value is `small' when spurious directions are considered, for `large' values of d the algorithm could be initialized using the results for d−1 making it `nested', i.e. looking only for
, once
have been determined. Of course, this is just a suggestion based on the pattern of the tables in the examples, but this simplified scheme for spurious values of d might save some computational time.
Finally, in connection with condition (1.2), in Velilla (1998), section 4.1, I proposed a method for generating regressors X satisfying condition (1.2) that are not necessarily elliptical. This method has been applied, for example, in Bura and Cook (2001a, b) for assessing by simulation the performance of some methods for testing for dimension.
Wenyang Zhang (University of Kent at Canterbury)
I have two comments to make on this interesting paper.
Shannon's entropy
A measure of uncertainty, Shannon's entropy, was introduced by Shannon (1948), which is extremely useful in communication theory. It also can be used to reduce dimension in regression to avoid the `curse of dimensionality'.



Let Y be the response, X be the covariate with high dimension p and (Xi,Yi),i=1,…,n, be a sample from (X,Y). For any fixed β, the estimate
of I(Y,βτX) can be obtained by standard density estimation; see Fan and Gijbels (1996). An alternative dimension reduction procedure is maximize
subject to ||β||=1, to find the maximizer β1 and maximum I1, then maximize
, subject to βτβ1=0 and ||β||=1, to find the maximizer β2 and maximum I2, and continue this exercise until Iq is less than a selected critical value c which may be obtained by cross‐validation. (β1,…,βq) forms the efficient directions to reduce the dimension. It would be very interesting to compare this approach with that in the paper.
Curse of dimensionality

If the dimension p of X is very large, it would be impossible to obtain an initial B with small bias owing to the `curse of dimensionality'. My question is does this bias matter in your procedure? If not, why could we not take the whole range of Xi as the initial bandwidth?
Frank Critchley (The Open University, Milton Keynes)
In welcoming the faster rate of consistency and time series extensions afforded by the paper, I would like to make the following points in which Yx:=(Y|X=x) and ɛx:=(ɛ|X=x).
- (a)
I was somewhat surprised not to find fuller reference to the important body of work by Cook and co‐workers, surveyed to that date in Cook (1998). Among other attractive features, such as its graphical emphasis, this approach examines how the whole distribution of Yx—not just, as here, its mean g(x)—varies with x. Again, it exploits a conditional independence formulation throughout, that is both logically cogent and statistically intuitive. I would also like to draw attention to two forthcoming papers, available on the Annals of Statistics Web site and directly relevant to this paper: Cook and Li (2002), which addresses dimension reduction for g(x), and Chiaromonte et al. (2002), which overlaps with Section 3.4.
- (b)
There are two apparent significant errors of omission.
- (i)
In the sentence two after equation (1.1), a simple counter‐example is
and
The omission appears to be that model (1.1) should be augmented by the location regression requirement Y⊥⊥X|E(Y|X) (Cook (1998), page 111); a similar remark applies to model (1.3).
- (ii)
In the sentence including expression (2.1), additional conditions—such as constancy of var(ɛx) over x—apparently are required.
- (i)
- (c)
The benefits of this paper—including relaxation of condition (1.2) on X—come at the price of other non‐trivial restrictions to its applicability: in particular, to additive error models that are special cases of location regression and in which certain additional conditions hold.
- (d)
In unpublished preliminary discussions with Cook, it was noted that the conditional independence approach seems natural in a variety of time series contexts, autoregressive processes being obvious examples. This would seem a promising line of enquiry.
- (e)
In view of the quadratic nature of the criterion minimized, I was somewhat surprised by the robustness to outliers claim (Section 6) and would value further details.
- (f)
Concerning Section 2.1.2, under what conditions is convergence (to a unique solution) guaranteed?
Anthony Atkinson(London School of Economics and Political Science)
I congratulate the authors on an interesting paper which stimulated an excellent discussion. I have five points.
- (a)
John Kent placed the authors' proposal in the context of other dimension reduction methods, including partial least squares. This method is often used with p close to n. Is this likely to cause any problems? Partial least squares is also often used with p≫n, e.g. in the spectroscopic data set analysed again by Brown et al. (2001). Can the authors' method be extended to this important class of problems?
- (b)
The interpretation of results like those of Table 4 seems beset with difficulties, since the directions can be rotated in the D‐dimensional subspace. Basilevsky (1994), section 6.10, discussed the similar problem of rotation and interpretation in factor analysis.
- (c)
On pages 378–379 the data have the effects of two factors removed, so that the yt are indeed notationally abused, being residuals. The method of added variables (e.g. Atkinson and Riani (2000), section 2.2) indicates that the same regression should be performed on the explanatory variables as on the response, so that the analysis becomes one of residuals on residuals. Incidentally, this use of only one set of residuals is a frequent occurrence in time series analysis, where a series is `pre‐whitened', but the regressors left untouched.
- (d)
Some dicussants have mentioned robustness. It has been the experience of Marco Riani and myself that use of the forward search (Atkinson and Riani, 2000) reveals masked outliers and their effects in a way that is impossible by looking at a fit to all the data. The data are fitted to subsets of increasing size and parameter estimates, residuals and other quantities monitored. The starting‐point for the searches is a robustly chosen subset of p, or a few more, observations. Could relatively small subsets of the data be used here to start such a process?
- (e)
Many statistical methods, including, I suspect, that described here, tend to work better if the data are approximately normal. In applications of inverse regression for dimension reduction, the data are sometimes transformed to approximate multivariate normality by using a multivariate Box–Cox transformation. An example is the analysis of data on New Zealand mussels in chapters 10 and 11 of Cook and Weisberg (1994). A robust version of this transformation using the forward search is illustrated in Riani and Atkinson (2001). What is the effect here of such transformations both on computation time and on the conclusions drawn from Tables 4 and 7?
Qiwei Yao (London School of Economics and Political Science)
The authors should be congratulated for making a further contribution along their impressive list of publications on nonparametric multivariate regression—a very important and immensely difficult topic.
Theorem 1 may be presented in a slightly stronger form by defining the weights wij in terms of {BTXi} instead of {Xi}. This effectively changes a p‐dimensional smoothing problem into a d‐dimensional one. The gain in convergence rate would now be hopt log (n)=O{n−1/(d+4) log (n)} at the price of the added computational complication in the minimization of problem (2.7).
As B0 is only defined up to any orthogonal transforms, will the alternating iteration between refined kernel weights and estimating βj in step 1(b) lead to stable
? The use of refined kernel weights only makes sense if such a stable solution is guaranteed.

Then
in probability if and only if
estimates B0 `correctly'.
Finally the method proposed is most useful when D is small such as 2 or 3, as we still need to estimate the link function even if we have the right effective dimension reduction. If model (1.1) does not hold, will the procedure lead to a `good' approximation for the conditional expectation of y given X?
A. H. Welsh (University of Southampton)
Comparisons of minimum average variance estimation (MAVE) with sliced average variance estimation (SAVE) proposed by Cook and Weisberg (1991)(see Cook and Yin (2001) for recent references) in addition to sliced inverse regression may be interesting and more insightful. Robustness issues in sliced inverse regression and SAVE were raised at the 2000 Australian conference in a presentation by Ursula Gather and the discussion to Cook and Yin (2001). The issues are subtle so the claim that MAVE has good robustness properties needs a proper investigation.
is essentially determined by

. The approach in which we estimate g and g' by smoothing (as in the present paper) but estimate β0 by standard maximum likelihood (Brillinger, 1992; Weisberg and Welsh, 1994) seems rather different. However, it is important to centre Xi about an estimate of E(X|XTβ0=XiTβ0) and, under the simplifying conditions of the present paper and using local linear smoothing (Ruckstuhl and Welsh, 1999), the equivalent expression for this estimator is

Whereas we usually use undersmoothing, higher order kernels or higher order polynomials in local polynomial smoothing to increase the rate of convergence of
so that it is asymptotically negligible, MAVE estimates integrals of g rather than g so we can use optimal bandwidths for g while estimating β0. If the above expressions are correct, MAVE should have the same asymptotic distribution (possibly up to centring of the covariates) as the maximum likelihood estimator but this needs to be checked carefully. Finally, MAVE should also be extended to other distributions, presumably by maximizing the average local log‐likelihood.
Hengjian Cui (Beijing Normal University) and Guoying Li (Academy of Mathematics and System Sciences, Beijing)
This paper is very interesting and very provocative! The authors give us new ideas to search the effective dimension reduction (EDR) space in nonparametric regression settings.

Then, the associated p‐dimensional kernel can be taken as a product of p one‐dimensional kernels. This intuitively makes sense by theorem 1 and lemma 1. Also, we may refine the kernel weights and determine the number D by the procedures described in Sections 2.1.2 and 2.2 respectively.
The example in Section 5.2 shows that the (refined) MAVE method is robust. It seems to us that it is robust against outliers in X‐space because the local smoother puts lower weights on further Xjs. If the outliers occur in Y‐space the story may be different.
There are at least two obvious questions. One is the inference of the EDR directions, which involves the asymptotic normality of the
. This is true for single‐index models (Härdle et al., 1993; Xia and Li, 1999). We believe that the
obtained by (refined) MAVE has √n‐consistency and asymptotic normality under some regular conditions. The expression of the asymptotic covariance matrix of
could be complicated, and its consistent estimator is needed. This may be given by, say, a bootstrap method. Moreover, the estimation of the link function is also important. In particular, we may first ask whether the link function is additive (Cui et al., 2001). Also, it is expected that the MAVE method may be extended to the case that X includes continuous as well as categorical (or, generally, discrete) or functionally related covariates, as mentioned in Section 3.4. Further work is definitely needed in this area.
Vladimir Spokoiny (Weierstrass Institute and Humboldt University, Berlin)



which comes from the first‐step estimation with the multidimensional weights. If this first‐step estimator is not sufficiently precise then the advantage of using the refined weights disappears and the whole procedure may fail in estimating the true effective dimension reduction. Hristache, Juditski and Spokoiny (2001) and Hristache, Juditski, Polzehl and Spokoiny (2001) proposed another way of selecting the refined weights wij based on the idea of structural adaptation. The idea is to pass progressively from multidimensional weights wij to the low dimensional weights of type
. In this context, an interesting question is the possibility of joining the proposal of this paper (to estimate the index space by minimizing the mean average squared error) with the structural adaptation method.
The following contributions were received in writing after the meeting.
K. S. Chan (University of Iowa, Iowa City) and Ming‐Chung Li (EMMES Corporation, Rockville)
We congratulate the authors for their masterly piece of work that will certainly stimulate much research on semiparametric modelling and non‐linear time series.


versus the indices u1 and u2. Whereas
seems linear,
appears to be piecewise linear. Below is the estimate of C and that after transformation that renders the two non‐linear principal components uncorrelated and of unit variance:


(a) Smoothed graph of
, (b) smoothed graph of
, (c) time series plots of the two non‐linear principal components and (d) dendrogram from a cluster analysis of the dynamics of the four pollution variables, based on
The Euclidean distance between any two rows of the rotated C measures the dissimilarity in the dynamics of the corresponding variables. The rotated
suggests that the sulphur dioxide variable enjoyed different dynamics from other variables whereas the suspended particulates and nitrogen dioxide variables shared similar dynamics, over the study period; see also Fig. 8.
Pavel Čížek and Wolfgang Härdle (Humboldt University, Berlin) and Lijian Yang (Michigan State University, East Lansing)
This paper addresses the challenging problem of dimension reduction and we congratulate the authors for this new insight into modelling high dimensional data. They provide the new minimum average variance estimation (MAVE) approach that creates a variety of semiparametric modelling strategies. The technical treatment is excellent and the algorithms derived are directly implementable. From a practitioner's point of view, there are probably questions about the performance of the method in non‐standard situations.
For an assumed number of directions, the MAVE method is based on the local linear approximation of a regression function. The main idea is to use this approximation (conditionally on yet unknown indices) directly in the local linear smoothing procedure by using a multidimensional kernel. This is just a simultaneous minimization with respect to function and direction estimates, which is broader than the usual methods that estimate only function values or only directions. According to theorem 1, this makes undersmoothing of the bandwidth selection unnecessary. Additionally, MAVE together with a cross‐validation procedure can be used to estimate the effective dimension reduction (EDR) dimension.
On the basis of MAVE, the authors design generalizations of several existing methods (e.g. the outer product of gradients (OPG) method is a generalization of additive derivatives estimation by Härdle and Stoker (1989)). Additionally, these extensions even outperform the original methods. However, we must keep in mind that these generalizations are valid only under assumptions on the smoothness of all the variables and cannot therefore replace the corresponding single‐ and multi‐index methods that can also handle discrete variables (e.g. semiparametric least squares by Ichimura (1993)).
Finally, the MAVE method is claimed to be robust against outliers, supposedly in the space of explanatory variables. We examined the robustness of the choice of the EDR dimension and the OPG and MAVE methods to outliers and random noise in more detail. In the first case, our simulations regarding the cross‐validation procedure in the presence of a single outlier show two main effects: the outlier results generally in an upwardly biased estimate of the EDR dimension, and additionally, in most cases, model estimates under contamination do not reduce the variance of the dependent variable conditionally on the regression function. In the second case, we studied the behaviour of MAVE and OPG under contamination. The most interesting result is that OPG, which for clean data is always worse than MAVE, can keep up with or even outperform MAVE when applied to contaminated data. We achieved similar results also under no contamination and a high variance of the error term.
R. D. Cook (University of Minnesota, St Paul)
The authors refer to span(B0) from model (1.1) as the effective dimension reduction (EDR) subspace, but I find this characterization to be incorrect. Li (1991) defined the EDR subspace as the span(B) in the representation y=g(BTX,δ), where the error δ⊥⊥X and B=(b1,…,bk). Because ɛ may depend on X, equation (1.1) permits a model with ɛ=σ(C0TX)δ, where σ(C0TX)≥0. For this version of model (1.1), the EDR subspace is span(B0)+span(C0), not span(B0) as the paper implies. This confusion is unfortunate but perhaps understandable because published descriptions of the EDR subspace are not explicitly constructive.
A mean subspace is any subspace span(B) of ℝp such that y⊥⊥E(y|X)|BTX. If the intersection of all mean subspaces is itself a mean subspace it is called the central mean subspace (CMS) and may be taken as the subject of a regression inquiry. Recently introduced by Cook and Li (2002), the CMS seems to be the subspace pursued in this paper.
A dimension reduction subspace (DRS) is any subspace span(B) such that y⊥⊥X|BTX. When the intersection of all DRSs is itself a DRS it is called the central subspace (CS; Cook (1996a, b, 1998)), which is a metaparameter for dimension reduction. The CS may not exist when the EDR subspace does exist. And the CS may exist straightforwardly when the construction of the EDR subspace is problematic (e.g. binary responses). I find the CS to be much easier to handle in theory and widely applicable in practice. The CMS is contained in the CS. The CS is invariant under strictly monotonic transformations of Y, whereas the CMS and span(B0) are not. Compactness of the support of X is not required for the CMS or the CS (see the discussion following lemma 1).
I do not regard sliced inverse regression (SIR) and refined minimum average variance estimation (RMAVE) to be direct competitors. SIR estimates directions in the CS, whereas RMAVE apparently estimates the CMS. The authors demonstrate that RMAVE does better than SIR in some situations that RMAVE was designed to handle. I wonder how RMAVE would perform across the many situations where SIR, sliced average variance estimation and related methods have apparently uncovered key regression structures.
The fact that SIR will not perform well in models like model (4.3) is known (Cook and Weisberg, 1991). Does the performance of RMAVE degrade when there are strong non‐linear relationships among the predictors, the kind that would render SIR ineffective?
I found this paper interesting because of the suggestion that local methods might mitigate the need for restrictions on the predictors.
Model parameters and identifiability
Jianqing Fan (University of North Carolina at Chapel Hill)

This is the same as expression (2.1). Hence, the model assumption (2.1) is not needed as far as the procedure for estimating B0 and g is concerned. Under what conditions does the optimization problem (2.1) have a unique solution, namely when is the parameter B0 identifiable? (Indeed, only the space spanned by the columns of B0 is possibly identifiable.)


Minimum average variance estimation and profile likelihood
. Now, find the parameter B to minimize

The fully iterated procedure in Carroll et al. (1997) used this idea. Minimum average variance estimation is a nice variation of the profile likelihood method. It is motivated from estimating the conditional variance by a kernel estimator rather than minimizing directly the mean‐square errors. As a result, it has the nice expression (2.7) which facilitates theoretical studies but involves an extra loop of summation in computation. The merits of both approaches are worth exploring further. However, it is worthwhile to mention that the profile likelihood method generally gives semiparametric efficient estimators (see, for example, Carroll et al. (1997) and Murphy and van der Vaart (2000)). Whether minimum average variance estimation has this kind of optimality remains to be seen. Two procedures share at least one merit in common: no undersmoothing is needed for estimating parametric components (Carroll et al. (1997) and theorem 1 of the present paper). In fact, the criteria that the two procedures optimize are approximately the same.
Expression (2.7) is somewhat informal, since its minimization with respect to B is not unique though its effective dimension reduction is. Could the authors therefore explain how problem (2.7) is minimized and clarify the convergence criterion in Section 2.3?
L. Ferré (University of Toulouse le Mirail)
The paper is interesting since it substitutes local linear smoothing for inverse regression for estimating the effective dimension space. The main advantage of the method over inverse regression is that condition (1.2) is relaxed, allowing applications to time series. Even if my own experience of the application of sliced inverse regression in times series is quite positive, time reversibility is indeed an awkward condition derived from equation (1.2). However, an argument in favour of inverse regression is simplicity: estimates of the effective dimension reduction space are deduced from a simple eigenvalue decomposition of a matrix independently from g. This feature allows in particular extensions to functional data (see for example Dauxois et al. (2001)). This necessary reduction of the dimension (recall the goal: overcome the `curse of dimensionality') comes before (and independently of) the nonparametric estimation of g. For deriving this dimension, tests have been proposed, relying, in the original papers, on distributional assumptions. These assumptions can be removed since recent unpublished work has shown that the existence of the first four moments is sufficient. An alternative is to use a model selection approach based on the distance between S(B0) and
by letting d vary (Ferré, 1998). The main idea is that a working dimension that is lower than the `true' dimension D can be preferable and the distance between
and a d‐subspace of the unknown S(B0) is finally used. Simple estimates of this criterion have been proposed for elliptically distributed explanatory variates but also for the general case by using the bootstrap or jackknife (see Ferré (1997, 1998)). Local linear smoothing intends to estimate at the same time the regression function and the effective dimension reduction space. The price to pay is that more local linear smoothing is needed than covariates are included in the model. For the dimensionality a global model selection approach is considered, but cross‐validation, in addition to the high computational cost, does not avoid the curse of dimensionality. Indeed,
is the Nadaraya–Watson estimator which may perform poorly for large values of d and my feeling is that overparameterization is to be feared.
Ker‐chau Li (University of California at Los Angeles)
The dramatic improvement of the methods proposed over sliced inverse regression (SIR) and the principal Hessian directions method for the three examples deserves some non‐asymptotic explanations. For n=200 and p=10, it is difficult to tell why the nice asymptotic theorems are relevant. For the first two examples, a simple explanation goes like this. First, least squares regression is known to be consistent in finding an effective dimension reduction direction (Brillinger, 1983; Li and Duan, 1989) under condition (1.2). It is straightforward to extend this result to weighted least squares regression provided that the weight function depends on (y,x) only through (y,B0TX). Now because equation (2.6) is basically a weighted least squares regression, one can prove that, for the population version of equation (2.6), bTBT should be in the effective dimension reduction space. If condition (1.2) does not hold, then the result may be biased and an upper bound of bias can be evaluated (Duan and Li, 1991; Li, 1997). Problem (2.7) amounts to averaging over a number of weight functions. Averaging may help the cancellation of bias in the time series context.
For fairness, I would like to point out that weighted versions of SIR and similar procedures have been proposed before to temper the bias problem; see the discussion and rejoinder in Li (1991). It is worth pointing out the difference between condition (1.2) and elliptical symmetry (Hall and Li, 1993). Also SIR and principal Hessian directions can be applied to residuals after deterministic components have been taken out. Iteration does improve the results. However, the issue of non‐linear confounding (Li, 1997) sets a limitation that is difficult to bypass by any procedure. It is not clear to me whether the new approach can do anything about it.
For brevity, I shall not go over the long list of clever ideas that I found interesting in this path breaking work by the authors. Let me close by noting that they did not compare their procedure with projection pursuit regression. A dozen years ago when I submitted my SIR paper to the Journal of the American Statistical Association, the Associate Editor recommended rejection because he or she thought that SIR was not as good as projection pursuit regression. Luckily my paper was salvaged by the Editor, who allowed me to explain the difference between the two approaches. Apparently the authors have done more than enough to convince the reviewers just as they have convinced me!
Lexin Li (University of Minnesota, St Paul)
Adopting the notation in model (1.1) and following the definitions of the central mean subspace (CMS) (Cook and Li, 2002), the minimum average variance estimation (MAVE) methods seem to pursue the CMS only. To confirm this, simulations were done on models of the form y=g(B1TX)+h(B2TX)ɛ, where g and h are both unknown functions, ɛ is independent of X and E(ɛ)=0. My results indicate that MAVE methods can successfully estimate B1 in the mean structure E(y|X), whereas they always miss B2 in the error structure.
Refined MAVE (RMAVE) does not require sliced inverse regression's (Li, 1991) linearity condition. Simulations were done to examine the performance of RMAVE when there are strong non‐linear relationships among the predictors X. I considered one‐dimensional models only, where B∈ℜp. The results show that RMAVE has good performance for one‐dimensional models when the non‐linearity in X is strong.
Under the assumption D=1, however, there is still room for improvement, compared with RMAVE, to estimate the underlying true direction without the requirement of the linearity condition. Cook and Nachtsheim (1994) suggested a co‐ordinatewise reweighting approach to remove the non‐linearity in X and to make X elliptically contoured. I have been investigating the possibility of extending the idea of removing the non‐linearity in X by clustering on X‐space as the first step. An ordinary least squares (OLS) estimate is obtained from each cluster, and all those estimates are combined to estimate the true direction. Intuitively, the clusterwise OLS method works because non‐linearity in X is broken and within each cluster the linearity condition should hold approximately. Then the Li–Duan proposition (Li and Duan (1989), theorem 2.1, and Cook (1998), proposition 8.1) is applicable within each cluster. I also consider an iterative version of the algorithm, which obtains the estimate by iteratively clustering on
, where
is the estimate from the ith iteration. Simulations show that the OLS estimate with clustering achieves a better performance than RMAVE. As an example, consider the model x1∼ uniform(0,1) and x2= log (x1)+e, where e∼ uniform(−0.3,0.3), and y= log (x1)+ɛ, where ɛ∼N(0,0.01). The actual direction is B=(1,0)T. With 100 observations, RMAVE gives an estimate of
with the angle to B equal to 7.626○, whereas OLS with five clusters produces
with the angle to B equal to 2.196○. Here the number of clusters, 5, is chosen before we see the computational results, to make the comparison fair. Details of this work will be reported elsewhere.
Oliver Linton (London School of Economics and Political Science)

. The Ichimura (1993) procedure involves sequential minimization with the difference that he uses only local constant but also includes the dependence of wij on β; this leads to a nasty non‐linear optimization problem, whereas the authors' procedure is just bilinear least squares, and so is conditionally linear. They apparently prove that after two iterations their
behaves as if (a,b) were known in expression (A.1). I think that this is an important idea that will make estimation of these models much easier. The authors develop many useful tools and apply them impressively. I have some comments and questions.
The initial consistent estimator that lurks in Appendix A.2 is either the average derivative estimator (in which case the criticisms in (a) and (b) of the second page apply) or some non‐linear least squares estimator, which itself will be heavily computational.
I suppose that the authors' estimator achieves the semiparametric efficiency bound in for example the special case of Appendix A.2 with independent and identically distributed ɛ, but it is not so clear to me.
In time series, we come across special sorts of indices like Σk=0 ∞ βkXt−k, where β is unknown; this would generalize the linear model yt=βyt−1+γXt+ɛt that is widely used. Have the authors thought about this case?
I do not think that the optimal amount of smoothing for the function will always be the same as the optimal amount of smoothing for the parameter. Generally speaking it seems that in `adaptive' cases the optimal bandwidth for the parameter and the function have the same magnitude, although not the same constant. See for example Carroll and Härdle (1989). In non‐adaptive cases this is not usually so. In the partially linear model y=βx+g(z)+e, Linton (1995) showed that the Robinson (1998) estimator
for β has expansion
under twice continuous differentiability of g, which suggests an optimal bandwidth rate of h ∝ n−1/9, i.e. it is optimal to undersmooth. Although maybe the authors can find an estimator of β that has the optimal bandwidth rate of h ∝ n−1/5.
Liqiang Ni (University of Minnesota, St Paul)
I applaud the authors for the promising refined minimum average variance estimation (RMAVE) algorithm and the intriguing idea of determining the dimension in a cross‐validation approach. Many methods have been proposed to estimate directions in the effective dimension reduction space (Li, 1991), or the central subspace (Cook, 1996). Sliced inverse regression (SIR) can discover directions of linear terms in mean functions but fails in symmetric situations like y=(βTX)2+ɛ with X normal, E(X)=0 and ɛ ⊥⊥ X, where the direction can be detected by sliced average variance estimation (Cook and Weisberg, 1991). In my experience, RMAVE can estimate both linear and quadratic terms well.

The selection of the bandwidth seems tricky. The estimation of dimension is much more stable when CV adopts the Nadaraya–Watson estimator than when using a local linear estimator. Neverthless, it is still sensitive to the bandwidth. I applied RMAVE to the AIS data (Chiaromonte et al., 2002) which consist of a mixture of two linear regressions determined by the only categorical predictor—gender. Considering only continuous predictors, the Nadaraya–Watson CV values suggested two dimensions with larger bandwidth and only one dimension with smaller bandwidth. The partial RMAVE method described as above, however, suggested one dimension consistently, which confirmed that both linear regressions associate with the same direction, y=GC(βTX)+ɛ.
I have a question about inverse MAVE. The essence of SIR is that, under the linearity condition (1.2), the space spanned by E(Z|Y) where E(Z)=0 and cov(X)=I is a subset of the EDR space. To estimate this space, Li (1991) proposed slicing on Y, and Zhu and Fang (1996)proposed kernel methods. I am not sure whether inverse MAVE is intended to estimate span {E(Z|Y)} also.
Megu Ohtaki and Yasunori Fujikoshi (Hiroshima University)
We praise the authors of this paper, which has a highly original and fascinating content. The paper is sure to be one of the monumental works in the field of multivariate analysis.
In the paper it is clearly shown that the minimum average variance estimation (MAVE) method and its algorithm have many advantages over existing methods for searching an effective dimension reduction (EDR) space. Just like the sliced inverse regression method, however, no description for the reduction in the number of the original covariables was given. It is also important to consider selection of the original variables as well as the covariables β1TX,…,βpTX. In practical situations of data analysis, a model with a small number of original covariables is preferable while the bias is negligible. This problem may be formulated mathematically as below.



In linear statistical inference, it has been reported that the model selection method using Akaike's information criteria AIC is not consistent for estimating the true model (see, for example, Shibata (1976) and Fujikoshi (1985)). Stone (1974) showed that the cross‐validation criterion and AIC are asymptotically equivalent for model selection. Given these results, we wonder whether theorem 2 is consistent with the classical results.
James R. Schott (University of Central Florida, Orlando)
Over the past decade, there has been a considerable amount of work on dimensionality reduction techniques in the regression setting. This paper represents a substantial contribution to that area. I have just a couple of minor comments relating to the sliced inverse regression (SIR) procedure of Li (1991) and subsequent similar types of procedure such as the sliced average variance estimate of Cook and Weisberg (1991).
The linear condition given in equation (1.2) is a fundamental requirement for most of these procedures. Additional assumptions may be needed; for instance, sliced average variance estimation requires a constant variance assumption, and inferential methods, associated with these procedures, for determining the correct dimension often require stronger conditions. These additional assumptions are certainly restrictive, but it is important to note that equation (1.2) is a fairly mild condition. It is weaker than elliptical symmetry because it only has to hold for the directions B0. Thus, we may not have elliptical symmetry but be sufficiently lucky still to have condition (1.2) hold. In fact, Hall and Li (1993) have shown that, loosely speaking, if the dimension of X is high, then it is likely that condition (1.2) holds at least approximately.
A further point to note is that procedures like SIR estimate a space that may be a proper subspace of the space spanned by the columns of B0. Have we missed any important directions? If so, how do we recover them? These are questions that may need to be answered when using SIR. However, they are not relevant questions for the adaptive procedures proposed here since they directly estimate the space spanned by the columns of B0.
C. M. Setodji (University of Minnesota, St Paul)
We have been presented with a constructive and useful paper and the authors are to be congratulated. Minimum average variance estimation (MAVE) seems to be an interesting and intriguing method for dimension reduction estimation. Equation (1.1) is applicable to any regression problem since, for any Y and X, we can always define ɛ=Y−E(Y|X) which depends on X and satisfies the conditions in the paper. I have applied MAVE to three well‐known sets of data that have been studied in the dimension reduction literature, and the optimal bandwidth was used throughout. Background on the examples was given by Cook and Critchley (2000). In all three examples, MAVE fails to produce the directions obtained by other methods.
First the methods proposed were applied to the bank‐note data. With a binary response (the bank‐note's authenticity) and six predictors, all the information in the regression is contained in the mean function. The refined MAVE method gave
, which is the same as the result produced by sliced average variance estimation (SAVE) (Cook and Critchley, 2000; Chiaromonte et al., 2002) and projection pursuit analysis (Posse, 1995). Whereas the first MAVE and SAVE directions are essentially the same, the second directions are quite different. The second SAVE direction shows two kinds of forged notes, but the role of the second MAVE direction is unclear. It misses the clustering in the counterfeit notes.
We also applied MAVE to the Hawkins data, designed to challenge traditional and robust regression methods with outliers. Although the data with four covariates and a continuous response have two directions in the mean function, refined MAVE and inverse MAVE suggest independence whereas the outer products of gradients method suggests only one direction. SAVE correctly identifies the regression structure. Lastly, the method was applied to the AIS data, a data set with mixtures. MAVE gave
, suggesting one direction, whereas sliced inverse regression infers
. MAVE evidently missed the `joining information' for males and females.
Many regression problems are filled with `mixtures' which is the one thing that all these data sets have in common. Mixtures increase the dimension of the mean function. My experience suggests that the MAVE methods fail to detect mixture regressions. Is it possible to enhance the proposed method to face such an issue?
Finally, for me, one of the weaknesses of the method proposed is the fact that it is not invariant under linear transformations. Using (x1,x2) or (x1+x2,x2) as predictors may yield different first directions when d=1. More developments need to be pursued for these methods.
Nils Chr. Stenseth and Ole Chr. Lingjæ rde (University of Oslo)
Lynx populations undergo regular density cycles all across the boreal forest of Canada (see, for example, Stenseth et al. (1998)). In a previous analysis of the lynx dynamics (Stenseth et al., 1999) two competing hypotheses were put forward regarding the spatial structure of the dynamics. One predicts that the dynamical structure clusters into groups defined according to ecological‐based features, whereas the other predicts that it clusters into groups according to climatic‐based features. On the basis of an analysis of 21 time series from 1821 onwards, Stenseth et al. (1999) found evidence in support of the latter hypothesis, assuming a piecewise linear autoregressive model for each population. However, their model did not explicitly include any climatic effects.



Comparison of dynamic structures across Canada, using cross‐validation estimates for the orders d(s) (the comparison is based on the largest principal angles between the estimated reduction subspaces for each region): (a) average linkage hierarchical clustering of the 21 time series; (b) pseudocolour checker‐board plot of distances (the plotted values are non‐linearly scaled as exp {ϕ(s,s')} to accentuate the regions of similar dynamics; order of regions (from left to right) with two major clusters emphasized, L18, L19, L16, L8, L14, L3, L22, L17, L15, L2, L7, L12, L20, L6, L9, L11, L5, L10, L21, L13, L4)
The results are strikingly similar to what we proposed as the ecological region structuring, and there is no strong support for the climatic region structuring, the latter of which was concluded to be the most appropriate region by Stenseth et al. (1999). To understand the underlying reasons for these differences certainly requires further work, both on the ecological and on the statistical side—work that we would like to pursue.
The authors replied later, in writing, as follows.
The extraordinarily kind words from so many distinguished discussants have overwhelmed us. We thank all the discussants for their constructive remarks and stimulating questions. Limitations of time and space prevent us from answering every question raised. Moreover, some of the suggestions will keep us busy for a while!
We thank Professor Kent for pointing out possible connections with other areas. His point regarding reduced rank models is clearly related to Chan and M. Li's important contribution. Turning to partial least squares, one of us has studied a nonparametric partial least squares regression after transformation. For data (y,X), a spline transformation G(⋅) of the response y is carried out so that the partial least squares regression can be modelled without knowing the exact form of G(⋅). Readers can refer to Zhu (2002) for more details. The basic idea is to `linearize' a smooth function G(⋅) of the response y by π(⋅)Tθ, where π(⋅) is a vector of B‐spline basis functions of y and θ is an unknown projection parameter.
Concerning the issue of possible confounding between the covariates sulphur dioxide, nitrogen dioxide and the particulates (Bowman), the contribution by Professor Chan and Dr M. Li is relevant.
(1)
(2)
and X=(x1,x2,x3,x4)T. Two cases are considered:
- (a)
an uncorrelated design, x1,x2,x3,x4,ɛ∼IIDN(0,1), and
- (b)
a design with functional relationships, x3=(2x1+2x2+ɛ1)/3,x4={sgn(x1)|x1|2+ɛ2}/2 and x1,x2,ɛ1,ɛ2,ɛ∼IIDN(0,1).
We estimate model (2) under respectively the nonparametric setting and the non‐linear parametric setting. With different sample sizes and bandwidths 0.6, 0.5, 0.45, 0.4, 0.35, 0.3, 0.28 and 0.25, results for the parametric estimators (obtained with the SAS software) and RMAVE estimators are shown in Fig. 10, where the error is defined as
with B0=(β1,β2). It is clear that both methods suffer from functional relationships between covariates. The relative degradation of efficiency for RMAVE due to collinearity and functional relationships between covariates is similar to that for the parametric case.

(a) Parametric estimation and (b) nonparametric estimation (♦, results with uncorrelated design; ◯, results for designs with functional relationships)
Our remark on the apparent robustness, based on our experience with MAVE, has somewhat to our surprise aroused substantial interest among the discussants (Critchley, Atkinson, Cui, G. Li, Yao, Čížek, Härdle, Yang and Welsh). The issue is important but we have as yet no theoretical results to offer.
(3)
(4)

(5)
and β2=(0,…,0,0.5,0.5,0.5,0.5)T. The simulation results are reported in Table 8. And Fig. 11 shows that by using the conditional expectation E(|y−Δk||X) we can capture all the EDR directions.
| β 1 | (0.4538 | 0.4387 | 0.4467 | 0.4443 | − 0.0041 | 0.0013 | 0.0242 | 0.0067 | 0.0193 | 0.0128)T |
| Standard deviation | (0.0985 | 0.0988 | 0.1089 | 0.0969 | 0.0765 | 0.1025 | 0.1973 | 0.1900 | 0.1815 | 0.1961)T |
| β 2 | (0.0106 | − 0.0033 | 0.0223 | 0.0127 | 0.0016 | 0.0071 | 0.3983 | 0.3833 | 0.3765 | 0.3428)T |
| Standard deviation | (0.2290 | 0.2300 | 0.2435 | 0.2520 | 0.1339 | 0.1511 | 0.2063 | 0.1963 | 0.1869 | 0.2223)T |
- † h =0.6 was used.

1000 observations from model (3) (⋅) and conditional expectations (——), based on kernel regression from 1 million observations

(6)
. With a sample size of 200 and 200 independent replications, the estimated errors are listed in Table 9. The PPR algorithm in S‐PLUS performs much worse than the MAVE algorithm; even without the benefit of the additive noise structure, the RMAVE method still outperforms the PPR algorithm in S‐PLUS.
for model (6) based on different algorithms†
| Method | Means of estimation errors for various bandwidths or spans | ||
|---|---|---|---|
| PPR (S‐PLUS) | [0.4] (0.3459, 0.2876) | [0.5] (0.2997, 0.2613) | [0.6] (0.3707, 0.2776) |
| RMAVE (additive) | [0.2] (0.0415, 0.0355) | [0.3] (0.0088, 0.0170) | [0.4] (0.0214, 0.0212) |
| RMAVE (non‐additive) | [0.3] (0.0305, 0.0516) | [0.4] (0.0481, 0.0731) | [0.5] (0.1104, 0.0586) |
- †Bandwidths or spans are given in square brackets.
We refer Professor Ohtaki and Professor Fujikoshi to Cheng and Tong (1992), which establishes consistency of the cross‐validation (CV) estimate, and to Professor Ferré's contribution.
We now consider Professor Setodji's examples. Because of the estimation of the remainder term, we have fewer problems to face than undersmoothing. It allows us to use the optimal bandwidth chosen by data‐driven methods. For example, the CV method for the local linear smoothing of yi on
can be applied to step 1(b) of our algorithm to choose the bandwidth that is used for the next iteration of estimation. Using this kind of bandwidth, we have re‐examined the data sets cited by Professor Setodji. As usual we standardize each covariate before applying the RMAVE method. Table 10shows our results with the smallest CV values highlighted in bold.
| Data | Method † | Results for the following dimensions: | ||||||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||
| Bank‐note | Bandwidth | — | 0.1 | 0.3 | 0.5 | 0.6 | 0.7 | 0.7 |
| LL–CV value | 0.2525 | 0.0016 | 0.0029 | 0.0045 | 0.0061 | 0.0093 | 0.0153 | |
| NW–CV value | 0.2525 | 0.0036 | 0.0049 | 0.0047 | 0.0079 | 0.0078 | 0.0085 | |
| AIS | Bandwidth | — | 0.6 | 0.7 | 0.9 | |||
| LL–CV value | 150.5675 | 13.7718 | 12.4450 | 12.9045 | ||||
| NW–CV value | 150.5675 | 20.2026 | 19.8053 | 27.1200 | ||||
| Hawkins | Bandwidth | — | 0.28 | 0.26 | 0.28 | 1 | ||
| LL–CV value | 9.2133 | 7.8666 | 8.8623 | 10.4843 | 18.5332 | |||
| NW–CV value | 9.2133 | 7.6566 | 9.0900 | 11.1208 | 11.6386 | |||
- †LL, local linear; NW, Nadaraya–Watson.

See also Fig. 12(a). With this simple deterministic single‐index relationship, it seems difficult to believe that the efficient dimension is 2 as suggested by the sliced average variance estimation (SAVE) method in Cook and Critchley (2000). One possible explanation for suggesting a second dimension is that, if we classify {β1TXi,i=1,…,n} into two groups then one of the notes might be in the wrong group on the basis of the SIR (or SAVE) direction as shown in Fig. 12(c). However, on the basis of the RMAVE direction above there is no such `outlier'. See Fig. 12(b). For the AIS data, the CV estimated dimension is 2, which is the same as that suggested by SAVE. The results are shown in Figs 12(d) and 12(e). It seems to us that RMAVE has not missed any information. For the Hawkins data, the dimension is estimated to be 1. The model seems to give a reasonable fit to the data although the estimated dimension is lower than 2; see Fig. 12(f). Since the data set was generated from two regression models, we have also explored RMAVE with dimension 2 (and bandwidth 0.2). The directions are estimated as β1=(0.0326,0.7432,−0.2440, 0.6221)T and β2=(0.7139,−0.1634,0.5653,0.3796)T. The difference between these directions and the directions β01 and β02 that are estimated on the basis of the two regressions above is very small. See also Figs 12(g)–12(j). Fig. 12(g)can distinguish the observations by their models. The rotation in Figs 12(g) and 12(h)is useful for interpretation purposes and is related to questions about the effect of rotation raised by Professor Bowman, Professor Atkinson, Professor Chan and Dr M. Li, and Professor Yao.

Calculations for (a), (b), (c) the bank‐note data (◯,y=`1';•,y=`0'), (d), (e) the AIS data (◯, females; •, males) and (f), (g), (h), (i), (j) the Hawkins data (◯, primary regression; •, second regression)
Professor Stenseth and Dr Lingjærde's application of the RMAVE method to the Canadian lynx populations is clearly very interesting. We also look forward to using the partial RMAVE method suggested by Professor Ni.
Concerning Professor Spokoiny's question, a further improvement on MAVE can be made. For example, we can improve the stability of the algorithm along the lines suggested by him.
References in the discussion
Citing Literature
Number of times cited according to CrossRef: 355
- Yuexiao Dong, A brief review of linear sufficient dimension reduction through optimization, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2020.06.006, 211, (154-161), (2021).
- Yan-Yong Zhao, Jian-Qiang Zhao, Su-An Qian, A new test for heteroscedasticity in single-index models, Journal of Computational and Applied Mathematics, 10.1016/j.cam.2020.112993, 381, (112993), (2021).
- Xuehu Zhu, Xu Guo, Tao Wang, Lixing Zhu, Dimensionality determination in dimension reduction: A thresholding double ridge ratio approach, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.106910, (106910), (2020).
- Chuanlong Xie, Lixing Zhu, Generalized kernel-based inverse regression methods for sufficient dimension reduction, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.106995, (106995), (2020).
- Zhong-Cheng Han, Jin-Guan Lin, Yan-Yong Zhao, Adaptive semiparametric estimation for single index models with jumps, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107013, (107013), (2020).
- Jin-Hong Park, Harrison B. Hood, T. N. Sriram, Time series central subspace with covariates and its application to forecasting pine sawtimber stumpage prices in the Southern United States, Journal of the Korean Statistical Society, 10.1007/s42952-019-00029-5, 49, 2, (559-577), (2020).
- Esraa Rahman, Ali Alkenani, Sparse minimum average variance estimation via the adaptive elastic net when the predictors correlated, Journal of Physics: Conference Series, 10.1088/1742-6596/1591/1/012041, 1591, (012041), (2020).
- Chun Yui Wong, Pranay Seshadri, Geoffrey T. Parks, Mark Girolami, Embedded ridge approximations, Computer Methods in Applied Mechanics and Engineering, 10.1016/j.cma.2020.113383, 372, (113383), (2020).
- Rong Jiang, Keming Yu, Single-index composite quantile regression for massive data, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104669, (104669), (2020).
- Runxiong Wu, Xin Chen, MM algorithms for distance covariance based sufficient dimension reduction and sufficient variable selection, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107089, (107089), (2020).
- Li-Pang Chen, Ultrahigh-dimensional sufficient dimension reduction with measurement error in covariates, Statistics & Probability Letters, 10.1016/j.spl.2020.108931, (108931), (2020).
- Hyung Park, Eva Petkova, Thaddeus Tarpey, R. Todd Ogden, A constrained single‐index regression for estimating interactions between a treatment and covariates, Biometrics, 10.1111/biom.13320, 0, 0, (2020).
- Ge Lin, Tonglin Zhang, Spatial Regression with Multiple Dependent Variables: Principal Component Analysis and Spatial Autocorrelation, Geographical Analysis, 10.1111/gean.12235, 0, 0, (2020).
- Rong Jiang, Meng-Fan Guo, Xin Liu, Composite quasi-likelihood for single-index models with massive datasets, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2020.1753074, (1-17), (2020).
- Zhensheng Huang, Wen Lou, Statistical inferences for single-index models with measurement errors, Journal of Applied Statistics, 10.1080/02664763.2020.1754358, (1-20), (2020).
- Weiqiang Hang, Yingcun Xia, Advance of the sufficient dimension reduction, WIREs Computational Statistics , 10.1002/wics.1516, 0, 0, (2020).
- Ash Abebe, Huybrechts F. Bindele, Masego Otlaadisa, Boikanyo Makubate, Robust estimation of single index models with responses missing at random, Statistical Papers, 10.1007/s00362-020-01184-2, (2020).
- Guangren Yang, Sijia Xiang, Weixin Yao, Lin Xu, Robust estimation and outlier detection for varying-coefficient models via penalized regression, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2020.1784429, (1-12), (2020).
- Željko Kereta, Timo Klock, Valeriya Naumova, Nonlinear generalization of the monotone single index model, Information and Inference: A Journal of the IMA, 10.1093/imaiai/iaaa013, (2020).
- Tonghui Wang, Liming Wang, Xian Zhou, Estimation in single-index varying-coefficient panel data model, Communications in Statistics - Theory and Methods, 10.1080/03610926.2020.1804589, (1-22), (2020).
- Bingying Xie, Jun Shao, Nonparametric Estimation of Conditional Expectation With Auxiliary Information and Dimension Reduction, Journal of the American Statistical Association, 10.1080/01621459.2020.1713793, (1-12), (2020).
- Jialiang Li, Jing Lv, Alan T. K. Wan, Jun Liao, AdaBoost Semiparametric Model Averaging Prediction for Multiple Categories, Journal of the American Statistical Association, 10.1080/01621459.2020.1790375, (1-15), (2020).
- Qin Wang, Xiangrong Yin, Aggregate Inverse Mean Estimation for Sufficient Dimension Reduction, Technometrics, 10.1080/00401706.2020.1774423, (1-10), (2020).
- Lei Wang, Siying Sun, Zheng Xia, An Efficient Multiple Imputation Approach for Estimating Equations with Response Missing at Random and High-Dimensional Covariates, Journal of Systems Science and Complexity, 10.1007/s11424-020-9133-9, (2020).
- Muxuan Liang, Menggang Yu, A Semiparametric Approach to Model Effect Modification, Journal of the American Statistical Association, 10.1080/01621459.2020.1811099, (1-33), (2020).
- Xiufan Yu, Jiawei Yao, Lingzhou Xue, Nonparametric Estimation and Conformal Inference of the Sufficient Forecasting with a Diverging Number of Factors, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1813589, (1-33), (2020).
- Wenhui Sheng, Qingcong Yuan, Sufficient dimension folding in regression via distance covariance for matrix‐valued predictors, Statistical Analysis and Data Mining: The ASA Data Science Journal, 10.1002/sam.11442, 13, 1, (71-82), (2019).
- Shao‐Hsuan Wang, Chin‐Tsang Chiang, Concordance‐based estimation approaches for the optimal sufficient dimension reduction score, Scandinavian Journal of Statistics, 10.1111/sjos.12420, 47, 3, (662-689), (2019).
- Jicai Liu, Peirong Xu, Heng Lian, Estimation for single-index models via martingale difference divergence, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.03.008, 137, (271-284), (2019).
- Guochang Wang, Fode Zhang, Heng Lian, Directional regression for functional data, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2019.03.011, (2019).
- Mei Yan, Efang Kong, Yingcun Xia, Quantile based dimension reduction in censored regression, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.106818, (106818), (2019).
- Minyong R. Lee, Modified Active Subspaces Using the Average of Gradients, SIAM/ASA Journal on Uncertainty Quantification, 10.1137/17M1140662, 7, 1, (53-66), (2019).
- Liliana Forzani, Daniela Rodriguez, Ezequiel Smucler, Mariela Sued, Sufficient dimension reduction and prediction in regression: Asymptotic results, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.12.003, (2019).
- Jingwei Wu, Hanxiang Peng, Wanzhu Tu, Large-sample estimation and inference in multivariate single-index models, Journal of Multivariate Analysis, 10.1016/j.jmva.2019.01.003, (2019).
- Chuanlong Xie, Lixing Zhu, A goodness-of-fit test for variable-adjusted models, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.01.018, (2019).
- Hong-Fan Zhang, Lei Huang, Lian-Lian Liu, On bootstrap consistency of MAVE for single index models, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.06.002, (2019).
- Jia Zhang, Xin Chen, Robust sufficient dimension reduction via ball covariance, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.06.004, (2019).
- Lei Huang, Hui Jiang, Huixia Wang, A novel partial-linear single-index model for time series data, Computational Statistics & Data Analysis, 10.1016/j.csda.2018.12.012, (2019).
- Wei Sun, Huybrechts F. Bindele, Ash Abebe, Hannah Correia, General local rank estimation for single-index varying coefficient models, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2019.01.005, (2019).
- 紫微 杨, Single-Index Weighted Composite Quantile Regression, Statistics and Application, 10.12677/SA.2019.85087, 08, 05, (766-776), (2019).
- Hyungwoo Kim, Yichao Wu, Seung Jun Shin, Quantile-slicing estimation for dimension reduction in regression, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2018.03.001, 198, (1-12), (2019).
- Chaohua Dong, Jiti Gao, Bin Peng, Series estimation for single‐index models under constraints, Australian & New Zealand Journal of Statistics, 10.1111/anzs.12274, 61, 3, (299-335), (2019).
- Luis Orea, José Luis Zofío, Common Methodological Choices in Nonparametric and Parametric Analyses of Firms’ Performance, The Palgrave Handbook of Economic Performance Analysis, 10.1007/978-3-030-23727-1, (419-484), (2019).
- Jing Yang, Guoliang Tian, Fang Lu, Xuewen Lu, Single-index modal regression via outer product gradients, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.106867, (106867), (2019).
- Guo-Liang Fan, Hong-Xia Xu, Han-Ying Liang, Dimension reduction estimation for central mean subspace with missing multivariate response, Journal of Multivariate Analysis, 10.1016/j.jmva.2019.104542, (104542), (2019).
- M. Matilainen, C. Croux, K. Nordhausen, H. Oja, Sliced average variance estimation for multivariate time series, Statistics, 10.1080/02331888.2019.1605515, (1-26), (2019).
- Wenbo Wu, Haileab Hilafu, Yuan Xue, Simultaneous estimation for semi-parametric multi-index models, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1617291, (1-19), (2019).
- Jun Zhang, Xia Cui, Heng Peng, Estimation and hypothesis test for partial linear single-index multiplicative models, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-019-00706-6, (2019).
- Claudio Agostinelli, Ana M. Bianco, Graciela Boente, Robust estimation in single-index models when the errors have a unimodal density with unknown nuisance parameter, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-019-00712-8, (2019).
- Huilan Liu, Hu Yang, Changgen Peng, Weighted composite quantile regression for single index model with missing covariates at random, Computational Statistics, 10.1007/s00180-019-00886-y, (2019).
- Ning Zhang, Zhou Yu, Qiang Wu, Overlapping Sliced Inverse Regression for Dimension Reduction, Analysis and Applications, 10.1142/S0219530519400013, (2019).
- Heng-Hui Lue, Pairwise directions estimation for multivariate response regression data, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1572145, (1-19), (2019).
- Bin Sun, Yuehua Wu, Wenzhi Yang, Yuejiao Fu, A note on the semiparametric approach to dimension reduction, Communications in Statistics - Theory and Methods, 10.1080/03610926.2019.1576887, (1-10), (2019).
- Ying Zhang, Lei Wang, Menggang Yu, Jun Shao, Quantile treatment effect estimation with dimension reduction, Statistical Theory and Related Fields, 10.1080/24754269.2019.1696645, (1-12), (2019).
- Eliana Christou, Central quantile subspace, Statistics and Computing, 10.1007/s11222-019-09915-8, (2019).
- Jun Zhang, Estimation and variable selection for partial linear single-index distortion measurement errors models, Statistical Papers, 10.1007/s00362-019-01119-6, (2019).
- Xinyi Xu, Jingxiao Zhang, Groupwise sufficient dimension reduction via conditional distance clustering, Metrika, 10.1007/s00184-019-00732-7, (2019).
- Jing Yang, Fang Lu, Hu Yang, Local Walsh-average-based estimation and variable selection for single-index models, Science China Mathematics, 10.1007/s11425-017-9262-3, (2019).
- Lei Wang, Dimension reduction for kernel-assisted M-estimators with missing response at random, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-018-0664-y, 71, 4, (889-910), (2018).
- Ji Chen, Bingying Xie, Jun Shao, Pseudo likelihood and dimension reduction for data with nonignorable nonresponse, Statistical Theory and Related Fields, 10.1080/24754269.2018.1516101, 2, 2, (196-205), (2018).
- Jing Yang, Fang Lu, Hu Yang, Statistical inference on asymptotic properties of two estimators for the partially linear single-index models, Statistics, 10.1080/02331888.2018.1506922, 52, 6, (1193-1211), (2018).
- Xiu Yang, Weixuan Li, Alexandre Tartakovsky, Sliced-Inverse-Regression--Aided Rotated Compressive Sensing Method for Uncertainty Quantification, SIAM/ASA Journal on Uncertainty Quantification, 10.1137/17M1148955, 6, 4, (1532-1554), (2018).
- Boyoung Kim, Seung Jun Shin, Principal weighted logistic regression for sufficient dimension reduction in binary classification, Journal of the Korean Statistical Society, 10.1016/j.jkss.2018.11.001, (2018).
- Lei Huo, Xuerong Meggie Wen, Zhou Yu, Trace pursuit variable selection for multi-population data, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1430364, 30, 2, (430-447), (2018).
- Jianglin Fang, Wanrong Liu, Xuewen Lu, Empirical likelihood for heteroscedastic partially linear single-index models with growing dimensional data, Metrika, 10.1007/s00184-018-0642-7, 81, 3, (255-281), (2018).
- Chaohua Dong, Jiti Gao, Bin Peng, Series Estimation for Single-Index Models Under Constraints, SSRN Electronic Journal, 10.2139/ssrn.3149172, (2018).
- Jeffrey M. Hokanson, Paul G. Constantine, Data-Driven Polynomial Ridge Approximation Using Variable Projection, SIAM Journal on Scientific Computing, 10.1137/17M1117690, 40, 3, (A1566-A1589), (2018).
- Zhibo Guo, Ying Zhang, A Sparse Corruption Non-Negative Matrix Factorization Method and Application in Face Image Processing & Recognition, Measurement, 10.1016/j.measurement.2018.12.087, (2018).
- R. Dennis Cook, Principal Components, Sufficient Dimension Reduction, and Envelopes, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-031017-100257, 5, 1, (533-559), (2018).
- Weiwei Wang, Xianyi Wu, Xiaobing Zhao, Xian Zhou, Robust variable selection of joint frailty model for panel count data, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.04.003, 167, (60-78), (2018).
- Hongfan Zhang, Estimation of the multi-index conditional volatility model, Journal of Physics: Conference Series, 10.1088/1742-6596/1053/1/012111, 1053, (012111), (2018).
- Yuan Xue, Qin Wang, Xiangrong Yin, A unified approach to sufficient dimension reduction, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2018.02.001, 197, (168-179), (2018).
- Ying Zhang, Lei Wang, Dimension reduction in estimating equations with covariates missing at random, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1438610, 30, 2, (491-504), (2018).
- Liliana Forzani, Rodrigo García Arancibia, Pamela Llop, Diego Tomassi, Supervised dimension reduction for ordinal predictors, Computational Statistics & Data Analysis, 10.1016/j.csda.2018.03.018, 125, (136-155), (2018).
- Ying Zhang, Jun Shao, Menggang Yu, Lei Wang, Impact of sufficient dimension reduction in nonparametric estimation of causal effect, Statistical Theory and Related Fields, 10.1080/24754269.2018.1466100, 2, 1, (89-95), (2018).
- Guoxiang Liu, Xiuli Du, Mengmeng Wang, Jinguan Lin, Qibing Gao, Semiparametric jump-preserving estimation for single-index models, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1444164, 30, 3, (556-580), (2018).
- Hiroaki Sasaki, Voot Tangkaratt, Gang Niu, Masashi Sugiyama, Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities, Neural Computation, 10.1162/neco_a_01035, 30, 2, (477-504), (2018).
- Wei Luo, Wenbo Wu, Yeying Zhu, Learning Heterogeneity in Causal Inference Using Sufficient Dimension Reduction, Journal of Causal Inference, 10.1515/jci-2018-0015, 0, 0, (2018).
- Juexin Lin, Dewei Wang, Single-index regression for pooled biomarker data, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1483501, 30, 4, (813-833), (2018).
- Junhua Zhang, Bingqing Lin, Yan Zhou, Jun Zhang, Dimension reduction regressions with measurement errors subject to additive distortion, Journal of Statistical Computation and Simulation, 10.1080/00949655.2018.1479753, 88, 13, (2631-2649), (2018).
- Hongfan Zhang, Quasi-likelihood estimation of the single index conditional variance model, Computational Statistics & Data Analysis, 10.1016/j.csda.2018.06.008, 128, (58-72), (2018).
- Ping Yu, Jiang Du, Zhongzhan Zhang, Single-index partially functional linear regression model, Statistical Papers, 10.1007/s00362-018-0980-6, (2018).
- Peng Zeng, Yu Zhu, A novel regularization method for estimation and variable selection in multi-index models, Communications in Statistics - Theory and Methods, 10.1080/03610926.2018.1473603, (1-15), (2018).
- Xianyan Chen, Qingcong Yuan, Xiangrong Yin, Sufficient dimension reduction via distance covariance with multivariate responses, Journal of Nonparametric Statistics, 10.1080/10485252.2018.1562065, (1-21), (2018).
- Jun Zhang, Cuizhen Niu, Tao Lu, Zhenghong Wei, Estimation of the error distribution function for partial linear single-index models, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2018.1468461, (1-17), (2018).
- Qian Lin, Zhigen Zhao, Jun S. Liu, Sparse Sliced Inverse Regression via Lasso, Journal of the American Statistical Association, 10.1080/01621459.2018.1520115, (1-33), (2018).
- Weixin Yao, Debmalya Nandy, Bruce G. Lindsay, Francesca Chiaromonte, Covariate Information Matrix for Sufficient Dimension Reduction, Journal of the American Statistical Association, 10.1080/01621459.2018.1515080, (1-28), (2018).
- Xin Zhang, Qing Mai, Efficient Integration of Sufficient Dimension Reduction and Prediction in Discriminant Analysis, Technometrics, 10.1080/00401706.2018.1512901, (0-0), (2018).
- Eliana Christou, Robust dimension reduction using sliced inverse median regression, Statistical Papers, 10.1007/s00362-018-1007-z, (2018).
- Huybrechts F. Bindele, Ash Abebe, Karlene N. Meyer, General rank-based estimation for regression single index models, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-017-0618-9, 70, 5, (1115-1146), (2017).
- Ulrike Genschel, The Effect of Data Contamination in Sliced Inverse Regression and Finite Sample Breakdown Point, Sankhya A, 10.1007/s13171-017-0102-x, 80, 1, (28-58), (2017).
- Chuanlong Xie, Lixing Zhu, A minimum projected-distance test for parametric single-index Berkson models, TEST, 10.1007/s11749-017-0568-9, 27, 3, (700-715), (2017).
- Shu Liu, Liangyuan Liu, Efficient estimation for time-dynamic longitudinal single-index model, Communications in Statistics - Theory and Methods, 10.1080/03610926.2017.1361986, 47, 15, (3656-3674), (2017).
- Dongfang Lou, Zhiyuan Ma, Estimating coefficients of single-index regression models by minimizing variation, Journal of Statistical Computation and Simulation, 10.1080/00949655.2017.1390574, 88, 2, (343-358), (2017).
- Jun Zhang, Zhenghui Feng, Xiaoguang Wang, A constructive hypothesis test for the single-index models with two groups, Annals of the Institute of Statistical Mathematics, 10.1007/s10463-017-0616-y, 70, 5, (1077-1114), (2017).
- Xin Chen, Wenhui Sheng, Xiangrong Yin, Efficient Sparse Estimate of Sufficient Dimension Reduction in High Dimension, Technometrics, 10.1080/00401706.2017.1321583, 60, 2, (161-168), (2017).
- Chung Eun Lee, Xiaofeng Shao, Martingale Difference Divergence Matrix and Its Application to Dimension Reduction for Stationary Multivariate Time Series, Journal of the American Statistical Association, 10.1080/01621459.2016.1240083, 113, 521, (216-229), (2017).
- Kofi P. Adragni, Minimum average deviance estimation for sufficient dimension reduction, Journal of Statistical Computation and Simulation, 10.1080/00949655.2017.1392523, 88, 3, (411-431), (2017).
- Yan Fan, Wolfgang Karl Härdle, Weining Wang, Lixing Zhu, Single-Index-Based CoVaR With Very High-Dimensional Covariates, Journal of Business & Economic Statistics, 10.1080/07350015.2016.1180990, 36, 2, (212-226), (2017).
- Yuan Xue, Nan Zhang, Xiangrong Yin, Haitao Zheng, Sufficient dimension reduction using Hilbert–Schmidt independence criterion, Computational Statistics & Data Analysis, 10.1016/j.csda.2017.05.002, 115, (67-78), (2017).
- See more




