Volume 71, Issue 3
Free Access

Invariant co‐ordinate selection

David E. Tyler

Rutgers University, Piscataway, USA

Search for more papers by this author
Frank Critchley

The Open University, Milton Keynes, UK

Search for more papers by this author
Lutz Dümbgen

University of Berne, Switzerland

Search for more papers by this author
Hannu Oja

University of Tampere, Finland

Search for more papers by this author
First published: 01 June 2009
Citations: 52
David E. Tyler, Department of Statistics, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854‐8019, USA.
E‐mail: dtyler@rci.rutgers.edu

Abstract

Summary. A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based on the eigenvalue–eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant co‐ordinate system for the multivariate data. Consequently, we view this method as a method for invariant co‐ordinate selection. By plotting the data with respect to this new invariant co‐ordinate system, various data structures can be revealed. For example, under certain independent components models, it is shown that the invariant co‐ ordinates correspond to the independent components. Another example pertains to mixtures of elliptical distributions. In this case, it is shown that a subset of the invariant co‐ordinates corresponds to Fisher's linear discriminant subspace, even though the class identifications of the data points are unknown. Some illustrative examples are given.

1. Introduction

When sampling from a multivariate normal distribution, the sample mean vector and sample variance–covariance matrix are a sufficient summary of the data set. To ect against non‐normality, and in particular against longer‐tailed distributions and outliers, we can replace the sample mean and covariance matrix with robust estimates of multivariate location and scatter (or pseudocovariance). A variety of robust estimates of the multivariate location vector and scatter matrix have been proposed. Among them are multivariate M‐estimates (Huber, 1981; Maronna, 1976), the minimum volume ellipsoid estimate and the minimum covariance determinant estimate (Rousseeuw, 1986), S‐estimates (Davies, 1987; Lopuhaä, 1985), projection‐based estimates (Maronna et al., 1992; Tyler, 1994), τ‐estimates (Lopuhaä, 1991), constrained M‐estimates (Kent and Tyler, 1996) and MM estimates (Tatsuoka and Tyler, 2000; Tyler, 2002), as well as one‐step versions of these estimates (Lopuhaä, 1999). After computing robust estimates of multivariate location and scatter, outliers can often be detected by examining the corresponding robust Mahalanobis distances; see for example Rousseeuw and Leroy (1987).

Summarizing a multivariate data set via a location and a scatter statistic, and then inspecting the corresponding Mahalanobis distance plot for possible outliers, is appropriate if the bulk of the data arises from a multivariate normal distribution or, more generally, from an elliptically symmetric distribution. However, if the data arise from a distribution which is not symmetric, then different location statistics are estimating different notions of central tendency. Moreover, if the data arise from a distribution other than an elliptically symmetric distribution, even one which is symmetric, then different scatter statistics are not necessarily estimating the same population quantity but rather are reflecting different aspects of the underlying distribution. This suggests that comparing different estimates of multivariate scatter may help to reveal interesting departures from an elliptically symmetric distribution. Such data structures may not be apparent in a Mahalanobis distance plot.

In this paper, we present a general multivariate method based on the comparison of different estimates of multivariate scatter. This method is based on the eigenvalue–eigenvector decomposition of one scatter matrix relative to another. An important property of this decomposition is that the corresponding eigenvectors generate an affine invariant co‐ordinate system for the multivariate observations, and so we view this method as a method for invariant co‐ordinate selection (ICS). By plotting the data with respect to this new invariant co‐ordinate system, various data structures can be revealed. For example, when the data arise from a mixture of elliptical distributions, the space that is spanned by a subset of the invariant co‐ordinates gives an estimate of Fisher's linear discriminant subspace, even though the class identifications of the data points are unknown. Another example pertains to certain independent components models. Here the variables that are obtained by using the invariant co‐ordinates correspond to estimates of the independent components.

The paper is organized as follows. Section 2 sets up some notation and concepts to be used in the paper. In particular, the general concept of affine equivariant scatter matrices is reviewed in Section 2.1 and some classes of scatter matrices are briefly reviewed in Section 2.2. The idea of comparing two different scatter matrices by using the eigenvalue–eigenvector decomposition of one scatter matrix relative to another is discussed in Section 3, with the invariance properties of the ICS transformation being given in Section 4. Section 5 gives a theoretical study of the ICS transformation under the aforementioned elliptical mixture models (Section 5.1), and under independent components models (Section 5.2). The results in Section 5.1 represent a broad generalization of results given under the heading of generalized principal components analysis by Ruiz‐Gazen (1993) and Caussinus and Ruiz‐Gazen (1993, 1995). Readers who are primarily interested in how ICS works in practice may wish to skip Section 5 at a first reading. In Section 6, a general discussion on the choice of scatter matrices that we may consider when implementing ICS, along with some examples illustrating the utility of the ICS transformation for diagnostic plots, is given. Further discussion, open research questions and the relationship of ICS to other approaches are given in Section 7. All formal proofs are reserved for Appendix A. An R package entitled ICS (Nordhausen, Oja and Tyler, 2008) is freely available for implementing the ICS methods.

2. Scatter matrices

2.1. Affine equivariance

Let FY denote the distribution function of the multivariate random variable Y ∈ ℜp, and let inline image represent the set of all symmetric positive definite matrices of order p. Affine equivariant multivariate location and scatter functionals, say μ(FY) ∈ ℜp and inline image respectively, are functions of the distribution satisfying the property that for Y*=AY+b, with A non‐singular and b ∈ ℜp,
image(1)
Classical examples of affine equivariant location and scatter functionals are the mean vector μY=E[Y] and the variance–covariance matrix ΣY=E[(YμY)(YμY)] respectively, provided that they exist. For our purposes, affine equivariance of the scatter matrix can be relaxed slightly to require only affine equivariance of its shape components. A shape component of a scatter matrix inline image refers to any function of V, say inline image, such that
image(2)
Thus, we say that the ‘shape’ of V(FY) is affine equivariant if
image(3)
For a p‐dimensional sample of size n, Y={y1,…,yn}, affine equivariant multivariate location and scatter statistics, say inline image and inline image respectively, are defined by applying the above definition to the empirical distribution function, i.e. they are statistics satisfying the property that, for any non‐singular A and any b ∈ ℜp,
image(4)
Likewise, the shape of inline image is said to be affine equivariant if
image(5)

The sample mean vector inline image and sample variance–covariance matrix Sn are examples of affine equivariant location and scatter statistics respectively, as are all the estimates that were cited in Section 1.

Typically, in practice, inline image is normalized so that it is consistent at the multivariate normal model for the variance–covariance matrix. The normalized version is thus given as inline image, where β>0 is such that V(FZ)=βI when Z has a standard multivariate normal distribution. For our purposes, it is sufficient to consider only the unnormalized scatter matrix inline image since our proposed methods depend only on the scatter matrix up to proportionality, i.e. only on the shape of the scatter matrix.

Under elliptical symmetry, affine equivariant location and scatter functionals have relatively simple forms. Recall that an elliptically symmetric distribution is defined to be one arising from an affine transformation of a spherically symmetric distribution, i.e., if ZQZ for any p×p orthogonal matrix Q, then the distribution of Y=AZ+μ is said to have an elliptically symmetric distribution with centre μ ∈ ℜp and shape matrix Γ=AA; see for example Bilodeau and Brenner (1999). If the distribution of Y is also absolutely continuous, then it has a density of the form
image(6)
for some non‐negative function g and with inline image. As defined, the shape parameter Γ of an elliptically symmetric distribution is only well defined up to a scalar multiple, i.e., if Γ satisfies the definition of a shape matrix for a given elliptically symmetric distribution, then λΓ also does for any λ>0. In the absolutely continuous case, if no restrictions are placed on the function g, then the parameter Γ is confounded with g. One could normalize the shape parameter by setting, for example, det(Γ)=1 or tr(Γ)=p. Again, this is not necessary for our purposes since only the shape components of Γ, as defined in expression (2), are of interest in this paper, and these shape components for an elliptically symmetric distribution are well defined.

Under elliptical symmetry, any affine equivariant location functional corresponds to the centre of symmetry and any affine equivariant scatter functional is proportional to the shape matrix, i.e. μ(FY)=μ and V(FY)∝Γ. In particular, μY=μ and ΣY∝Γ when the first and second moments exist respectively. More generally, if V(FY) is any functional satisfying condition (3), then V(FY)∝Γ.

As noted in Section 1, for general distributions, affine equivariant location functionals are not necessarily equal and affine equivariant scatter functionals are not necessarily proportional to each other. The corresponding sample versions of these functionals are therefore estimating different population features. The difference in these functionals reflects in some way how the distribution differs from an elliptically symmetric distribution.

Remark 1. The class of distributions for which all affine equivariant location functionals are equal and all equivariant scatter functionals are proportional to each other is broader than the class of elliptical distributions. For example, this can be shown to be true for FY when Y=AZ+μ with the distribution of Z being exchangeable and symmetric in each component, i.e. ZDJZ for any permutation matrix J and any diagonal matrix D having diagonal elements ±1. We conjecture that this is the broadest class for which this property holds. This class contains the elliptical symmetric distributions, since these correspond to Z having a spherically symmetric distribution.

2.2. Classes of scatter statistics

Conceptually, the simplest alternatives to the sample mean inline image and sample covariance matrix Sn are the weighted sample means and sample covariance matrices respectively, with the weights dependent on the classical Mahalanobis distances. These are defined by
image(7)
where inline image, and u1(s) and u2(s) are some appropriately chosen weight functions. Other simple alternatives to the sample covariance matrix can be obtained by applying only the scatter equation above to the sample of pairwise differences, i.e. to the symmetrized data set
image(8)
for which the sample mean is 0. Even though the weighted mean and covariance matrix, as well as the symmetrized version of the weighted covariance matrix, may downweight outliers, they have unbounded influence functions and zero breakdown points.
A more robust class of multivariate location and scatter statistics is given by the multivariate M‐estimates, which can be viewed as adaptively weighted sample means and sample covariance matrices respectively. More specifically, they are defined as solutions to the M‐estimating equations
image(9)
where inline image, and u1(s), u2(s) and u3(s) are again some appropriately chosen weight functions. We refer the reader to Huber (1981) and Maronna (1976) for the general theory regarding the multivariate M‐estimates. The equations given in expression (9) are implicit equations in inline image since the weights depend on the Mahalanobis distances relative to inline image, i.e. on inline image. Nevertheless, relatively simple algorithms exist for computing the multivariate M‐estimates. The maximum likelihood estimates of the parameters μ and Γ of an elliptical distribution for a given spread function g in expression (6) are special cases of M‐estimates.

From a robustness perspective, an often‐cited drawback to the multivariate M‐estimates is their relatively low breakdown in higher dimension. Specifically, their breakdown point is bounded above by 1/(p+1). Subsequently, numerous high breakdown point estimates have been proposed, such as the minimum volume ellipsoid, the minimum covariance determinant, the S‐estimates, the projection‐based estimates, the τ‐estimates, the constrained M‐estimates and the MM estimates, all of which are cited in Section 1. All the high breakdown point estimates are computationally intensive and, except for small data sets, are usually computed by using approximate or probabilistic algorithms. The computational complexity of high breakdown point multivariate estimates is especially challenging for extremely large data sets in high dimensions, and this remains an open and active area of research.

The definition of the weighted sample means and covariance matrices given by expression (7) can be readily generalized by using any initial affine equivariant location and scatter statistic, say inline image and inline image respectively, i.e.
image(10)
where now inline image. In the univariate setting such weighted sample means and variances are sometimes referred to as one‐step W‐estimates (Hampel et al., 1986; Mosteller and Tukey, 1977), and so we refer to their multivariate versions as multivariate one‐step W‐estimates. Given a location and a scatter statistic, a corresponding one‐step W‐estimate provides a computationally simple choice for an alternative location and scatter statistic.

Any method that one uses for obtaining location and scatter statistics for a data set Y can also be applied to its symmetrized version Ys to produce a scatter statistic. For symmetrized data, any affine equivariant location statistic is always 0.

The functional or population versions of the location and scatter statistics that were discussed in this section are readily obtained by replacing the empirical distribution of Y with the population distribution function FY. For the M‐estimates and the one‐step W‐estimates, this simply implies replacing the averages in expressions (9) and (10) respectively with expected values. For symmetrized data, the functional versions are obtained by replacing the empirical distribution of Ys with its almost sure limit inline image, the distribution function of Ys=Y1Y2, where Y1 and Y2 are independent copies of Y.

3. Comparing scatter matrices

Comparing positive definite symmetric matrices arises naturally within a variety of multivariate statistical problems. Perhaps the most obvious case is when we wish to compare the covariance structures of two or more different groups; see for example Flury (1988). Other well‐known cases occur in multivariate analysis of variance, wherein interest lies in comparing the within‐group and between‐group sum of squares and cross‐products matrices, and in canonical correlation analysis, wherein interest lies in comparing the covariance matrix of one set of variables with the covariance matrix of its linear predictor based on another set of variables. These methods involve either multiple populations or two different sets of variables. Less attention has been given to the comparison of different estimates of scatter for a single set of variables from a single population. Some work in this direction, though, can be found in Art et al. (1982), Caussinus and Ruiz‐Gazen (1990, 1993, 1995), Caussinus et al. (2003) and Ruiz‐Gazen (1993), which will be discussed in later sections.

Typically, the difference between two positive definite symmetric matrices can be summarized by considering the eigenvalues and eigenvectors of one matrix with respect to the other. More specifically, suppose that inline image and inline image. An eigenvalue, say ρj, and a corresponding eigenvector, say hj, of V2 relative to V1 correspond to a non‐trivial solution to the matrix equations
image(11)
Equivalently, ρj and hj are an eigenvalue and corresponding eigenvector respectively of inline image. Since most readers are probably more familiar with the eigenvalue–eigenvector theory of symmetric matrices, we note that ρj also represents an eigenvalue of the symmetric matrix inline image, where inline image denotes the unique positive definite symmetric square root of V1. Hence, we can choose p ordered eigenvalues, ρ1ρ2ρp>0, and an orthonormal set of eigenvectors qj,j=1,…,p, such that Mqj=ρjqj. The relationship between hj and the eigenvectors of M is given by inline image, and so inline image for ij. This yields the following simultaneous diagonalization of V1 and V2:
image(12)
where H=(h1hp), D1 and D2 are diagonal matrices with positive entries and inline image. Without loss of generality, we can take D1=I by normalizing hj so that inline image. Alternatively, we can take D2=I. Such a normalization is not necessary for our purposes and we simply prefer the general form (12) since it reflects the exchangeability between the roles of V1 and V2. Note that the matrix inline image has the spectral value decomposition
image(13)
Various useful interpretations of the eigenvalues and eigenvectors in equation (11) can be given whenever V1 and V2 are two different scatter matrices for the same population or sample. We first note that the eigenvalues ρ1,…,ρp are the maximal invariants under affine transformation for comparing V1 and V2, i.e., if we define a function G(V1,V2) such that G(V1,V2)=G(AV1A,AV2A) for any non‐singular A, then G(V1,V2)=G(D1,D2)=G(I,Δ), with D1, D2 and Δ being defined as above. Furthermore Δ is invariant under such transformations. Since scatter matrices tend to be well defined only up to a scalar multiple, it is more natural to be interested in the difference between V1 and V2 up to proportionality. In this case, if we consider a function G(V1,V2) such that G(V1,V2)=G(λ1AV1A,λ2AV2A) for any non‐singular A and any λ1>0 and λ2>0, then G(V1,V2)=G{I,Δ/det(Δ)1/p}, i.e. maximal invariants in this case are
image
or, in other words, we are interested in (ρ1,…,ρp) up to a common scalar multiplier.
A more useful interpretation of the eigenvalues arises from the following optimality property, which follows readily from standard eigenvalue–eigenvector theory. For h ∈ ℜp, let
image(14)

For V1=V1(FY) and V2=V2(FY),κ(h) represents the square of the ratio of two different measures of scale for the variable hY. Recall that the classical measure of kurtosis corresponds to the fourth power of the ratio of two scale measures, namely the fourth root of the fourth central moment and the standard deviation. Thus, the value of κ(h)2 can be viewed as a generalized measure of ‘relative’ kurtosis. The term relative is used here since the scatter matrices V1 and V2 are not necessarily normalized. If both V1 and V2 are normalized so that they are both consistent for the variance–covariance matrix under a multivariate normal model, then a deviation of κ(h) from 1 would indicate non‐normality. In general, though, the ratio κ(h1)2/κ(h2)2 does not depend on any particular normalization.

The maximal possible value of κ(h) over h ∈ ℜp is ρ1 with the maximum being achieved in the direction of h1. Likewise, the minimal possible value of κ(h) is ρp with the minimum being achieved in the direction of hp. More generally, we have
image(15)
with the supremum being obtained at hm, and
image(16)
with the infimum being obtained at hm. These successive optimality results suggest that plotting the data or distribution by using the co‐ordinates Z=HY may reveal interesting structures. We explore this idea in later sections.
Remark 2. An alternative motivation for the transformation Z=HY is as follows. Suppose that Y is first ‘standardized’ by using a scatter functional V1(F) satisfying condition (3), i.e. X=V1(FY)−1/2Y. If Y is elliptically symmetric about μY, then X is spherically symmetric about the centre μX=V1(FY)−1/2μY. If a second scatter functional is then applied to X, say V2(F) satisfying condition (3), then V2(FX)∝I, and hence no projection of X is any more interesting than any other projection of X. However, if Y is not elliptically symmetric, then V2(FX) is not necessarily proportional to I. This suggests that a principal components analysis of X based on V2(FX) may reveal some interesting projections. By taking the spectral value decomposition V2(FX)=QDQ, where Q is an orthogonal matrix, and then constructing the principal component variables QX, we obtain
image(17)
whenever H is normalized so that HV1(FY)H=I.

4. Invariant co‐ordinate systems

In this and the following section we study the properties of the transformation Z=HY in more detail, and in Section 6 we give some examples illustrating the utility of the transformation when used in diagnostic plots. For simplicity, unless otherwise stated, we hereafter state any theoretical properties by using the functional or population version of scatter matrices. The sample version then follows as a special case based on the empirical distributions. Examples are, of course, given for the sample version. The following condition is assumed throughout and the following notation is used hereafter.

Condition 1. For Y ∈ ℜp having distribution FY, let V1(F) and V2(F) be two scatter functionals satisfying condition (3). Further, suppose that both V1(F) and V2(F) are uniquely defined at FY.

Definition 1. Let H(F)=(h1(F)…hp(F)) be a matrix of eigenvectors defined as in equations (11) and (12), with ρ1(F)…ρp(F) being the corresponding eigenvalues, whenever V1 and V2 are taken to be V1(F) and V2(F) respectively.

It is well known that principal component variables are invariant under translations and orthogonal transformations of the original variables, but not invariant under other general affine transformations. An important property of the transformation that is proposed here, i.e. Z=H(FY)Y, is that the resulting variables are invariant under any affine transformation.

Theorem 1. In addition to condition 1, suppose that the roots ρ1(FY),…,ρp(FY) are all distinct. Then for the affine transformation Y*=AY+b, with A being non‐singular,
image(18)
for some γ>0. Moreover, the components of Z=H(FY)Y and Z*=H(FY*)Y* differ at most by co‐ordinatewise location and scale, i.e., for some constants α1,…,αp and β1,…,βp, with αj≠0 for j=1,…,p,
image(19)

Owing to property (19) we refer to the transformed variables Z=H(FY)Y as an invariant co‐ordinate system, and the method for obtaining them as ICS. If a univariate standardization is applied to the transformed variables, then the standardized versions of Zj and inline image differ only by a factor of ±1.

A generalization of the previous theorem, which allows for possible multiple roots, can be stated as follows.

Theorem 2. Let Y, Y*, Z and Z* be defined as in theorem 1. In addition to condition 1, suppose that the roots ρ1(FY),…,ρp(FY) consist of m distinct values, say ρ(1)>…>ρ(m), with ρ(k) having multiplicity pk for k=1,…,m, and hence p1+…+pm=p. Then, expression (18) still holds. Furthermore, suppose that we partition inline image, where Z(k) ∈ ℜpk. Then, for some non‐singular matrix Ck of order pk and some pk‐dimensional vector βk,
image(20)
i.e. the space that is spanned by the components of inline image is the same as the space that is spanned by the components of Z(k).

As with any eigenvalue–eigenvector problem, eigenvectors are not well defined. For a distinct root, the eigenvector is well defined up to a scalar multiple. For a multiple root, say with multiplicity p0, the corresponding p0 eigenvectors can be chosen to be any linearly independent vectors spanning the corresponding p0‐dimensional eigenspace. Consequently Z(k) in theorem 2 is not well defined. One could construct some arbitrary rule for defining Z(k) uniquely. However, this is not necessary here since, no matter which rule we may use to define Z(k) uniquely, the results of theorem 2 hold.

5. Invariant co‐ordinate selection under non‐elliptical models

When Y has an elliptically symmetric distribution, all the roots ρ1(FY),…,ρp(FY) are equal, and so the ICS transformation Z=H(FY)Y is arbitrary. The aim of ICS though is to detect departures of Y from an elliptically symmetric distribution. In this section, the behaviour of the ICS transformation is demonstrated theoretically for two classes of non‐elliptically symmetric models, namely for mixtures of elliptical distributions and for independent components models.

5.1. Mixture of elliptical distributions

In practice, data often appear to arise from mixture distributions, with the mixing being the result of some unmeasured grouping variable. Uncovering the different groups is typically viewed as a problem in cluster analysis. One clustering method, which was proposed by Art et al. (1982), is based on first reducing the dimension of the clustering problem by attempting to identify Fisher's linear discriminant subspace. To do this, they gave an iterative algorithm for approximating the within‐group sum of squares and cross‐products matrix, say Wn, and then considered the eigenvectors of inline image, where Tn is the total sum‐of‐squares and cross‐products matrix. The approach that was proposed by Art et al. (1982) was motivated primarily by heuristic arguments and was supported by a Monte Carlo study.

Subsequently, Ruiz‐Gazen (1993) and Caussinus and Ruiz‐Gazen (1993, 1995) showed for a location mixture of multivariate normal distributions with equal variance–covariance matrices that Fisher's linear discriminant subspace can be consistently estimated even when the group identification is not known, provided that the dimension, say q, of the subspace is known. Their results are based on the eigenvectors that are associated with the q largest eigenvalues of inline image, where Sn is the sample variance–covariance matrix and S1,n is either the one‐step W‐estimate (7) or its symmetrized version. They also required that the S1,n differs from Sn by only a small perturbation, since their proof involves expanding the functional version of S1,n about the functional version of Sn. In this subsection, it is shown that these results can be extended essentially to any pair of scatter matrices, and also that the results hold under mixtures of elliptical distributions with proportional scatter parameters.

For simplicity, we first consider properties of the ICS transformation for a mixture of two multivariate normal distributions with proportional covariance matrices. Considering proportional covariance matrices allows for the inclusion of a point mass contamination as one of the mixture components, since a point mass contamination is obtained by letting the proportionality constant go to 0.

Theorem 3. In addition to condition 1, suppose that
image
where 0<α<1, μ1μ2, λ>0 and inline image. Then either
  • (a)

    ρ 1(FY)>ρ2(FY)=…=ρp(FY),

  • (b)

    ρ 1(FY)=…=ρp−1(FY)>ρp(FY), or

  • (c)

    ρ 1(FY)=…=ρp(FY).

For p>2, if case (a) holds, then h1(FY)∝Γ−1(μ1μ2) and, if case (b) holds, then hp(FY)∝Γ−1(μ1μ2). For p=2, if ρ1(FY)>ρ2(FY), then either h1(FY) or h2(FY) is proportional to Γ−1(μ1μ2).

Thus, depending on whether case (a) or case (b) holds, h1 or hp respectively corresponds to Fisher's linear discriminant function (see for example Mardia et al. (1980)), even though the group identity is unknown. An intuitive explanation about why we might expect this to hold is that any estimate of scatter contains information on the between‐group variability, i.e. the difference between μ1 and μ2, and the within‐group variability or shape, i.e. Γ. Thus, one might expect that we could separate these two sources of variability by using two different estimates of scatter. This intuition though is not used in our proof of theorem 3; nor is our proof based on generalizing the perturbation arguments that were used by Ruiz‐Gazen (1993) and Caussinus and Ruiz‐Gazen (1995) in deriving their aforementioned results. Rather, the proof of theorem 3 that is given in Appendix A relies solely on invariance arguments.

Whether case (a) or case (b) holds in theorem 3 depends on the choice of V1(F) and V2(F) and on the nature of the mixture. Obviously, if case (a) holds and then the roles of V1(F) and V2(F) are reversed, then case (b) would hold. Case (c) holds only in very specific situations. In particular, case (c) holds if μ1=μ2, in which case Y has an elliptically symmetric distribution. When μ1μ2, i.e. when the mixture is not elliptical itself, it is still possible for case (c) to hold. This though is dependent not only on the specific choice of V1(F) and V2(F) but also on the particular value of the parameters α, μ1,μ2,Γ and λ.

For example, suppose that V1(F)=Σ(F), the population covariance matrix, and inline image where
image(21)

Beside being analytically tractable, the scatter functional inline image is one which arises in a classical algorithm for independent components analysis and is discussed in more detail in later sections. For the special case λ=1 and when μ1μ2, if we let η=α(1−α), then it can be shown that case (a) holds for η>1/6, case (b) holds for η<1/6 and case (c) holds for η=1/6. Also, for any of these three cases, we have ρ1(FY)−ρp(FY)=η|1−6η|θ2/(1+ηθ)2, where θ=(μ1μ2)Γ−1(μ1μ2).

Other examples have been studied in Caussinus and Ruiz‐Gazen (1993, 1995). In their work, V2(F)=Σ(F) and V1(F) corresponds to the functional version of the symmetrized version of the one‐step W‐estimate (7). Paraphrasing, they showed for the case λ=1 and for the class of weight functions u2(s)=u(βs) that case (a) holds for sufficiently small β provided that η<1/6. They did not note, though, that case (a) or (b) can hold for other values of β and η. The reason that the condition η<1/6 arises in their work, as well as in the discussion in the previous paragraph, is because their proof involves expanding u(βs) about u(s), with the matrix inline image then appearing in the linear term of the corresponding expansion of the one‐step W‐estimate about Σ(F).

Theorem 3 readily generalizes to a mixture of two elliptical distributions with equal shape matrices, but with possibly different location vectors and different spread functions, i.e., if Y has density
image
where 0<α<1, μ1μ2 and f(y;μ,Γ,g) is defined by expression (6), then the results of theorem 3 hold. Note that this mixture distribution includes the case where both mixture components are from the same elliptical family but with proportional shape matrices. This special case corresponds to setting g2(s)=g1(s/λ), and hence f(y;μ2,Γ,g2)=f(y;μ2,λΓ,g1).

An extension of these results to a mixture of k elliptically symmetric distributions with possibly different centres and different spread functions, but with equal shape matrices, is given in the following theorem. Stated more heuristically, this theorem implies that Fisher's linear discriminant subspace (see for example Mardia et al. (1980)) corresponds to the span of some subset of the invariant co‐ordinates, even though the group identifications are not known.

Theorem 4. In addition to condition 1, suppose that Y has density
image
where αj>0 for j=1,…,k, α1+…+αk=1, inline image and g1,…,gk are non‐negative functions. Also, suppose that the centres μ1,…,μk span some q‐dimensional hyperplane, with 0<q<p. Then, using the notation of theorem 2 for multiple roots, there is at least one root ρ(j),j=1,…,m, with multiplicity greater than or equal to pq. Furthermore, if no root has multiplicity greater than pq, then there is a root with multiplicity pq, say ρ(t), such that
image(22)
where Hq(FY)=(h1(FY),…,hp1+…+pt−1(FY),hp1+…+pt+1(FY),…,hp(FY)).

The condition in theorem 4 that only one root has multiplicity pq and no other root has a greater multiplicity reduces to case (a)–(b) in theorem 3 when k=2. Analogously to the discussion given after theorem 3, this condition generally holds except for special cases. For a given choice of V1(FY) and V2(FY), these special cases depend on the particular values of the parameters.

5.2. Independent components analysis models

Independent components analysis (ICA) is a highly popular method within many applied areas which routinely encounter multivariate data. For a good overview, see Hyvärinen et al. (1981). The most common ICA model presumes that Y arises as a convolution of p independent components or variables, i.e. Y=BX, where B is non‐singular, and the components of X, say X1,…,Xp, are independent. The main objective of ICA is to recover the mixing matrix B so that we can ‘unmix’Y to obtain independent components X*=B−1Y. Under this ICA model, there is some indeterminacy in the mixing matrix B, since the model can also be expressed as Y=B0X0, where B0=BQΛ and X0−1QX, Q being a permutation matrix and Λ a diagonal matrix with non‐zero entries. The components of X0 are then also independent. Under the condition that at most one of the independent components X1,…,Xp has a normal distribution, it is well known that this is the only indeterminacy for B, and consequently the independent components X=B−1Y are well defined up to permutations and componentwise scaling factors.

The relationship between ICS and ICA for symmetric distributions is given in the next theorem.

Theorem 5. In addition to condition 1, suppose that Y=BX+μ, where B is non‐singular, and the components of X, say X1,…,Xp, are mutually independent. Further, suppose that X is symmetric about 0, i.e. XdX, and the roots ρ1(FY),…,ρp(FY) are all distinct. Then, the transformed variable Z=H(FY)Y consists of independent components or, more specifically, Z and X differ by at most a permutation and/or componentwise location and scale.

From the proof of theorem 5, it can be noted that the condition that X be symmetrically distributed about 0 can be relaxed to require that only p−1 of the components of X be symmetrically distributed about 0. It is also worth noting that the condition that all the roots be distinct is more restrictive than the condition that at most one of the components of X is normal. This follows since it is straightforward to show in general that, if the distributions of two components of X differ from each other by only a location shift and/or scale change, then there is at least one root having multiplicity greater than 1.

If X is not symmetric about 0, then we can symmetrize Y before applying theorem 5, i.e. suppose that Y=BX+μ with X having independent components, and let Y1 and Y2 be independent copies of Y. Then Ys=Y1Y2=BXs, where Xs=X1X2 is symmetric about zero and has independent components. Thus, theorem 5 can be applied to Ys. Moreover, since the convolution matrix B is the same for both Y and Ys, it follows that the transformed variable inline image and X differ by at most a permutation and/or componentwise location and scale, where inline image refers to the symmetrized distribution of FY, i.e. the distribution of Ys.

An alternative to symmetrizing Y is to choose both V1(F) and V2(F) so that they satisfy the following independence property.

Definition 1. An affine equivariant scatter functional V(F) is said to have the ‘independence property’ if V(FX) is a diagonal matrix whenever the components of X are mutually independent, provided that V(FX) exists.

Assuming this property, Oja et al. (2006) proposed to use principal components on standardized variables as defined in remark 2 to obtain a solution to the ICA problem. Their solution can be restated as follows.

Theorem 6. In addition to condition 1, suppose that Y=BX+μ, where B is non‐singular, and the components of X, say X1,…,Xp, are mutually independent. Further, suppose that both scatter functionals V1(F) and V2(F) satisfy the independence property that is given in definition 1, and the roots ρ1(FY),…,ρp(FY) are all distinct. Then, the transformed variable Z=H(FY)Y consists of independent components or, more specifically, Z and X differ by at most a permutation and/or componentwise location and scale.

The covariance matrix Σ(F) is of course well known to satisfy definition 1. It is also straightforward to show that the scatter functional inline image that is defined in equation (21) does as well. Theorem 6 represents a generalization of an early ICA algorithm that was proposed by Cardoso (1989) based on the spectral value decomposition of a kurtosis matrix. Cardoso's algorithm, which he called the fourth‐order blind identification algorithm, can be shown to be equivalent to choosing V1(F)=Σ(F) and inline image in theorem 6.

It is worth noting that the independence property that is given by definition 1 is weaker than the property
image(23)

The covariance matrix satisfies property (23), whereas inline image does not.

An often overlooked observation is that property (23) does not hold for robust scatter functionals in general, i.e. independence does not necessarily imply a zero pseudocorrelation. It is an open problem what scatter functionals other than the covariance matrix, if any, satisfy property (23). Furthermore, robust scatter functionals tend not to satisfy in general even the weaker definition 1. At symmetric distributions, though, the independence property can be shown to hold for general scatter matrices in the following sense.

Theorem 7. Let V(F) be a scatter functional satisfying condition (3). Suppose that the distribution of X is symmetric about some centre μ ∈ ℜp, with the components of X being mutually independent. If V(FX) exists, then it is a diagonal matrix.

Consequently, given a scatter functional V(F), we can construct a new scatter functional satisfying definition 1 by defining Vs(F)=V(Fs), where Fs represents the symmetrized distribution of F. Using symmetrization to obtain scatter functionals which satisfy the independence property has been studied recently by Taskinen et al. (2007).

Finally, we note that the results of this section can be generalized in two directions. First, we consider the case of multiple roots, and next we consider the case where only blocks of the components of X are independent.

Theorem 8. In addition to condition 1, suppose that Y=BX+μ, where B is non‐singular, and the components of X, say X1,…,Xp, are mutually independent. Further, suppose that either

  • (a)

    X is symmetric about 0, i.e. XdX, or

  • (b)

    both V1(F) and V2(F) satisfy definition 1.

Then, using the notation of theorem 2 for multiple roots, for the transformed variable Z=H(FY)Y the random vectors Z(1),…,Z(m) are mutually independent.

Theorem 9. In addition to condition 1, suppose that Y=BX+μ, where B is non‐singular, and inline image has mutually independent components X(1) ∈ ℜp1,…,X(m) ∈ ℜpm, with p1+…+pm=p. Further, suppose that X is symmetric about 0, and the roots ρ1(FY),…,ρp(FY) are all distinct. Then, there is a partition {J1,…,Jm} of {1,…,p} with the cardinality of Jk being pk for k=1,…,m such that for the transformed variable Z=H(FY)Y the random vectors
image
are mutually independent. More specifically, Z(j) and X(j) are affine transformations of each other.

From the proof of theorem 9 in Appendix A, it can be noted that the theorem still holds if one of the X(j)s is not symmetric. If the distribution of X is not symmetric, theorems 8 and 9 can be applied to Ys, the symmetrized version of Y. To generalize theorem 6 to the case where blocks of the components of X are independent, a modification of the independence property is needed. Such generalizations of definition 1, theorem 6 and theorem 7 are fairly straightforward and so are not treated formally here.

Remark 3. The general case of multiple roots for the setting that is given in theorem 9 is more problematic. The problem stems from the possibility that a multiple root may not be associated with a particular X(j) but rather with two or more different X(j)s. For example, consider the case inline image, with X(1) ∈ ℜ2 and X(2) ∈ ℜ. For this case, V1(FX)−1V2(FX) is block diagonal with diagonal blocks of order 2 and 1. The three eigenvalues ρ1(FY), ρ2(FY) and ρ3(FY) correspond to the two eigenvalues of the diagonal block of order 2 and to the last diagonal element, but not necessarily respectively. So, if ρ1(FY)=ρ2(FY)>ρ3(FY), this does not imply that the last diagonal element corresponds to ρ3(FY), and hence Z(1) ∈ ℜ2 and Z(2) ∈ ℜ, as defined in theorem 2, are not necessarily independent.

6. Discussion and examples

Although the theoretical results of this paper essentially apply to any pair of scatter matrices, in practice the choice of scatter matrices can affect the resulting ICS method. From our experience, for some data sets, the choice of the scatter matrices does not seem to have a big influence on the diagnostic plots of the ICS variables, particularly when the data are consistent with one of the mixture models or one of the independent component models that were considered in Section 5. For some other data sets, however, the resulting diagnostic plots can be quite sensitive to the choice of the scatter matrices. In general, different pairs of scatter matrices may reveal different types of structure in the data, since departures from an elliptical distribution can come in many forms. Consequently, it is doubtful whether any specific pair of scatter matrices is best for all situations. Rather than choosing two scatter matrices beforehand, especially when one is in a purely exploratory situation having no idea of what to expect, it would be reasonable to consider a number of different pairs of scatter matrices and to consider the resulting ICS transformations as complementary.

A general sense of how the choice of the pair of scatter matrices may impact the resulting ICS method can be obtained by a basic understanding of the properties of the scatter matrices being used. For the purpose of this discussion, we divide the scatter matrices into three broad classes. Class I scatter statistics will refer to those which are not robust in the sense that their breakdown point is essentially zero. This class includes the sample covariance matrix, as well as the one‐step W‐estimates defined by expression (7) and their symmetrized version. Other scatter statistics which lie within this class are the multivariate sign and rank scatter matrices; see for example Visuri et al. (2000). Class II scatter statistics will refer to those which are moderately robust in the sense that they have bounded influence functions as well as positive breakdown points, but with breakdown points being no greater than 1/(p+1). This class primarily includes the multivariate M‐estimates, but it also includes among others the sample covariance matrices that are obtained after applying either convex hull peeling or ellipsoid hull peeling to the data; see Donoho and Gasko (1992). Class III scatter statistics will refer to the high breakdown point scatter matrices which are discussed in Section 2.2. The symmetrized version of a class II or III scatter matrix, as well as the one‐step W‐estimates of scatter (10) which uses an initial class II or III scatter matrix for downweighting, are viewed respectively as class II or III scatter matrices themselves.

If one or both scatter matrices are from class I, then the resulting ICS transformation may be heavily influenced by a few outliers at the expense of finding other structures in the data. In addition, even if there are no spurious outliers and a mixture model or an independent components model of the form that was discussed in Section 5 holds, but with long‐tailed distributions, then the resulting sample ICS transformation may be an inefficient estimate of the corresponding population ICS transformation. Simulation studies that were reported in Nordhausen, Oja and Ollila (2008) have shown that for ICA an improved performance is obtained by choosing robust scatter matrices for the ICS transformation. Nevertheless, since they are simple to compute, the use of class I scatter matrices can be useful if the data set is known not to contain any spurious outliers or if the objective of the diagnostics is to find such outliers, as recommended in Caussinus and Ruiz‐Gazen (1990).

If we use class II or III scatter matrices, then we can still find spurious outliers by plotting the corresponding robust Mahalanobis distances. The resulting ICS transformation, though, would not be heavily affected by the spurious outliers. Outliers affect class II scatter matrices more so than class III scatter matrices, although even a high proportion of spurious outliers may not necessarily affect the class II scatter matrices. For outliers to affect a class II scatter matrix heavily, they usually need to lie in a cluster; see for example Dümbgen and Tyler (2005). The results of Section 5.1 though suggest that such clustered outliers can be identified after making an ICS transformation, even if they cannot be identified by using a robust Mahalanobis distance based on a class II statistic.

Using two class III scatter matrices for an ICS transformation may not necessarily give good results, unless we are interested only in the structure of the ‘inner’ 50% of the data. For example, suppose that the data arise from a 60–40 mixture of two multivariate normal distributions with widely separated means but equal covariance matrices. A class III scatter matrix is then primarily determined by the properties of the 60% component. Consequently, when using two class III scatter matrices for ICS the corresponding ICS roots will tend to be equal or nearly equal. In the case where all the roots are equal, theorem 3 does not apply. In the case where the roots are nearly equal, owing to sampling variation, the sample ICS transformation may not satisfactorily uncover Fisher's linear discriminant function.

A reasonable general choice for the pair of scatter matrices to use for an ICS transformation would be to use one class II and one class III scatter matrix. If we wish to avoid the computational complexity that is involved with a class III scatter matrix, then using two class II scatter matrices may be adequate. In particular, we could choose a class II scatter matrix whose breakdown point is close to 1/(p+1), such as the M‐estimate corresponding to the maximum likelihood estimate for an elliptical Cauchy distribution (Dümbgen and Tyler, 2005), together with a corresponding one‐step W‐estimate for which ψ(s)=su2(s)→0 as s→∞. Such a one‐step W‐estimate of scatter has a redescending influence function. From our experience, the use of a class III scatter matrix for ICS does not seem to reveal any data structures that cannot be obtained otherwise.

The remarks and recommendations that are made here are highly conjectural. The question of what pairs of scatter matrices are best at detecting specific types of departure from an elliptical distribution remains a broad open problem. In particular, it would be of interest to discover for what types of data structures it would be advantageous to use at least one class III scatter matrix in the ICS method. Most likely, some advantages may arise when working with very high dimensional data sets, in which case the computational intensity that is needed to compute a class III scatter matrix is greatly amplified; see for example Rousseeuw and van Driessen (1999).

We demonstrate some of the concepts in the following examples. These examples illustrate for several data sets the use of the ICS transformation for constructing diagnostic plots. They also serve as illustrations of the theory that has been presented in the previous sections.

6.1. Example 1

Rousseeuw and van Driessen (1999) analysed a data set consisting of n=677 metal plates on which p=9 characteristics are measured. For this data set they computed the sample mean and covariance matrix as well as the minimum covariance determinant estimate of centre and scatter. Their paper helps to illustrate the advantage of using high breakdown point multivariate estimates, or class III statistics, for uncovering multiple outliers in a data set.

For our illustration, we choose two class II location and scatter statistics. The first estimate inline image is taken to be the maximum likelihood estimate that is derived from an elliptical Cauchy distribution. This corresponds to an M‐estimate (9) with u1(s)=u2(s)=(p+1)/(s+1) and u3(s)=1. The M‐estimating equations for this M‐estimate are known to admit a unique solution in general, which can be found via a simple reweighting algorithm regardless of the initial value.

For our second estimate inline image, we take the sample mean vector and sample covariance matrix, using only the inner 50% of the data as measured by the Mahalanobis distances that are derived by using the Cauchy M‐estimate, i.e. inline image. This corresponds to a multivariate one‐step W‐estimate of scatter (10) with u1(s)=u2(s)=I(s1) and u3(s)=1, and with initial estimates inline image and inline image, where inline image.

Figs 1(a) and 1(b) show the Mahalanobis distances plots for inline image and inline image respectively, with Fig. 1(c) being a scatter plot of these two sets of distances. These plots are somewhat similar to the plots that are based on the classical Mahalanobis distances and those based on the minimum covariance determinant given in Rousseeuw and van Driessen (1999). As noted in Rousseeuw and van Driessen (1999), Figs 1(b) and 1(c) indicate that there are at least three distinct groups: the first 100 points, those with index 491–565 and the rest. The index itself is a factor that is not taken into account in obtaining the Mahalanobis distances. It represents an order of production and is clearly an important factor. The effect of the index is also apparent in some of the plots of the individual variables.

image

Example 1: Mahalanobis distances based on (a) inline image, (b) inline image and (c) inline imageversusinline image

The comparative Mahalanobis distance plot that is given in Fig. 1(c) indicates that the data do not arise from an elliptically symmetric distribution. Otherwise, the scatter plot of the two distances would be approximately linear since the two location statistics would be estimating the same centre and the two scatter matrices would be estimating the same population shape matrix up to a proportionality constant, and consequently the resulting Mahalanobis distances would be approximately proportional to each other. The non‐elliptical nature of the data can most likely be attributed to a mixture resulting from the index factor.

The affine invariant plots that are given in Fig. 1 do not reveal whether the three groups that are observed in the plots correspond to three clusters, since the Mahalanobis distances give no indication of the relative distance of the points from each other. A more complete affine invariant view of the data can be obtained from a pairs plot of the ICS transformation of the data based on the scatter matrices inline image and inline image described above. For this analysis, the resulting ICS roots are inline image. Fig. 2(a) shows the scatter plot for the first two ICS components, with Figs 2(b) and 2(c) showing the first two ICS components separately. The three groups can also be seen in these plots. Moreover, we can ascertain how the groups differ. In particular, it can be noted that the group that is associated with index 491–565 essentially lies in a particular direction from the rest of the data, namely that determined by the first ICS component, whereas the first 100 points essentially lie in a different direction, determined by the second ICS component. Finally, if we plot the other ICS components, various isolated outliers also become visible.

image

Example 1: first and second ICS co‐ordinates based on inline image and inline image

6.2. Example 2

The pairs plot that is given in Fig. 3(a) arises from simulation of a random sample of size n=500 from a p=4 dimensional distribution. Arguably nothing seems particularly remarkable about this data set. The sample variance–covariance matrix of the data set in Fig. 3(a) is the identity matrix and so a principal components analysis does not indicate any particular direction of interest. If we apply an ICS transformation to these data, however, we can uncover an interesting hidden structure in the data as seen in the pairs plot given in Fig. 3(b). The corresponding ICS roots are inline image. For this example, the original data yi correspond to an affine transformation of a distribution that is generated by simulating a uniform distribution on the unit circle, to which independent normal noise with mean 0 and standard deviation 0.01 is added, concatenated with a standard normal distribution and a t‐distribution on 5 degrees of freedom. Note that, no matter how the simulated data are affinely transformed, the resulting ICS co‐ordinates are always given by Fig. 3(b).

image

Example 2: (a) a simulated four‐dimensional data set and (b) the ICS co‐ordinates by using Sn and inline image

The two scatter matrices that are used here for the ICS transformation are the sample covariance matrix Sn and the sample version of the scatter matrix inline image given in equation (21), namely
image(24)
inline image can be viewed as a one‐step W‐estimate of scatter (7) obtained by weighting each point by its classical Mahalanobis distance squared, i.e. choosing u2(s)=s and u3(s)=1, with inline image. As an estimate of scatter, inline image is obtained by actually upweighting outliers. Even though neither Sn nor inline image are robust estimates of scatter, they can uncover the structure in this particular data set since it contains no spurious outliers. Such a structure would be difficult to detect, with or without spurious outliers, if we were to consider only the Mahalanobis distances or a robust version of them. Similar results to those displayed in Fig. 3(b) arise when using almost any other pair of scatter matrices for the ICS transformation.

Note that none of the theorems that were given in Section 5.2 are directly applicable to this example, but rather this example is of the type that is discussed in remark 3. For the functional version of this example, we have inline image, with the distribution of X(1) ∈ ℜ2 being that of the uniform distribution on the unit circle plus bivariate spherical normal noise with variances 0.1, X(2) ∈ ℜ having a standard normal distribution, X(3) ∈ ℜ having a t‐distribution on 5 degrees of freedom and with X(1), X(2) and X(3) being mutually independent. By using invariance arguments, it can be shown that, regardless of the choice of the two scatter matrices, at least two of the roots ρ1(FY),…,ρ4(FY) are equal, and hence there are at most three distinct roots. For the case of three distinct roots, the two ICS variables that are associated with the multiple root correspond to an affine transformation of X(1), and the ICS variables that are associated with the two distinct roots correspond to univariate linear transformations of X(2) and X(3). The case of three distinct roots tends to hold except for very special choices of V1(F) and V2(F). In particular, it can be shown to hold for the choice V1(F)=Σ(F) and inline image, with the smallest root being the multiple root, the largest root being associated with X(2) and the second‐largest root being associated with X(3). Hence, the results that are displayed in Fig. 3 are as expected.

6.3. Other examples

We briefly explain here the results of some other examples. The first is the classical Fisher iris data, which can be found in the statistical package R (R Development Core Team, 2005). This data set consists of p=4 measurements, namely sepal length, sepal width, petal length and petal width, on n=150 iris flowers. The 150 flowers belong to three different varieties of irises. Suppose that we ignore the group classification of the data and perform an ICS transformation of the n=150 data points by using the sample covariance matrix and a Cauchy M‐estimate. It turns out that the first ICS component is almost identical with the first linear discriminant function that we would obtain if we did a discriminant analysis using the varieties as the group variable, with a sample correlation between the two being 0.99, even though the former does not take the group classification into account. The results of the ICS method for this example are similar for almost any pair of scatter matrices that we may choose. This can be attributed to the data being consistent with the mixture models that are discussed in Section 5.1 together with the absence of any obvious outliers.

The next example uses the modified wood gravity data set that was given in Rousseeuw and Leroy (1987). This data set is frequently used as an example illustrating outlier detection methods. It consists of n=20 observations in p=6 dimensions, of which four of the observations are artificial outliers that had been put into the data set by the original authors. Rousseeuw and Leroy (1987) demonstrated how classical outlier detection methods fail to uncover these outliers, whereas they are readily uncovered by using Mahalanobis distances based on high breakdown point location and scatter statistics. For this data set, we compute a Cauchy M‐estimate and a t2M‐estimate, which have breakdown points of 1/7=0.143 and 1/8=0.125 respectively. Unlike example 1, neither corresponding Mahalanobis distance plot, which are given by Figs 4(a) and 4(b), reveals any outliers, and the two plots are fairly similar. Since the proportion of contamination is 4/20=0.20, we would not expect that the Mahalanobis distances based on either of the two M‐estimates would reveal the outliers if the outliers formed a cluster; see Tyler (2002). However, if the outliers do form a cluster then the results of Section 5.1 suggest an ICS transformation based on these two scatter estimates may separate the main cluster of data from the cluster of four outliers. Such is the case here, with all four outliers clearly appearing in the first ICS co‐ordinate; see Fig. 4(c). The results here are again not heavily dependent on the scatter matrices being used in the ICS transformation.

image

Example of ICS on the modified wood gravity data set

As a final example, consider the RANDU data set which can be also be found in R (R Development Core Team, 2005); Fig. 5(a). This consists of n=300 observations in p=3 dimensions which are supposedly obtained by a random‐number generator. In reality, though, the data lie on parallel planes which are not apparent in the original co‐ordinates. However, if we transform this data set to the ICS co‐ordinates by using the sample covariance matrix Sn and the one‐step W‐estimate based on pairwise differences given by
image(25)
then the parallel plane structure in the data becomes apparent in a pairs plot; see Fig. 5(b).
image

Example of ICS on the RANDU data set

In the last example, the presumed distribution from which the data arise is not an elliptical distribution but rather a uniform distribution within the unit cube, which falls within the class of distributions that was discussed in remark 1. Thus, we might expect that a departure from this presumed distribution would be reflected in an ICS analysis. The parallel lines in the data set may be viewed as arising from a location mixture of symmetric singular distributions. Consequently the behaviour of an ICS transformation for mixture distributions that was given in Section 5.1, although not directly applicable since the mixture components are not strictly elliptically distributed, gives some rationale about why such a pattern can be detected. For this example, though, the resulting ICS pairs plots are fairly sensitive to the choice of the scatter matrices being used. In particular, if we replace the square term in the denominator of equation (25) with a power of q, then for q<1.5 the lines in the RANDU data set are not very apparent. In contrast, the results do not appear to be heavily dependent on q for q>1.5.

7. Concluding remarks

7.1. Relationship to projection pursuit

Aside from its relationship to mixture models and to ICA, the concept underlying the ICS method has similarities to projection pursuit methods, and in particular to what Huber (1985) referred to as class III projection pursuit approaches, i.e. approaches which investigate the affine invariant aspects of the data. In projection pursuit, we typically seek interesting, usually meaning non‐normal, projections of the data; see for example Cook et al. (1993), Friedman and Tukey (1974), Huber (1985) and Jones and Sibson (1987). The evaluation of what makes a projection interesting in the projection pursuit context depends solely on the distribution of the particular projection. In general, the pursuit in projection pursuit methods tends to be computationally intensive. In contrast, the value of κ(h) that is given by equation (14) used in ICS is not strictly a function of the distribution of the linear combination hY but rather is dependent on the multivariate distribution F through V1(F) and V2(F). The sequential optimization of κ(h) as given by equations (15) and (16) has an analytic solution in terms of eigenvectors and so is not computationally intensive. In this sense, ICS can be viewed as a projection pursuit without the pursuit effort.

The relationship between projection pursuit itself and ICA is well documented; see for example Hyvärinen et al. (2001). Almost all algorithms proposed for the ICA problem tend to be of a projection pursuit nature. One notable exception, though, is the previously cited algorithm that was proposed by Cardosa (1989). It is also worth noting that a relationship between projection pursuit based on kurtosis and multivariate normal mixtures has been observed by Peña and Prieto (2001). They showed that, for a mixture of two multivariate normal distributions with equal covariance matrices, either the projection which minimizes or the projection which maximizes the classical kurtosis coefficient corresponds to Fisher's linear discrimination function between the two elements of the mixture. An analogous result was obtained by Yenyukov (1988) when using the ratio of a robust variance to the sample variance as a projection index. He also proposed to use κ(h) based on the sample covariance matrix and a robust estimate of the covariance matrix as an approximation to such a projection index.

7.2. Other related methods

As noted in Section 1, ICS can be viewed as a more general formulation of what was referred to by Ruiz‐Gazen (1993) and Caussinus and Ruiz‐Gazen (1995) as generalized principal components analysis. They used this terminology since the matrix equation (11) is commonly referred to as a generalized eigenvalue–eigenvector problem. The term generalized principal components analysis, however, is often used in the literature to describe various unrelated generalizations of principal components. Hence, to distinguish this method from other methods that are referred to as generalized principal components, as well as to emphasize a central property of the method, we use the term ICS. Furthermore, we do not view ICS as a generalization of principal components analysis; nor do we consider ICS to be a competitor to either principal components analysis or to a robust version of principal components analysis, but rather as a complementary method. Principal components analysis is concerned with understanding the spread or scatter of a data cloud, which is a property which cannot be identified within an affine invariant setting. As suggested by Huber (1985), a fuller understanding of a data set is obtained by exploring its affine invariant aspects in addition to its location–scale information.

Recently, Critchley et al. (2007) proposed to perform a principal components analysis on standardized data, as described in remark 2, which they referred to as principal axis analysis. Their proposal thus corresponds to the special case of the ICS transformation when we take inline image, the sample covariance matrix, and inline image to be a one‐step reweighted covariance matrix with the weights corresponding to the inverse of the classical squared Mahalanobis distances, i.e. inline image is a one‐step W‐estimate as defined in expression (7) with weight functions u2(s)=1/s and u3(s)=1, using the sample mean and covariance matrix as the initial estimates. They referred to their approach as principal axis analysis since inline image depends on the standardized sample vectors inline image only through their pairs of opposed directions ±Xi/‖Xi‖. Within Critchley et al. (2007), heuristic arguments are given to motivate the use of principal axis analysis for detecting well‐separated clusters when the data arise from a mixture of elliptical distributions, even one with possibly different shape matrices.

Another approach for generating affine invariant co‐ordinates, which is well known within the area of multivariate non‐parametric statistics, is the transformation–retransformation (TR) approach that was proposed by Chakraborty and Chaudhuri (1996, 1998). The basic idea behind the TR approach in the one‐sample problem is to transform the multivariate data by multiplying each observation by the inverse of a matrix containing p of the observations. The TR approach though is not invariant under permutation of the n observations, unless either the p observations that are used for standardizing are chosen randomly or some permutation invariant criterion is used to select them. In any event, it is difficult to express the TR approach in terms of functionals, and this makes the theoretical properties of the TR transformation problematic to study. The TR transformation, however, is not meant to be an exploratory transformation of the data, but rather a step that is used for generating affine invariant multivariate non‐parametric tests. Likewise, an ICS transformation can also be used for generating such tests (see for example Nordhausen et al. (2006)), or in general for defining multivariate versions of univariate concepts. An affine equivariant componentwise multivariate median for Y ∈ ℜP, for example, can be defined by μY, where inline image with Z=H(FY)Y corresponding to an ICS transformation. In such settings, the main focus of ICS is not on dimension reduction but rather on the complete affine invariant co‐ordinate system.

7.3. Summary and continuing research

In this paper, we have introduced the concept of ICS as a general affine invariant method for exploring multivariate data. After removing the effect of the centre and scatter from a multivariate data set, ICS essentially addresses the question of whether there is anything else of interest in the data set. The paper also shows how an ICS transformation theoretically behaves under elliptical mixture models and under ICA models.

From a statistical modelling perspective, one might argue that ICA models may seem impractical except for very specific problems. Nevertheless, ICA algorithms have become increasingly popular in many areas that routinely apply multivariate methods and often yield interesting results even when the ICA model may seem unrealistic. One of the original goals of this paper was to give a model‐free explanation about why this might be so. The results in this paper relating ICS with both mixture models and with ICA provide one such explanation.

As noted in Section 6, the theoretical results of this paper apply to essentially any choice of two scatter matrices. The statistical variability and the robustness properties of the ICS transformations though do depend on the particular scatter matrices being used. The results concerning ICS under mixture models that was given in Section 5.1 suggest that the method may have some natural robustness properties, at least in terms of detecting clusters of outliers, even if the estimates themselves are not particularly robust.

As with all eigenvector methods, the stability of the ICS transformations will depend on the spread of the theoretical roots ρ1(F),…,ρp(F), which in turn depends on the choice of the scatter matrices. The effects of the choice of the scatter matrices on the resulting ICS method, as well as the statistical properties that are associated with ICS methods, are currently being studied by the authors and their students. Rather than choose two scatter matrices beforehand, one promising strategy seems to be to allow the data to choose two scatter matrices from among a large class of scatter matrices based on the observed separation of the respective roots. Such a data‐driven approach unfortunately makes a theoretical study of the statistical properties of the resulting method far more challenging.

Acknowledgements

The authors are grateful to Klaus Nordhausen for his help in providing R code for the examples and illustrations, to Anne Ruiz‐Gazen for bringing to our attention the literature on generalized principal components analysis and to Stefan Van Aelst for providing access to the data set that was used in example 1.

The research of the first author was supported by National Science Foundation grant DMS‐0604596. The research of the third author was supported by the Swiss National Science Foundation.

    Appendix

    Appendix A: Proofs

    A.1. Proof of theorems 1 and 2

    From property (3), it follows that there are γ1>0 and γ2>0 such that V1(FY*)=γ1AV1(FY)A and V2(FY*)=γ2AV2(FY)A. By definition, V2(FY*) hj(FY*)=ρj(FY*) V1(FY*) hj(FY*) and so
    image(26)
    where γ=γ1/γ2. This implies that condition (18) holds. If ρj(FY) is a distinct root, then equation (26) also implies that hj(FY)=ajAhj(FY*) for some scalar aj≠0, and so
    image
    which completes the proof for theorem 1. Consider now the case of a multiple root, say ρ(k)ρj1(FY)=…=ρj2(FY) where j2=j1+pk−1, and let H(k)(F)=(hj1(F),…,hj2(F)). As a consequence of a multiple root, the exact choice H(k)(F) is somewhat arbitrary unless some rule is specified about how to choose its columns. However, the span of H(k)(F) is uniquely defined and so, no matter what rule we use to define H(k)(F), equation (26) implies that inline image for some non‐singular matrix Bk. This implies that
    image
    which completes the proof for theorem 2.

    A.2. Proof of theorems 3 and 4

    Since theorem 3 is a special case of theorem 4, it is only necessary to prove the latter. Using the notation of theorem 4, let inline image, where M=(μ1μk) and 1k ∈ ℜk is a vector of 1s. Since M0 has rank q, the triangular decomposition for matrices gives
    image
    with P being an orthogonal matrix of order p and Tu being an upper triangular matrix of order q. The distribution of X=PΓ−1/2(Yμk) is then a mixture of k spherical distributions with centres t1,…,tk, where tq+1=…=tk=0, and spread functions gi,i=1,…,k, i.e. the density of X is given by
    image
    The distributions of X and QX are thus the same for any orthogonal Q of the form
    image
    where Iq is the identity matrix of order q and Q22 is an orthogonal matrix of order pq. Thus, given a scatter functional V(F) satisfying condition (3), V(FX)=V(FQX)∝QV(FX)Q, for any such Q, and so
    image(27)
    for any orthogonal matrix Q22. Note that equality holds in expression (27) rather than just proportionality since the upper block diagonal matrices are equal (and non‐zero). By making appropriate choices for Q22 in expression (27) we obtain V12(FX)=0 and V22(FX)=γIpq, for some γ>0. Thus, for the two scatter functionals V1(F) and V2(F),
    image

    This matrix has at least one root with multiplicity greater than or equal to pq. By theorem 2, we know that the roots of V1(FY)−1V2(FY) are proportional to the roots of V1(FX)−1V2(FX), and so at least one of the roots ρ(j) has a multiplicity that is greater than or equal to pq.

    Suppose now that no root has multiplicity greater than pq, which by theorem 2 applies to V1(FX)−1V2(FX) as well as to V1(FY)−1V2(FY). For V1(FX)−1V2(FX), one root with multiplicity pq must be γ2/γ1. Also, the q‐dimensional subspace that is spanned by the eigenvectors of V1(FX)−1V2(FX), other than those associated with γ2/γ1, is the same as the subspace that is spanned by (Iq 0), or equivalently it is the same as the subspace that is spanned by T. From the shape equivariant property (3), we have V1(FX)−1V2(FX)∝PΓ1/2V1(FY)−1V2(FY−1/2P, and so it follows that if a is an eigenvector of V1(FX)−1V2(FX) then h−1/2Pa is an eigenvector of V1(FY)−1V2(FY). If the eigenvector a is associated with the root γ2/γ1, then h is associated with some root, say ρ(t), with multiplicity pq. The subspace that is spanned by all the eigenvectors of V1(FY)−1V2(FY), other than those associated with ρ(t), is thus the same as the subspace that is spanned by inline image, and hence equation (22) holds.

    A.3. Proof of theorem 5

    The symmetry of X, along with X having independent components, implies that inline image for any diagonal matrix inline image having only 1s and −1s as entries, i.e. for matrices of the form inline image. So, for any scatter functional V(F) which satisfies the shape equivariant property (3), it follows that inline image, for any such inline image. The last equality follows from property (3) since the diagonal components of V(FX) and SV(FX)S are the same. By choosing inline image, we note that all the off‐diagonal terms in the first row and in the first column of V(FX) must be 0. Continuing, we conclude that V(FX) is a diagonal matrix.

    For the two scatter functionals V1(F) and V2(F), V1(FX)−1V2(FX) is a diagonal matrix. By theorem 1, it follows that inline image, where Δ(FY)=diag{ρ1(FY),…,ρp(FY)} and P is a permutation matrix. Using property (3) again gives
    image
    which by the spectral value decomposition (13) implies that inline image for some non‐singular diagonal matrix inline image. The theorem then follows since
    image

    A.4. Proof of theorem 6

    It follows immediately that, if V(F) is a scatter functional satisfying definition 1, then V(FX) is a diagonal matrix. The remainder of the proof is then identical to the proof of theorem 5.

    A.5. Proof of theorem 7

    By equivariance, we can assume without loss of generality that μ=0. The proof is then given by the first part of the proof to theorem 5.

    A.6. Proof of theorem 8

    The proof of theorem 8 is analogous to the proof of theorems 5 and 6. The only difference is that the matrix inline image at the end of the proof is not necessarily a diagonal matrix, but rather a block diagonal matrix with diagonal blocks of order p1,…,pm.

    A.7. Proof of theorem 9

    The proof of theorem 9 is a generalization of the proof for theorem 5. For this proof, a reference to the blocks of a matrix of order p refers to the partitioning of the matrix in blocks of dimension pi×pj for i,j=1,…,m. The symmetry condition on X, along with the assumption that X has mutually independent subvectors, implies that inline image for any block diagonal matrix inline image having diagonal blocks of the form ±Ipk, k=1,…,m. So, for any scatter functional V(F) which satisfies the shape equivariant property (3), it follows that inline image, for any such inline image. The last equality follows from property (3) since the block diagonal components of V(FX) and inline image are the same. By choosing the first diagonal block of inline image to be −Ip1 and the other diagonal blocks to be Ipk for k=2,…,p, and then continuing in this fashion, we conclude that V(FX) is a block diagonal matrix.

    For the two scatter functionals V1(F) and V2(F), V1(FX)−1V2(FX) is a block diagonal matrix. Applying the spectral value decomposition (13) to the block diagonal elements gives inline image, with Δj being a diagonal matrix of order pj for j=1,…,m. Let Δ be the diagonal matrix of order p with diagonal blocks Δj, and let H be the block diagonal matrix of order p with diagonal blocks Hj. Thus, V1(FX)−1V2(FX)=HΔH−1. It follows from theorem 2 that inline image, where Δ(FY)∝diag{ρ1(FY),…,ρp(FY)} and P is a block permutation matrix. Applying property (3) again gives
    image
    Comparing this with the spectral value decomposition (13) for V1(FY)−1V2(FY) gives inline image for some non‐singular diagonal matrix inline image. Thus,
    image
    where inline image. Since inline image, it then follows that
    image
    with inline image, for j=1,…,m. Hence, theorem 9 holds.

    Discussion on the paper by Tyler, Critchley, Dümbgen and Oja

    J. T. Kent (University of Leeds)

    This is a delightful paper. Like many of the best ideas in statistics, it is based on a simple, yet elegant, idea which leads to powerful new methods of discovering patterns in data. All the user needs to do is to specify two scatter functionals (effectively two metrics for the specific set of data) and then to carry out a relative eigenanalysis. As the examples of the paper make clear, this new methodology has proved its value in a wide range of settings.

    Dual metrics have a long history in multivariate analysis. The simplest example is perhaps principal component analysis itself, which involves the eigendecomposition of one matrix (typically a sample covariance matrix) with respect to another matrix (typically the identity matrix, which is often implicit). Another example arises in multivariate analysis of variance, which is also called discriminant analysis, in terms of the ‘between‐’ and ‘within‐’groups sums of squares and products matrices B and W (or, alternatively, in terms of T=B+W and W), though in this case prior knowledge of the grouping structure is needed. These two examples involve quadratic functions of the data, whereas the focus in this paper is on non‐quadratic functions of the data.

    The simplest non‐quadratic function of a random variable U is the kurtosis, which takes the form kurt(U)=E(U4)−3, when U is centred and scaled to have mean 0 and variance 1. It is useful to distinguish three cases:

    • (a)

      kurt(U)=0, which holds under normality, and the alternatives

    • (b)

      kurt(U)>0 and

    • (c)

      kurt(U)<0,

    which can be termed the ‘super‐Gaussian’ and ‘sub‐Gaussian’ cases respectively. The super‐Gaussian case arises for long‐tailed distributions, whereas the sub‐Gaussian case arises for what might be called ‘balanced’ mixtures of two normal distributions with different means. Here balanced means that the mixing proportions are not too far from inline image and the variances are not too dissimilar. When the variances are equal, it is possible to characterize the class of balanced mixtures explicitly; see, for example, Section 5.1 of the paper or Peña and Prieto (2001).

    One way to look for structure in a p‐dimensional random vector Y (with mean 0 and covariance matrix Σ) is to look for a linear combination a to maximize the absolute kurtosis, |kurt(ay)|, and this criterion forms the basis of one of the standard algorithms in independent component analysis (ICA). However, the above paragraph suggests that, when clustering is suspected, a better approach might be to minimize the signed kurtosis kurt(aY), which leads to what can be termed a ‘sub‐ICA’ algorithm; see, for example Bugrien and Kent (2005) for more details.

    The set of all fourth‐order moments forms a four‐way array (and hence does not define a metric). Since the kurtosis involves a quartic function of a acting on this four‐way array, optimization for either of these algorithms must generally be carried out numerically.

    The paper finesses its way out of this numerical problem by replacing the full four‐way array of fourth‐order moments by a matrix of selected fourth‐order moments, inline image in equation (21), and replacing the quartic optimization by a quadratic optimization. Thus a natural question is which criterion offers more insight into the structure of multivariate data,
    image

    Further, do any insights here offer any guidance to the analysis of more general scatter functionals?

    Several other questions also spring to mind.

    • (a)

      Ordering of eigenvalues: the paper makes little distinction between inline image and inline image. Would it be helpful to label one of the matrices as ‘more robust’ than the other and to distinguish between the interpretation of the largest eigenvalue and the smallest (cf. sub‐ICA above)?

    • (b)

      Estimating the centre of the data: this topic has received little discussion in the paper, but it seems potentially important. Does choice of location functional matter, especially for skew data? Or does symmetrization successfully deal with the issue? If symmetrization is not used, would it be desirable to enforce a common estimate of location when defining the matrices V1 and V2?

    • (c)

      Third moments: this paper works with and extends fourth moments (kurtosis). Is it worth investigating and extending third moments (skewness)?

    • (d)

      High dimension: the examples in this paper involve data sets of fairly modest dimension. Are there opportunities for insights with high dimensional data (n<p or np), after regularizing?

    Let me end with a more philosophical question. The authors motivate the methods in the paper by using ideas from robustness theory, which was developed to protect against outliers. However, this paper is more concerned with pattern detection, which is a more subtle problem. Is it merely serendipity that methods developed for one problem provide tools for another, or is there something deeper going on?

    This has been a fascinating paper opening up a whole new direction in the search for patterns in multivariate data. It gives me great pleasure to propose the vote of thanks.

    Trevor Ringrose (Cranfield University, Swindon)

    Multivariate analysis often seems to be a randomly assorted grab‐bag of vaguely related methods rather than a coherent field, so it is very encouraging to see a paper which shows the connections between several methods and even more importantly opens up a wide array of potentially useful generalizations and special cases.

    The authors rightly point out that when different robust estimators of ostensibly the same parameter produce different answers this is not necessarily a bad thing, as there is information in these differences. Similarly in introductory statistics lectures we often mention that the mean, median and mode are all roughly the same for samples from symmetrical distributions, so it tells us something useful if they are all different. The paper offers a convincing method for making such comparisons in a multivariate setting, which the reader can understand by analogy with the very similar use of within‐groups and between‐groups covariance matrices in multivariate analysis of variance and canonical variate analysis.

    However, the job of the seconder of the vote of thanks is to be more critical, so we might ask the obvious question of how well do the methods proposed work in practice, and in particular what do they add to what we already have? Some of the examples are not very convincing. It was admitted during the verbal presentation that the outlying cluster can in fact be seen quite clearly in a matrix plot of the wood gravity data. Ignoring the distinction between response and explanatory variables (as in the paper) a biplot of the principal component analysis (PCA) solution (79% of variance on the first two axes) clearly picks out the cluster and shows that they have above‐average values on x2 and x5 and below‐average values for the other variables, as can then be seen clearly in the raw data. Similarly, distinguishing between the species in Fisher's iris data is trivially easy because again a simple matrix plot shows very clear differences in petal sizes, and the authors note in Nordhausen et al. (2008) that simple PCA does almost as well as invariant co‐ordinate selection. Admittedly this is a mainly theoretical paper so these data sets were chosen for illustration rather than real interest, but even given this they still seem excessively easy. Similarly, the picture mixing example of invariant co‐ordinate selection as independent component analysis in Nordhausen et al. (2008), pages 24–26, can be performed almost equally well (in R) by using PCA, and in this example PCA seems to cope better with cases where the number of output mixtures exceeds the number of input signals.

    It is a criticism of all of us that we tend to use and reuse the same toy examples in published work, which might easily make the cynical outsider suspicious that our methods work only in certain restricted cases. In particular, I would like to propose a moratorium on further published use of Fisher's iris data!

    I have two final comments. Firstly, the paper concentrates very much on the co‐ordinate scores on the new axes, but in many cases the eigenvector coefficients will also be of interest. Can meaningful biplots be produced? Secondly, the final paragraph mentions data‐driven choice of scatter matrices based on the observed separation of eigenvalues (with larger separations regarded as better, one assumes). Sample eigenvalues are usually more spread out than population eigenvalues anyway, and this will then tend to pick the biggest of these overestimates of the separation. Is this good or bad, though? It might turn out to be good when searching for outliers but bad when trying to model the bulk of a distribution.

    While reading the paper it seemed very clear that the authors must have started the work separately and from differing perspectives. One of the authors confirmed that this was indeed so: that they had worked independently until three of them realized that they had all talked about the same thing at a conference. It is a pleasingly self‐referential aspect of the paper that these three independent components of the work can be unmixed by the reader.

    The criticisms above are very minor, however, as this is a very interesting paper which points the way to many more papers developing the methods and their practical application. In recent years developments in multivariate statistics seem to have fallen behind those in areas such as regression modelling and Bayesian methods, and this paper should help to spark new interest in the field. This paper is a very welcome addition, and I have no hesitation in seconding the vote of thanks.

    The vote of thanks was passed by acclamation.

    Davy Paindaveine (Université Libre de Bruxelles)

    Beyond the role that it plays in detecting departures from ellipticity, invariant co‐ordinate selection (ICS) is potentially useful to choose a proper model for the data at hand among the many multivariate models that are available in the literature: (mixtures of) elliptical models, the independent component (IC) models of Section 5.2 (see also Nordhausen et al. (2009b)), skew elliptical models (see, for example, Genton (2004)), etc. This discussion partly supports this claim by proposing an informal graphical method that allows us to ‘test’ the null hypothesis inline image under which IC models are appropriate.

    In Fig. 1(c), it is shown how a couple of location–scatter estimates inline image, l=1,2, can be used to detect departures from ellipticity, on the basis of the fact that, for any such couple and under ellipticity, we should have inline image for some λ>0. For inline image, we could similarly think of using three—or four—different scatter estimates to derive—typically, via theorem 5—a couple of consistent estimates inline image, for the underlying mixing matrix H (clearly, it is crucial to adopt a common normalization for inline image and H here, such as the Z‐standardization in the R package ICS; see Nordhausen et al. (2008) for details). Although proper (Frobenius‐type) distances between the resulting inline image and inline image would provide natural test statistics for inline image, a direct graphical tool, in the same spirit as in Fig. 1(c), is the scatter plot of ICS distancesinline image with
    image
    where inline image is the vector of marginal medians for the lth ICS and inline image is the diagonal matrix collecting the corresponding marginal median absolute deviations. Under inline image, all points in such scatter plots should roughly sit on the main diagonal, which allows us to detect possible violations of inline image.

    The choice of the various scatter matrices is, here as well, a delicate issue. But one might still argue that combining scatter matrices with different robustness properties could reveal interesting features. This is illustrated (with the same data as in Section 6.1) in Fig. 6, where, interestingly, only the plot based exclusively on robust scatter matrices seems to be compatible with inline image.

    image

    Scatter plots of ICS distance inline image, inline image, i=1,…,n, with inline image(inline image, inline image

    As shown beautifully in the paper, though, the relevance of ICS extends far beyond IC models, and I congratulate the authors for one of the most refreshing and inspiring works of the decade in the field of multivariate statistics.

    Mervyn Stone (University College London)

    This useful paper starts with Cartesian co‐ordinates that come with any data, graduates to matrices and ends up with affine invariance—in other words, next door to the open‐air geometry of co‐ordinate freedom!

    I doubt whether the authors depended on the algebra of Sections 24 to be confident that that would happen—before writing the computer program that does have to use co‐ordinates and matrices. Readers of the paper might have been spared the algebra—if only that great exponent of co‐ordinate freedom, Paul Halmos, had gone deeper into probability and statistics to wean us off co‐ordinates and matrices wherever and whenever these impede understanding.

    It is not too late to supply the alternative thin gruel.

    • (a)

      A few concepts and terms from the thinnest and least influential books on multivariate analysis: inline image is the vector space of variables (made out of p names) and inline image is its dual space of evaluators e whose evaluation of variable v (a possible ‘observation’ if v is a name) is the bilinear product [e,v]. V1 and V2 are inner products on inline image and also so‐called ‘covariance operators’ (linear inline image).

    • (b)

      Realization that fixed point theory can open the door to a simplified equivalent eigenanalysis for V1 and V2: inline image is the closed surface of a (V1+V2)‐hemisphere in inline image. The transformation inline image that is defined by inline image is continuous. So inline image has a fixed point h with V2h=ρ(h)V1h and, as a consequence, you can take it from here with a willingness to ‘go to the pictures’.

    • (c)

      The pictures that I refer to here are downloadable and are more fully explained in Stone (2008). Their reassuring features are affine invariants as obvious as three lines meeting in a point—and simply discovering them can be a more rewarding and liberating activity for a statistician than sudoku.

    Christian Hennig (University College London)

    The authors did a good job in providing a framework for a class of projection methods to visualize multivariate data sets. The comments on the choice of shape matrices mainly focus on robustness aspects. I think that other considerations are important as well, and in many situations the choice matters more than the paper suggests.

    I show a situation in which the choices that are suggested in the paper and the ICS software package do not work well, and an alternative shape matrix does better.

    This needs a definition of quality, depending on the patterns of interest, which are clusterings here. What should a good projection method deliver in a benchmark situation with a one‐dimensional interesting pattern in a three‐dimensional data set? The analogue of what is expected in a high dimensional situation is that the pattern should appear along either the first or the last invariant co‐ordinate.

    The three variables of the example data set have been generated independently from a t2‐distribution, a uniform distribution and a mixture with 300 points from inline image points from inline image and 400 points from inline image. Fig. 7 shows the solution with the default of the ICS software inline image; this is similar to the solution with V1 maximum likelihood (ML) for Cauchy, V2 ML for t2). The cluster pattern is not optimally visible along the third co‐ordinate. In Fig. 8, ML for t2 and the minimum covariance determinant (MCD) have been used. This shows the pattern along the second co‐ordinate.

    image

    ICS plot with Sn and inline image

    image

    ICS plot with ML for t2 and MCD

    Fig. 9 shows the best solution, which stems from MCD as V2 and V1 (‘local shape’) defined as follows.

    image

    ICS plot with local shape and MCD

    • (a)

      Compute a matrix of Mahalanobis distances between points (based on the MCD with 20% breakdown point, say).

    • (b)

      For every point, compute the covariance matrix of its 10% nearest neighbours.

    • (c)

      Standardize all these matrices by their traces to unify the influence of every point.

    • (d)

      Pool the covariance matrices.

    Using this matrix together with a global covariance matrix brings forth those co‐ordinates along which the local structure differs from the global structure.

    Here is another idea.

    • (a)

      Compute an affine invariant clustering of the data.

    • (b)

      Use the pooled within‐cluster covariance matrix.

    Conlusion: if clustering is of interest, it is advantageous to choose scatter matrices to explore global versus within‐cluster structure.

    A. P. Dawid (University of Cambridge)

    The central idea of this paper is very neat: that, in the presence of two different measures of scatter, defining two different inner products over the variables, we can apply simultaneous diagonalization to define a ‘natural’ set of basic variables for further analysis and display of the data. However, this set is natural only to the extent that the chosen pair of scatter measures can be considered natural. But, even in this case, why stop at two such measures?—in many problems there will be a wide variety of interesting scatter measures. Unfortunately the theory as presented requires, not one, not three, but exactly two scatter measures.

    Is there anything useful that can be said about an appropriate treatment (in a symmetrical fashion) of more than two?

    The following contributions were received in writing after the meeting.

    Henri Caussinus (Institut de Mathématiques de Toulouse) and Anne Ruiz‐Gazen (Toulouse School of Economics)

    We congratulate the authors for their very interesting paper which brings significant improvements in the theoretical knowledge of scatter matrices comparison. From our perspective several issues deserve further attention. The first issue is the choice of the dimension of the graphical display, which has been a crucial concern since the earlier time of the projection pursuit approach (Sun, 1991): which projections are significant, i.e. which projections contain a genuine structure rather than merely random variation corresponding to elliptical distributions? For example, within the framework of theorem 4, what is the value of k or, more precisely, what is the dimension of the subspace containing the kμj? The answer to this practical question rests on the distribution of eigenvalues of the matrix product involved. We gave very preliminary theoretical results for specific scatter matrices in Caussinus et al. (2003a) for the detection of outliers and in Caussinus et al. (2003b) for the detection of groups. Another issue is the complementary use of invariant co‐ordinate selection and classification. The co‐ordinates that are selected by invariant co‐ordinate selection can be used to visualize possible groups, to suggest their number and to improve the efficiency of clustering algorithms. These various aspects were illustrated in Caussinus and Ruiz‐Gazen (2007) as an encouragement for further research. A third issue concerns the choice of the (class of) scatter matrices to be compared with respect to the structure of interest. From our experience, many choices lead to displaying outliers. Since, in practice, outliers are often present in the data sets, they can mask other interesting features. To display groups or special structures like those of example 2 or the RANDU data set, much care is needed in the choice of scatter estimators to be compared, and all the more so in the presence of outliers. It seems that scatter matrices resting on pairwise differences are of special interest. A class of scatter matrices depends on a tuning parameter whose choice is also challenging. As quoted by Tyler and his colleagues, some of our results lead to choosing small values of this parameter; this is basically the case when looking for outliers. However, in other cases of interest, our practice and some limited unpublished results lead to different values, e.g. 2, the value which appears in Caussinus et al. (2003a). We hope that the authors will be interested in further investigating these various issues.

    Christophe Croux (Katholieke Universiteit Leuven)

    This paper introduces a new tool for multivariate data analysis, called invariant co‐ordinate selection. I consider the ideas in this paper to be new and innovative, and this paper is very likely to result in a new stream of research in multivariate analysis. I congratulate the authors for this fascinating paper, and for the clear exposition of their work.

    The method is quite easy to put in practice: you compute eigenvalues and eigenvectors of V1 and V2, with V1 and V2 two scatter matrices. The idea only works if V1 and V2 are different scatter matrices. The reason why this method has not been discovered earlier is probably because most statisticians only use the covariance matrix. Scatter matrices are well known in the robustness literature, but application of the methods here does not require the scatter matrices to be robust. What I consider as most important contributions are as follows.

    • (a)

      The introduction of an ‘invariant co‐ordinate system’: an affine transformation of the data is not changing the co‐ordinate system. Principal component analysis only has this property for orthogonal transformations. The invariant co‐ordinate system depends on the choice of the two scatter matrices and yields arbitrary transformations for elliptical distributions. Also in principal components, one computes eigenvectors of a chosen scatter matrix (most often the covariance matrix), and one obtains arbitrary rotations for spherical distributions.

    • (b)

      The result that invariant co‐ordinate selection retrieves

    • (i)

      the independent components of the independent component analysis model and

    • (ii)

      Fisher's linear discriminant subspace for mixtures of elliptical distributions.

    If we have neither a mixture of elliptical distributions, nor an independent component analysis model, then the interpretation of the selected co‐ordinates is relying on a projection pursuit argument, where one generalizes a generalized measure of kurtosis. Note that this measure of kurtosis is defined conditionally on a given multivariate distribution. For an arbitrary univariate distribution, it is not so clear how this generalized measure of kurtosis is defined.

    Whereas most of the theory in multivariate statistics relies on elliptical distributions, the authors go one step beyond this and open a whole new area of research. I liked reading the paper, and I congratulate the authors once more.

    Peter Filzmoser (Vienna University of Technology)

    I congratulate the authors for this interesting contribution that combines and generalizes several approaches. The work of Caussinus and Ruiz‐Gazen on generalized principal components analysis is generalized, and Fisher's linear discriminant subspace turns out to be a special case. Also independent component analysis and projection pursuit are taken into account. For the latter method, invariant co‐ordinate selection (ICS) does not require the pursuit effort. In contrast, one could evaluate the co‐ordinate pairs resulting from ICS for their ‘interestingness’, thereby using standard projection pursuit indices. Moreover, as already indicated by the authors, different pairs of scatter matrices could be used to find interesting projections. A further idea could be to use linear combinations of scatter matrices and to combine them in the same way as is done now with equation (13). Depending on the coefficients for the linear combinations, different insights into the multivariate data structure could be obtained.

    An interesting aspect of ICS is that the pairs plots offer the possibility of interpreting the outliers. For instance, in example 1 (Fig. 2) the directions of the first two ICS components refer to the contributions of the nine variables. Thus, by inspecting these ‘loadings’ it could be possible to interpret the outlier groups in terms of the original variables.

    Finally, thanks to the available R package ‘ICS’ I did some experiments with high dimensional data. I generated two multivariate normally distributed data clouds in 1000 dimensions, the first cloud consisting of 2000 observations, and the second of 200, and both centred at the origin. The covariance matrices are the identity matrix for the first cloud and the identity matrix multiplied by 1.2 for the second cloud. Thus, it is practically impossible to distinguish both groups in any pairs plot. With the default parameters for the ‘ics’ function we can see slightly different behaviour of both groups in the first and last ICS directions. When taking the classical covariance matrix of the original and of the weighted data, with weights obtained from a multivariate outlier detection method, we can clearly see both groups. Here I used an outlier detection method that is not affine equivariant (Filzmoser et al., 2008) and, although theoretical results for ICS would no longer hold, the practical results are very useful.

    Marc Hallin (Université Libre de Bruxelles)

    This paper, which brings together and unifies fundamental ideas from several statistical areas—principal components, discriminant analysis, robustness, invariance, statistical depth, flexible modelling, independent component analysis, …—is certainly among the most stimulating and refreshing that I have read for many years.

    Focusing on the use of two distinct measures of scatter (or shape) V1(F) and V2(F) in detecting departures from ellipticity, one question, which is not examined by the authors, naturally comes to mind: for given non‐elliptical F, is there any such thing as a ‘most efficient’ or ‘most contrasting’ choice of FVj(F),j=1,2—maximizing, for instance, some adequate distance between the scaled version of (ρ1,…,ρp) and (1,…,1)? This question, quite presumably, is related to the problem of constructing ‘optimal’ tests for sphericity (robust alternatives to the traditional Mauchly (1940) and John (1972) tests can be found in Tyler (1982, 1987) and Hallin and Paindaveine (2006)). Answering such a question would be most useful, for instance in the problem of recovering, in an optimal way, independent components in independent components analysis models.

    Affine invariance or equivariance, however, is not the only invariance property that we could require for the scatter matrices FVj(F),j=1,2. Another group of transformations, of equal relevance, is not mentioned, which also preserves ellipticity: the group of monotone radial transformations. More precisely, assuming that some location θ=θ(F) has been chosen, consider a scatter functional FV(F) (in the sense of this paper), and let
    image

    Then, Y (with distribution function FY) is elliptical if and only if inline image (with distribution function inline image) is also elliptical, where rg(r) is an arbitrary continuous monotone increasing transformation of inline image such that g(0)=0 and  lim r→∞{g(r)}=∞. Classical invariance arguments suggest that inline image be proportional (shape equivalent) to V(FY) for any g and FY—a property that the scatter functionals considered in the paper only have when restricted to the family of elliptical FYs. This invariance under radial transformations severely restricts the class of admissible scatter functionals; note that the functional that was proposed by Tyler (1987) satisfies the condition—but other solutions do exist.

    In the empirical version (denote by inline image the empirical distribution function for a sample Y1,…,Yn of size n), similar invariance arguments imply that inline image should be measurable with respect to UV(n);i:=(V(n))−1/2(Yiθ)/‖(V(n))−1/2(Yiθ)‖ and the ranks inline image of the distances rV(n);i:={(Yiθ)(V(n))−1(Yiθ)}1/2,i=1,…,n. This is not easily achieved for finite n, but it holds for the M‐estimator that was proposed by Tyler (1987) and, under asymptotic form, for the R‐estimators of shape that were developed in Hallin et al. (2006).

    Daniel Peña and Júlia Viladomat (Universidad Carlos III de Madrid)

    The authors present a very general method to generate an affine invariant co‐ordinate system by projecting the data onto some eigenvectors of the matrix inline image, where V1 and V2 are any pair of (robust) affine equivariant scatter matrices. These projections are shown to reveal departures from an elliptical distribution and can be seen as a projection pursuit method based on kurtosis (see equation (14)). Projection directions maximizing and minimizing kurtosis were shown to be useful for robust multivariate estimation in Peña and Prieto (2001b), who also proved the optimality properties of these directions for clustering (Peña and Prieto, 2001a). They used numerical optimization to find these optimal directions. An important contribution of this paper is that these directions can also be obtained as eigenvectors of some general class of kurtosis matrices.

    Thus, we have two ways of finding extreme directions of kurtosis. The first way is through numerical optimization and the second finds the eigenvectors of some generalized kurtosis matrix. In Peña et al. (2008) we have compared these two approaches in a particular case. Given a multivariate random vector X with mean μ and covariance matrix Σ, we propose to compute the eigenvectors of the kurtosis matrix K=E(ZTZZZT), where Z−1/2(Xμ). Using this matrix is equivalent to choosing V1=Σ, and V2=E{ZTZ(Xμ)(Xμ)T} in this paper. We then show that if the ratio n/p is large, where n is the sample size and p the dimension, the estimation of a matrix of dimension p is reliable and estimating its eigenvectors becomes accurate and useful. Also, in this case numerical optimization is computationally intensive. However, when n/p is small, estimating the elements of the matrix has limited precision and the eigenvectors are not useful for showing the clusters. Since the use of the kurtosis matrix K is based on an existent kurtosis‐based algorithm, we can use the algorithm in Peña and Prieto (2001a) when n/p is small. An interesting problem is the performance of these two procedures under the more general situation of different scatter matrices. Then the use of just any pair of robust scatter matrices does not guarantee the identification of the clusters, whereas the directions of extreme kurtosis have been found to be effective in this situation.

    Werner A. Stahel and Martin Mächler (Eidgenössiche Technische Hochschule, Zurich)

    The paper introduces an elegant piece of theory and derives a very useful tool for finding patterns in multivariate data. We warmly congratulate the authors for this work.

    This comment recalls a benchmark distribution for multivariate tools that aim at good robustness properties, which was introduced in section 5.5a of Hampel et al. (1986), which we shall call the ‘barrow wheel’. It is a mixture of a flat normal distribution contaminated with a portion ɛ=1/p of gross errors concentrated near a one‐dimensional subspace. Let
    image
    where p is the dimension and H is the distribution of Y, where Y(1) has a symmetric distribution with inline image and is independent of inline image (Fig. 10(b)). Then, this distribution is rotated such that the X(1)‐axis points in the space diagonal direction (1, 1,…,1), and the components are rescaled to obtain G. Note that the covariance matrix of both G0 and G will tend to Ip for σ1→0 and σ2→0, and all known ‘cheap alternatives’ to high breakdown (‘class III’) point covariance estimation fail to detect the outlier part H. For more details and R functions, see http://stat.ethz.ch/research/areas/robustness.
    image

    Scatter plot matrices of a sample from the barrow wheel distribution, p=4, and of the invariant co‐ordinates obtained from it

    Is the barrow wheel an artificial situation? The ‘wheel’ describes a multivariate normal distribution with a strong linear relationship between variables—a situation which multivariate statistics searches for. The outliers are ‘nasty’, but making them more realistic does not render the problem of detecting the structure much easier. Robust multivariate procedures should therefore pass this benchmark.

    Fig. 10(a) shows a sample from G for p=4 and σ1=0.1 and σ2=0.2. Any structure seems difficult to spot. The invariant co‐ordinate selection that is obtained from using the robust MCD covariance as V2 and the empirical covariance matrix as V1 shows the structure very clearly (Fig. 10(b)). Note that the outliers would appear in the last co‐ordinates if we followed the advice of the authors to use a mildly or non‐robust scatter estimate as V1 and a more robust estimate as V2.

    Thus, invariant co‐ordinate selection passes the benchmark—if a high breakdown scatter matrix is used. The cheaper alternative that is based on a class II scatter matrix and a one‐step W‐estimate applied to it (Section 6 of the paper) will generally miss the structure. If a full class III estimate is too expensive, we recommend simply restricting the number of elemental subsets of the usual resampling algorithm to find such an estimate and using the respective ‘unsecure’ estimate as V2.

    The authors replied later, in writing, as follows.

    We thank all the discussants for their insightful and generally encouraging remarks. Many of the points that were made by them have also been major concerns of ours, and we hope that our paper stimulates others to develop this topic further. The discussants have already pointed the way to many important open problems.

    Rather than respond to the discussants one by one, we address their main recurring themes.

    Choice of scatter and statistical variability

    One of the more prominent themes in the contributions is the choice of the scatter matrices. This is certainly a major topic deserving a better understanding. A good choice for the scatter matrices, though, will probably depend on the problem at hand, e.g. whether interest lies in a mixture problem, an independent components analysis (ICA) problem or some other problem.

    Much of the discussions tends to focus on the role of invariant co‐ordinate selection (ICS) in detecting mixtures or clusters. In this setting, it seems natural that one should try to define one scatter matrix so that it can be viewed as a measure of within‐group scatter. This is essentially the idea behind Dr Hennig's proposal for a local shape matrix. (As defined, this matrix is not affine equivariant but can be made so by replacing tr(VIm) with det(VIm) in its definition.) It is also the motivating idea behind the clustering algorithm that was proposed by Art et al. (1982), as well as the idea behind the scatter matrices based on downweighting large pairwise differences that were noted in the discussion of Professor Caussinus and Professor Ruiz‐Gazen and in Lutz Dümbgen's oral presentation explaining the choice of scatter matrices that were used in the RANDU example.

    Nevertheless, if one of the models that are considered in Section 5 holds, the results of our paper imply that the choice of the scatter matrices used in deriving the ICS co‐ordinates is theoretically irrelevant for sufficiently large sample sizes. As noted by Professor Hallin and by Dr Ringrose, the main considerations are the theoretical separation of the ICS roots, ρ1(F),…,ρp(F), and the statistical variability of the sample scatter matrices inline image and inline image. If the theoretical roots are not well separated then some modest statistical variability in the scatter matrices may result in the ICS co‐ordinates being poorly estimated. The theoretical ICS co‐ordinates, however, do not depend on the choice of the scatter matrices, at least within the context of the theorems of Section 5. Studying the statistical variability of the ICS roots and co‐ordinates is a reasonably straightforward problem, at least asymptotically. However, the more important problem of understanding the theoretical separation of the roots based on two given scatter functionals for a specific underlying model appears to be very challenging, and any results on this topic are greatly welcomed.

    To illustrate these points further, consider the example that was presented by Dr Hennig. This interesting example is presented as a clustering problem but it does not fall under the mixture models that were considered in Section 5.1. Rather, it provides a nice example of an ICA model. Theorem 5 states that essentially any two scatter matrices should uncover the structure. Furthermore, as noted in the discussion after theorem 5, the independence property or symmetrization is not needed here since two of the three marginals are symmetric. The scatter matrices that were first considered by Hennig, i.e. Sn and inline image, may not be appropriate since neither one is defined at the population model owing to the t2‐distribution. (Curiously, the t2‐component is easily found and the difficulty appears to be in separating the mixture component from the uniform component.) Otherwise, any well‐defined pair of scatter matrices should find the independent components for a sufficiently large sample size, even if they are not specialized to this particular problem.

    Figs 11(a) and 11(b) show the results for Dr Hennig's example when Dümbgen's scatter matrix is chosen for inline image in both figures, and with the t2M‐estimate of scatter and its symmetrized version chosen respectively as inline image. From Fig. 11 the symmetrized version appears to give a slightly better recovery of the independent components, with the sample ICS roots being more widely separated in the symmetrized version, i.e. (1.26, 0.98, 0.80) for Fig.11(a)versus (1.53, 0.87, 0.75) for Fig. 11(b). This suggests, as commented on by Professor Kent, and Professor Caussinus and Professor Ruiz‐Gazen, that there may be advantages to symmetrization, at least for moderate sample sizes. It also seems that using a common centre for both scatter matrices may be advantageous. The ICS plots using a t2M‐estimate of scatter and Tyler's shape matrix (which is the unsymmetrized version of Dümbgen's scatter matrix) centred at the t2M‐estimate of location gives a plot similar to Fig. 11(b). Using a common centre avoids the additional computations that are needed in working with symmetrized data.

    image

    ICS for Hennig's example: (a) inline image, the t2M‐estimate, and inline image, Dümbgen's scatter; (b) inline image, the symmetrized t2M‐estimate, and inline image, Dümbgen's scatter

    This example also sheds some light on Professor Kent's question regarding the distinction between larger and smaller ICS roots. Dümbgen's matrix may be viewed in a loose sense as being more ‘robust’ than both the t2M‐estimate and its symmetrized version. In Fig. 11(a), though, the t2‐component is related to the largest root whereas in Fig. 11(b) it is related to the smallest root.

    Statistical inference and general distributions

    Our paper does not give any results on statistical inference, but rather leaves this topic open for further research. As noted by Professor Caussinus and Professor Ruiz‐Gazen, there are some interesting open inferential problems when we assume a mixture model. Perhaps the most fundamental question, though, is first to determine whether ICS roots significantly differ from each other. Otherwise, the ICS method is simply exploring noise.

    Some work on using two scatter matrices to test for multivariate normality can be found in Kankainen et al. (2007). In Wang (2008), the sample ICS roots are used to develop tests for the hypothesis that the data come from an elliptical distribution. The local power function of these tests under mixtures of elliptical distributions and under skewed elliptical distributions are also obtained.

    The question that is posed by Professor Hallin regarding the optimal choice of scatter matrices for such tests again depends on the problem at hand, i.e. on the alternative model. The problem of testing the hypothesis of ellipticity against a general multivariate distribution is far more complex than testing the hypothesis of sphericity within the class of elliptical distributions. Even when considering a mean mixture of two multivariate normal distributions, we have noted in some preliminary work that one pair of scatter matrices may be more powerful than another pair for some mixtures, whereas the reverse may hold for other mixtures.

    For distributions other than the models that are discussed in Section 5, the theoretical ICS co‐ordinates themselves can be heavily dependent on the scatter functionals. In this case, the use of more than two scatter functionals, as pondered by Professor Dawid, may be helpful for exploring these more complex non‐elliptical structures. Generating a new co‐ordinate system based on the comparison of more than two scatter matrices is more problematic since in general three or more scatter functionals cannot be simultaneously diagonalized. (Note that theorem 5 states that all scatter functionals can be simultaneously diagonalized for the ICA model that is considered in the theorem.) Perhaps some approximate simultaneous diagonalization as suggested by Professor Filzmoser can be developed. Approximate simultaneous diagonalization techniques have been developed in another context in ICA; see for example Cardoso and Souloumiac (1996).

    The contribution by Dr Paindaveine presents a clever application which uses the information that is contained in more than two scatter matrices, namely a graphical method for accessing how varied different ICS co‐ordinate systems may be. The insight that is obtained from this can be used to diagnose whether one of the models considered in Section 5 is appropriate. The graphs that he presents suggest that a wide range of scatter matrices should be considered in practice, since some pairs of scatter matrices may give similar ICS results, whereas others may give differing results.

    The hypothesis of the equality of the theoretical ICS roots is equivalent to the hypothesis V1V2. Information coming from several scatter matrices can also be used to develop alternative and perhaps more powerful tests for ellipticity by considering the hypothesis V1V2…∝Vk. Expanding on Dr Paindaveine's idea, several scatter matrices can also be used to test whether there is any significant deviation from one of the models considered in Section 5. For such a test, rather than testing whether the scatter matrices are proportional to each other, we would be interested in testing whether the scatter matrices can be simultaneously diagonalized. These are challenging topics for future research.

    High dimensional data and projection pursuit

    Several discussants bring up the topic of high dimensional data, and in particular when the sample size n is small relative to the dimension p. For np+1, all affine equivariant sample scatter matrices are proportional to each other (see for example Tyler (2009)), and so the ICS method is not applicable in this case. When n is not too large relative to p, as noted by Professor Peña and Dr Viladomat, ICS is unlikely to be successful at finding underlying structures because of the statistical variability of the scatter matrices, unless the structures are extreme.

    As p increases as a multiple of n, it is not clear whether ICS roots and co‐ordinates will converge to anything. In an infinite dimensional space, how do we define an affine equivariant scatter operator other than the covariance operator? In this setting, and in the setting for which n/p is not very large, we may need to relax the requirement of affine equivariance and, as suggested by Professor Kent, introduce some type of regularization.

    The question of the differences between ICS and projection pursuit methods derived by using one‐dimensional projection indexes is a natural one. Professor Kent, for example, questions the difference between using kurt(aY) and using inline image. Professor Peña and Dr Viladomat report on some recent work comparing these two measures, which we are eager to read. They remark that the theory for ICS does not guarantee the identification of clusters when the scatter matrices of the mixture components differ (or, more precisely, when they are not proportional), whereas projections with extreme univariate kurtosis have been effective in this situation. To the best of our knowledge, the theory for projection pursuit in this case only guarantees identifying the components of a normal mixture when the covariance matrices are equal, or when the components are well separated. It is possible to show that the latter case will also hold for ICS. A special case of this has been considered by Critchley et al. (2007).

    We do not generally recommend using inline image and Σ as the scatter matrices in ICS. For this choice, the results of ICS can be too heavily focused on just a few spurious outliers, and statistical variability of the method can be high for longer‐tailed distributions, including mixture models. This pair of scatter matrices is nevertheless of interest, not only for their role in one of the earliest ICA algorithms, FOBI, but also since they can be analytically tractable and hence can help to lead to a better theoretical understanding of the method. For example, in the mixture of two multivariate normal distributions that was discussed after theorem 3, it is shown that when the mixing proportion satisfies inline image then inline image is constant and therefore no direction is distinguishable from any other. If we examine the formula for kurtosis for a mixture of two univariate normal distributions (see Preston (1953)), we draw the same conclusion regarding kurt(aY), i.e. it is constant. There is also a relationship between the two measures under the ICA model, namely
    image
    when inline image is one of the independent components; see Nordhausen et al. (2008).

    Content and style

    As observed by Dr Ringrose, the examples that are given in the paper are intended to illustrate clearly the theory given in the paper. Consequently, the structures that are found in some of the examples can also be found by using a simple principal components analysis (PCA). It is a fair question though to ask what ICS can do that cannot be done with PCA. This question can also be asked of discriminant analysis, with some of us having had the experience of consulting with researchers in applied fields who do obtain answers from PCA even though the problem is one of discriminant analysis. It is well known to researchers in multivariate analysis that one can easily construct examples where PCA will fail to uncover the group differences, particularly when the means vary in the direction of the smallest principal component direction relative to the within‐group covariance matrix. Such cases, when group identification is unknown, will result in a similar difference between ICS and PCA.

    Other examples can be found in the vast ICA literature, where one of the main motivations for its development is that PCA often fails to find important structures in a multivariate data set. If one applied PCA to our example 2, one would not find the underlying structure regardless of the scatter matrix that is used for the PCA. In practice, it is possible for the ICS roots to be significantly different, yet no obvious structure, groups or outliers may be visible in the plots of the ICS co‐ordinates. The theory assures us though that the underlying distribution is more complicated than an elliptical distribution, and so a deeper understanding of the data is needed, as opposed to a simple location–scatter summary, and a closer examination of the ICS co‐ordinates may be enlightening. Such examples, though, do not make good initial illustrations.

    Whether or not a moratorium should be placed on Fisher's iris data is a matter of debate. They can be useful for illustration while taking up minimal space in a paper or minimal time in a presentation since one does not need to explain the data set in detail. Also, if a method does not perform well on the iris data, then the theory may be suspect. Other fields have their pet data sets for illustrating methods and theory (the iris data are a pet rather than a toy), such as the famous Lena image in computer vision.

    We appreciate Professor Stone's co‐ordinate‐free formulation to the ICS variates. The co‐ordinate‐free approach to multivariate statistics certainly offers a theoretically elegant and concise view of the topic. Professor Kent's comments also hint at the co‐ordinate‐free approach in his mention of dual metrics. The statement that two scatter matrices can always be simultaneously diagonalized can be stated more elegantly in a co‐ordinate‐free manner by simply noting that for any two inner products on a finite dimensional vector space there is a basis (the ICS basis) which is orthogonal (but not necessarily orthonormal) relative to both inner products. Alternatively, rather than present the usual technical gruel of relating the spectral value decomposition of the symmetric matrix inline image to the eigenvalues and eigenvectors of inline image we could note that ICS simply corresponds to the usual spectral value decomposition of the symmetric operator inline image, where symmetry is with respect to the inner product inline image. Some of us have mentioned these more abstract concepts in presentations and at times receive the query why confuse the audience with the abstraction? So, a common consideration in presenting results is the intended audience, which for this paper is those who are interested in general multivariate methodology. The interpretation of ICS as a PCA on standardized data may be particularly appealing to practitioners.

    A co‐ordinate‐free approach can be beneficial in any attempt to generalize ICS to infinite dimensional Hilbert spaces. Here the concept of a mutual orthogonal basis relative to two different inner products still holds, as well as the spectral value decomposition (or the Karhunen–Loève decomposition) for symmetric operators. The covariance operator can also be defined within a co‐ordinate‐free format; see for example Eaton (1983). It is not clear, however, how other scatter functionals, whether in finite or infinite dimensional space, can be formulated in a co‐ordinate‐free manner.

    Robustness

    Several of the discussants noted the role of robustness in ICS. Dr Croux gives a very perceptive discussion on the key points of our paper and in particular notes how ICS is a natural outgrowth of contemplating problems in robust statistics. Researchers within the robustness community are familiar with working with competing functionals (and estimates) measuring (estimating), under general assumptions, presumably the same population parameter. Robust statistics is often focused on automatically accommodating outliers so that they do not influence the interpretation of the majority of the data. It is then natural to ask in the multivariate setting whether a location–scatter summary is reasonable even for the majority of the data, e.g. a 30–30–40 mixture.

    The type of outliers that usually cause ‘breakdown’ for many statistics are not simply spurious outliers but rather outliers that have a pattern of their own, with the most extreme case being point mass contamination. One can argue that such data structures are confounded with mixture models, and as an alternative one could try to identify such a pattern while accommodating spurious outliers. Statistics, other than high breakdown point statistics, tend to blur the majority structure with any outlier structure. However, as shown in the modified wood gravity data example, when two such scatter statistics are used together in ICA, the separate patterns may be more apparent.

    A similar phenomenon occurs for the example that was given by Stahel and Mächler, which was originally used in Hampel et al. (1986) to illustrate the concept of breakdown at the edge. In this example, the majority of the data lie close to some subspace. As correctly noted in their contribution scatter statistics which do not have breakdown points near inline image fail to pick up the near singularity of the majority of the data. This does not imply though that ICS based on two lower breakdown point statistics will also fail to detect this pattern. Although this example does not correspond to one of the mixture models or ICA models that were considered in Section 5, it turns out that the ICS co‐ordinates do not depend theoretically on the two scatter matrices used.

    To see this, we first note that because of the invariance of ICS it is sufficient to consider the distribution of G0. The distribution of G0 is invariant under transformations of the form QX, i.e., if X has distribution G0, then so does QX when Q is block diagonal with blocks q11=±1 and Q22 being a 3×3 orthogonal matrix. Consequently, any affine equivariant scatter matrix V at G0 must be block diagonal with elements v11 and V22=v22I3, where I3 is the 3×3 identify matrix. Consequently, inline image has the same block diagonal form, and so either the first or last ICS co‐ordinate will correspond to the first variate in G0 and the other three co‐ordinates will correspond to some rotation of the last three co‐ordinates in G0 (except for the idiosyncratic case when V1 and V2 are theoretically proportional to each other). Again the question of choice depends on the separation of the theoretical roots along with the statistical variability of the roots. For sufficiently large sample size, the pattern should be detected for any two choices of scatter and neither one needs to have a high breakdown point. Fig. 12 illustrates this for a sample of size n=100 from G0 using the sample covariance matrix and the Cauchy M‐estimate.

    image

    ICS for Stahel–Mächler's example: inline image, the t1M‐estimate, and inline image, the sample covariance

    Other remarks

    We have not been able to respond to all of the comments in the contributions in detail. Dr Ringrose and Professor Filzmoser bring up the topic of biplots for the ICS co‐ordinates, and so we wish to note briefly that such biplots have been considered by Caussinus et al. (2003) in the context of generalized PCA. Professor Kent raises the question of possible extensions for third moments. Here, we note some recent work on this topic by Nordhausen et al. (2009a).

    Again we thank all the discussants for their contributions, and we hope that our responses are somewhat enlightening. Overall, though, there is still much work to do and new methodologies are needed to understand better the nature of multivariate data, especially when we move away from the comfort of elliptical distributions.

    Appendix

    Appendix A: Proofs

    A.1. Proof of theorems 1 and 2

    From property (3), it follows that there are γ1>0 and γ2>0 such that V1(FY*)=γ1AV1(FY)A and V2(FY*)=γ2AV2(FY)A. By definition, V2(FY*) hj(FY*)=ρj(FY*) V1(FY*) hj(FY*) and so
    image(26)
    where γ=γ1/γ2. This implies that condition (18) holds. If ρj(FY) is a distinct root, then equation (26) also implies that hj(FY)=ajAhj(FY*) for some scalar aj≠0, and so
    image
    which completes the proof for theorem 1. Consider now the case of a multiple root, say ρ(k)ρj1(FY)=…=ρj2(FY) where j2=j1+pk−1, and let H(k)(F)=(hj1(F),…,hj2(F)). As a consequence of a multiple root, the exact choice H(k)(F) is somewhat arbitrary unless some rule is specified about how to choose its columns. However, the span of H(k)(F) is uniquely defined and so, no matter what rule we use to define H(k)(F), equation (26) implies that inline image for some non‐singular matrix Bk. This implies that
    image
    which completes the proof for theorem 2.

    A.2. Proof of theorems 3 and 4

    Since theorem 3 is a special case of theorem 4, it is only necessary to prove the latter. Using the notation of theorem 4, let inline image, where M=(μ1μk) and 1k ∈ ℜk is a vector of 1s. Since M0 has rank q, the triangular decomposition for matrices gives
    image
    with P being an orthogonal matrix of order p and Tu being an upper triangular matrix of order q. The distribution of X=PΓ−1/2(Yμk) is then a mixture of k spherical distributions with centres t1,…,tk, where tq+1=…=tk=0, and spread functions gi,i=1,…,k, i.e. the density of X is given by
    image
    The distributions of X and QX are thus the same for any orthogonal Q of the form
    image
    where Iq is the identity matrix of order q and Q22 is an orthogonal matrix of order pq. Thus, given a scatter functional V(F) satisfying condition (3), V(FX)=V(FQX)∝QV(FX)Q, for any such Q, and so
    image(27)
    for any orthogonal matrix Q22. Note that equality holds in expression (27) rather than just proportionality since the upper block diagonal matrices are equal (and non‐zero). By making appropriate choices for Q22 in expression (27) we obtain V12(FX)=0 and V22(FX)=γIpq, for some γ>0. Thus, for the two scatter functionals V1(F) and V2(F),
    image

    This matrix has at least one root with multiplicity greater than or equal to pq. By theorem 2, we know that the roots of V1(FY)−1V2(FY) are proportional to the roots of V1(FX)−1V2(FX), and so at least one of the roots ρ(j) has a multiplicity that is greater than or equal to pq.

    Suppose now that no root has multiplicity greater than pq, which by theorem 2 applies to V1(FX)−1V2(FX) as well as to V1(FY)−1V2(FY). For V1(FX)−1V2(FX), one root with multiplicity pq must be γ2/γ1. Also, the q‐dimensional subspace that is spanned by the eigenvectors of V1(FX)−1V2(FX), other than those associated with γ2/γ1, is the same as the subspace that is spanned by (Iq 0), or equivalently it is the same as the subspace that is spanned by T. From the shape equivariant property (3), we have V1(FX)−1V2(FX)∝PΓ1/2V1(FY)−1V2(FY−1/2P, and so it follows that if a is an eigenvector of V1(FX)−1V2(FX) then h−1/2Pa is an eigenvector of V1(FY)−1V2(FY). If the eigenvector a is associated with the root γ2/γ1, then h is associated with some root, say ρ(t), with multiplicity pq. The subspace that is spanned by all the eigenvectors of V1(FY)−1V2(FY), other than those associated with ρ(t), is thus the same as the subspace that is spanned by inline image, and hence equation (22) holds.

    A.3. Proof of theorem 5

    The symmetry of X, along with X having independent components, implies that inline image for any diagonal matrix inline image having only 1s and −1s as entries, i.e. for matrices of the form inline image. So, for any scatter functional V(F) which satisfies the shape equivariant property (3), it follows that inline image, for any such inline image. The last equality follows from property (3) since the diagonal components of V(FX) and SV(FX)S are the same. By choosing inline image, we note that all the off‐diagonal terms in the first row and in the first column of V(FX) must be 0. Continuing, we conclude that V(FX) is a diagonal matrix.

    For the two scatter functionals V1(F) and V2(F), V1(FX)−1V2(FX) is a diagonal matrix. By theorem 1, it follows that inline image, where Δ(FY)=diag{ρ1(FY),…,ρp(FY)} and P is a permutation matrix. Using property (3) again gives
    image
    which by the spectral value decomposition (13) implies that inline image for some non‐singular diagonal matrix inline image. The theorem then follows since
    image

    A.4. Proof of theorem 6

    It follows immediately that, if V(F) is a scatter functional satisfying definition 1, then V(FX) is a diagonal matrix. The remainder of the proof is then identical to the proof of theorem 5.

    A.5. Proof of theorem 7

    By equivariance, we can assume without loss of generality that μ=0. The proof is then given by the first part of the proof to theorem 5.

    A.6. Proof of theorem 8

    The proof of theorem 8 is analogous to the proof of theorems 5 and 6. The only difference is that the matrix inline image at the end of the proof is not necessarily a diagonal matrix, but rather a block diagonal matrix with diagonal blocks of order p1,…,pm.

    A.7. Proof of theorem 9

    The proof of theorem 9 is a generalization of the proof for theorem 5. For this proof, a reference to the blocks of a matrix of order p refers to the partitioning of the matrix in blocks of dimension pi×pj for i,j=1,…,m. The symmetry condition on X, along with the assumption that X has mutually independent subvectors, implies that inline image for any block diagonal matrix inline image having diagonal blocks of the form ±Ipk, k=1,…,m. So, for any scatter functional V(F) which satisfies the shape equivariant property (3), it follows that inline image, for any such inline image. The last equality follows from property (3) since the block diagonal components of V(FX) and inline image are the same. By choosing the first diagonal block of inline image to be −Ip1 and the other diagonal blocks to be Ipk for k=2,…,p, and then continuing in this fashion, we conclude that V(FX) is a block diagonal matrix.

    For the two scatter functionals V1(F) and V2(F), V1(FX)−1V2(FX) is a block diagonal matrix. Applying the spectral value decomposition (13) to the block diagonal elements gives inline image, with Δj being a diagonal matrix of order pj for j=1,…,m. Let Δ be the diagonal matrix of order p with diagonal blocks Δj, and let H be the block diagonal matrix of order p with diagonal blocks Hj. Thus, V1(FX)−1V2(FX)=HΔH−1. It follows from theorem 2 that inline image, where Δ(FY)∝diag{ρ1(FY),…,ρp(FY)} and P is a block permutation matrix. Applying property (3) again gives
    image
    Comparing this with the spectral value decomposition (13) for V1(FY)−1V2(FY) gives inline image for some non‐singular diagonal matrix inline image. Thus,
    image
    where inline image. Since inline image, it then follows that
    image
    with inline image, for j=1,…,m. Hence, theorem 9 holds.

      Number of times cited according to CrossRef: 52

      • Dimension reduction for outlier detection using DOBIN, Journal of Computational and Graphical Statistics, 10.1080/10618600.2020.1807353, (1-31), (2020).
      • An overview of properties and extensions of FOBI, Knowledge-Based Systems, 10.1016/j.knosys.2019.02.026, (2019).
      • REPPlab: An R package for detecting clusters and outliers using exploratory projection pursuit, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2019.1626880, (1-23), (2019).
      • Penalized Orthogonal Iteration for Sparse Estimation of Generalized Eigenvalue Problem, Journal of Computational and Graphical Statistics, 10.1080/10618600.2019.1568014, (1-23), (2019).
      • Finite mixtures, projection pursuit and tensor rank: a triangulation, Advances in Data Analysis and Classification, 10.1007/s11634-018-0336-z, 13, 1, (145-173), (2018).
      • Clustering via finite nonparametric ICA mixture models, Advances in Data Analysis and Classification, 10.1007/s11634-018-0338-x, 13, 1, (65-87), (2018).
      • Skewness-based projection pursuit: A computational approach, Computational Statistics & Data Analysis, 10.1016/j.csda.2017.11.001, 120, (42-57), (2018).
      • Robust Nonparametric Inference, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-031017-100247, 5, 1, (473-500), (2018).
      • On Exploring Hidden Structures Behind Cervical Cancer Incidence, Cancer Control, 10.1177/1073274818801604, 25, 1, (107327481880160), (2018).
      • ICS for multivariate outlier detection with application to quality control, Computational Statistics & Data Analysis, 10.1016/j.csda.2018.06.011, 128, (184-199), (2018).
      • Minimum covariance determinant and extensions, WIREs Computational Statistics , 10.1002/wics.1421, 10, 3, (2017).
      • Invariance property of coordinate transformation, Journal of Spatial Science, 10.1080/14498596.2017.1316688, 63, 1, (23-34), (2017).
      • A new kurtosis matrix, with statistical applications, Linear Algebra and its Applications, 10.1016/j.laa.2016.09.033, 512, (1-17), (2017).
      • Asymptotic and Bootstrap Tests for the Dimension of the Non-Gaussian Subspace, IEEE Signal Processing Letters, 10.1109/LSP.2017.2696880, 24, 6, (887-891), (2017).
      • Subgroup detection in genotype data using invariant coordinate selection, BMC Bioinformatics, 10.1186/s12859-017-1589-9, 18, 1, (2017).
      • Blind source separation of tensor-valued time series, Signal Processing, 10.1016/j.sigpro.2017.06.008, 141, (204-216), (2017).
      • Independent component analysis for tensor-valued data, Journal of Multivariate Analysis, 10.1016/j.jmva.2017.09.008, 162, (172-192), (2017).
      • Gini covariance matrix and its affine equivariant version, Statistical Papers, 10.1007/s00362-016-0842-z, 60, 3, (291-316), (2016).
      • On the Computation of Symmetrized M-Estimators of Scatter, Recent Advances in Robust Statistics: Theory and Applications, 10.1007/978-81-322-3643-6, (151-167), (2016).
      • Influence Functions and Efficiencies of k-Step Hettmansperger–Randles Estimators for Multivariate Location and Regression, Robust Rank-Based and Nonparametric Methods, 10.1007/978-3-319-39065-9_11, (189-207), (2016).
      • undefined, 2016 IEEE International Conference on Image Processing (ICIP), 10.1109/ICIP.2016.7532370, (315-319), (2016).
      • Combining Linear Dimension Reduction Subspaces, Recent Advances in Robust Statistics: Theory and Applications, 10.1007/978-81-322-3643-6, (131-149), (2016).
      • One-step M-estimates of scatter and the independence property, Statistics & Probability Letters, 10.1016/j.spl.2015.12.006, 110, (133-136), (2016).
      • The use of a common location measure in the invariant coordinate selection and projection pursuit, Journal of Multivariate Analysis, 10.1016/j.jmva.2016.08.007, 152, (145-161), (2016).
      • New algorithms for M-estimation of multivariate scatter and location, Journal of Multivariate Analysis, 10.1016/j.jmva.2015.11.009, 144, (200-217), (2016).
      • Sparse optimal discriminant clustering, Statistics and Computing, 10.1007/s11222-015-9547-8, 26, 3, (629-639), (2015).
      • Effect of kurtosis on efficiency of some multivariate medians, Journal of Nonparametric Statistics, 10.1080/10485252.2015.1046450, 27, 3, (331-348), (2015).
      • A cautionary note on robust covariance plug-in methods, Biometrika, 10.1093/biomet/asv022, 102, 3, (573-588), (2015).
      • Vector-valued skewness for model-based clustering, Statistics & Probability Letters, 10.1016/j.spl.2015.01.018, 99, (230-237), (2015).
      • On ANOVA-Like Matrix Decompositions, Modern Nonparametric, Robust and Multivariate Methods, 10.1007/978-3-319-22404-6, (425-439), (2015).
      • On Invariant Within Equivalence Coordinate System (IWECS) Transformations, Modern Nonparametric, Robust and Multivariate Methods, 10.1007/978-3-319-22404-6, (441-453), (2015).
      • Blind Source Separation for Spatial Compositional Data, Mathematical Geosciences, 10.1007/s11004-014-9559-5, 47, 7, (753-770), (2014).
      • Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix, Communications in Statistics - Theory and Methods, 10.1080/03610926.2012.755198, 44, 5, (914-932), (2014).
      • A note on the fourth cumulant of a finite mixture distribution, Journal of Multivariate Analysis, 10.1016/j.jmva.2013.09.007, 123, (386-394), (2014).
      • Robust estimators for nondecomposable elliptical graphical models, Biometrika, 10.1093/biomet/asu041, 101, 4, (865-882), (2014).
      • The spatial sign covariance matrix with unknown location, Journal of Multivariate Analysis, 10.1016/j.jmva.2014.05.004, 130, (107-117), (2014).
      • A characterization of elliptical distributions and some optimality properties of principal components for functional data, Journal of Multivariate Analysis, 10.1016/j.jmva.2014.07.006, 131, (254-264), (2014).
      • Daytime Low Stratiform Cloud Detection on AVHRR Imagery, Remote Sensing, 10.3390/rs6065124, 6, 6, (5124-5150), (2014).
      • Probabilistic approach to cloud and snow detection on Advanced Very High Resolution Radiometer (AVHRR) imagery, Atmospheric Measurement Techniques, 10.5194/amt-7-799-2014, 7, 3, (799-822), (2014).
      • Supervised invariant coordinate selection, Statistics, 10.1080/02331888.2013.800067, 48, 4, (711-731), (2013).
      • Comparing Covariance Matrices by Relative Eigenanalysis, with Applications to Organismal Biology, Evolutionary Biology, 10.1007/s11692-013-9260-5, 41, 2, (336-350), (2013).
      • Skewness and the linear discriminant function, Statistics & Probability Letters, 10.1016/j.spl.2012.08.032, 83, 1, (93-99), (2013).
      • On asymptotic properties of the scatter matrix based estimates for complex valued independent component analysis, Statistics & Probability Letters, 10.1016/j.spl.2013.01.020, 83, 4, (1219-1226), (2013).
      • Multivariate Median, Robustness and Complex Data Structures, 10.1007/978-3-642-35494-6, (3-15), (2013).
      • Large covariance estimation by thresholding principal orthogonal complements, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12016, 75, 4, (603-680), (2013).
      • References, Methodology in Robust and Nonparametric Statistics, 10.1201/b12681, (357-384), (2012).
      • Joint Diagonalization of Several Scatter Matrices for ICA, Latent Variable Analysis and Signal Separation, 10.1007/978-3-642-28551-6_22, (172-179), (2012).
      • Eigenvectors of a kurtosis matrix as interesting directions to reveal cluster structure, Journal of Multivariate Analysis, 10.1016/j.jmva.2010.04.014, 101, 9, (1995-2007), (2010).
      • Characteristics of multivariate distributions and the invariant coordinate system, Statistics & Probability Letters, 10.1016/j.spl.2010.08.010, 80, 23-24, (1844-1853), (2010).
      • k-Step shape estimators based on spatial signs and ranks, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2010.05.003, 140, 11, (3376-3388), (2010).
      • A New Performance Index for ICA: Properties, Computation and Asymptotic Analysis, Latent Variable Analysis and Signal Separation, 10.1007/978-3-642-15995-4_29, (229-236), (2010).
      • Detecting Multivariate Outliers Using Projection Pursuit with Particle Swarm Optimization, Proceedings of COMPSTAT'2010, 10.1007/978-3-7908-2604-3, (89-98), (2010).