Large covariance estimation by thresholding principal orthogonal complements
Summary
The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross‐sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method ‘POET’ to explore such an approximate factor structure with sparsity. The POET‐estimator includes the sample covariance matrix, the factor‐based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.
1. Introduction
Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high dimensional data involves the estimation of a covariance matrix or its inverse (the precision matrix). Examples include portfolio management and risk assessment (Fan et al., 2008), high dimensional classification such as the Fisher discriminant (Hastie et al., 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap et al., 2009; Xiong et al., 2011) and testing the capital asset pricing model (Sentana, 2009), among others. See Section 5. for some of those applications. Yet, the dimensionality is often either comparable with the sample size or even larger. In such cases, the sample covariance is known to have poor performance (Johnstone, 2001), and some regularization is needed.
Realizing the importance of estimating large covariance matrices and the challenges that are brought by the high dimensionality, in recent years researchers have proposed various regularization techniques to estimate Σ consistently. One of the key assumptions is that the covariance matrix is sparse, namely many entries are 0 or nearly so (Bickel and Levina, 2008; Rothman et al., 2009; Lam and Fan, 2009; Cai and Zhou, 2012; Cai and Liu, 2011). In many applications, however, the sparsity assumption directly on Σ is not appropriate. For example, financial returns depend on the equity market risks, housing prices depend on the economic health and gene expressions can be stimulated by cytokines, among others. Because of the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a factor model structure, as in Fan et al. (2008). However, they restrict themselves to the strict factor models with known factors.
(1.1)Here
is the observed response for the ith (i=1,…,p) individual at time t=1,…,T,
is a vector of factor loadings,
is a K×1 vector of common factors and
is the error term, which is usually called the idiosyncratic component, uncorrelated with
. Both p and T diverge to ∞, whereas K is assumed fixed throughout the paper, and p is possibly much larger than T.
is observable. It is intuitively clear that the unknown common factors can only be inferred reliably when there are sufficiently many cases, i.e. p→∞. In a data rich environment, p can diverge at a rate that is faster than T. The factor model (1.1) can be put in a matrix form as
(1.2)
,
and
. We are interested in Σ, the p×p covariance matrix of
, and its inverse, which are assumed to be time invariant. Under model (1.1), Σ is given by
(1.3)
is the covariance matrix of
. The literature on approximate factor models typically assumes that the first K eigenvalues of
diverge at rate O(p), whereas all the eigenvalues of
are bounded as p→∞. This assumption holds easily when the factors are pervasive in the sense that a non‐negligible fraction of factor loadings should be non‐vanishing. The decomposition (1.3) is then asymptotically identified as p→∞. In addition to it, in this paper we assume that
is approximately sparse as in Bickel and Levina (2008) and Rothman et al. (2009): for some q ∈ [0,1),

, the maximum number of non‐zero elements in each row.
The conditional sparsity structure of form (1.2) was explored by Fan et al. (2011a) in estimating the covariance matrix, when the factors
are observable. This allows them to use regression analysis to estimate
. This paper deals with the situation in which the factors are unobservable and must be inferred. Our approach is simple and optimization free and it uses the data only through the sample covariance matrix. Run the singular value decomposition on the sample covariance matrix
of
, keep the covariance matrix that is formed by the first K principal components and apply the thresholding procedure to the remaining covariance matrix. This results in a principal orthogonal complement thresholding estimator POET. When the number of common factors K is unknown, it can be estimated from the data. See Section 2. for additional details. We shall investigate various properties of POET under the assumption that the data are serially dependent, which includes independent observations as a specific example. The rate of convergence under various norms for both estimated Σ and
and their precision (inverse) matrices will be derived. We show that the effect of estimating the unknown factors on the rate of convergence vanishes when p log (p)≫T and, in particular, the rate of convergence for
achieves the optimal rate in Cai and Zhou (2012).
This paper focuses on the high dimensional static factor model (1.2), which is innately related to the principal component analysis (PCA), as clarified in Section 2.. This feature makes it different from the classical factor model with fixed dimensionality (e.g. Lawley and Maxwell (1971)). In the last decade, much theory on the estimation and inference of the static factor model has been developed, e.g. Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003) and Doz et al. (2011), among others. Our contribution is on the estimation of covariance matrices and their inverse in large factor models.
The static model that is considered in this paper is to be distinguished from the dynamic factor model as in Forni et al. (2000); the latter allows
to depend also on
with lags in time. Their approach is based on the eigenvalues and principal components of spectral density matrices, and on the frequency domain analysis. Moreover, as shown in Forni and Lippi (2001), the dynamic factor model does not really impose a restriction on the data‐generating process, and the assumption of idiosyncrasy (in their terminology, a p‐dimensional process is idiosyncratic if all the eigenvalues of its spectral density matrix remain bounded as p→∞) asymptotically identifies the decomposition of
into the common component and idiosyncratic error. The literature includes, for example, Forni et al. (2000, 2004), Forni and Lippi (2001), Hallin and Liška (2007, 2011) and many other references therein. Above all, both the static and the dynamic factor models are receiving increasing attention in applications of many fields where information usually is scattered through a (very) large number of interrelated time series.
There has been extensive literature in recent years that deals with sparse principal components, which has been widely used to enhance the convergence of the principal components in high dimensional space. d'Aspremont et al. (2008), Shen and Huang (2008), Witten et al. (2009) and Ma (2013) proposed and studied various algorithms for computations. More literature on sparse PCA is found in Johnstone and Lu (2009), Amini and Wainwright (2009), Zhang and El Ghaoui (2011) and Birnbaum et al. (2012), among others. In addition, there has also been a growing literature that theoretically studies the recovery from a low rank plus sparse matrix estimation problem; see, for example, Wright et al. (2009), Lin et al. (2008), Candès et al. (2011), Luo (2011), Agarwal et al. (2012) and Pati et al. (2012). It corresponds to the identifiability issue of our problem.
There is a big difference between our model and those considered in the aforementioned literature. In the current paper, the first K eigenvalues of Σ are spiked and grow at a rate O(p), whereas the eigenvalues of the matrices that have been studied in the existing literature on covariance estimation are usually assumed to be either bounded or slowly growing. Because of this distinctive feature, the common components and the idiosyncratic components can be identified and, in addition, PCA on the sample covariance matrix can consistently estimate the space that is spanned by the eigenvectors of Σ. The existing methods of either thresholding directly or solving a constrained optimization method can fail in the presence of very spiked principal eigenvalues. However, there is a price to pay here: as the first K eigenvalues are ‘too spiked’, one can hardly obtain a satisfactory rate of convergence for estimating Σ in absolute terms, but it can be estimated accurately in relative terms (see Section 3.3. for details). In addition,
can be estimated accurately.
We would like to note further that the low rank plus sparse representation of our model is on the population covariance matrix, whereas Candès et al. (2011), Wright et al. (2009) and Lin et al. (2009) considered such a representation on the data matrix. (We thank a referee for reminding us about these related works.) As there is no Σ to estimate, their goal is limited to producing a low rank plus sparse matrix decomposition of the data matrix, which corresponds to the identifiability issue of our study, and does not involve estimation and inference. In contrast, our ultimate goal is to estimate the population covariance matrices as well as the precision matrices. For this, we require the idiosyncratic components and common factors to be uncorrelated and the data‐generating process to be strictly stationary. The covariances that are considered in this paper are constant over time, though slow time varying covariance matrices are applicable through localization in time (time domain smoothing). Our consistency result on
demonstrates that decomposition (1.3) is identifiable, and hence our results also shed the light of the ‘surprising phenomenon’ of Candès et al. (2011) that one can separate fully a sparse matrix from a low rank matrix when only the sum of these two components is available.
The rest of the paper is organized as follows. Section 2. gives our estimation procedures and builds the relationship between the PCA and the factor analysis in high dimensional space. Section 3. provides the asymptotic theory for various estimated quantities. Section 4. illustrates how to choose the thresholds by using cross‐validation and guarantees the positive definiteness in any finite sample. Specific applications of regularized covariance matrices are given in Section 5.. Numerical results are reported in Section 6.. Finally, Section 7. presents a real data application on portfolio allocation. All proofs are given in Appendix A. Throughout the paper, we use
and
to denote the minimum and maximum eigenvalues of a matrix A. We also denote by
, ‖A‖,
and
the Frobenius norm, spectral norm (also called the operator norm),
‐norm and elementwise norm of a matrix A, defined respectively by
,
,
and
. When A is a vector, both
and ‖A‖ are equal to the Euclidean norm. Finally, for two sequences, we write
if
and
if
and 
The programs that were used to analyse the data can be obtained from
2. Regularized covariance matrix via principal components analysis
- to understand the relationship between PCA and high dimensional factor analysis;
- to estimate both covariance matrices Σ and the idiosyncratic
and their precision matrices in the presence of common factors;
- to investigate the effect of estimating the unknown factors on the covariance estimation.
The propositions in Section 2.1. show that the space that is spanned by the principal components in the population level Σ is close to the space that is spanned by the columns of the factor loading matrix B.
2.1. High dimensional principal components analysis and factor model

, is small compared with p and T, and thus is assumed to be fixed throughout the paper. In the model, the only observable variable is the data
. One of the distinguished features of the factor model is that the principal eigenvalues of Σ are no longer bounded, but growing fast with the dimensionality. We illustrate this in the following example.
2.1.1. Example 1
where
. Suppose that the factor is pervasive in the sense that it has non‐negligible effect on a non‐vanishing proportion of outcomes. It is then reasonable to assume that
for some c>0. Therefore, assuming that
, an application of decomposition (1.3) yields

.
We now elucidate why PCA can be used for the factor analysis in the presence of spiked eigenvalues. Write
as the p×K loading matrix. Note that the linear space that is spanned by the first K principal components of
is the same as that spanned by the columns of B when
is non‐degenerate. Thus, we can assume without loss of generality that the columns of B are orthogonal and
, the identity matrix. This canonical form corresponds to the identifiability condition in decomposition (1.3). Let
be the columns of B, ordered such that
is in a non‐increasing order. Then,
are eigenvectors of the matrix
with eigenvalues
and the rest 0. We shall impose the pervasiveness assumption that all eigenvalues of the K×K matrix
are bounded away from 0, which holds if the factor loadings
are independent realizations from a non‐degenerate population. Since the non‐vanishing eigenvalues of the matrix
are the same as those of
, from the pervasiveness assumption it follows that
are all growing at rate O(p).
Let
be the eigenvalues of Σ in a descending order and
be their corresponding eigenvectors. Then, an application of Weyl's eigenvalue theorem (see Appendix A) yields the following proposition.
Proposition 1.Assume that the eigenvalues of
are bounded away from 0 for all large p. For the factor model (1.3) with the canonical condition
(2.1)
In addition, for j≤K,
.
Using proposition 1 and the sin (θ) theorem of Davis and Kahn (1970) (see their appendix), we have the following proposition.
Proposition 2.Under the assumptions of proposition 1, if
are distinct, then

. This is assured through a sparsity condition on
, which is frequently measured through
(2.2)
are bounded. Therefore, when
, proposition 1 implies that we have distinguished eigenvalues between the principal components
and the rest of the components
and proposition 2 ensures that the first K principal components are approximately the same as the columns of the factor loadings.
The aforementioned sparsity assumption appears reasonable in empirical applications. Boivin and Ng (2006) conducted an empirical study and showed that imposing zero correlation between weakly correlated idiosyncratic components improves the forecast. (We thank a referee for this interesting reference.) More recently, Phan (2012) empirically estimated the level of sparsity of the idiosyncratic covariance by using UK market data.
(2.3)
are close to the normalized vectors
when
. This provides the mathematics for using the first K principal components as a proxy for the space that is spanned by the columns of the factor loading matrix B. In addition, because of condition (2.3), the signals of the first K eigenvalues are stronger than those of the spiked covariance model that was considered by Jung and Marron (2009) and Birnbaum et al. (2012). Therefore, our other conditions for the consistency of principal components at the population level are much weaker than those in the spiked covariance literature. However, this also shows that, under our setting, PCA is a valid approximation to factor analysis only if p→∞. The fact that PCA on the sample covariance is inconsistent when p is bounded has also previously been demonstrated in the literature (see, for example, Bai (2003)).
With assumption (2.3), the standard literature on approximate factor models has shown that PCA on the sample covariance matrix
can consistently estimate the space that is spanned by the factor loadings (e.g. Stock and Watson (1998) and Bai (2003)). Our contribution in propositions 1 and 2 is that we connect the high dimensional factor model to the principal components and obtain the consistency of the spectrum in the population level Σ instead of the sample level
. The spectral consistency also enhances the results in Chamberlain and Rothschild (1983). This provides the rationale behind the consistency results in the factor model literature.
2.2. Principal orthogonal complement thresholding
be the ordered eigenvalues of the sample covariance matrix
and
be their corresponding eigenvectors. Then the sample covariance has the following spectral decomposition:
(2.4)
is the principal orthogonal complement, and K is the number of diverging eigenvalues of Σ. Let us first assume that K is known.
. Define
(2.5)
is a generalized shrinkage function of Antoniadis and Fan (1976), employed by Rothman et al. (2009) and Cai and Liu (2011), and
is an entry‐dependent threshold. In particular, the hard thresholding rule
(Bickel and Levina, 2008) and the constant thresholding parameter
are allowed. In practice, it is more desirable to have
entry adaptive. An example of the adaptive thresholding is
(2.6)
is the ith diagonal element of
. This corresponds to applying the thresholding with parameter τ to the correlation matrix of
.
(2.7)We shall call this estimator the principal orthogonal complement thresholding estimator POET. It is obtained by thresholding the remaining components of the sample covariance matrix, after taking out the first K principal components. One of the attractive features of POET is that it is optimization free and hence is computationally appealing. (We have written an R package for POET, which outputs the estimated Σ,
, K, the factors and the loadings.)
With the choice of
in expression (2.6) and the hard thresholding rule, our estimator encompasses many popular estimators as its specific cases. When τ=0, the estimator is the sample covariance matrix and, when τ=1, the estimator becomes that based on the strict factor model (Fan et al., 2008). When K=0, our estimator is the same as the thresholding estimator of Bickel and Levina (2008) and (with a more general thresholding function) Rothman et al. (2009) or the adaptive thresholding estimator of Cai and Liu (2011) with a proper choice of
.
In practice, the number of diverging eigenvalues (or common factors) can be estimated on the basis of the sample covariance matrix. Determining K in a data‐driven way is an important topic and is well understood in the literature. We shall describe the estimator POET with a data‐driven K in Section 2.4.
2.3. Least squares point of view
and
such that
(2.8)
(2.9)
has been removed, i.e.
for all i≤p,j≤K and t≤T. Putting it in a matrix form, the optimization problem can be written as
(2.10)
and
. For each given F, the least squares estimator of B is
, using the constraint (2.9) on the factors. Substituting this into problem (2.10), the objective function now becomes
The minimizer is now clear: the columns of
are the eigenvectors corresponding to the K largest eigenvalues of the T×T matrix
and
(see, for example, Stock and Watson (2002)).
consistently estimates the true
uniformly over i≤p and t≤T. Since
is assumed to be sparse, we can construct an estimator of
by using the adaptive thresholding method by Cai and Liu (2011) as follows. Let
and
For some predetermined decreasing sequence
, and sufficiently large C>0, define the adaptive threshold parameter as
The estimated idiosyncratic covariance estimator is then given by
(2.11)
(see Antoniadis and Fan (2001)),

It is easy to verify that
includes many interesting thresholding functions such as hard thresholding (
), soft thresholding (
), smoothly clipped absolute deviation and the adaptive lasso (see Rothman et al. (2009)).
(2.12)
(2.13)In practice, the true number of factors K might be unknown to us. However, for any determined
, we can always construct either
as in estimator (2.7) or
as in estimator (2.12) to estimate
. The following theorem shows that, for each given
, the two estimators based on either regularized PCA or least squares substitution are equivalent. Similar results were obtained by Bai (2003) when
and no thresholding was imposed.
Theorem 1.Suppose that the entry‐dependent threshold in definition (2.5) is the same as the thresholding parameter that is used in expression (2.11). Then, for any
, estimator (2.7) is equivalent to the substitution estimator (2.12), i.e.

In this paper, we shall use a data‐driven
to construct POET (see Section 2.4.), which has two equivalent representations according to theorem 1.
2.4. Principal orthogonal complement thresholding with unknown K
Determining the number of factors in a data‐driven way has been an important research topic in the econometrics literature. Bai and Ng (2002) proposed a consistent estimator as both p and T diverge. Other recent criteria have been proposed by Kapetanios (2010), Onatski (2010) and Alessi et al. (2010), among others.
to estimate the covariance matrices. In principle, any procedure that gives a consistent estimate of K can be adopted. In this paper we apply the well‐known method in Bai and Ng (2002). It estimates K by
(2.14)
is a
matrix whose columns are √T times the eigenvectors corresponding to the
largest eigenvalues of the T×T matrix
and g(T,p) is a penalty function of (p,T) such that g(T,p)=o(1) and min{p,T} g(T,p)→∞. Two examples suggested by Bai and Ng (2002), IC1 and IC2, are respectively


be the solution to problem (2.14) by using either IC1 or IC2. The asymptotic results are not affected regardless of the specific choice of g(T,p). We define the POET‐estimator with unknown K as
(2.15)The procedure is as stated in Section 2.2. except that
is now data driven.
3. Asymptotic properties
3.1. Assumptions
This section presents the assumptions on model (1.2), in which only
are observable. Recall the identifiability condition (2.1).
The first assumption has been one of the most essential in the literature of approximate factor models. Under this assumption and other regularity conditions, the number of factors, loadings and common factors can be consistently estimated (e.g. Stock and Watson (1998, 2002), Bai and Ng (2002) and Bai (2003)).
Assumption 1.All the eigenvalues of the K×K matrix
are bounded away from both 0 and ∞ as p→∞.
Remark 1.
- It is implied from proposition 1 in Section 2. that the first K eigenvalues of Σ grow at rate O(p). This unique feature distinguishes our work from most of other work on low rank plus sparse covariances that has been considered in the literature, e.g. Luo (2011), Pati et al. (2012), Agarwal et al. (2012) and Birnbaum et al. (2012). (To our best knowledge, the only other references that estimate large covariances with diverging eigenvalues (growing at the rate of dimensionality O(p)) are Fan et al. (2008, 2008) and Bai and Shi (2011). Whereas Fan et al. (2008, 2011a) assumed that the factors are observable, Bai and Shi (2011) considered the strict factor model in which
is diagonal.)
- Assumption 1 requires the factors to be pervasive, i.e. to impact a non‐vanishing proportion of individual time series. See example 1 in Section 2.1.1 for its meaning. (It is important to distinguish the model that we consider in this paper from the ‘sparse factor model’ in the literature, e.g. Carvalho et al. (2008) and Pati et al. (2012), which assumes that the loading matrix B is sparse. The intuition of a sparse loading matrix is that each factor is related to only a relatively small number of stocks, assets, genes, etc. With B being sparse, all the eigenvalues of
and hence those of Σ are bounded.)
- As to be illustrated in Section 3.3. below, owing to the fast diverging eigenvalues, we can hardly achieve a good rate of convergence for estimating Σ under either the spectral norm or Frobenius norm when p>T. This phenomenon arises naturally from the characteristics of the high dimensional factor model, which is another distinguished feature compared with those convergence results in the existing literature.
Assumption 2.
is strictly stationary. In addition,
for all i≤p,j≤K and t≤T.
- There are constants
such that
,
and

- There are
and
such that, for any s>0, i≤p and j≤K,

Condition (a) requires strict stationarity as well as the non‐correlation between
and
. These conditions are slightly stronger than those in the literature, e.g. Bai (2003), but are still standard and simplify our technicalities. Condition (b) requires that
be well conditioned. The condition
instead of a weaker condition
is imposed here to estimate K consistently. But it is still standard in the approximate factor model literature as in Bai and Ng (2002), Bai (2003), etc. When K is known, such a condition can be removed. Fan et al. (2011b) shows that the results continue to hold for a growing (known) K under the weaker condition
. Condition (c) requires exponential‐type tails, which allow us to apply the large deviation theory to
and
.
and
denote the σ‐algebras that are generated by
and
respectively. In addition, define the mixing coefficient
(3.1)
Assumption 3. (strong mixing.)There exists
such that
, and C>0 satisfying, for all
,

In addition, we impose the following regularity conditions.
Assumption 4.There exists M>0 such that, for all i≤p, t≤T and s≤T,
,
and
.
These conditions are needed to estimate consistently the transformed common factors as well as the factor loadings. Similar conditions were also assumed in Bai (2003) and Bai and Ng (2006). The number of factors is assumed to be fixed. Our conditions in assumption 4 are weaker than those in Bai (2003) as we focus on different aspects of the study.
3.2. Convergence of the idiosyncratic covariance
Estimating the covariance matrix
of the idiosyncratic components
is important for many statistical inferences. For example, it is needed for large sample inference of the unknown factors and their loadings, for testing the capital asset pricing model (Sentana, 2009) and large‐scale hypothesis testing (Fan et al., 2012). See Section 5..
by thresholding the principal orthogonal complements after the first
principal components of the sample covariance have been taken out:
By theorem 1, it also has an equivalent expression given by estimator (2.11), with
. Throughout the paper, we apply the adaptive threshold
(3.2)
When direct observation of
is not available, the effect of estimating the unknown factors also contributes to this uniform estimation error, which is why
appears in the threshold.
The following theorem gives the rate of convergence of the estimated idiosyncratic covariance. Let
. In the convergence rate below, recall that
and q are defined in the measure of sparsity (2.2).
Theorem 2.Suppose that
,
and assumptions 1–4 hold. Then, for a sufficiently large constant C>0 in the threshold (3.2), the POET‐estimator
satisfies

If further
, then the eigenvalues of
are all bounded away from 0 with probability approaching 1, and

When estimating
, p is allowed to grow exponentially fast in T, and
can be made consistent under the spectral norm. In addition,
is asymptotically invertible whereas the classical sample covariance matrix based on the residuals is not when p>T.
Remark 2.
- Consistent estimation of
indicates that
is identifiable in model (1.3), namely the sparse
can be separated perfectly from the low rank matrix there. The result here gives another proof (when assuming that
) of the ‘surprising phenomenon’ in Candès et al. (2011) under different technical conditions.
- Fan et al. (2011a) recently showed that, when
are observable and q=0, the rate of convergence of the adaptive thresholding estimator is given by

- Hence, when the common factors are unobservable, the rate of convergence has an additional term
, coming from the effect of estimating the unknown factors. This effect vanishes when p log (p)≫T, in which case the minimax rate as in Cai and Zhou (2012) is achieved. As p increases, more information about the common factors is collected, which results in more accurate estimation of the common factors
.
- When K is known and grows with p and T, with slightly weaker assumptions, Fan et al. (2011b) shows that, under the exactly sparse case (i.e. q=0), the result continues to hold with convergence rate

3.3. Convergence of POET
Since the first K eigenvalues of Σ grow with p, we can hardly estimate Σ with satisfactory accuracy in absolute terms. This problem does not arise from the limitation of any estimation method but is due to the nature of the high dimensional factor model. We illustrate this by using a simple example.
3.3.1. Example 2
be the eigenvalues and vectors, and assume that the largest eigenvalue
for some c>0. Let
be the estimated first eigenvector and define the covariance estimator
Assume that
is a good estimator in the sense that
. However,

.

(3.3)
plays the role of normalization. The loss (3.3) is closely related to the entropy loss, which was introduced by James and Stein (1961). Also note that

is the weighted quadratic norm in Fan et al. (2008).
Fan et al. (2008) showed that, in a large factor model, the sample covariance is such that
which does not converge if p>T. In contrast, theorem 3 below shows that
can still be convergent as long as
. Technically, the effect of high dimensionality on the convergence rate of
is via the number of rows in B. We show in Appendix A that B appears in
through
whose eigenvalues are bounded. Therefore it successfully cancels out the curse of high dimensionality that is introduced by B.
Compared with estimating Σ, in a large approximate factor model, we can estimate the precision matrix with a satisfactory rate under the spectral norm. The intuition follows from the fact that
has bounded eigenvalues.
The following theorem summarizes the rate of convergence under various norms.
Theorem 3.Under the assumptions of theorem 2 the POET‐estimator that is defined in equation (2.15) satisfies

In addition, if
, then
is non‐singular with probability approaching 1, with

Remark 3.
- When estimating
, p is allowed to grow exponentially fast in T, and the estimator has the same rate of convergence as that of the estimator
in theorem 2. When p becomes much larger than T, the precision matrix can be estimated at the same rate as if the factors were observable.
- As in remark 2, when K>0 is known and grows with p and T, Fan et al. (2011a) prove the following results (when q=0):

- The results state explicitly the dependence of the rate of convergence on the number of factors. (The assumptions in Fan et al. (2011a) are slightly weaker than those presented here, in that they required that
instead of
be bounded.)
- The relative error
in operator norm can be shown to have the same order as the maximum relative error of estimated eigenvalues. It does not converge to 0 nor diverge. It is much smaller than
, which is of order p/√T (see example 2).
3.4. Convergence of unknown factors and factor loadings
Many applications of the factor model require estimating the unknown factors. In general, factor loadings in B and the common factors
are not separably identifiable, as, for any matrix H such that
,
. Hence
cannot be identified from
. Note that the linear space that is spanned by the rows of B is the same as that by those of
. In practice, it often does not matter which is used.
Let V denote the
diagonal matrix of the first
largest eigenvalues of the sample covariance matrix in decreasing order. Recall that
and define a
matrix
Then, for t≤T,
Note that
depends only on the data
and an identifiable part of parameters
. Therefore, there is no identifiability issue in
regardless of the identifiability condition imposed.
Bai (2003) obtained the rate of convergence for both
and
for any fixed (i,t). However, the uniform rate of convergence is more relevant for many applications (see example 3 in Section 5.). The following theorem extends those results in Bai (2003) in a uniformity sense. In particular, with a more refined technique, we have improved the uniform convergence rate for
.
Theorem 4.Under the assumptions of theorem 2,

As a consequence of theorem 4, we obtain the following corollary (recall that the constant
is defined in assumption 2).
Corollary 1.Under the assumptions of theorem 2,

The rates of convergence that were obtained above also explain the condition
in theorems 2 and 3. It is needed to estimate the common factors
uniformly in t≤T. When we do not observe
, in addition to the factor loadings, there are KT factors to estimate. Intuitively, the condition
requires the number of parameters that are introduced by the unknown factors to be ‘not too many’, so that we can consistently estimate them uniformly. Technically, as demonstrated by Bickel and Levina (2008), Cai and Liu (2011) and many others, achieving uniform accuracy is essential for large covariance estimations.
4. Choice of threshold
4.1. Finite sample positive definiteness
, where C is determined by the users. To make POET operational in practice, we must choose C to maintain the positive definiteness of the estimated covariances for any given finite sample. We write
, where the covariance estimator depends on C via the threshold. We choose C in the range where
. Define
(4.1)When C is sufficiently large, the estimator becomes diagonal, whereas its minimum eigenvalue must retain strict positivity. Thus,
is well defined and, for all
,
is positive definite under finite samples. We can obtain
by solving
We can also approximate
by plotting
as a function of C, as illustrated in Fig. 1. In practice, we can choose C in the range
for a small ɛ and sufficiently large M. Choosing the threshold in a range to guarantee the finite sample positive definiteness has also been previously suggested by Fryzlewicz (2012).

as a function of C for three choices of thresholding rules (the plot is based on the simulated data set in Section 6.2):
, hard thresholding;
, soft thresholding;
, smoothly clipped absolute deviation
4.2. Multifold cross‐validation
In practice, C can be data driven, and chosen through multifold cross‐validation. After obtaining the estimated residuals
by PCA, we divide them randomly into two subsets, which are, for simplicity, denoted by
and
. The sizes of
and
, which are denoted by
and
, are
and
For example, in sparse matrix estimation, Bickel and Levina (2008) suggested the choice
.
the POET‐estimator with the threshold
on the training data set
We also denote by
the sample covariance based on the validation set, defined by
Then we choose the constant
by minimizing a cross‐validation objective function over a compact interval
(4.2)
is the minimum constant that guarantees the positive definiteness of
for
as described in the previous subsection, and M is a large constant such that
is diagonal. The resulting
is data driven, so it depends on Y as well as p and T via the data. In contrast, for each given N×T data matrix Y,
is a universal constant in the threshold
in the sense that it does not change with respect to the position (i,j). We also note that the cross‐validation is based on the estimate of
rather than Σ because POET thresholds the error covariance matrix. Thus cross‐validation improves the performance of thresholding.
It is possible to derive the rate of convergence for
under the current model setting, but it ought to be much more technically involved than the regular sparse matrix estimation that was considered by Bickel and Levina (2008) and Cai and Liu (2011). To keep our presentation simple we do not pursue it in the current paper.
5. Applications of POET
We give four examples to which the results in theorems 2–4 can be applied. Detailed pursuits of these are beyond the scope of the paper.
5.1. Example 3 (large‐scale hypothesis testing)

and these test statistics Z are jointly normal N(μ,Σ) where Σ is unknown. For a given critical value x, the false discovery proportion is then defined as FDP(x)=V(x)/R(x) where
and
are the total number of false discoveries and the total number of discoveries respectively. Our interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. Only V(x) needs to be estimated.
(5.1)
is sparse. By the principal factor approximation (theorem 1, Fan et al. (2012))
(5.2)
and the number of true significant hypotheses
is o(p), where
is the upper x‐quantile of the standard normal distribution,
and
.
Now suppose that we have n repeated measurements from model (5.1). Then, by corollary 1,
can be uniformly consistently estimated, and hence
and FDP(x) can be consistently estimated. Efron (2010) obtained these repeated test statistics on the basis of the bootstrap sample from the original raw data. Our theory (theorem 4) gives a formal justification to the framework of Efron (2007, 2010).
5.2. Example 4 (risk management)
appears in risk assessment as in Fan et al. (2012). For a fixed portfolio allocation vector w, the true portfolio variance and the estimated variance are given by
and
respectively. The estimation error is bounded by

, the
‐norm of w, is the gross exposure of the portfolio. Usually a constraint is placed on the total percentage of the short positions, in which case we have a restriction
for some c>0. In particular, c=1 corresponds to a portfolio with no short positions (all weights are non‐negative). Theorem 3 quantifies the maximum approximation error.

5.3. Example 5 (panel regression with a factor structure in the errors)

is a vector of observable regressors with fixed dimension. The regression error
has a factor structure and is assumed to be independent of
, but
,
and
are all unobservable. We are interested in the common regression coefficients β. This panel regression model has been considered by many researchers, such as Ahn et al. (2001) and Pesaran (2006), and has broad applications in social sciences.
Although ordinary least squares produces a consistent estimator of β, a more efficient estimation can be obtained by generalized least squares. The generalized least squares method depends, however, on an estimator of
, which is the inverse of the covariance matrix of
. By assuming that the covariance matrix of
is sparse, we can successfully solve this problem by applying theorem 3. Although
is unobservable, it can be replaced by the regression residuals
, obtained via first regressing
on
. We then apply POET to
. By theorem 3, the inverse of the resulting estimator is a consistent estimator of
under the spectral norm. A slight difference lies in the fact that, when we apply POET,
is replaced with
, which introduces an additional term
in the estimation error.
5.4. Example 6 (validating an asset pricing theory)
of firm i at time t follows model (1.1), in which
are the excess returns of the risk factors at time t. To test the null hypothesis (1.2), we embed the model into the multivariate linear model
(5.3)
. The F‐test statistic involves the estimation of the covariance matrix
, whose estimates are degenerate without regularization when p≥T. Therefore, in the literature (Sentana (2009), and references therein), we focus on the case that p is relatively small. The typical choices of parameters are T=60 monthly data and the number of assets p=5, 10, 25. However, the capital asset pricing model should hold for all tradeable assets, not just a small fraction of assets. With our regularization technique, non‐degenerate estimate
can be obtained and the F‐test or likelihood ratio test statistics can be employed even when p≫T.
be the least squares estimator of model (5.3). Then, when
,
for a constant
which depends on the observed factors. When
is known, the Wald test statistic is
. When it is unknown and p is large, it is natural to use the F‐type of test statistic
. The difference between these two statistics is bounded by

Since under the null hypothesis
, we have
. Thus, it follows from boundness of
that
Theorem 2 provides the rate of convergence for this difference. Detailed development is out of the scope of the current paper, and we shall leave it as a separate research project (see Pesaran and Yamagata (2012)).
6. Monte Carlo experiments

The factor loadings are drawn from a trivariate normal distribution
and the idiosyncratic errors from
, and the factor returns
follow a vector auto‐regressive VAR(1) model. To make the simulation more realistic, model parameters are calibrated from the financial returns, as detailed in the following section.
6.1. Calibration
are computed for the period from January 1st, 2009, to December 31st, 2010. Here, we present a short outline of the calibration procedure.
- Given
as the input data, we fit a Fama–French three‐factor model and calculate a 100×3 matrix
, and 500×3 matrix
, using the principal components method that was described in Section 3.1. - We summarize 100 factor loadings (the rows of
) by their sample mean vector
and sample covariance matrix
, which are reported in Table 1. The factor loadings
for i=1,…,p are drawn from
.
- We run the stationary vector auto‐regressive model
, which is a VAR(1) model, to the data
to obtain the multivariate least squares estimator for μ and Φ, and we estimate
. Note that all eigenvalues of Φ in Table 2 fall within the unit circle, so our model is stationary. The covariance matrix
can be obtained by solving the linear equation
The estimated parameters are depicted in Table 2 and are used to generate
.
- For each value of p, we generate a sparse covariance matrix
of the form
Here,
is the error correlation matrix, and D is the diagonal matrix of the standard deviations of the errors. We set
, where each
is generated independently from a gamma distribution G(α,β), and α and β are chosen to match the sample mean and sample standard deviation of the standard deviations of the errors. A similar approach to that of Fan et al. (2011a) has been used in this calibration step. The off‐diagonal entries of
are generated independently from a normal distribution, with mean and standard deviation equal to the sample mean and sample standard deviation of the sample correlations between the estimated residuals, conditional on their absolute values being no larger than 0.95. We then employ hard thresholding to make
sparse, where the threshold is found as the smallest constant that provides the positive definiteness of
. More precisely, start with threshold value 1, which gives
, and then decrease the threshold values in a grid until positive definiteness is violated.
|
|
||
|---|---|---|---|
| 0.0047 | 0.0767 | −0.00004 | 0.0087 |
| 0.0007 | −0.00004 | 0.0841 | 0.0013 |
| −1.8078 | 0.0087 | 0.0013 | 0.1649 |
‐generating process
| μ |
|
Φ | ||||
|---|---|---|---|---|---|---|
| −0.0050 | 1.0037 | 0.0011 | −0.0009 | −0.0712 | 0.0468 | 0.1413 |
| 0.0335 | 0.0011 | 0.9999 | 0.0042 | −0.0764 | −0.0008 | 0.0646 |
| −0.0756 | −0.0009 | 0.0042 | 0.9973 | 0.0195 | −0.0071 | −0.0544 |
6.2. Simulation
For the simulation, we fix T=300, and let p increase from 1 to 600. For each fixed p, we repeat the following steps N=200 times, and record the means and the standard deviations of each respective norm.
- Step 1: generate independently
, and set 
- Step 2: generate independently
.
- Step 3: generate
as a vector auto‐regressive sequence of the form
.
- Step 4: calculate
from
.
- Step 5: set hard thresholding with threshold
. Estimate K by using IC1 of Bai and Ng (2002). Calculate covariance estimators by using POET. Calculate the sample covariance matrix
.
In the graphs below, we plot the averages and standard deviations of the distance from
and
to the true covariance matrix Σ, under norms
, ‖·‖ and
. We also plot the means and standard deviations of the distances from
and
to
under the spectral norm. The dimensionality p ranges from 20 to 600 in increments of 20. Because of invertibility, the spectral norm for
is plotted only up to p=280. Also, we zoom into these graphs by plotting the values of p from 1 to 100, this time in increments of 1. Note that we also plot the distance from
to Σ for comparison, where
is the estimated covariance matrix that was proposed by Fan et al. (2011a), assuming that the factors are observable.
6.3. Results
In a factor model, we expect POET to perform as well as
when p is relatively large, since the effect of estimating the unknown factors should vanish as p increases. This is illustrated in the plots.
From the simulation results, reported in Figs 2-5, we observe that POET under the unobservable factor model performs just as well as the estimator in Fan et al. (2011a) if the factors are known, when p is sufficiently large. The cost of not knowing the factors is approximately of order
. It can be seen in Figs 2 and 3 that this cost vanishes for p≥200. To give a better insight of the effect of estimating the unknown factors for small p, a separate set of simulations is conducted for p≤100. As we can see from Figs 2(b) and 2(d) and 3(b), 3(c), 3(e) and 3(f) the effect decreases quickly. In addition, when estimating
, it is difficult to distinguish the estimators with known and unknown factors, whose performances are quite stable compared with the sample covariance matrix. Also, the maximum absolute elementwise error (Fig. 4) of our estimator performs very similarly to that of the sample covariance matrix, which coincides with our asymptotic result. Fig. 5 shows that the performances of the three methods are indistinguishable in the spectral norm, as expected.

with known factors (
,
), POET (
,
) and the sample covariance (
,
) over 200 simulations, as a function of the dimensionality p: (a), (c) p ranges in 20–600 with increment 20; (b), (d) p ranges in 1–100 with increment 1

with known factors (
,
), POET (
,
) and sample covariance (
,
) over 200 simulations, as a function of the dimensionality p: (a), (d) p ranges in 20–600 with increment 20; (b), (e) p ranges in 1–100 with increment 1; (c), (f) the same as (a) and (d) with the sample covariance curve omitted

with known factors (
,
), POET (
,
) and sample covariance (
,
) over 200 simulations, as a function of the dimensionality p: they are nearly undifferentiable

and (b)
with known factors (
,
), POET (
,
) and sample covariance (
,
) over 200 simulations, as a function of the dimensionality p: the three curves are barely distinguishable in (a)
6.4. Robustness to the estimation of K
POET depends on the estimated number of factors. Our theory uses a consistent esimator
. To assess the robustness of our procedure to
in finite samples, we calculate
for K=1,2,…,10. Again, the threshold is fixed to be
.
6.4.1. Design 1
The simulation set‐up is the same as before where the true
. We calculate
,
,
and
for K=1,2,…,10. Fig. 6 plots these norms as p increases but with a fixed T=300. The results demonstrate a trend that is quite robust when K≥3; especially, the accuracy of estimation of the spectral norms for large p are close to each other. When K=1 or K=2, the estimators perform badly because of modelling bias. Therefore, POET is robust to overestimated K, but not to underestimation.

; (b)
; (c)
; (d) 
6.4.2. Design 2

factor model, where the factors are independently simulated as

Table 3 summarizes the average estimation error of covariance matrices across K in the spectral norm. Each simulation is replicated 50 times and T=200.
| Errors for the following values of K: | |||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 8 | |
| p=100 | |||||||
|
10.70 | 5.23 | 1.63 | 1.80 | 1.91 | 2.04 | 2.22 |
|
2.71 | 2.51 | 1.51 | 1.50 | 1.44 | 1.84 | 2.82 |
|
2.69 | 2.48 | 1.47 | 1.49 | 1.41 | 1.56 | 2.35 |
|
94.66 | 91.36 | 29.41 | 31.45 | 30.91 | 33.59 | 33.48 |
|
17.37 | 10.04 | 2.05 | 2.83 | 2.94 | 2.95 | 2.93 |
| p=200 | |||||||
|
11.34 | 11.45 | 1.64 | 1.71 | 1.79 | 1.87 | 2.01 |
|
2.69 | 3.91 | 1.57 | 1.56 | 1.81 | 2.26 | 3.42 |
|
2.67 | 3.72 | 1.57 | 1.55 | 1.70 | 2.13 | 3.19 |
|
200.82 | 195.64 | 57.44 | 63.09 | 64.53 | 60.24 | 56.20 |
|
20.86 | 14.22 | 3.29 | 4.52 | 4.72 | 4.69 | 4.76 |
| p=300 | |||||||
|
12.74 | 15.20 | 1.66 | 1.71 | 1.78 | 1.84 | 1.95 |
|
7.58 | 7.80 | 1.74 | 2.18 | 2.58 | 3.54 | 5.45 |
|
7.59 | 7.49 | 1.70 | 2.13 | 2.49 | 3.37 | 5.13 |
|
302.16 | 274.12 | 87.92 | 92.47 | 91.90 | 83.21 | 92.50 |
|
23.43 | 16.89 | 4.38 | 6.04 | 6.16 | 6.14 | 6.20 |
- †True K=3.
Table 3 illustrates some interesting patterns. First, the best accuracy of estimation is achieved when
. Second, the estimation is robust for
. As K increases from
, the estimation error becomes larger but is increasing slowly in general, which indicates the robustness when a slightly larger K has been used. Third, when the number of factors is underestimated, corresponding to K=1,2, all the estimators perform badly, which demonstrates the danger of missing any common factors. Therefore, overestimating the number of factors, while still maintaining a satisfactory accuracy of estimation of the covariance matrices, is much better than underestimating. The resulting bias caused by underestimation is more severe than the additional variance that is introduced by overestimation. Finally, estimating Σ, the covariance of
, does not achieve good accuracy even when
in the absolute term
, but the relative error
is much smaller. This is consistent with our discussions in Section 3.3.
6.5. Comparisons with other methods
6.5.1. Comparison with related methods
We compare POET with related methods that address low rank plus sparse covariance estimation, specifically, the low rank and sparse covariance estimator LOREC proposed by Luo (2011), the strict factor model SFM by Fan et al. (2008), the dual method (Dual) by Lin et al. (2009) and, finally, the singular value thresholding method of Cai et al. (2008), SVT. In particular, SFM is a special case of POET which employs a large threshold that forces
to be diagonal even when the true
might not be. Note that Dual, SVT and many others dealing with low rank plus sparseness, such as Candès et al. (2011) and Wright et al. (2009), assume a known Σ and focus on recovering the decomposition. Hence they do not estimate Σ or its inverse, but decompose the sample covariance into two components. The resulting sparse component may not be positive definite, which can lead to large estimation errors for
and
.
Data are generated from the same set‐up as design 2 in Section 6.4. Table 4 reports the averaged estimation error of the five methods being compared, calculated on the basis of 50 replications for each simulation. Dual and SVT assume that the data matrix has a low rank plus sparse representation, which is not so for the sample covariance matrix (though the population Σ has such a representation). The tuning parameters for POET, LOREC, Dual and SVT are chosen to achieve the best performance for each method. (We used the R package for LOREC that was developed by Luo (2011) and the MATLAB codes for Dual and SVT provided on Yi Ma's Web site ‘Low‐rank matrix recovery and completion via convex optimization’ at the University of Illinois. The tuning parameters for each method have been chosen to minimize the sum of relative errors
. We have also written an R package for POET.)
| Method |
|
|
RelE |
|
|
|---|---|---|---|---|---|
| p=100 | |||||
| POET | 1.624 | 1.336 | 2.080 | 1.309 | 29.107 |
| LOREC | 2.274 | 1.880 | 2.564 | 1.511 | 32.365 |
| SFM | 2.084 | 2.039 | 2.707 | 2.022 | 34.949 |
| Dual | 2.306 | 5.654 | 2.707 | 4.674 | 29.000 |
| SVT | 2.59 | 13.64 | 2.806 | 103.1 | 29.670 |
| p=200 | |||||
| POET | 1.641 | 1.358 | 3.295 | 1.346 | 58.769 |
| LOREC | 2.179 | 1.767 | 3.874 | 1.543 | 62.731 |
| SFM | 2.098 | 2.071 | 3.758 | 2.065 | 60.905 |
| Dual | 2.41 | 6.554 | 4.541 | 5.813 | 56.264 |
| SVT | 2.930 | 362.5 | 4.680 | 47.21 | 63.670 |
| p=300 | |||||
| POET | 1.662 | 1.394 | 4.337 | 1.395 | 65.392 |
| LOREC | 2.364 | 1.635 | 4.909 | 1.742 | 91.618 |
| SFM | 2.091 | 2.064 | 4.874 | 2.061 | 88.852 |
| Dual | 2.475 | 2.602 | 6.190 | 2.234 | 74.059 |
| SVT | 2.681 |
|
6.247 |
|
80.954 |
-
†RelE represents the relative error
.
6.5.2. Comparison with direct thresholding
- Model 1, one‐factor: the factors and loadings are independently generated from N(0,1). The error covariance is the same banded matrix as design 2 in Section 6.4. Here Σ has one diverging eigenvalue.
- Model 2, sparse covariance: set K=0; hence
itself is a banded matrix with bounded eigenvalues.
- Model 3, cross‐sectional AR(1): set K=0, but
. Now Σ is no longer sparse (or banded) but is not too dense either since
decreases to 0 exponentially fast as |i−j|→∞. This is the correlation matrix if
follows a cross‐sectional AR(1) process:
.
For each model, POET uses an estimated
based on IC1 of Bai and Ng (2002), whereas THR thresholds the sample covariance directly. We find that, in model 1, POET performs significantly better than THR as the latter misses the common factor. For model 2, IC1 estimates
precisely in each replication, and hence POET is identical to THR. For model 3, POET still outperforms THR. The results are summarized in Table 5.
| Model |
|
|
|
||
|---|---|---|---|---|---|
| POET | THR | POET | THR | ||
| p=200 | |||||
| 1 | 26.20 | 240.18 | 1.31 | 2.67 | 1 |
| 2 | 2.04 | 2.04 | 2.07 | 2.07 | 0 |
| 3 | 7.73 | 11.24 | 8.48 | 11.40 | 6.2 |
| p=300 | |||||
| 1 | 32.60 | 314.43 | 2.18 | 2.58 | 1 |
| 2 | 2.03 | 2.03 | 2.08 | 2.08 | 0 |
| 3 | 9.41 | 11.29 | 8.81 | 11.41 | 5.45 |
- †The reported numbers are the averages based on 100 replications.
6.6. Simulated portfolio allocation
We demonstrate the improvement of our method compared with the sample covariance and that based on the strict factor model, in a problem of portfolio allocation for risk minimization purposes.
be a generic estimator of the covariance matrix of the return vector
, and w be the allocation vector of a portfolio consisting of the corresponding p financial securities. Then the theoretical and the empirical risk of the given portfolio are
and
respectively. Now, define

, and the estimated risk (which is also called the empirical risk) is equal to
. In practice, the actual risk is unknown, and only the empirical risk can be calculated.
For each fixed p, the population Σ was generated in the same way as described in Section 6.1., with a sparse but not diagonal error covariance. We use three different methods to estimate Σ and to obtain
: strict factor model
(estimate
by using a diagonal matrix), our POET‐estimator
(both are with unknown factors) and sample covariance
. We then calculate the corresponding actual and empirical risks.
It is interesting to examine the accuracy and the performance of the actual risk of our portfolio
in comparison with the oracle risk
, which is the theoretical risk of the portfolio that we would have created if we knew the true covariance matrix Σ. We thus compare the regret
, which is always non‐negative, for three estimators of
. They are summarized by using the boxplots over the 200 simulations. The results are reported in Fig. 7. In practice, we are also concerned about the difference between the actual and empirical risk of the chosen portfolio
. Hence, in Fig. 8, we also compare the average estimation error
and the average relative estimation error
over 200 simulations. When
is obtained on the basis of the strict factor model, both differences—between actual and oracle risk, and between actual and empirical risk—are persistently greater than the corresponding differences for the approximate factor estimator. Also, in terms of the relative estimation error, the factor‐model‐based method is negligible, whereas the sample covariance does not have such a property.

for (a) p=80 and (b) p=140: in each panel, the boxplots from left to right correspond to
obtained by using
based on the approximate factor model, the SFM and the sample covariance

, POET;
, SFM;
, sample): (a) average absolute error
; (b) average relative error
(here,
and
are obtained on the basis of three estimators of
)
7. Real data example
We demonstrate the sparsity of the approximate factor model on real data and present the improvement of POET over the SFM in a real world application of portfolio allocation.
7.1. Sparsity of idiosyncratic errors
The data were obtained from the Center for Research in Security Prices database and con sist of p=50 stocks and their annualized daily returns for the period January 1st, 2010–December 31st, 2010 (T=252). The stocks are chosen from five different industry sectors (more specifically, ‘consumer goods—textiles and apparel clothing’, ‘financial—credit services’, ‘healthcare—hospitals’, ‘services—restaurants’ and ‘utilities—water utilities’), with 10 stocks from each sector. We made this selection to demonstrate a block diagonal trend in the sparsity. More specifically, we show that the non‐zero elements are clustered mainly within companies in the same industry. We also note that these are the same groups that show predominantly positive correlation.
The largest eigenvalues of the sample covariance equal 0.0102,0.0045 and 0.0039, whereas the rest are bounded by 0.0020. Hence K=0,1,2,3 are the possible values of the number of factors. Fig. 9 shows the heat map of the thresholded error correlation matrix (for simplicity, we applied hard thresholding). The threshold has been chosen by using cross‐validation as described in Section 4.. We compare the level of sparsity (the percentage of non‐zero off‐diagonal elements) for the five diagonal blocks of size 10×10, versus the sparsity of the rest of the matrix. For K=2, our method results in 25.8% non‐zero off‐diagonal elements in the five diagonal blocks, as opposed to 7.3% non‐zero elements in the rest of the covariance matrix. Note that, out of the non‐zero elements in the central five blocks, 100% are positive, as opposed to a distribution of 60.3% positive and 39.7% negative among the non‐zero elements in off‐diagonal blocks. There is a strong positive correlation between the returns of companies in the same industry after the common factors have been taken out, and the thresholding has preserved them. The results for K=1, 2, 3 show the same characteristics. These provide stark evidence that the strict factor model is not appropriate.

7.2. Portfolio allocation
We extend our data size by including larger industrial portfolios (p=100), and a longer period (10 years): from January 1st, 2000, to December 31st, 2010, of annualized daily excess returns. Two portfolios are created at the beginning of each month, based on two different covariance estimates through approximate and strict factor models with unknown factors. At the end of each month, we compare the risks of both portfolios.
. On the first of each month, we estimate
(method SFM) and
(POET with soft thresholding) using the historical data of excess daily returns for the preceeding 12 months (T=252). The value of the threshold is determined by using the cross‐validation procedure. We minimize the empirical risk of both portfolios to obtain the two respective optimal portfolio allocations
and
(based on
and
):
. At the end of the month (21 trading days), their actual risks are compared, calculated by

We can see from Fig. 10 that the minimum risk portfolio that was created by POET performs significantly better, achieving lower variance 76% of the time. Among those months, the risk is decreased by 48.63%. In contrast, during the months that POET produces a higher risk portfolio, the risk is increased by only 17.66%.

Next, we demonstrate the effect of the choice of number of factors and threshold on the performance of POET. If cross‐validation seems computationally expensive, we can choose a common soft threshold throughout the whole investment process. The average constant in the cross‐validation was 0.53, which is close to our suggested constant 0.5 used for simulation. We also present the results based on various choices of constant C=0.5,0.75,1,1.25, with soft threshold
. The results are summarized in Table 6. The performance of POET seems consistent across different choices of these parameters.
| C |
Results for the following values of
:
|
||
|---|---|---|---|
|
|
|
|
| 0.25 | 0.58/29.6% | 0.68/38% | 0.71/33% |
| 0.5 | 0.66/31.7% | 0.70/38.2% | 0.75/33.5% |
| 0.75 | 0.68/29.3% | 0.70/29.6% | 0.71/25.1% |
| 1 | 0.66/20.7% | 0.62/19.4% | 0.69/18% |
- †The first number is the proportion of the time that POET outperforms and the second number is the percentage of average risk improvements. C represents the constant in the threshold.
8. Conclusion and discussion
We study the problem of estimating a high dimensional covariance matrix with conditional sparsity. Realizing that an unconditional sparsity assumption is inappropriate in many applications, we introduce a latent factor model that has a conditional sparsity feature and propose POET to take advantage of the structure. This expands considerably the scope of the model based on the strict factor model, which assumes independent idiosyncratic noise and is too restrictive in practice. By assuming a sparse error covariance matrix, we allow for the presence of the cross‐sectional correlation even after taking out the common factors. The sparse covariance is estimated by the adaptive thresholding technique.
It is found that the rates of convergence of the estimators have an extra term approximately
in addition to the results based on observable factors by Fan et al. (2008, 2011a), which arises from the effect of estimating the unobservable factors. As we can see, this effect vanishes as the dimensionality increases, as more information about the common factors becomes available. When p grows sufficiently large, the effect of estimating the unknown factors is negligible, and we estimate the covariance matrices as if we knew the factors.
The proposed POET also has wide applicability in statistical genomics. For example, Carvalho et al. (2008) applied a Bayesian sparse factor model to study breast cancer hormonal pathways. Their real data results have identified about two common factors that have highly loaded genes (about half of 250 genes). As a result, these factors should be treated as ‘pervasive’ (see the explanation in example 1 in Section 2.1.1), which will result in one or two very spiked eigenvalues of the gene expressions’ covariance matrix. POET can be applied to estimate such a covariance matrix and its network model.
Acknowledgements
The research was partially supported by National Institutes of Health grants R01GM100474‐01 and R01‐GM072611, grant DMS‐0704337 and the Bendheim Center for Finance at Princeton University. The bulk of the research was carried out while Yuan Liao was a postdoctoral fellow at Princeton University.
Appendix A:: Estimating a sparse covariance with contaminated data
We estimate
by applying the adaptive thresholding given by expression (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for
are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when
represent the error terms in regression models or when data are subject to the measurement of errors. Instead, we may observe
. For instance, in the approximate factor models, 
by using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold
, define

(A.1)
satisfies, for all
,
and 
When
is sufficiently close to
, we can show that
is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters
and
that are defined in assumptions 2 and 3, let
.
Theorem 5.Suppose that
, and that assumptions 2 and 3 hold. In addition, suppose that there is a sequence
so that
and
then there is a constant C>0 in the adaptive thresholding estimator (A.1) with


If further
, then
is invertible with probability approaching 1, and

Proof.By assumptions 2 and 3, the conditions of lemmas A.3 and A.4 of Fan et al. (2011a) are satisfied. Hence, for any ɛ>0, there are positive constants
and
such that each of the events

. Now for
under the event 

Let
. Then, with probability at least 1−2ɛ,
. Since ɛ is arbitrary, we have
. If, in addition,
, then the minimum eigenvalue of
is bounded away from 0 with probability approaching 1 since
. This then implies that
.
Appendix B:: Proofs for Section 2
We first cite two useful theorems, which are needed to prove propositions 1 and 2. In lemma 1 below, let
be the eigenvalues of Σ in descending order and
be their associated eigenvectors. Correspondingly, let
be the eigenvalues of
in descending order and
be their associated eigenvectors.
Lemma 1.
- (Weyl's theorem)
.
- ( sin (θ) theorem; Davis and Kahan (1970)):

B.1. Proof of proposition 1
are the eigenvalues of Σ and
are the first K eigenvalues of
(the remaining p−K eigenvalues are 0), then by Weyl's theorem, for each j≤K,

For j>K,
. However, the first K eigenvalues of BB are also the eigenvalues of
. By the assumption, the eigenvalues of
are bounded away from 0. Thus, when j≤K,
are bounded away from 0 for all large p.
B.2. Proof of proposition 2

For a generic constant c>0,
for all large p, since
but
is bounded by prosposition 1. However, if j<K, the same argument implies that
. If j=K,
, where
is bounded away from 0, but
. Hence, again,
.
B.3. Proof of theorem 1

and
. If we show that
, then from the decompositions of the sample covariance,

. Consequently, applying thresholding on
is equivalent to applying thresholding on
, which gives the desired result.
We now show that
indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints:
, and
is diagonal. Let
be the solution to the new optimization problem. Switching the roles of B and F, then the solution of problem (2.10) is
and
. In addition,
. From
, it follows that
.
Appendix C:: Proofs for Section 3
We shall proceed by subsequently showing theorems 4, 2 and 3.
C.1. Preliminary lemmas
The following results are to be used subsequently. The proofs of lemmas 2, 3 and 4 are found in Fan et al. (2011a).
Lemma 2.Suppose that A and B are symmetric semipositive definite matrices, and
for a sequence
. If
, then
, and

Lemma 3.Suppose that the random variables
and
both satisfy the exponential‐type tail condition: there exist
,
and
, such that, ∀s>0,

and
, and any s>0,
(C.1)
Lemma 4.Under the assumptions of theorem 2,
,
and
.
Lemma 5.Let
denote the Kth largest eigenvalue of
; then
with probability approaching 1 for some
.
Proof.First, by proposition 1, under assumption 1, the Kth largest eigenvalue
of Σ satisfies, for some c>0,

. Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2),
. Using this and model (1.3),
can be decomposed as the sum of the four terms


, which is
if K log(p)=o(T). Consequently, by assumption 1, we have

. It follows from lemma 4 that

, it remains to deal with
, which is bounded by

since log(p)=o(T).
Lemma 6.Under assumption 3,
.
Proof.Since
is weakly stationary,
. In addition,
for some constant M and any i and t since
has an exponential tail. Hence by Davydov's inequality (corollary 16.2.4 in Athreya and Lahiri (2006)), there is a constant C>0, for all i≤p,t≤T,
, where α(t) is the α‐mixing coefficient. By assumption 3,
. Thus, uniformly in T,

C.2. Proof of theorem 4
Our derivation below relies on a result that was obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that
equals the true K with probability approaching 1. Note that, under our assumptions 1–4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following lemma.
Lemma 7. (theorem 2 in Bai and Ng [2002].)For
defined in expression (2.14),

Proof.For a proof, see Bai and Ng (2002).
We first prove some preliminary results in the following lemmas. Denote
.
Lemma 8.For all
,
,
,
and
.
Proof.
- We have, ∀i,
. By the Cauchy–Schwarz inequality,

- By lemma 6,
, which then yields the result.
- By the Cauchy–Schwarz inequality,

- Note that
. By assumption 4,
, which implies that
and yields the result.
- By definition,
. We first bound
. Assumption 4 implies that
. Therefore, by the Cauchy–Schwarz inequality,

- Similarly to part (c), noting that
is a scalar, we have
where the last line follows from the Cauchy–Schwarz inequality.
Lemma 9.
,
,
and
.
Proof.
- By the Cauchy–Schwarz inequality and the fact that
,
The result then follows from assumption 3.
- By the Cauchy–Schwarz inequality,
It follows from assumption 4 that

. It then follows from Chebyshev's inequality and Bonferroni's method that
.
- By assumption 4,
. Chebyshev's inequality and Bonferroni's method yield
with probability 1, which then implies

- By the Cauchy–Schwarz inequality and assumption 4, we have demonstrated that
. In addition, since
,
. It follows that

Lemma 10.
.
.
.
Proof.We prove this lemma conditioning on the event
. Once this has been done, because
, it then implies the unconditional arguments.
- When
, by lemma 5, all the eigenvalues of V/p are bounded away from 0. Using the inequality
and identity (C.2), we have, for some constant C>0,
Each of the four terms on the right‐hand side are bounded in lemma 8, which then yields the desired result.
- Part (b) follows from part (a) and
Part (c) is implied by identity (C.2) and lemma 9.

Lemma 11.
.
.
Proof.We first condition on
.
- Lemma 5 implies that
. Also
. In ad dition,
. It then follows from the definition of H that
. Define
. Applying the triangular inequality gives
(C.3) - By lemma 4, the first term in inequality (C.3) is
. The second term of inequality (C.3) can be bounded, by the Cauchy–Schwarz inequality and lemma 10, as follows:

- Still conditioning on
, since
and
, right multiplying H gives
. Part (a) also gives, conditioning on
,
. Hence further left multiplying
yields
. Because
, we reach the desired result.
C.2.1. Completion of proof of theorem 4
The second part of theorem 4 was proved in lemma 10. We now derive the convergence rate of
.
, and that
, we have
(C.4)
. Therefore,
. The Cauchy–Schwarz inequality and lemma 10 imply

Finally,
and
imply that the third term is
.
C.2.2. Proof of corollary 1
. By theorem 4, uniformly in i and t,

C.3. Proof of theorem 2
Lemma 12.
, and
.
Proof.We have
. Therefore, using the inequality
, we have

C.3.1. Completion of proof of theorem 2
Theorem 2 follows immediately from theorem 5 and lemma 12.
C.4. Proof of theorem 3

Lemma 13.
, and
.
.
.
-
Proof.
- We have
. Moreover, since all the eigenvalues of Σ are bounded away from 0, for any matrix A,
. Hence
.
- By theorem 2,
.
- The same argument for the proof of theorem 2 in Fan et al. (2008) implies that
. Thus,
is upper bounded by
.
- Again, by
, and lemma 11,
(C.5)
C.4.1. Proof of theorem 3, part (a)
. Hence, for a generic constant C>0,

Lemma 14.
.
Proof.
. Hence
(C.6)
Lemma 15.If
, then with probability approaching 1, for some c>0,
,
,
and
.
Proof.
- By lemma 11, with probability approaching 1,
is bounded away from 0. Hence,

- The result follows from part (a) and lemma 14. Parts (c) and (d) follow from a similar argument to that for part (a) and lemma 11.
C.4.2. Completion of proof of theorem 3
. Define

and
. The triangular inequality gives

, where
(C.7)
is bounded by theorem 2. Let
; then

. Lemma 15 then implies that
. This shows that
. Similarly
. In addition, since
,
. Similarly
. Finally, let
. By lemma 15,
. Then, by lemma 14,

. Adding up
–
gives


C.4.3. Completion of proof of theorem 3: 
. Repeatedly using the triangular inequality yields

be the (i,j) entry of
. Then
.

Hence
. The result then follows immediately.
References
Discussion on the paper by Fan, Liao and Mincheva
Marc Hallin (Université Libre de Bruxelles and Princeton University)

. Such realizations can be seen either as a collection of p one‐dimensional time series, related to p individuals or cross‐sectional items, or as one observed time series
in dimension p. The objective is the estimation of the covariance matrix Σ of
. As both p and T are ‘large’, (p, T) asymptotics are appropriate. Owing to the effect of a common environment (unobserved ‘factors’ or ‘common shocks’ generating a large number of cross‐covariances), the traditional assumption of a sparse Σ is unlikely to hold. However, the same assumption becomes reasonable once the effect of unobserved factors or common shocks has been removed. The main challenge, thus, consists in removing that effect—which is highly non‐trivial, as those factors or common shocks are not observed and are not even well defined.
Factor model methods in such context are tailor‐made solutions. The general idea behind factor models consists in decomposing each
into
where
(Y's common component, accounting for the effect of the factors) and
(Y's idiosyncratic component, on which a sparsity assumption is to be made) are unobserved mutually orthogonal (at all leads and lags) processes. Further identification constraints on
and
of course are needed, yielding a variety of factor models. The better
is at accounting for the effect of common factors (cross‐correlations), the more plausible the sparsity assumption on the covariance of the
.
The identification assumption that was chosen by the authors requires
to be of the form
, where
is some latent K‐dimensional vector process (the factors)—yielding a static (approximate) factor model of the type studied by Chamberlain and Rothschild (1983), Bai and Ng (2002), Stock and Watson (2002a,b), and many others. The authors then successfully propose a principal‐component‐based method for reconstructing
and
, followed by a thresholding procedure for the estimation of
, and derive powerful consistency results, with rates that depend on the sparsity of
.
This is a path breaking contribution to the literature on high dimensional covariance estimation. The authors should be congratulated for this, and I have no hesitation in proposing a vote of thanks. However, ‘Sans la liberté de bl
mer, il n'est point déloge flatteur’ (‘Praising has no value in the absence of free criticism’ (Beaumarchais, Le Mariage de Figaro)), and I would like to enhance my praising of the authors’ work with a couple of friendly critical comments.
Principal components—the traditional static ones, computed from the ‘instantaneous’ covariances of the
s—are a fundamental tool in the authors’ approach. Since Brillinger (1981), however, it is widely admitted that static principal components are not the adequate concept of principal components in a time series context. By maximizing normed linear combinations of the form
, indeed, they completely overlook serial dependences. A static principal component with a small eigenvalue may have a negligible contemporaneous effect on
, but a significant effect on
, and hence a high predictive value: discarding it, as does the static principal component method, shifts its contribution to the idiosyncratic component, possibly jeopardizing assumed idiosyncratic sparsity.
Static principal components of course are fine under the assumptions of the static factor model, which in turn are pertinent in the presence of independent observations (the type that the authors probably have in mind despite the time series setting of their paper: gene expressions, financial returns, etc.). They are no longer adequate in the presence of serial dependence, and this is an indication that the static factor model assumptions are unlikely to hold in a genuine time series context. The weak point of static factor assumptions is the fact that all factors are to be loaded contemporaneously at time t whereas, in most practical situations, factors are loaded with lags. A general form for the common component is
instead of
with K×1 loading filters
instead of the K×1 loading matrices
(equivalently,
, where
is a K‐tuple of mutually orthogonal white noises, the common shocks). Adopting this dynamic characterization of
leads to the general dynamic factor that was studied in Forni et al. (2000) or Forni and Lippi (2001). An important advantage of that dynamic model is that, in contrast with the static one, it holds, basically, without any assumption (but second‐order stationarity) on
.
With this dynamic specification replacing the static model, the idiosyncratic covariance matrices, as well as the lagged idiosyncratic cross‐covariance matrices, are much more likely to satisfy sparsity assumptions. Brillinger's dynamic principal components can be used (Forni et al., 2000), very much in the same way as the static principal components in the static model, to reconstruct the decomposition
, then applying the thresholding technique that is recommended by the authors. (Dynamic principal components are based on the maximization of linear combinations of the form
involving the past, present and future values of
and can be computed from the eigenvalues and eigenvectors of the spectral density matrices of
.) The resulting consistency rates will involve the sparsity of the dynamic factor idiosyncratic covariance matrix rather than that of its static factor counterpart. And, the same methods as proposed by the authors naturally apply with the more ambitious objective of estimating the full (high dimensional) autocovariance structure of the observed series.
Piotr Fryzlewicz and Na Huang (London School of Economics and Political Science)
We would like to start by congratulating Professor Fan, Dr Liao and Ms Mincheva for the stimulating and thought‐provoking paper.
The POET‐estimator is the sum of two parts: the non‐sparse, low rank part resulting from the factor model, and the sparse part arising as a result of thresholding the ‘principal orthogonal complement’. The estimator has been designed with a particular factor model in mind, and therefore it is natural to ask, firstly, whether and how one could verify this model assumption and, secondly, whether POET offers acceptable performance if the assumption does not hold.
We may be wrong here, but we are unaware of a reliable technique for estimating the number of factors K which works well except in the most ‘textbook’ cases of the first few eigenvalues being ‘visibly’ larger than others. Even if a factor structure is present, the presence of both stronger and less strong factors may lead to the cut‐off in the eigenvalues being less obvious, in which case any inference for the number of factors may not be reliable. However, it is important to choose K correctly from the point of view of the usability of POET: the authors warn us that POET may perform poorly if K is underestimated. It is therefore tempting to ask whether POET may benefit from averaging over K as a possible guard against picking one ‘wrong’ (e.g. underestimated) value of K. Averaging may also be beneficial in cases when the factor model assumption is not satisfied.

is the p×p sample covariance matrix, δ is a constant in [0,1], λ is a p×p matrix with non‐negative entries and t(·,·) is a function that applies soft, hard or other thresholding to each non‐diagonal entry of its first argument, with the threshold value equal to the corresponding entry of its second argument. λ will typically be parameterized by one scalar parameter. Obviously,
and
are the non‐sparse and sparse components respectively.
performs ‘shrinkage of the sample covariance towards a sparse target’. To the best of our knowledge,
is a new proposal, although shrinkage towards some other targets has been studied extensively before, notably by Ledoit and Wolf (2003), who proposed shrinkage towards a one‐factor target and Schaefer and Strimmer (2005), who reviewed and discussed six commonly used targets. Some ideas for the ‘optimal’ choice of δ are proposed by Ledoit and Wolf (2003) and Schaefer and Strimmer (2005) and can be adopted in the context of
, thereby reducing the number of ‘free’ parameters of the procedure to the single scalar parameter of the threshold matrix λ. If all new covariance estimators were required to have ‘literary’ names (such as POET), we would name ours ‘NOVELIST’, for ‘novel integration of the sample and thresholded covariance estimators’. The benefits of NOVELIST include simplicity, ease of implementation and the fact that its application avoids eigenanalysis, which is unfamiliar to many practitioners.
We now briefly exhibit the performance of POET versus NOVELIST on a simulated covariance matrix Σ of size 100×100, available from http://stats.lse.ac.uk/fryzlewicz/testcov.RData (use load(“uottestcov.RData”uot) in R; the variable name is testcov). Σ was not generated from a factor model and is not sparse. The range of its diagonal elements is [3.32, 7.09], and only 56 of the non‐diagonal entries are larger than 1 in absolute value. The sample size is n=100, so
itself is not invertible. In NOVELIST, we use both δ=0 and
, and the constant matrix λ≡1. In POET, we use K=7, following the authors’ advice, given in the R package POET, to choose a large K (to avoid issues related to K being underestimated), but preferably smaller than 8. Both POET and NOVELIST use soft thresholding. Other POET parameters are set to default values. The data are Gaussian, and there are N=100 repetitions. Table 7 shows the results. POET performs poorly for Σ: it is the worst in all norms by a large margin. NOVELIST with δ=0 (which reduces to the simple soft thresholding estimator) and with
are difficult to tell apart in terms of their performance. However, as far
is concerned, NOVELIST with
is the best, followed by POET and then by the simple soft thresholding. The overall clear ‘winner’ in this example is NOVELIST with
.
(right‐hand section) for
with δ=0, with
and for the POET‐estimator, in the
‐, Frobenius, max‐ and spectral norms†
| Norm | Σ |
|
||||
|---|---|---|---|---|---|---|
| δ=0 |
|
POET | δ=0 |
|
POET | |
|
34 | 34 | 61 | 42 | 38 | 41 |
| Frobenius | 30 | 32 | 50 | 34 | 30 | 33 |
| max | 2.09 | 2.03 | 2.29 | 4.88 | 3.47 | 3.92 |
|
10 | 9 | 19 | 17 | 15 | 17 |
-
†Distances to
were multiplied by 10 before averaging. The best results are in italics.
- attempts to list some research questions regarding POET which we believe are worth exploring further and
- proposes a simple competitor.
We found the paper a pleasure to read and thought it was written in a clear and pedagogical way. We are convinced that POET will stimulate further research in the important field of large covariance estimation. It therefore gives us great pleasure to second the vote of thanks for this paper.
The vote of thanks was passed by acclamation.
Wenyang Zhang (University of York) and Heng Peng (Hong Kong Baptist University)
We congratulate Professor Fan, Dr Liao and Ms Mincheva for such a brilliant paper. We believe that this paper will have huge influence on the estimation of covariance matrices of large size and will stimulate many further researches in this topic.
The condition of sparsity is often imposed when it comes to large matrix estimation. As the authors rightly point out sparsity may not be appropriate in some circumstances; the covariance matrix of asset returns used in portfolio allocation is an example. It is better to impose some kind of structure on the covariance matrix on the basis of the data concerned. The matrices with the structure stated in this paper are very general and appear in many research areas, such as portfolio allocation, risk management and image analysis. The estimation that is proposed in this paper is intuitive and easy to implement; it will become very popular in the areas where large matrix estimation is needed.
Among the many clever and stimulating ideas in this paper, we particularly appreciate the connection between principal component analysis and factor models. This connection leads to a brand new estimation of factor loadings in factor analysis and of the covariance matrix of idiosyncratic components.

When p is larger than T, the sample covariance matrix would be singular, and some
will be 0 when i>K+1. This problem will become more acute when p>>T. How would this affect the estimator of the covariance matrix of idiosyncratic components? Would some kind of iteration improve the accuracy of the estimator? For example: treat
as an initial estimator of
, decompose
by its principal components and denote the sum of the first K terms in the decomposition by
and apply the thresholding rule on
to obtain improved
, and continue this iterative procedure until convergence.
The presence of spiked eigenvalues is clearly formulated in this paper; however, when the factor loadings are not available, which is the case in reality, how do we check whether there are spiked eigenvalues? Would it work simply by checking whether there is a jump among the eigenvalues of the sample covariance matrix?
As far as the estimation of the covariance matrix Σ is concerned, is it really necessary to have the condition of the presence of spiked eigenvalues? Would
always work regardless of whether the spiked eigenvalues exist or not? We may have missed some important points on this issue.
The selection of K on the basis of the Akaike or Bayesian information criterion does not seem to make use of the information about the presence of spiked eigenvalues; is there any room to improve the selection by incorporating this information in the selection procedure?
Alexei Onatski (University of Cambridge)
My comments on this interesting paper are confined to its central assumption that the first K eigenvalues of
diverge at rate O(p), whereas all the eigenvalues of
are bounded as p→∞. This factor pervasiveness assumption implies that
and
are good approximations to the sample covariances of the systematic and idiosyncratic components of the data respectively, which is a key to the success of POET. Unfortunately, it may be misleading in many economic and financial applications.
For example, as Fig. 11 shows, except for i=1, there are no large gaps between eigenvalues i and i+1 of the sample covariance matrix of the excess return data that were used in Section 6. However, since, as is commonly believed, such data contain at least three factors, the factor pervasiveness assumption suggests the existence of a large gap for i≥3.

) and of data simulated from the factor model calibrated as in Section 6 (boxplots based on 100 replications)
The absence of the gap between ‘systematic’ and ‘idiosyncratic’ eigenvalues may have a negative effect on the performance of POET. Table 8 reports the mean quality of POET over 1000 replications of data simulated as in Section 6.2, but with
. As ρ increases from 0.4 to 0.9, the systematic–idiosyncratic gap measured by
decreases from 3.7 to 1.1. For the 100 industrial portfolios data that were used in Section 6,
, which is best matched by the simulated data with ρ=0.8. The quality of POET dramatically deteriorates in the neighbourhood of ρ=0.8.
| Estimator |
Results for the following values of ρ and
:
|
||||||
|---|---|---|---|---|---|---|---|
| ρ=0.4, | ρ=0.5, | ρ=0.6, | ρ=0.7, | ρ=0.8, | ρ=0.9, | ||
|
|
|
|
|
|
||
|
POET | 0.74 | 1.04 | 1.57 | 2.10 | 3.89 | 8.98 |
| Shrinkage | 1.21 | 1.63 | 2.18 | 2.96 | 4.68 | 9.24 | |
|
POET | 7.38 | 10.1 | 18.7 | 43.2 |
|
|
| Shrinkage | 6.16 | 9.63 | 14.1 | 22.2 | 39.2 | 90.7 | |
|
POET | 0.85 | 0.90 | 1.07 | 1.33 | 2.14 | 6.65 |
| Shrinkage | 0.97 | 1.05 | 1.11 | 1.16 | 1.20 | 1.27 | |
|
POET | 7.09 | 9.64 | 17.8 | 39.0 |
|
|
| Shrinkage | 4.35 | 5.67 | 7.47 | 10.5 | 16.9 | 41.9 | |
|
POET | 20.2 | 20.4 | 20.4 | 21.3 | 19.7 | 21.2 |
| Shrinkage | 20.3 | 20.4 | 20.4 | 21.3 | 19.7 | 21.1 | |
For comparison, in rows of Table 8 marked ‘shrinkage’, I report the quality of the estimator that replaces POET's thresholding step by Ledoit and Wolf's (2004) linear shrinkage procedure applied to the principal orthogonal complement. The deterioration of the quality of parameter shrinkage is not as dramatic as that of POET.
Continuing the comparison, I computed the risk of portfolios created as in Section 7 with parameter shrinkage. The minimum risk portfolio that was created with shrinkage had lower variance than that created with POET 51% of the time. Among those months, the risk was decreased by 19%. During the months that POET produced a lower risk portfolio, the risk was decreased by 15%. These results indicate that POET does not dominate other simple covariance matrix estimation methods in applications where there is no clear gap between systematic and idiosyncratic eigenvalues. Developing covariance estimation methods that would work well in such situations is an important task for future research.
Clifford Lam and Charlie Hu (London School of Economics and Political Science)
- the potential underestimation of the number of factors K;
- the potential non‐sparseness of the estimated principal orthogonal complement.
Point (a) is addressed by using a larger K. With pervasive factors assumed in the paper, it is relatively easy to find such K. However, in an analysis of macroeconomic data for example, there can be a mix of pervasive factors and many weaker ones; see Chudik et al. (2011), Lam et al. (2011) and Lam and Yao (2012) for a general definition of weak factors.
In Stock and Watson (2005), monthly data of p=132 US macroeconomic time series from 1959 to 2003 (n=526) were analysed. Using principal component analysis (Bai and Ng, 2002) the method in Lam et al. (2011) and a modified version called autocovariance‐based factor modelling, we compute the average forecast errors of 30 monthly forecasts by using a vector auto‐regressive model VAR(3) on the estimated factors from these methods with different number of factors r (Fig. 12). Whereas three pervasive factors decrease forecast errors sharply, including more factors, up to r=35, decreases forecast errors more slowly, showing the existence of many ‘weaker’ factors.

, autocovariance‐based factor modelling;
, Lam et al. (2011);
, principal component analysis
Hence it is not always possible to have ‘enough’ factors for accurate thresholding of the principal orthogonal complement, which can still include contributions from many weak factors and is not sparse. Points (a) and (b) can therefore be closely related, and can be addressed if we regularize the condition number of the orthogonal complement instead of thresholding. Whereas Won et al. (2013) restrict the extreme eigenvalues with a tuning parameter to be chosen, we use the idea of Abadir et al. (2010) (the properties have not been investigated sufficiently unfortunately). We are studying its theoretical properties.

being independent AR(1) processes and
the standardized macroeconomic data in Stock and Watson (2005) plus independent N(0,0.2) noise. Following example 5 of the paper, we estimate
by using different methods and plot the sum of absolute bias for estimating β by using generalized least squares against the number of factors r used in Fig. 13. Clearly regularizing on condition number leads to stabler estimators.

) against the number of factors r used in POET (C=0.5) (
) and the condition‐number‐regularized estimator (
) (the bias for the least square method is constant throughout)
Parallel to Section 7.2, we compare the risk of portfolios created by using POET and the method above. Again Fig. 14 and Table 9 show stabler performance of regularization on condition number.
| K | Proportion of time | % of average risk |
|---|---|---|
| POET outperforms | improvements | |
| 1 | 0.40 | −4.07 |
| 2 | 0.46 | −2.50 |
| 3 | 0.56 | 0.66 |
| 4 | 0.56 | 0.71 |
Natalia Bailey (University of Cambridge), M. Hashem Pesaran (University of Southern California, Los Angeles, and Trinity College, Cambridge) and Takashi Yamagata (University of York)

The paper's key contribution lies in tackling the problem of estimation of a large covariance matrix,
, when the data set examined is ‘contaminated’ with strong unobserved factors. Both cross‐sectional and time dimensions (p,T) are assumed large. Fan and his colleagues extend existing literature on regularization of such a matrix via thresholding when this is strictly sparse—e.g. Bickel and Levina (2008) and Cai and Liu (2011). Their proposed estimator, POET, is applied to the covariance matrix of residuals extracted from a regression of the original data
on K estimated factors. The errors
are considered to be weakly cross‐sectionally dependent (i.e.
is sparse or
).
:
(1)
is the thresholded version of
, and
when q=0. They find the same rate (l) for
, ensuring positive definiteness of
by setting a lower bound on their thresholding parameter. Using this result they then suggest that
can be used in various applications in finance, in particular in their example 6 where they consider testing α=0, in the linear asset pricing model
(2)
in the development of such a test will be valid even when p≫T. However, as shown in Pesaran and Yamagata (2012) this cannot be so even if the rate condition (1) holds. Pesaran and Yamagata (2012) conduct a test of
using as generating process (2) and propose a test statistic which, under normality of
, can be written as
(3)
is a T×1 vector of 1s and
,
. When
is known the test is valid for any T>K+1 but, if an estimator of
is inserted in expression (3), then
only if p log (p)/T→0, which requires p<T.
To illustrate this point we conducted a small Monte Carlo simulation following the set‐up in Pesaran and Yamagata (2012) where we plugged in
and the Ledoit and Wolf (2004) shrinkage estimator,
, as estimates of
in expression (3) for a set of (p,T) combinations. As shown in Table 10, considerable size distortions are visible when either estimator is used for p>T. Size improves only when T increases.
test in the case of models with three factors†
| T | Results for | Results for | Results for | ||||
|---|---|---|---|---|---|---|---|
| p=50 | p=100 | p=200 | |||||
| Size | Power | Size | Power | Size | Power | ||
|
60 | 0.14 | 0.78 | 0.19 | 0.91 | 0.25 | 0.98 |
| 100 | 0.08 | 0.93 | 0.11 | 0.98 | 0.14 | 1.00 | |
|
60 | 0.14 | 0.62 | 0.18 | 0.78 | 0.25 | 0.92 |
| 100 | 0.11 | 0.85 | 0.13 | 0.92 | 0.17 | 0.98 | |
|
60 | 0.05 | 0.58 | 0.04 | 0.74 | 0.04 | 0.89 |
| 100 | 0.05 | 0.87 | 0.05 | 0.95 | 0.05 | 1.00 | |
-
†Errors are weakly cross‐sectionally dependent and normally distributed. Sparseness of
is defined as in Table 3 of Pesaran and Yamagata (2012) with
. Size:
for all i=1, … ,p. Power:
IIDN(0,1) for
,
; otherwise
. The number of replications is set to 2000. ‘Hard’ thresholding in
is conducted by using cross‐validation.
:
(4)
denotes the standard t‐ratio of
in the ordinary least squares regression of individual asset returns, and
(5)
I(·) is an indicator function and the threshold value
is chosen such that
declines steadily with p. Size and power for this test are also summarized in Table 10 and show the ability of the test to control the size well with high power even if p≫T.
Cinzia Viroli (University of Bologna)
This paper is very interesting and rich with thought‐provoking themes. I would like to comment on the double formulation of the estimation problem for large covariance matrices.
Fan and his colleagues assume that the data have been generated according to an approximate K‐factor model and suggest recovery of the covariance matrix via its decomposition into a low rank matrix and a sparse error matrix. They first address this issue by resorting to the first K principal components to estimate the factor loadings and to the thresholded principal orthogonal complement for the estimation of the idiosyncratic variance.
They then show that the estimator has an equivalent representation by using a constrained least squares method. For each given vector F of factor scores they derive the least squares estimator of the factor loadings (as a function of F) and hence estimate the factor scores. Given both the factor scores and the factor loadings they recover the idiosyncratic factors, derive their covariance matrix and suitably threshold it.
One of the advantages of the least squares formulation is that it generalizes and extends, to the unknown factor case, the results that the authors have obtained for known factors, both assuming a strict (Fan et al., 2008) and an approximate (Fan et al., 2011) factor model. It is also coherent with a large part of the literature on the topic and allows us to use consolidated proof strategies and model selection tools.
However, by pursuing the least squares approach the authors seem somewhat to neglect the principal components analysis results. The way that they solve the least squares problem (first obtain the loadings and then the factor scores) is natural when the factors are known and the loadings unknown, but when both of them are unknown the dual problem (first scores and then loadings) could be as meaningful, as in Stock and Watson (2002b). This would allow us to obtain the factor loadings as the eigenvectors of the sample covariance matrix directly, thus leading to the POET‐estimator, making theorem 1 in Section 2 no longer needed. Also, the principal components analysis approach would allow us to obtain the idiosyncratic errors simply as the scores in the principal components orthogonal complement with no need to estimate the common factor scores first. I have really appreciated the paper, but I cannot help noting that along the paper the poesy of POET somewhat fades. I am wondering whether there is some drawback of the principal components analysis approach that justifies the least squares prevalence.
Angela Montanari (University of Bologna)
This is a very interesting paper, full of stimuli in all its parts.
One of the most relevant aspects is the identification of sparsness of the idiosyncratic covariance matrix as a sufficient condition (together with factor pervasiveness) for the identifiability of an approximate factor model and for its estimation through principal components analysis (PCA).
This result, besides further justifying Chamberlain and Rothschild's (1983) result (which required limited idiosyncratic eigenvalues only) also provides a clear indication on the kind of dependence structure which the model can capture. And this is very important from an empirical point of view, as the examples in Section 5 clearly show.
Sparsness of the idiosyncratic covariance (together with diverging p) also offers PCA a sort of ground for revenge as an estimation method for a factor model. For finite p, and under the strict factor model assumption, it is well known that PCA performs poorly as an estimation method, since it generates correlated errors; but for diverging p,and under the approximate factor model assumption, this paper shows that PCA represents a natural and theoretically grounded estimation instrument. I would speak, for PCA, of a blessing of dimensionality.
Within this coherent framework, anyway, I feel a little uneasy with the empirical application in Section 7.
50 series, related to as many stocks (chosen from five different industry sectors), and their annualized daily returns for T=252 days are considered. Fan and his colleagues identify three relevant factors, they estimate the factor loadings through the first three PCs of the sample covariance matrix and finally obtain the thresholded error correlation matrix. This matrix shows that a strong positive correlation between the returns of companies in the same industry is still present after taking out the common factors and from this the authors conclude that it provides strong evidence that the strict factor model is inappropriate.
In this case p is not very large with respect to T. My feeling is that we are still dealing with the finite p situation in which, as already said, PCA returns correlated idiosyncratic errors, even when they are actually uncorrelated. In other words, I am wondering whether the residual correlation is evidence of the inappropriateness of a strict factor model or, on the contrary, it is simply induced by an inappropriate use of PCA. If the factor loadings had been estimated by any of the estimation methods ordinarily used in classical factor analysis, would the residual correlations still be non‐vanishing?
Yi Yu and Richard J. Samworth (University of Cambridge)
We congratulate the authors on their paper. POET elegantly tackles low rank plus sparse matrix estimation, provided that the eigenvalues of the low rank matrix grow at rate O(p) (see assumption 1). Suppose now that this assumption does not hold, and instead, we have the following condition.
Assumption 5.All the eigenvalues of the K×K matrix
are bounded away from both 0 and ∞ as p→∞, where 0<α<1.
Similar conditions are widely used in sparse principal components analysis and low rank plus sparse matrix estimation problems; see, for example, Amini and Wainwright (2009) and Agarwal et al. (2012). In what follows, we consider the three main objectives in Section 2. The notation and model are the same as those in the paper.
Proposition 3.Assume assumption 1. For the factor model with condition (2.1), we have

Moreover, if
are distinct, then

From this we see that, under a suitable sparsity condition on
, the first K principal components are still approximately the same as the columns of the factor loadings, even if the eigenvalues are not as spiked as O(p).
that

The other half of this theorem still holds, however, so the less spiked structure will not asymptotically increase the risk of overestimation in the selection of K.
The performances of IC, the Akaike information criterion AIC and the Bayesian information criterion BIC are compared in Table 11, with the corresponding largest eigenvalues
in Fig. 15. If the spectrum structure satisfies assumption 1
, both IC and BIC select the correct value of K. However, if we shrink the spiked eigenvalues, IC and BIC tend to underestimate, whereas AIC overestimates, the true K.
| Methods | Results for the following values of C: | |||
|---|---|---|---|---|
|
|
|
|
|
| IC | 6.00 (0.00) | 1.08 (0.27) | 1.00 (0.00) | 6.00 (0.00) |
| AIC | 20.00 (0.00) | 20.00 (0.00) | 20.00 (0.00) | 20.00 (0.00) |
| BIC | 6.00 (0.00) | 2.00 (0.00) | 1.00 (0.00) | 6.00 (0.00) |
-
†For the same u and
as in Section 6.2, define
and expand
to a block diagonal matrix,
by making
the diagonal block of
. The rows of
are generated from an
distribution. Expand the generating process of F similarly to match
and generate
accordingly, and then let
. Here, K=6 and
. The means of the estimated K are reported over 100 repetitions, with standard error in parentheses.

in cases (a)
, (b)
, (c)
and (d) 
, but the estimator is

is the entrywise shrunk estimator of
. In this case, owing to the common factor, most of the pairs of cross‐sectional units in
are no longer ‘weakly correlated’. Note that the
s in Appendix A are still the same, i.e. no extra shrinkage is introduced. However,
used in theorems 2 and 3 is not o(p), so the error bound does not converge to 0, but, when K is correctly estimated or overestimated, even substituting assumption 5 for assumption 1, the corresponding results in theorems 2 and 3 still hold. Thus, if there is doubt about the validity of assumption 1, a less severe penalty (e.g. AIC) may be preferable, to avoid the more serious error of underestimation of K.
Frank Critchley (The Open University, Milton Keynes)
- Consistent estimation of Σ, or its inverse, relies on key assumptions—typically, a static model with pervasiveness factors and (approximately) sparse errors. Two key questions arise.
- How far can they be checked? Given POET's fast (singular value decomposition) nature, deletion diagnostics seem entirely feasible. Again, individuals and/or time points can be deleted. Further, the static model can be tested within broader—e.g. first‐order auto‐regressive—models.
- What effects do these assumptions have on subsequent inference? Indeed, could such context‐specific considerations help to guide the choices to be made within a POET‐analysis? In short, is there scope for a ‘POET for purpose’?
- POET having an equivalent least squares formulation, potential lack of robustness to outliers merits consideration, with the usual array of possible solutions. In particular, there is a variety of robust versions of both principal component analysis and factor analysis.
- Might there be a role for the invariant co‐ordinate selection (ICS) methodology introduced in Tyler et al. (2009)? If so, there would seem to be several possible advantages. ICS requiring two affine equivariant scatter functionals, a robust choice could, for example, be used alongside the regular covariance. Of particular relevance here, I have recently shown that one of these functionals can be singular, without essential loss. ICS could be used as a complement to POET, the former using a generalized form of the principal component analysis asymptotically determining the latter. More radically, it could replace POET, as is perhaps natural on invariance grounds: subsuming centring, ICS's invariance to linear transformation of the data combines principal component analysis's invariance to orthogonal transformation with that of factor analysis to separate scaling of the variables. This would have the additional advantage of potentially extending POET to a wider range of data types. In particular, incommensurable variables could be accommodated. Finally, however ICS is implemented, it retains POET's computational speed, while providing visual displays. These offer a range of diagnostic and other potential benefits, notably, multivariate outlier detection or, again, group detection (via implicit estimation of Fisher's discriminant subspace).
Once again, it is a pleasure to congratulate the authors on a very stimulating paper.
Jian Zhang (University of Kent, Canterbury)
I congratulate Fan and his colleagues on their ground breaking and innovative contribution to high dimensional covariance estimation. I would like to contribute to the discussion by the following comments.
(6)
is a grid approximation to the brain, β(r,t) is a latent univariate time source of interest at location r, x{r,η(r)} is a design vector determined by the so‐called Maxwell equations with orientation η(r) and ɛ(t) is noise. Assume that β(r,t) is sparse, i.e. the temporal variability (called power or the marginal variance) var{β(r,·)}=0 for all r ∈ Ω except a few locations (i.e. non‐null sources). We want to localize these non‐null sources among an infinite number of candidates. A spatial filtering theory has been developed by Zhang et al. (2012) and Zhang (2012) for searching for a non‐null source. Under certain conditions, the covariance matrix of Y(t) can be expressed as
where
is the output vector of these sensors that would be induced by a unit magnitude source located at
along orientation
and
is the power at
. So it is not surprising that our theory is relying on an appropriate estimate of the covariance matrix when p is much larger than n and J. In this sense, I expect that POET can significantly improve our spatial filters.
There are two different asymptotics: expanding time domain asymptotics, the time window is expanding when J increases and infill asymptotics, where
, 1≤j≤J, are restricted to a certain time window when J increases. Under the infill setting, the current strong mixing condition in Section 3 will not hold. I am curious about the performance of POET in the infill setting.
John T. Kent (University of Leeds)
It is interesting to compare the methodology of this paper with conventional multivariate analysis in a fixed dimension p, where p is small or moderate and the relevant asymptotics involve the sample size n growing large. The simplest methodology is either invariant or equivariant under affine transformations, e.g. Hotelling's
‐statistic. Thus if there are two variables, height and weight, say, it does not matter what units we use to measure them; further the original two variables can be replaced by any two linearly independent linear combinations.
However, even in this simple setting, there is typically less invariance for dimension reduction methods. For example, principal component analysis is equivariant only under orthogonal transformations of the data. Factor analysis is equivariant only under diagonal transformations, i.e. rescaling the variables.
- the factor loadings are pervasive and
- the idiosyncratic covariance matrix
is sparse.
- the variables themselves (rather than linear combinations) are important;
- the choice of variables is important (for example we are not in a situation where half the underlying variables measure the same feature of the data);
- the choice of measurement units is important (ideally the variables should be commensurate, so that all the variables are measured in the same units with comparable variances).
I would be interested in the authors’ comments on these thoughts.
The following contributions were received in writing after the meeting.
Amir Ahmad, Sarosh Hashmi and Sami M. Halawani (King Abdulaziz University, Rabigh)
We congratulate Fan and his colleagues for this interesting paper. The paper proposed a model for the estimation of high dimensional covariance. The proofs are detailed and the experiments are extensive. The discussion provides a good insight into the problem.
Gene expression data sets and protein expression data sets (e.g. Golub et al. (1999) and Alon et al. (1999)) provide a challenge because of their high dimensions and small number of data points. The authors have talked about statistical genomics as one of the fields of application of the methods proposed in the paper. Hence, it would be interesting if they could show some results obtained by the proposed method on these data sets and if they could comment on future extensions of the model proposed.
Charles Bouveyron (University Paris 1 Panthéon‐Sorbonne)
Before I go further, I would like to thank the authors greatly for this very interesting and painstaking work. I found this paper made with a real care. I particularly appreciated the fair balance between theory and experiments.
The subject of the paper, large covariance matrix estimation, has become a central problem in modern statistics. Indeed, the technological advances of the last two decades have significantly modified the nature of data, and consequently of statistics. In particular, modern data are often high dimensional (large numbers of variables), big (large numbers of observations) or available as a stream (the observations pass and cannot be stored).
The paper focuses on the factor model and discusses solutions for estimating the covariance matrix. The POET‐method that is introduced has the advantage of including existing regularization strategies for large covariance matrix estimation. Among those strategies, one consists in thresholding the principal directions associated with the smallest eigenvalues. For this, POET completes the eigendecomposition with a thresholded matrix, let us say R. This allows us in particular to perform the inversion of the covariance matrix efficiently. An alternative would be to use the covariance matrix approximation that was used in Bouveyron et al. (2007) which leads to an explicit inversion for the covariance matrix. Furthermore, recent strategies in estimating sparse covariance matrices include
‐type penalties. A theoretical and experimental comparison with these approaches would be interesting.
D. S. Coad and H. Maruri‐Aguilar (Queen Mary University of London)
We congratulate Fan and his colleagues on this beautiful paper, which provides an elegant method for estimating a high dimensional covariance matrix with a conditional sparsity structure. The simplicity of the approach and its wide applicability make it very appealing. Asymptotic properties and simulation results convincingly demonstrate the superiority of the method. We feel that the estimator proposed has a multitude of other potential uses in practice.
The problem of controlling the false discovery rate in example 3 often presents itself in gene association analysis, but limited numbers of observations are available. Since there is only a small number of observations for each hypothesis, a one‐stage design can lead to tests with poor power. However, Zehetmayer et al. (2005) have shown that a two‐stage design based on combining the p‐values from a screening stage and a testing stage can significantly improve the power. A generalization to multistage designs is provided by Zehetmayer et al. (2008). A natural question is whether the principal factor approximation can be applied to these designs.
In the multiperiod asset pricing model that is outlined in example 6, to test the null hypothesis (1.2), the model is embedded in the multivariate linear model (5.3). When p<T, the usual test statistic has either a
‐ or an F‐distribution under the null hypothesis, according to whether the covariance matrix
is used or an estimate. However, when p≫T, the estimate of
is degenerate and the non‐degenerate estimate
can be employed instead. It would be interesting to know what the distributions are of the test statistics W and
. In particular, it is unclear what the corresponding degrees of freedom would be.
A problem with large data sets in computer experiments is the intractability of the usual Gaussian process model. The main obstacle is the evaluation and inversion of large covariance matrices. Kaufman et al. (2008, 2011) used respectively tapering to produce sparse correlation matrices and correlation functions with compact support. The thresholding methods that are described could be used for the analysis of computer experiments, by devising a special form for the entry‐adaptive thresholding rule
. This would allow fast covariance computations and tractability of the problem.
Wei Dang (Shihezi University) and Keming Yu (Brunel University, London)
The principal component analysis method for large covariance matrix estimation is a novel idea for a challenging issue. By assuming a sparse error covariance matrix in a multifactor model, the proposed principal orthogonal complement thresholding estimator POET does have a proper rate of convergence.
Whereas principal component analysis can apply to the analysis of non‐stationary time series (Lansangan and Barrios, 2009), POET may lose those good properties presented in theorems 1 and 3 for non‐stationary and non‐ergodic time series. Because POET relies on the stationary and ergodic assumptions of underlying time series, it may exclude many important application examples, including financial time series analysis and health science data analysis. For example, modern mathematical models largely focus on martingale models, including Brownian motion. But a multi‐dimensional Brownian motion may not be ergodic. In health sciences the data under analysis may be the yearly heights and weights of a large group of children recorded from their early ages to the end of their high school studies. It is often observed that the heights and weights of these children rise much quicker in certain years than in some other years, so the difference between the sample means and variance of the period of quick growth are statistically different from some of the other years; then the data under analysis would be non‐stationary. One way to apply POET in these problems may use transformation first, such as detrended non‐stationary processes transformed into stationary ones, and Laplace transform non‐ergodic processes into ergodic processes.
The other issue with the proposed POET is to incorporate it for the analysis of data with outliers. Many empirical studies find that the distribution of stock returns departs from normality, including the stock return from the Center for Research in Security Prices database and used in the paper. Like principal component analysis, POET may become unreliable if outliers are present in the data. The same type of data may occur in health science. As Jolliffe (2002) pointed out, for a sample of healthy children of various ages between 5 and 15 years old, an observation with height and weight 175 cm and 25 kg respectively is not particularly extreme on either the height or the weight, individually, but the combination (17 cm, 25 kg) is an outlier. In such cases, it is desirable to employ a statistical estimation procedure that may be more efficient and robust than ordinary least squares for a robust POET‐estimator.
Matteo Farnè (University of Bologna)
I thank the authors for this very challenging paper. While reading it and listening to the presentation, I have learnt much. My comment is on possible extensions of the method proposed.
In Farnè and Montanari (2013) I have done some work on a different approach to the estimation of large covariance matrices, namely the approach based on shrinkage, under assumptions which differ from those considered in this paper, Ledoit and Wolf (2003, 2004) suggested to obtain a well‐behaved covariance matrix by shrinking the sample covariance matrix either towards a scaled identity matrix or, to impose some structure on the estimator, towards a single index model covariance matrix. Boehm and Von Sachs (2008, 2009) have successfully extended shrinkage approaches to the estimation of the spectral matrix of a multivariate time series.
My feeling is that the POET‐method could be profitably extended to the estimation of large spectral matrices also. Of course, owing to the particular nature of spectral matrices the extension is not straightforward; for instance the effect of smoothing must also be taken into account. I am wondering whether Fan and his co‐authors would suggest that we employ their method in the frequency domain also or on the contrary whether they see any reason why such an extension is not advisable.
Marco A. R. Ferreira (University of Missouri, Columbia)
I congratulate Professor Fan and his colleagues for their valuable contribution to the area of large covariance matrix estimation.
Professor Fan and colleagues have developed a method for estimating large covariance matrices when there are common unobservable factors and additional cross‐sectional correlation. They consider the case when, as the number of individuals p and the number of time points T grow to infinite, the number of common unobservable factors K remains fixed. In addition, in their set‐up the eigenvalues corresponding to the common factors are divergent as p→∞. Finally, they assume that the covariance matrix of the idiosyncratic component is approximately sparse.
With these assumptions, the authors develop a method based on principal component analysis for covariance matrix estimation. Specifically, first they estimate the contribution of the common factors to the covariance matrix by the sum of the K first terms of the sample covariance matrix spectral decomposition. Then, they subtract the estimated common factors contribution from the sample covariance matrix to obtain the principal orthogonal complement. Further, they apply thresholding to the principal orthogonal complement to obtain an estimator of the idiosyncratic covariance matrix. Finally, their covariance matrix estimator is the sum of the estimated common factors contribution and the estimated idiosyncratic covariance matrix.
- As the number of individuals p increases, it seems intuitive to assume that the underlying process generating the data should grow in complexity, i.e. it seems intuitive that K should grow with p. What would be the potential technical issues that would arise if one decides to extend the current work to the case when K grows with p?
- For the application of thresholding, there are a number of constants that must be chosen such as τ in equation (2.6) and C in equation (3.2). There seems to be an opportunity for the use of empirical Bayes methodology for the estimation of those threshold parameters.
Florian Frommlet (Medical University Vienna)
I congratulate the authors on this impressive paper concerned with estimating high dimensional covariance matrices under conditional sparsity. Their approach is surprisingly simple: first compute the principal components of the sample covariance matrix, then estimate the number of relevant components and finally apply a thresholding procedure on the remaining covariance matrix. In spite of this simplicity extensive simulation studies in their paper show that POET, the implementation of the approach presented, outperforms competing algorithms in various scenarios.
It is not too surprising that POET performs well in those scenarios based on factor models with few factors, which mimic the situation under which the authors have derived asymptotic results for their method, i.e. when the covariance matrix has a small (fixed) number K of very large eigenvalues. It is quite intuitive that in this situation the first K principal components will simply represent the corresponding factors of the factor model. Also it appears to be clear that the procedure works well when no factors are present, as long as the number of components is then correctly estimated to be 0.
For me the most astonishing result is that POET appears to do relatively well in model 3 of Section 6.5.2, where data were simulated according to an auto‐regressive AR(1) model. This is the only presented simulation scenario where data were not simulated either from a factor model with a small number of strong factors, or alternatively from a model without factors and sparse covariance matrix. The covariance matrix of the suggested AR(1) model does not have particularly spiked eigenvalues, but the eigenvalues smoothly decrease from their maximum. In fact for p=200 and p=300 there are 36 and 53 eigenvalues larger than 1 respectively. According to the simulations presented POET picks for this scenario (both for p=200 and p=300) on average roughly six factors to model the covariance structure, outperforming direct thresholding of the sample covariance matrix. This result indicates that POET might work well even in situations which are not covered by the asymptotic analysis presented. However, further work seems to be necessary to explain why that would be so.
I. Gijbels and K. Herrmann (KU Leuven) and A. Verhasselt (Universiteit Hasselt and Universiteit Antwerpen)
Fan and his colleagues present a very nice estimation technique (POET) for high dimensional covariance estimation, based on principal component estimation and thresholding the orthogonal complement of the principal components. They show that POET is equivalent to constrained least squares (CLS) estimation.
We wonder how robust POET is when the data matrix is corrupted, since it is well known that least‐squares‐based methods are not robust. The equivalence of POET to a CLS estimation problem seems to open the way for a more robust procedure. The use of robust principal component methods (e.g. Engelen et al. (2005)) could also offer a possibility.
As pointed out in the literature (see for example Antoniadis (2007)), the qualitative properties of a thresholding rule turn out to be important. For example, the hard thresholding rule is discontinuous, whereas the soft thresholding rule is continuous. In CLS regression hard thresholding leads to a larger variance of the estimates, whereas soft thresholding shifts the estimates, creating a bias. What is the effect of such qualitative properties of the thresholding rule on the POET estimator?
The authors use a computationally expensive cross‐validation criterion to choose C. It might be worth the effort to exploit the equivalent CLS problem and to use criteria based on this equivalence, such as an Akaike type of criterion.
Portfolio allocation in the Markowitz (1952) framework is chosen as a numerical illustration of POET. In the simulation studies and empirical application, the emphasis is on estimating the weights of the minimum variance (MV) portfolio as the solution to
. This is in line with current literature (Kourtis et al. (2012) and references therein) where the MV portfolio is preferred because it alleviates the necessity of estimating the stock returns. As the MV weights admit the expression
, the comparison of POET, the strict factor model (Fan et al., 2008) and the sample covariance (SC) matrix estimator amounts to a comparison of the estimated precision matrix in the models considered. It is known that the SC precision matrix performs poorly (Fan et al., 2008; Kourtis et al., 2012) and measures to counterbalance estimation errors must be taken. In a p<T framework shrinkage methods for example are applied to the SC matrix before (Ledoit and Wolf, 2003) or after inversion (Kourtis et al., 2012), significantly enhancing results. Shrinkage methods have also been applied to the p≫T framework (see Ledoit and Wolf (2004)), establishing a possible competitor in this scenario as well. Owing to the known shortcomings of the SC precision matrix deeper insights can be expected from a comparison with such refined methods.
Wally Gilks (University of Leeds)
Fan and his colleagues state that the low rank plus sparse representation of their model is for the population covariance matrix. The most obvious interpretation of this assertion is that the model is intended to describe the population, not the specific individuals sampled. This interpretation is somewhat at odds with the design of the sparse component of the model
, which accounts for idiosyncratic correlations between specific individuals.
of their idiosyncratic errors is ρ if i and j interact, and 0 otherwise. In a sample of size p, the probability that individual i idiosyncratically interacts with k other individuals is distributed as binomial(π,p−1). The authors require that their measure of sparsity,
, grows with sample size as o(p). Letting
, we have

and
denote the maximum and mean values in a sample of size p from a binomial(π,p−1) distribution. Thus,
unless ρ=0.
Hajo Holzmann and Anna Leister (Philipps‐Universität Marburg)
We congratulate Fan and his colleagues for an inspiring paper on estimating the factor structure in high dimensional, approximate factor models, and its consequences for estimating the underlying covariance matrix.
Let us consider implications for the time series structure of
, specifically its lagged covariance matrix, and convergence in the
‐norm.
, Fan et al. (2011), theorem 3.2, obtain the rate
in
for an estimate based on an observed factor structure, whereas in the present paper, utilizing estimated factors, the authors obtain the rate
. Now, for the sample covariance, writing

we obtain
. Thus, using an estimated factor structure may not be beneficial for moderate values of p.
For the lagged covariance
h≥1, assume for distinction that the errors (
) are known to be serially uncorrelated:
.
. In contrast, for the estimate
, i,j=1,…,p, we obtain from corollary 1 (in the present paper)

We give a finite sample illustration, similar in setting to Fan et al. (2011), but with factors following a more strongly dependent AR(1) process. The results are plotted in Fig. 16. Further details for the above statements and the simulation can be found at http://www.unimarburg.de/fb12/stoch/files/holzmann/fandiscuss.pdf.

, sample;
, factor;
, observed factor); (b) averages of lagged covariance (max‐norm) over 500 simulations against pHanwen Huang, Yufeng Liu, J. S. Marron, Dan Shen and Haipeng Shen (University of North Carolina at Chapel Hill)
We congratulate Fan and his colleagues on a very interesting contribution, which takes the fundamentally important field of covariance matrix estimation in some important new directions. We agree that now is a good time to be studying asymptotic contexts, where the first K eigenvalues of Σ grow quickly. The asymptotic mode of the sample size tending to ∞, with an exponentially growing dimension, can be improved by taking the dimension as the asymptotic driver, with the sample size growing at a logarithmic rate. This makes it clear that this setting is very close to the high dimension, low sample size setting with fixed sample size (Hall et al., 2005). Shen et al. (2012) studied another notion of principal component analysis consistency, in a wide range of such settings. Shen et al. (2013) studied another approach to sparsity under a growing eigenvalue assumption, establishing a new characterization of the boundary between regions of consistency and strong inconsistency for sparse principal component analysis in high dimension, low sample size settings. Can similar results be established for POET?


), soft (
) and POET (
) methods (
, true) for two simulated data sets with d=1000 and n=50: (a) results for spike size λ=5 and w=200 spike entries (POET works best); (b) results for λ=100 and w =10 (soft thresholding works best)
Fig. 17(a) shows that, in situations where the number of the spikes is larger than the sample size, the POET‐method gives a major improvement. Fig. 17(b) shows that, in situations with few large spikes, the soft method works better than POET owing to better background noise estimation. Ultimately, a combination of POET with existing SigClust methods may work better.
Jian Huang (University of Iowa, Iowa City, and Shanghai University of Finance and Economics) and Yong Zhou (Chinese Academy of Sciences, Beijing, and Shanghai University of Finance and Economics)
We congratulate Fan and his colleagues on presenting a wonderful and thought‐provoking paper dealing with an important topic in high dimensional data analysis. They introduce the POET‐methodology for covariance matrix estimation and study its properties. The theoretical results obtained by them are highly original, notably the aspects concerning the relative magnitude of the sample size and the number of variables. They also describe several poetic and important applications. We focus our discussion on an application of POET in the linear regression model y=Xβ+ɛ. Here
,
,
is the vector of regression coefficients and
consists of independent errors.
An attractive approach to selection and estimation in high dimensional regression is based on the penalized least squares criterion
), where ρ(·;λ) is a suitable penalty function with a tuning parameter λ≥0. The success of this approach depends on the behaviour of the restricted eigenvalues and related quantities of
(see, for example, Bickel et al. (2009)). We discuss a way to repair the degeneracy of
based on POET.
Let Σ be the covariance matrix of the row vectors in X. Assume factor model (1.1) in the paper for the predictors and denote the POET‐estimator by
. Consider the spectral decomposition
, where V is a p×p orthonormal matrix of the eigenvectors and
is a diagonal matrix of the eigenvalues of
. Let
. Then
. This expression is reminiscent of singular value decomposition. But here U is only approximately orthogonal, since V is from
. However, it can be viewed as a POET‐regularized singular value decomposition.
The least squares loss equals
. Replacing
by
in this expression and noting that
, we obtain
plus a term independent of β. Let
and
. It is natural to consider the penalized criterion
. The loss function here can be considered a regularized version of the least squares loss, in which the rank deficiency of
is repaired by making use of
.
In particular, the least squares estimator based on
is
, which is well defined because
is invertible. With standardized predictors,
was used for screening variables by Fan and Lv (2008). The
can be considered a corrected version of
based on
. It also can be used for screening.
The validity of the above proposal rests on the properties of
. Simulation studies are needed to evaluate their finite sample performance. Also, much work is required to analyse their theoretical properties. The results obtained by the authors provide a solid basis for such analyses.
Sujngkyu Jung (University of Pittsburgh) and Jason P. Fine (University of North Carolina at Chapel Hill)
In this very stimulating paper, Fan and colleagues show the gain of conditional sparsity assumptions in covariance matrix estimation when principal components are pervasive. It is striking that the estimation procedure uses the first few principal components from the sample covariance matrix without any thresholding. The sample principal components are known to be inconsistent without strong assumptions (Johnstone and Lu, 2009). The present paper shows that under the conditional sparsity assumption the covariance estimator based on these principal components is consistent.
It seems that the gains in the methodology proposed arise in part from the sparsity assumptions and in part from the pervasive factor assumptions. The latter assumption requires that the K largest eigenvalues of the p×p covariance matrix Σ are of order p. It is well known that the magnitude of eigenvalues is a critical condition for the consistency of principal component directions when dimension p increases with the sample size n. As an example, suppose that the largest eigenvalue
of Σ is of magnitude δ(p) with the rest being simply 1. The corresponding sample eigenvector
is consistent in the sense that
as p→∞ when δ(p)/p→∞. In contrast, such a strong result does not hold whenever δ(p)=O(p) or o(p) (Jung and Marron, 2009; Jung et al., 2012). This gives the insight that the sparsity assumption on the error covariance matrix is critical for the proposed estimator under the pervasive factor assumption (i.e. δ(p)=O(p)).
Should we be tied to the pervasive factor assumption? Many other conditions have been considered in the literature. For example, in random‐matrix theory where n and p increase at the same rate, it is customary to assume fixed eigenvalues for all p (see, for example, Paul (2007)), which corresponds to δ(p)≡δ=o(p). Meanwhile, implicit assumptions in sparse estimation of principal components are that the number of non‐zero loadings in population eigenvectors grows at a slower rate than p (Shen et al., 2013). The case δ(p)/p→∞ yields a trivial solution, with easy separation of the leading eigenvectors from the error covariance matrix. In contrast, the case δ(p)/p→0 makes estimation of the covariance matrix much more difficult. What can be said about the proposed estimator when
It would be worthwhile to investigate more carefully the interaction between the two key assumptions on the magnitudes of the eigenvalues and the sparsity of the orthogonal complement matrix.
Although the theoretical results are quite elegant, concerns arise about the practical implications of the two key assumptions. This is particularly true since there may not be information in the data to detect violations of the assumptions. In what types of applications are such assumptions reasonable? Are there situations where one type of assumption is more realistic than the other type? Are there diagnostics that might be employed in real data analysis? Additional practical guidance would be welcomed.
Oliver Linton and Michael Vogt (University of Cambridge)
, where
- the systematic part
has K large (O(p)) eigenvalues and p−K zero eigenvalues and
- the residual part
is a sparse matrix with bounded eigenvalues.
In financial applications,
represents idiosyncratic risk that can be diversified away, and so makes a smaller order contribution to portfolio risk, but in practice it can be important. The authors are to be congratulated on their comprehensive and useful method for taking full account of this structure.
The assumption that all of the non‐zero eigenvalues of
dominate the largest eigenvalue of
by the magnitude p is likely to be a little strong in practice. If K is moderately large, the Kth eigenvalue of
can be expected to be much closer to the largest eigenvalue of
than the first eigenvalue. This may affect the quality of the estimation procedure and make the problem of selecting K difficult. Fig. 18 illustrates this point with the help of the data from Section 7. We wonder whether the main theoretical results continue to hold under weaker assumptions on the growth rate of the smallest positive eigenvalue of
.

and
for the various yearly data samples used in Section 7.2 (the time points on the x‐axis indicate the starting date of each sample and K=3 as in the paper; the plots show that the first (i.e. the largest) eigenvalue of
is much more spiked than the third, the latter roughly having the same magnitude as the largest eigenvalue of
): (a) largest eigenvalue of
(
) and
(
) in each sample; (b) comparison of the third largest eigenvalue of
(
) with the largest eigenvalue of
(
)
Another remark concerns the technical assumption 2, part (c), which imposes exponentially decaying tails on the model residuals. This is a very strong condition which in particular implies that all moments exist. In applications to daily equity returns this is likely to be violated. Fig. 19 shows a log‐rank plot of the data from Section 7 which suggests that the residuals are far from having exponentially decaying tails. To have a better idea of how the estimation procedure works when applied to financial data, it would thus be important to understand which parts of the procedures are robust to weaker moment conditions and which are not.

, median of the estimated exponents obtained for each sample (as can be seen, the exponents take values roughly between 3.5 and 5, indicating that in many cases only the first few moments will exist)
Regarding the portfolio choice application, the authors use shrinkage methods to impose sparsity on the idiosyncratic part of the covariance matrix of returns. An alternative or perhaps complementary approach here (see Yen (2011)) is to impose sparsity on the portfolio weights through an
‐penalty. Each non‐zero investment entails a transaction cost and so it makes financial sense to minimize the number of such transactions; this is especially relevant for very large portfolios turned over daily. One further concern with the portfolio methodology is that no smoothness assumptions on the thresholded idiosyncratic covariances are exploited. In particular, the location of the 0s in the thresholded matrices (and thus their eigenstructure) may change abruptly over time, even though the rolling window data overlap considerably from period to period.
Han Liu (Princeton University) and Lie Wang (Massachusetts Institute of Technology, Cambridge)
We congratulate Professor Fan, Professor Liao and Miss Mincheva for their thought‐provoking paper. We believe that the proposed methodology will have profound impact and stimulate many further researches.
Estimating a large covariance matrix under a small sample size is a fundamental problem. However, it suffers from the challenge that the eigenvalues of the sample covariance matrix do not converge to the population truth when the population eigenvalues are bounded or grow at a slow rate. In this paper, Professor Fan, Dr Liao and Miss Mincheva avoid this problem by exploiting an approximate factor model with a spiked eigenvalue condition: they assume that the population covariance matrix decomposes into a low rank component and a residual component. The eigenvalues of the low rank component are spiked and diverge at a fast rate, whereas the eigenvalues of the residual component are bounded. The POET‐estimator directly runs the singular value decomposition on the sample covariance matrix. It estimates the low rank component by the top principal components and applies thresholding methods to estimate the residual component according to different sparsity and smoothness conditions. The covariance matrix is then estimated by combining these two components.
Their paper stimulated us to consider the following two extensions.
Semiparametric extension
The POET‐method requires exponential‐type tails of the data to establish large deviation results. It is interesting to extend this method to handle data from the semiparametric non‐paranomral family (Liu et al., 2012).
such that
is Gaussian, i.e.
. For identifiability
is constrained to be a correlation matrix. Under the non‐paranomral model, Liu et al. (2012) suggested replacing the sample correlation matrix by the Kendall's τ rank correlation matrix
with
(7)
is the empirical Kendall τ‐statistic between
and
. By assuming that
admits the ‘low rank plus sparse’ structure, we could apply the POET‐method on
. We have obtained encouraging numerical results; further theoretical investigation is on the way.
Tuning‐insensitive extension
Another extension is to apply more sophisticated methods to estimate the residual component matrix. For Gaussian models, Liu and Wang (2012) proposed a sparse inverse covariance estimation method named TIGER. The TIGER‐estimator is tuning sensitive and achieves the optimal rates of convergence for both covariance and inverse covariance estimation under different norms. It would be interesting to see whether these good theoretical properties still hold for the corresponding POET‐estimator.
Jorge Mateu (University Jaume I, Castellón)
Fan and his colleagues are to be congratulated on a valuable and thought‐provoking contribution on the estimation of high dimensional covariances with a conditional sparsity structure. As they note, this problem can be encountered in a wide variety of practical examples and scientific fields. In particular they mention the problem of high dimensional classification.
I would like to comment on this problem in the context of spatial point processes. Byers and Raftery (1998) considered the problem of detecting features in spatial point processes in the presence of substantial clutter with the aim of outlining seismic faults. They used kth‐nearest‐neighbour distances to produce high breakdown point robust estimators of a covariance matrix in a high dimensional problem. If in this context we assume that we have some common but unknown factors, we can then use the idea of the principal orthogonal complement thresholding method to explore such an approximate factor structure with sparsity. We have sound strategies to calculate in this spatial context the sample covariance matrix and the factor‐based covariance matrix so that we can use the idea of conditional sparsity.
In the context of spatial point processes, Collins and Cressie (2001) developed exploratory data analytic tools, in terms of local indicators of spatial association functions based on the product density, to examine individual points in the point pattern in terms of how they relate to their neighbouring points. For each point of the point pattern we have a local indicators of spatial association function. To perform statistical inference, needed for example in testing for local clustering, Collins and Cressie (2001) developed closed expressions of the autocovariance and cross‐covariance between any two such functions. These covariance structures are complicated to work with as they live in (very) high dimensional spaces. Again, it is not difficult to assume common factors among these functions and thus it could be appropriate to consider conditional sparsity to estimate the covariance matrix consistently. This will provide a new insight in such a problem.
Consider, finally, an approach based on latent process modelling and principal component analysis to obtain a computationally feasible exploratory tool for discovering patterns of association between components of a highly multivariate point process. The latent Gaussian fields are obtained as linear combinations of some independent Gaussian processes. Again it is easy to think about the POET‐method to estimate the complicated covariance matrix.
Guangming Pan (Nanyang Technological University, Singapore) and Heng Peng (Hong Kong Baptist University)
We congratulate Professor Fan, Dr Liao and Ms Mincheva for such a timely paper. We enjoyed reading it since there are some good and novel ideas here, particularly the idea to estimate a covariance matrix by reducing Σ to a low rank matrix plus a sparse matrix and the concept of conditional sparsity.
(8)
would depend on the first finite principal components of
. The regressor
can be then supposed to follow a factor model like
(9)
is the r×1 factor process and
are the p×1 idiosyncratic error components. Combining model (8) with model (9) we then have
(10)
. When considering the new model (10),
is not necessarily sparse. Though the dimension of the model is still ultrahigh, as in Wang (2012) and Ke et al. (2012), it can be efficiently reduced by the sure independence screening procedure (Fan and Lv, 2008) if we impose some simple structure on the covariance matrix of
.
Fan, Liao and Mincheva assume that the number of principal components, K, is known. In many applications K is unknown and must be estimated. There is some literature focusing on this problem, e.g. Bai and Ng (2002) and Onatski (2009, 2010). Their approaches require similar spike conditions such that the first K largest eigenvalues go to ∞ and the remaining eigenvalues are bounded. But what would happen if Σ is structured in a different way, say, Σ is a Toeplitz matrix where the eigenvalues are not spiked? Can the number of factors still be estimated consistently? Below we consider the problem of estimating K (the first K largest eigenvalues do not tend to ∞).
In some sense, estimating the number of factors is equivalent to finding the number of eigenvalues of the population covariance matrix Σ which are greater than a constant number C, i.e.
, where #{i:i ∈ A} denotes the number of
which satisfies the property A and
, are eigenvalues of Σ.
are consistent estimates of the respective population eigenvalues of Σ (theorem 4 of Chen and Pan (2012)). Hence K can be determined by the sample eigenvalues as
. When p/n→c ∈ (0,∞), Baik and Silverstein (2006) and Bai and Yao (2008) stated that the eigenvalues of spiked Σ that are greater than 1+√c can be recovered from those of
. Each population eigenvalue
outside [1−√c,1+√c] pulls one sample eigenvalue away from the support
of the Marchenko–Pastur distribution and positions it at
. Therefore, when C>1+√c, we can estimate K as

When p/n→∞, we can firstly split the random vector
into several subgroups by some criteria. For every subgroup, its dimension would be smaller than or proportional to n. Hence the number of factors for every subgroup can be determined by the method suggested above. Since every subgroup uses the same factors, those estimated numbers of factors for every different subgroup can be averaged, or maximized to obtain the final estimate of K, the number of factors for all of the random vector
. Though such an idea including computational burden would need further investigation, we believe that, when p is significantly larger than n, the number of factors in the model should be able to be estimated accurately. We would be very interested in hearing the authors’ views on this point.
Mohsen Pourahmadi (Texas A&M University, College Station)
How do we go beyond the prevalent sparsity assumption in the recent literature and estimate a large, non‐sparse covariance matrix? The question arises naturally in the factor models of the form
, which is a low rank plus a sparse (non‐diagonal) matrix, known as approximate factor models (Chamberlain and Rothschild, 1983). Given the sample data
with dimension p≥n, the attraction of the POET covariance estimator proposed by Fan, Liao and Mincheva is in its simplicity, transparency, generality and rigour. These attributes are highly desirable and we would like to see more of them in the rapidly growing algorithm‐driven area of high dimensional data analysis (Pourahmadi, 2013).
Construction of a POET‐estimator is simple and proceeds as follows.

is the residuals.
(11)- When δ=0 and q=p, it reduces to the sample covariance matrix.
- When δ=1, the estimator reduces to that based on the standard factor model.
- When q=0 it reduces to the thresholded estimator of Bickel and Levina (2008) or the adaptive thresholded estimator of Cai and Liu (2011) depending on the choice of
.
In addition as a bonus, using equation (11) and the Sherman–Morrison–Woodbury formula we obtain estimators of the the precision matrix.
The asymptotic properties of the POET covariance estimators are established when the data are temporally correlated and under the strict stationarity assumption. In this set‐up, it is desirable to go beyond estimating
and to have consistent estimators of the autocovariance matrices
or the spectral density matrix of the underlying process. I wonder whether the authors have thought of this problem and can shed any light on how their conditions relate to those in Forni et al. (2004) in the context of generalized dynamic factor models.
Cheng Yong Tang (University of Colorado, Denver) and Yingying Fan (University of Southern California, Los Angeles)
We most heartily congratulate Fan and his colleagues for their thought‐provoking and impactful work on estimating the large covariance matrix, which is pivotal in many contemporary scientific and practical studies. Facilitated by a factor model, a parsimonious structure is proposed for the large covariance matrix by combining a low rank matrix and a sparse covariance matrix. In the authors’ framework, a factor model is used to characterize the systematic common components underlying the target large‐scale dynamics in various problems, and a sparse covariance matrix is imposed to incorporate the remaining idiosyncratic contributions to the variations and covariations. Our comments are mainly on the treatment for the idiosyncratic component, i.e. the remaining dynamics after identifying and removing the systematic part.
An important assumption of the approach proposed is that a sparse covariance matrix
is imposed for modelling the idiosyncratic component. One may naturally wonder that, in situations when a sparse
is inadequate, what alternative approach can be used for modelling the idiosyncratic component. Further, can a similar idea of parsimonious modelling by structural decomposition be extended for solving other problems such as large precision matrix estimation? In the framework of graphical models, Tang and Fan (2013) investigate the problem of large precision matrix estimation by parsimoniously modelling the idiosyncratic component by using a sparse precision matrix
. They observe that the large‐scale precision matrix
depends on the idiosyncratic component only through the precision matrix
. Thus a similar idea of structural decomposition can be equally applied for estimating the large precision matrix, with the systematic component being captured by a factor model. Facilitated by the interpretation that 0s in a precision matrix imply conditional independence between the corresponding components, a sparse
can have useful practical implications. For example, in the famous Fama–French factor model (Fama and French, 1993) in finance, a non‐diagonal sparse precision matrix for the idiosyncratic component characterizes the interpretable market effects among returns of stocks at different levels, such as the industrial segmentwise connections, and the intrinsic within‐industry associations, say, among financial firms. Existence of such effects after removing the dynamics corresponding to the systematic component may result in a non‐sparse
, yet sparse modelling can still be valid by exploring the sparse precision matrix
.
Joong‐Ho Won (Korea University, Seoul) and Woncheol Jang and Johan Lim (Seoul National University)
We congratulate Fan and his colleagues for a stimulating paper in which they have made a substantial contribution to challenging problems in large covariance estimation.
As practitioners, we are most interested in finite sample positive definiteness of the estimator proposed by the authors. They suggest using a scaling constant C in the threshold for the idiosyncratic covariance matrix
and adjusting C to render its minimum eigenvalue positive. This idea leads to the univariate root finding procedure of expression (4.1). Although this procedure looks apparently simple, it requires computing the minimum eigenvalue of a p×p matrix, which is computationally expensive by itself for even a modest value of p, for every value of C tried. Furthermore, altering C means that the thresholding must be recomputed in every iteration, changing the sparsity pattern of the initial
. Thus we are concerned that the resulting cost of solving expression (4.1) may not be so cheap, especially when the target function in it is not smooth (Fig. 1).
onto a space of positive definite matrices. This can be done by solving
(12)
for the spectral decomposition of
(Boyd and Vandenberghe, 2004). Second, replace the entries of
that correspond to the zero‐thresholded entries of
with 0. Repeat these two steps until convergence. This alternating projections procedure is guaranteed to converge, as both steps are convex (Boyd and Dattorro, 2003). The first step (12) requires a spectral decomposition of
as in the root finding procedure, but the second step is free of comparisons with varying thresholds.
We numerically compared two procedures in a simple setting. The comparison was done for 100 data sets with n=50 samples of a p=100‐dimensional standard normal vector. The results are summarized in Table 12. The root finding took roughly 1 min to converge, whereas the alternating projections converged in 2.5 s, with little additional time to the ordinary POET‐estimator, i.e. POET without adjustment for positive definiteness, in less than half the iterations.
| Procedure | Computing | Minimum | Number of |
|---|---|---|---|
| time (s) | eigenvalue | iterations | |
| to converge | |||
| POET | 2.34 (0.148) | <0 | — |
| Root finding | 62.5 (9.93) | 0.149 (0.054) | 20.7 (4.14) |
| Alternating | 2.43 (0.118) | 0.0997 (0.000) | 7.91 (0.71) |
| projections |
-
†Numbers in cells are averages over 100 data sets along with their empirical standard deviations in parentheses. The code is written in Octave 3.2.3 (Eaton, 2002) on a laptop computer (MacBook Air, 1.8 GHz i5 processor with 4 Gbytes memory). Covariance hard thresholding was used in the ordinary POET with C=0.1 and K=3. In the root finding, Octave function fzero()was used to find
, the root of equation (4.1), starting from C=0.1; final thresholding was conducted for
. In the alternating projections, the lower bound μ for the minimum eigenvalue was set to 0.1. Both procedures terminated if
or
does not change up to the third digit after the decimal point.
The POET‐method may theoretically be optimization free, but the post hoc adjustment to make the ordinary POET‐estimator positive definite involves some numerical optimization anyway. A little more attention to this step may greatly improve the practicality of the method proposed.
Lingzhou Xue (Princeton University) and Hui Zou (University of Minnesota, Minneapolis)
We first congratulate Fan, Liao and Mincheva for their innovative and timely contribution to high dimensional covariance matrix estimation. POET is a statistically and computationally appealing method for estimating a large covariance matrix with a conditional sparsity structure. We discuss two alternative methods for estimating the error covariance matrix in POET.
POET2 via positive definite adaptive thresholding estimation
to estimate the sparse error covariance matrix, namely

is the entry‐dependent threshold. In Section 4.1 Fan, Liao and Mincheva discussed the importance of choosing a proper threshold to guarantee the finite sample positive definiteness of
. POET chooses the threshold C in the range
where
is defined in expression (4.1). Xue et al. (2012) proposed a direct convex programme to deliver a positive definite thresholding covariance matrix estimator. We adopt the idea thereof to construct another positive definite adaptive thresholding estimator for POET. Specifically, we consider the following constrained
‐minimization problem:

. We introduce a new variable Θ and an equality constraint Σ=Θ, namely


We iteratively solve L(Θ,Σ;Λ) for (
by alternating minimization, and then we update the Lagrange multiplier
. The complete alternating direct method of multipliers algorithm proceeds as follows.



The two operators
and ST(·) are defined in Xue et al. (2012).
We call
the POET2 estimator of Σ. We compared POET2 and POET by using simulation models 1–3 with T=200 and p=200 from Section 6.5.2. As can be seen from Table 13, the two versions of POET have very similar performance.
| Results for model 1 | Results for model 2 | Results for model 3 | ||||
|---|---|---|---|---|---|---|
| POET | POET2 | POET | POET2 | POET | POET2 | |
|
26.20 | 26.18 | 2.04 | 2.04 | 7.73 | 7.74 |
|
1.31 | 1.30 | 2.07 | 2.06 | 8.48 | 8.50 |
POET3 via principal orthogonal complement banding
If
is in fact bandable, another version of POET can use banding instead of thresholding to regularize the principal orthogonal complement. The bandable structure is widely used to model dependence between ordered variables. Given a banding parameter k, principal orthogonal complement banding yields
. To guarantee the positive definiteness, we consider the eigendecomposition of
, and then define
. The POET3‐estimator of Σ is defined as
. We compared POET3 and POET by using simulation models 1 and 2. As shown in Table 14, POET3 performs better than POET by taking advantage of the bandable structure. However, POET3 is potentially better only when the bandable structure is reliable and the ordering information is accurate. Otherwise, POET (or POET2) should be preferred.
| Results for model 1 | Results for model 2 | |||
|---|---|---|---|---|
| POET | POET3 | POET | POET3 | |
|
26.20 | 25.76 | 2.04 | 1.68 |
|
1.31 | 1.26 | 2.07 | 1.73 |
The authors replied later, in writing, as follows.
We are very grateful to all contributors for their stimulating comments and questions on high dimensional covariance matrix estimation in the presence of common factors. They have touched many important issues, from theoretical understanding to methodological improvements and applications. Their contribution is important for the better understanding of the proposed POET‐estimator. We shall not be able to resolve all points in a brief rejoinder. Indeed, the discussion can be seen as a collective research agenda for the future, and some of the agendas have already been undertaken by the discussants.
Spiked eigenvalues
. We impose the pervasiveness of the factors through the assumption that the eigenvalues of the K×K matrix

, the first K eigenvalues are of order p whereas the remaining eigenvalues are bounded.
This pervasiveness is not the minimum condition to make the problem identifiable. As correctly pointed out by Jung and Fine, the spikiness of the eigenvalues of the low rank matrix
and sparseness of
together play an important role in distinguishing the systematic and idiosyncratic components. As long as
is much smaller than
, these two components can be distinguished. Of course, the rates of convergence depend on the size of the gaps and other parameters. For example, Yu and Samworth suggested a weaker version of the pervasive condition, which replaces
in the definition of
with
for some α ∈ (0,1). With this weaker condition, all results should still go through, and carefully inspecting our technical proofs should yield the rates of convergence. In contrast, there is also recent literature that requires α=0 or replaces
with
, which corresponds to approximately ‘sparse loading matrices’ (Pati et al., 2012; Carvalho et al., 2009). See also the discussion by Pan and Peng for a novel approach. Intuitively, this allows for non‐pervasive (weak) factors that have no effect on a non‐negligible portion of the individuals. However, this will bring more difficulty to estimating the number of spiked eigenvalues, and identifying the low rank part from the idiosyncratic part, because the signal is too weak.
We agree wholeheartedly with H. Huang, Y. Liu, Marron, D. Shen and H. Shen that now is a good time to study asymptotic contexts, where the first K eigenvalues of Σ grow quickly. Indeed, sparsity appears rarely in applications, yet conditional sparsity is likely to be more relevant for many applications. Studying spiked eigenvalues amounts to exploring the main structure of the covariance matrix.
We agree on the existence of weaker factors in applications (Lam and Hu, Linton and Vogt, and Onatski). These factors are usually difficult to differentiate from the idiosyncratic components and do not play a noticeable role without a large amount of data. We would like to add that our assumption on the spikiness of eigenvalues is imposed on the population covariance, not on the sample covariance matrix. Model diagnostics based on sample eigenvalues should be interpreted with care owing to large estimation errors in high dimensional matrices.
Choice of the number of factors K
The gaps between the spiked eigenvalues and the remaining eigenvalues have impact on the choice of the number of factors K, Fryzlewicz and N. Huang, Lam and Hu, and other discussants carried out many interesting simulations about the issue of choosing K, the number of these spiked eigenvalues. In many simulations by the contributors, the responses are not driven by a few common factors. In contrast, POET builds on the principal components analysis based on the sample covariance matrix, whose first K eigenvalues are growing at rate O(p). The existence of these spiked eigenvalues is implied by the pervasive condition for the common factors. This gap can be made smaller if Yu and Samworth's assumption is imposed instead. As pointed out by the discussants (H. Huang, Y. Liu, Marron, D. Shen and H. Shen) and Shen et al. (2012), the existence of spiked eigenvalues is necessary to achieve the principal components analysis consistency in the high dimension, low sample size context.

is the orthogonal complement of the sample covariance matrix; M is a given upper bound and IC(p,T) is one of the information criteria in Bai and Ng (2002). However, if there is no gap among the eigenvalues, then there are either no pervasive common factors (all eigenvalues are small) or too many common factors (most eigenvalues are very large). In the first case, the consistent method will estimate K as 0. In the latter case, factor analysis is inappropriate because it does not effectively reduce the dimension. Pan and Peng suggested a new way of choosing K by comparing the sample eigenvalues with a given threshold. Their method should work well even when the factors are weak. Tests based on the gaps among eigenvalues were also proposed by Onatski (2009).
We assume that K is fixed in the current version of the paper. Our working paper version allows K also to grow with p but is assumed to be known. We agree that allowing a growing and unknown K can be a good extension, as commented by Ferreira, and the first step should be consistently estimating it. This can be done by carefully reviewing the proofs in Bai and Ng (2002). Successfully solving this problem will also contribute to the literature on approximate factor models.
Generalized dynamic factor model and spectral matrix
Our paper was written for modelling large covariance matrices in genomics and finance. The former typically assumes that data are collected independently across the population, whereas the latter assumes that markets are efficient so past data play limited roles in asset returns. Hence, a conditional sparsity structure (conditioning here means specifically taking the linear dependence out) is imposed without considering the time‐lagged variables.
As correctly pointed out by Hallin and Pourahmadi, allowing lagged factors is important for other applied time series problems. In addition, Farnè and Pourahmadi raise the question of estimating the spectral matrix and frequency domain analysis. Indeed, the generalized dynamic factor model, except requiring second‐order stationarity, holds without any further assumptions. The method of Forni et al. (2000) should naturally extend POET to the frequency domain analysis, which will also enable us to estimate the autocovariance matrices and the spectral matrix. This will further broaden the scope of POET in both theory and application.

The sample covariance matrix of
involves the cross‐covariance matrices. An application of POET to the data vector
would yield the factors that are constructed on the basis of the present and past data.
Holzmann and Leister provided an interesting derivation for estimating the lagged covariance under the max‐norm. In terms of the convergence rate under the max‐norm, knowing the factor structure does not yield a significant improvement over the sample covarianee. This has been demonstrated and explained earlier in Fan et al. (2008).
Alternative and related methods
We thank several discussants for suggesting useful alternative steps to improve POET. Won, Jang and Lim propose iterative procedures to produce a finite sample positive definite error covariance matrix. A similar and conceptually simpler method is given by the contribution of Xue and Zou. Zhang and Peng also propose an iterative version of our method. The cost of the potential improvements is the loss of the simplicity of the POET method. Onatski's linear shrinkage to the principal orthogonal components provides a useful alternative approach to regularizing the error covariance matrix.
Fryzlewicz and N. Huang suggest an interesting alternative covariance estimation based on the aggregation of the sample covariance matrix (unbiased) and regularized covariance matrix (biased but with low variance). This aggregation allows a trade‐off between the bias and variance in the estimation. Although NOVELIST works well in their simulations, theoretical understanding of the procedures is needed. For example, in the approximate factor model, Σ is a dense matrix. Hence, it is difficult to explain why a thresholding is applied to the sample covariance matrix, which should also be dense.
Critchley, Dang and Yu, and Gijbels, Herrmann and Verhasselt recommend robust principal components analysis to protect against outliers and possible extensions to deal with non‐stationary time series. The transformed Kendall's τ rank correlation matrix suggested by H. Liu and Wang provides an answer to the robust issue on the tails of errors, raised by Linton and Vogt. Bouveyron proposes alternative methods based on
‐penalization. He suggests use of the covariance matrix approximation approach in Bouveyron et al. (2007). The effectiveness of the proposed approach for high dimensional covariance matrices hinges on good approximations and remains to be seen. We emphasize that POET works particularly well in the presence of common factors, as it takes out a few principal components in the first step of the singular value decomposition. Moreover, based on singular value decomposition, POET is optimization free except for choosing the number of factors (which involves a one‐dimensional optimization). It is also adaptive to locally stationary processes through time localization and time domain smoothing.
As pointed out by Gijbels, Herrmann and Verhasselt, and Ferreira, the issue of choosing the tuning parameters for thresholding is important in practice and also exists in all regularization procedures. Besides the method suggested by Gijbels and her colleagues, additional research is still needed.
Extensions
Xue and Zou suggest a banding orthogonal complements estimator, to deal with banded idiosyncratic components. This case is asymptotically nested in POET because thresholding can also produce a banded matrix. But, if the structure is indeed ‘conditionally banded’ (given the common factors), their suggested method should improve the finite sample performance. Technically, it is not difficult to achieve similar rates of convergence in this case.
It is also interesting to work with the sparse inverse idiosyncratic covariance, as suggested by Tang and Fan, and H. Liu and Wang. Tang and Fan also mention a couple of interesting applications that fit into this case. Under the high dimensionality, estimating a sparse precision matrix usually involves optimizations that may introduce some computational burdens. H. Liu and Wang suggest a viable column‐by‐column penalized square‐root lasso method to explore sparsity in inverse idiosyncratic covariance matrices, which is insensitive to the tuning parameter. Lam and Hu suggest an idea to deal with weak factors, which complements our method.
Applications
One of the immediate applications of POET is portfolio allocation, as commented by Linton and Vogt, and Gijbels and her colleagues because the problem crucially depends on estimating a high dimensional covariance matrix, and financial returns are often driven by a few common factors. Once the volatility matrix has been well estimated, we can proceed to portfolio selection via the Markowitz framework. We agree with Linton and Vogt that sparse portfolio allocation is another interesting idea to enhance the stability and the performance of portfolios. It has been thoroughly studied by Jagannathan and Ma (2003) and Fan et al. (2012).
, if we simply bound the estimation error by

unless p log (p)≪T even when the factors are observable. We would like to note that the above upper bound is too crude. To show that the estimation error is asymptotically negligible, we should not separate
and
because
is a weighted estimator error. More careful investigation of this term can yield an improved rate of convergence. However, we wholeheartedly agree with the notion in Pesaran and Yamagata (2012) that ignoring the correlation structure in constructing testing statistics yields more stable test statistics, whose sizes of tests can be more accurately determined.
Pan and Peng, and J. Huang and Zhou connect POET with an application to the high dimensional linear regression. Indeed, when the regressors depend on a few common factors, POET can be applied to estimate their joint covariance, which will help variable selections and prediction. J. Huang and Zhou suggest the use of POET to improve sure independence screening in Fan and Lv (2008) and this can be a fruitful direction to pursue. Sparse principal components can also be used as predictors. Further research along this line is required. We thank H. Huang, Y. Liu, Marron, D. Shen and H. Shen for reminding us of the literature on SigClust. POET works well in dealing with spiked eigenvalues, so we can foresee the success of combining POET and SigClust or other methods to estimate the principal eigenvalues in high dimension, low sample size contexts.
Ahmad, Hashmi and Halawani suggest genomics applications as an interesting test bed of POET. We agree with them. In fact, Fan and Han (2013) address large‐scale hypothesis testing problems in considerable detail. It includes applications to the type of genomic data as Ahmad and his colleagues suggested.
POET is very fast to compute. We are also excited to learn about the potential applications of POET to the spatial point processes suggested by Mateu, source localization problems discussed by Jian Zhang and computer experiments suggested by Coad and Maruri‐Aguilar, among others. They open broad areas where POET can be successfully applied and achieve important scientific discoveries.
Comments
Various contributors raise excellent comments. For brevity, some of them have been partially answered above, and many can be seen as a good research agenda.
We appreciate that Kent and Critchley remind us of the issues on invariance or equivariance under affine transformation. We agree that, if the measurement units change, the principal components will then change and hence POET does not have equivariance. In many applications like finance and genomics, the measurement units are comparable and affine transformations will create some interpretation problems. If the units used are a concern, one can apply POET to a correlation matrix. The sparsity of
remains intact. The equivariance issue in high dimensional problems is very challenging. It is at a very different scale of details from high dimensional data analysis. If affine transforms are considered as in Tyler et al. (2009), sparsity should be imposed to enhance the interpretability.
Viroli makes an interesting comment on the dual problem of estimating the factors and loadings, corresponding to the principal components analysis on either
or
, but give the same estimate for the common components. Theorem 1 in the paper is similar to what was obtained by Stock and Watson (2002b). But the result presented here allows any K≤p, not just the true
.
Montanari is concerned about the precision of estimated factors in our empirical illustration. Here
. As every time series loads on the same common factors, 50 series contain enough information to estimate the factors and therefore the idiosyncratic components. So the heat maps presented in the paper indeed demonstrate the sparsity of
. We feel that the spurious correlations that are created by not accurately estimated factors should be small. Besides PCA, one can perform a quasi‐maximum‐likelihood method, which is typically used for classical factor analysis (Lawley and Maxwell, 1971) and is also consistent under high dimensionality (Bai and Li, 2012).
Gilks gives an interesting example where each off‐diagonal entry of
is a non‐zero constant with probability π. This is a Bayesian perspective in which the population covariance is generated according to some probability distribution. From this perspective, in many applied problems, the probability of being 0 for each off‐diagonal entry should not be a universal constant but varies over the entries. For instance, correlations between the companies in the same industry may have smaller probabilities of being 0 than those across industries. Hence we can write
for each position (i,j) and require that
fast for most of the (i,j)s.
Frommlet makes a comment on our simulated results where Σ is generated from a cross‐sectional auto‐regressive AR(1) process instead of a factor model. We do not claim that POET can solve all large covariance estimation problems, but indicate its power by first controlling a few relatively large eigenvalues. Since POET is reasonably robust to the overestimation of the number of factors, it works well also with sparse matrices. This explains why it works well with AR(1) covariance structure, which is effectively sparse.
Conclusion
In summary, the contributors have provided a wide range of discussions about many aspects of estimating a high dimensional covariance matrix. Many applications suggested have motivated exciting opportunities for interdisciplinary collaborations. We feel very pleased to exchange our ideas and very much look forward to new tools in this important research area. We conclude by reiterating our thanks to all the contributors, and to the Royal Statistical Society and the journal for hosting this forum.
Appendix A:: Estimating a sparse covariance with contaminated data
We estimate
by applying the adaptive thresholding given by expression (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for
are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when
represent the error terms in regression models or when data are subject to the measurement of errors. Instead, we may observe
. For instance, in the approximate factor models, 
by using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold
, define

(A.1)
satisfies, for all
,
and 
When
is sufficiently close to
, we can show that
is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters
and
that are defined in assumptions 2 and 3, let
.
Theorem 5.Suppose that
, and that assumptions 2 and 3 hold. In addition, suppose that there is a sequence
so that
and
then there is a constant C>0 in the adaptive thresholding estimator (A.1) with


If further
, then
is invertible with probability approaching 1, and

Proof.By assumptions 2 and 3, the conditions of lemmas A.3 and A.4 of Fan et al. (2011a) are satisfied. Hence, for any ɛ>0, there are positive constants
and
such that each of the events

. Now for
under the event 

Let
. Then, with probability at least 1−2ɛ,
. Since ɛ is arbitrary, we have
. If, in addition,
, then the minimum eigenvalue of
is bounded away from 0 with probability approaching 1 since
. This then implies that
.
Appendix B:: Proofs for Section 2
We first cite two useful theorems, which are needed to prove propositions 1 and 2. In lemma 1 below, let
be the eigenvalues of Σ in descending order and
be their associated eigenvectors. Correspondingly, let
be the eigenvalues of
in descending order and
be their associated eigenvectors.
Lemma 1.
- (Weyl's theorem)
.
- ( sin (θ) theorem; Davis and Kahan (1970)):

B.1. Proof of proposition 1
are the eigenvalues of Σ and
are the first K eigenvalues of
(the remaining p−K eigenvalues are 0), then by Weyl's theorem, for each j≤K,

For j>K,
. However, the first K eigenvalues of BB are also the eigenvalues of
. By the assumption, the eigenvalues of
are bounded away from 0. Thus, when j≤K,
are bounded away from 0 for all large p.
B.2. Proof of proposition 2

For a generic constant c>0,
for all large p, since
but
is bounded by prosposition 1. However, if j<K, the same argument implies that
. If j=K,
, where
is bounded away from 0, but
. Hence, again,
.
B.3. Proof of theorem 1

and
. If we show that
, then from the decompositions of the sample covariance,

. Consequently, applying thresholding on
is equivalent to applying thresholding on
, which gives the desired result.
We now show that
indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints:
, and
is diagonal. Let
be the solution to the new optimization problem. Switching the roles of B and F, then the solution of problem (2.10) is
and
. In addition,
. From
, it follows that
.
Appendix C:: Proofs for Section 3
We shall proceed by subsequently showing theorems 4, 2 and 3.
C.1. Preliminary lemmas
The following results are to be used subsequently. The proofs of lemmas 2, 3 and 4 are found in Fan et al. (2011a).
Lemma 2.Suppose that A and B are symmetric semipositive definite matrices, and
for a sequence
. If
, then
, and

Lemma 3.Suppose that the random variables
and
both satisfy the exponential‐type tail condition: there exist
,
and
, such that, ∀s>0,

and
, and any s>0,
(C.1)
Lemma 4.Under the assumptions of theorem 2,
,
and
.
Lemma 5.Let
denote the Kth largest eigenvalue of
; then
with probability approaching 1 for some
.
Proof.First, by proposition 1, under assumption 1, the Kth largest eigenvalue
of Σ satisfies, for some c>0,

. Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2),
. Using this and model (1.3),
can be decomposed as the sum of the four terms


, which is
if K log(p)=o(T). Consequently, by assumption 1, we have

. It follows from lemma 4 that

, it remains to deal with
, which is bounded by

since log(p)=o(T).
Lemma 6.Under assumption 3,
.
Proof.Since
is weakly stationary,
. In addition,
for some constant M and any i and t since
has an exponential tail. Hence by Davydov's inequality (corollary 16.2.4 in Athreya and Lahiri (2006)), there is a constant C>0, for all i≤p,t≤T,
, where α(t) is the α‐mixing coefficient. By assumption 3,
. Thus, uniformly in T,

C.2. Proof of theorem 4
Our derivation below relies on a result that was obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that
equals the true K with probability approaching 1. Note that, under our assumptions 1–4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following lemma.
Lemma 7. (theorem 2 in Bai and Ng [2002].)For
defined in expression (2.14),

Proof.For a proof, see Bai and Ng (2002).
We first prove some preliminary results in the following lemmas. Denote
.
Lemma 8.For all
,
,
,
and
.
Proof.
- We have, ∀i,
. By the Cauchy–Schwarz inequality,

- By lemma 6,
, which then yields the result.
- By the Cauchy–Schwarz inequality,

- Note that
. By assumption 4,
, which implies that
and yields the result.
- By definition,
. We first bound
. Assumption 4 implies that
. Therefore, by the Cauchy–Schwarz inequality,

- Similarly to part (c), noting that
is a scalar, we have
where the last line follows from the Cauchy–Schwarz inequality.
Lemma 9.
,
,
and
.
Proof.
- By the Cauchy–Schwarz inequality and the fact that
,
The result then follows from assumption 3.
- By the Cauchy–Schwarz inequality,
It follows from assumption 4 that

. It then follows from Chebyshev's inequality and Bonferroni's method that
.
- By assumption 4,
. Chebyshev's inequality and Bonferroni's method yield
with probability 1, which then implies

- By the Cauchy–Schwarz inequality and assumption 4, we have demonstrated that
. In addition, since
,
. It follows that

Lemma 10.
.
.
.
Proof.We prove this lemma conditioning on the event
. Once this has been done, because
, it then implies the unconditional arguments.
- When
, by lemma 5, all the eigenvalues of V/p are bounded away from 0. Using the inequality
and identity (C.2), we have, for some constant C>0,
Each of the four terms on the right‐hand side are bounded in lemma 8, which then yields the desired result.
- Part (b) follows from part (a) and
Part (c) is implied by identity (C.2) and lemma 9.

Lemma 11.
.
.
Proof.We first condition on
.
- Lemma 5 implies that
. Also
. In ad dition,
. It then follows from the definition of H that
. Define
. Applying the triangular inequality gives
(C.3) - By lemma 4, the first term in inequality (C.3) is
. The second term of inequality (C.3) can be bounded, by the Cauchy–Schwarz inequality and lemma 10, as follows:

- Still conditioning on
, since
and
, right multiplying H gives
. Part (a) also gives, conditioning on
,
. Hence further left multiplying
yields
. Because
, we reach the desired result.
C.2.1. Completion of proof of theorem 4
The second part of theorem 4 was proved in lemma 10. We now derive the convergence rate of
.
, and that
, we have
(C.4)
. Therefore,
. The Cauchy–Schwarz inequality and lemma 10 imply

Finally,
and
imply that the third term is
.
C.2.2. Proof of corollary 1
. By theorem 4, uniformly in i and t,

C.3. Proof of theorem 2
Lemma 12.
, and
.
Proof.We have
. Therefore, using the inequality
, we have

C.3.1. Completion of proof of theorem 2
Theorem 2 follows immediately from theorem 5 and lemma 12.
C.4. Proof of theorem 3

Lemma 13.
, and
.
.
.
-
Proof.
- We have
. Moreover, since all the eigenvalues of Σ are bounded away from 0, for any matrix A,
. Hence
.
- By theorem 2,
.
- The same argument for the proof of theorem 2 in Fan et al. (2008) implies that
. Thus,
is upper bounded by
.
- Again, by
, and lemma 11,
(C.5)
C.4.1. Proof of theorem 3, part (a)
. Hence, for a generic constant C>0,

Lemma 14.
.
Proof.
. Hence
(C.6)
Lemma 15.If
, then with probability approaching 1, for some c>0,
,
,
and
.
Proof.
- By lemma 11, with probability approaching 1,
is bounded away from 0. Hence,

- The result follows from part (a) and lemma 14. Parts (c) and (d) follow from a similar argument to that for part (a) and lemma 11.
C.4.2. Completion of proof of theorem 3
. Define

and
. The triangular inequality gives

, where
(C.7)
is bounded by theorem 2. Let
; then

. Lemma 15 then implies that
. This shows that
. Similarly
. In addition, since
,
. Similarly
. Finally, let
. By lemma 15,
. Then, by lemma 14,

. Adding up
–
gives


C.4.3. Completion of proof of theorem 3: 
. Repeatedly using the triangular inequality yields

be the (i,j) entry of
. Then
.

Hence
. The result then follows immediately.
References in the discussion
Citing Literature
Number of times cited according to CrossRef: 229
- Marco Avella-Medina, Robust Methods for High-Dimensional Regression and Covariance Matrix Estimation, Macroeconomic Forecasting in the Era of Big Data, 10.1007/978-3-030-31150-6_19, (625-653), (2020).
- Rajen D. Shah, Benjamin Frot, Gian‐Andrea Thanei, Nicolai Meinshausen, Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12359, 82, 2, (361-389), (2020).
- Matteo Barigozzi, Marc Hallin, Generalized dynamic factor models and volatilities: Consistency, rates, and prediction intervals, Journal of Econometrics, 10.1016/j.jeconom.2020.01.003, (2020).
- Chee-Ming Ting, Hernando Ombao, Sh-Hussain Salleh, Ahmad Zubaidi Abd Latif, Multi-Scale Factor Analysis of High-Dimensional Functional Connectivity in Brain Networks, IEEE Transactions on Network Science and Engineering, 10.1109/TNSE.2018.2869862, 7, 1, (449-465), (2020).
- Jianqing Fan, Yuan Ke, Kaizheng Wang, Factor-adjusted regularized model selection, Journal of Econometrics, 10.1016/j.jeconom.2020.01.006, (2020).
- Ryota Yuasa, Tatsuya Kubokawa, Ridge-type linear shrinkage estimation of the mean matrix of a high-dimensional normal distribution, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104608, (104608), (2020).
- Matteo Barigozzi, Lorenzo Trapani, Sequential testing for structural stability in approximate factor models, Stochastic Processes and their Applications, 10.1016/j.spa.2020.03.003, (2020).
- Martin Lettau, Markus Pelger, Estimating latent asset-pricing factors, Journal of Econometrics, 10.1016/j.jeconom.2019.08.012, (2020).
- Yong He, Mingjuan Zhang, Xinsheng Zhang, Wang Zhou, High-dimensional two-sample mean vectors test and support recovery with factor adjustment, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107004, (107004), (2020).
- MARKUS PELGER, Understanding Systematic Risk: A High‐Frequency Approach, The Journal of Finance, 10.1111/jofi.12898, 75, 4, (2179-2220), (2020).
- Jianqing Fan, Yuan Ke, Yuan Liao, Augmented factor models with applications to validating market risk factors and forecasting bond risk premia, Journal of Econometrics, 10.1016/j.jeconom.2020.07.002, (2020).
- Shujie Ma, Oliver Linton, Jiti Gao, Estimation and inference in semiparametric quantile factor models, Journal of Econometrics, 10.1016/j.jeconom.2020.07.003, (2020).
- Matteo Barigozzi, Marco Lippi, Matteo Luciani, Large-dimensional Dynamic Factor Models: Estimation of Impulse–Response Functions with cointegrated factors , Journal of Econometrics, 10.1016/j.jeconom.2020.05.004, (2020).
- Rong Chen, Han Xiao, Dan Yang, Autoregressive models for matrix-valued time series, Journal of Econometrics, 10.1016/j.jeconom.2020.07.015, (2020).
- Yi Ding, Yingying Li, Xinghua Zheng, High dimensional minimum variance portfolio estimation under statistical factor models, Journal of Econometrics, 10.1016/j.jeconom.2020.07.013, (2020).
- Sainan Jin, Ke Miao, Liangjun Su, On factor models with random missing: EM estimation, inference, and cross validation, Journal of Econometrics, 10.1016/j.jeconom.2020.08.002, (2020).
- Raj Agrawal, Uma Roy, Caroline Uhler, Covariance Matrix Estimation under Total Positivity for Portfolio Selection*, Journal of Financial Econometrics, 10.1093/jjfinec/nbaa018, (2020).
- Yingjie Dong, Yiu-Kuen Tse, Forecasting large covariance matrix with high-frequency data using factor approach for the correlation matrix, Economics Letters, 10.1016/j.econlet.2020.109465, (109465), (2020).
- Sílvia Gonçalves, Benoit Perron, Bootstrapping factor models with cross sectional dependence, Journal of Econometrics, 10.1016/j.jeconom.2020.04.026, (2020).
- Jushan Bai, Xu Han, Yutang Shi, Estimation and inference of change points in high-dimensional factor models, Journal of Econometrics, 10.1016/j.jeconom.2019.08.013, (2020).
- Long Feng, Ping Zhao, Yanling Ding, Binghui Liu, Rank-based tests of cross-sectional dependence in panel data models, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107070, (107070), (2020).
- Cheng Hsiao, An Econometrician’s Perspective on Big Data, Essays in Honor of Cheng Hsiao, 10.1108/S0731-905320200000041009, (413-423), (2020).
- Gianluca De Nard, Zhao Zhao, A Large-Dimensional Test for Cross-Sectional Anomalies: Efficient Sorting Revisited, SSRN Electronic Journal, 10.2139/ssrn.3560178, (2020).
- Rui Wang, Xingzhong Xu, A Bayesian-motivated test for high-dimensional linear regression models with fixed design matrix, Statistical Papers, 10.1007/s00362-020-01157-5, (2020).
- Chihwa Kao, Min Seong Kim, Zhonghui Zhang, Mahalanobis Metric Based Clustering for Fixed Effects Model, Sankhya B, 10.1007/s13571-019-00211-z, (2020).
- Menghan Hu, Ciprian Crainiceanu, Matthew K Schindler, Blake Dewey, Daniel S Reich, Russell T Shinohara, Ani Eloyan, Matrix decomposition for modeling lesion development processes in multiple sclerosis, Biostatistics, 10.1093/biostatistics/kxaa016, (2020).
- Liangjun Su, Xia Wang, TESTING FOR STRUCTURAL CHANGES IN FACTOR MODELS VIA A NONPARAMETRIC REGRESSION, Econometric Theory, 10.1017/S0266466619000446, (1-32), (2020).
- Min Dai, Hanqing Jin, Steven Kou, Yuhong Xu, A Dynamic Mean-Variance Analysis for Log Returns, Management Science, 10.1287/mnsc.2019.3493, (2020).
- Olivier Ledoit, Michael Wolf, The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation, Journal of Financial Econometrics, 10.1093/jjfinec/nbaa007, (2020).
- Christian Brownlees, Guðmundur Stefán Guðmundsson, Gábor Lugosi, Community Detection in Partial Correlation Network Models, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1798241, (1-11), (2020).
- Wanjun Liu, Yuan Ke, Jingyuan Liu, Runze Li, Model-Free Feature Screening and FDR Control With Knockoff Features, Journal of the American Statistical Association, 10.1080/01621459.2020.1783274, (1-16), (2020).
- Dachuan Chen, Per A. Mykland, Lan Zhang, The Five Trolls Under the Bridge: Principal Component Analysis With Asynchronous and Noisy High Frequency Data, Journal of the American Statistical Association, 10.1080/01621459.2019.1672555, (1-18), (2020).
- Yong He, Xinbing Kong, Long Yu, Xinsheng Zhang, Large-dimensional Factor Analysis without Moment Constraints*, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1811101, (1-31), (2020).
- Jianqing Fan, Jianhua Guo, Shurong Zheng, Estimating Number of Factors by Adjusted Eigenvalues Thresholding, Journal of the American Statistical Association, 10.1080/01621459.2020.1825448, (1-33), (2020).
- Zheng Tracy Ke, Lingzhou Xue, Fan Yang, Diagonally Dominant Principal Component Analysis, Journal of Computational and Graphical Statistics, 10.1080/10618600.2020.1713798, (1-16), (2020).
- Chaohui Guo, Jialiang Li, Homogeneity and structure identification in semiparametric factor models, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1831516, (1-39), (2020).
- Xiaoning Kang, Xinwei Deng, On variable ordination of Cholesky‐based estimation for a sparse covariance matrix, Canadian Journal of Statistics, 10.1002/cjs.11564, 0, 0, (2020).
- Clifford Lam, High‐dimensional covariance matrix estimation, WIREs Computational Statistics , 10.1002/wics.1485, 12, 2, (2019).
- Degui Li, Jiraroj Tosasukul, Wenyang Zhang, Nonlinear Factor‐Augmented Predictive Regression Models with Functional Coefficients, Journal of Time Series Analysis, 10.1111/jtsa.12511, 41, 3, (367-386), (2019).
- Matthew A. Kraft, John P. Papay, Olivia L. Chi, Teacher Skill Development: Evidence from Performance Ratings by Principals, Journal of Policy Analysis and Management, 10.1002/pam.22193, 39, 2, (315-347), (2019).
- Jari Miettinen, Markus Matilainen, Klaus Nordhausen, Sara Taskinen, Extracting Conditionally Heteroskedastic Components using Independent Component Analysis, Journal of Time Series Analysis, 10.1111/jtsa.12505, 41, 2, (293-311), (2019).
- Yilei Wu, Yingli Qin, Mu Zhu, High‐dimensional covariance matrix estimation using a low‐rank and diagonal decomposition, Canadian Journal of Statistics, 10.1002/cjs.11532, 48, 2, (308-337), (2019).
- Jianqing Fan, Donggyu Kim, Structured volatility matrix estimation for non-synchronized high-frequency financial data, Journal of Econometrics, 10.1016/j.jeconom.2018.12.019, (2019).
- Tim Bollerslev, Nour Meddahi, Serge Nyawa, High-dimensional multivariate realizedvolatility estimation, Journal of Econometrics, 10.1016/j.jeconom.2019.04.023, (2019).
- Olivier Ledoit, Michael Wolf, The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation, SSRN Electronic Journal, 10.2139/ssrn.3384500, (2019).
- Valentina Ciccone, Augusto Ferrante, Mattia Zorzi, Factor Models With Real Data: A Robust Estimation of the Number of Factors, IEEE Transactions on Automatic Control, 10.1109/TAC.2018.2867372, 64, 6, (2412-2425), (2019).
- Yoshimasa Uematsu, Takashi Yamagata, Estimation of Weak Factor Models, SSRN Electronic Journal, 10.2139/ssrn.3374750, (2019).
- Kunpeng Li, Guowei Cui, Lina Lu, Efficient estimation of heterogeneous coefficients in panel data models with common shocks, Journal of Econometrics, 10.1016/j.jeconom.2019.08.011, (2019).
- Matteo Farné, Angela Montanari, A large covariance matrix estimator under intermediate spikiness regimes, Journal of Multivariate Analysis, 10.1016/j.jmva.2019.104577, (104577), (2019).
- Jushan Bai, Serena Ng, Rank regularized estimation of approximate factor models, Journal of Econometrics, 10.1016/j.jeconom.2019.04.021, (2019).
- Jia Chen, Degui Li, Oliver Linton, A new semiparametric estimation approach for large dynamic covariance matrices with multiple conditioning variables, Journal of Econometrics, 10.1016/j.jeconom.2019.04.025, (2019).
- Xin-Bing Kong, Zhi Liu, Wang Zhou, A rank test for the number of factors with high-frequency data, Journal of Econometrics, 10.1016/j.jeconom.2019.03.004, (2019).
- Bin Zhang, Jie Zhou, Jianbo Li, Improved Shrinkage Estimators of Covariance Matrices With Toeplitz-Structured Targets in Small Sample Scenarios, IEEE Access, 10.1109/ACCESS.2019.2936402, 7, (116785-116798), (2019).
- Jin Liu, Yingying Ma, Hansheng Wang, Semiparametric model for covariance regression analysis, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.106815, (106815), (2019).
- Ahmed Abdulwali Mohammed Haidar Al Asbahi, Feng Zhi Gang, Wasim Iqbal, Qaiser Abass, Muhammad Mohsin, Robina Iram, Novel approach of Principal Component Analysis method to assess the national energy performance via Energy Trilemma Index, Energy Reports, 10.1016/j.egyr.2019.06.009, 5, (704-713), (2019).
- Rasmus Lönn, Optimizing Large Portfolios Using Risk Factors and Sparse Hedging, SSRN Electronic Journal, 10.2139/ssrn.3364968, (2019).
- Yumou Qiu, Janaka S. S. Liyanage, Threshold selection for covariance estimation, Biometrics, 10.1111/biom.13048, 75, 3, (895-905), (2019).
- Anastasios Panagiotelis, George Athanasopoulos, Rob J. Hyndman, Bin Jiang, Farshid Vahid, Macroeconomic forecasting for Australia using a large number of predictors, International Journal of Forecasting, 10.1016/j.ijforecast.2018.12.002, 35, 2, (616-633), (2019).
- Martin J. Wainwright, , High-Dimensional Statistics, 10.1017/9781108627771, (2019).
- Taras Bodnar, Ostap Okhrin, Nestor Parolya, Optimal shrinkage estimator for high-dimensional mean vector, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.07.004, 170, (63-79), (2019).
- Jia Zhang, Xin Chen, Principal envelope model, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2019.10.001, (2019).
- Kelly M. Sunderland, Derek Beaton, Julia Fraser, Donna Kwan, Paula M. McLaughlin, Manuel Montero-Odasso, Alicia J. Peltsch, Frederico Pieruccini-Faria, Demetrios J. Sahlas, Richard H. Swartz, Stephen C. Strother, Malcolm A. Binns, The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project, BMC Medical Research Methodology, 10.1186/s12874-019-0737-5, 19, 1, (2019).
- Jianqing Fan, Yuan Liao, Learning Latent Factors from Diversified Projections and its Applications to Over-Estimated and Weak Factors, SSRN Electronic Journal, 10.2139/ssrn.3446097, (2019).
- Christian M. Hafner, Oliver B. Linton, Haihan Tang, Estimation of a multiplicative correlation structure in the large dimensional case, Journal of Econometrics, 10.1016/j.jeconom.2019.12.012, (2019).
- Kazuyoshi Yata, Makoto Aoshima, Yugo Nakayama, A test of sphericity for high-dimensional data and its application for detection of divergently spiked noise, Sequential Analysis, 10.1080/07474946.2018.1548850, 37, 3, (397-411), (2019).
- Seyoung Park, Eun Ryung Lee, Sungchul Lee, Geonwoo Kim, Dantzig Type Optimization Method with Applications to Portfolio Selection, Sustainability, 10.3390/su11113216, 11, 11, (3216), (2019).
- Qingliang Fan, Xiao Han, Guangming Pan, Bibo Jiang, LARGE SYSTEM OF SEEMINGLY UNRELATED REGRESSIONS: A PENALIZED QUASI-MAXIMUM LIKELIHOOD ESTIMATION PERSPECTIVE, Econometric Theory, 10.1017/S026646661900015X, (1-33), (2019).
- F Jiang, Y Ma, Y Wei, Sufficient direction factor model and its application to gene expression quantitative trait loci discovery, Biometrika, 10.1093/biomet/asz010, (2019).
- Xiucai Ding, Spiked sample covariance matrices with possibly multiple bulk components, Random Matrices: Theory and Applications, 10.1142/S2010326321500143, (2019).
- Mingjing Chen, A self-reliant projected information criterion for the number of factors, Communications in Statistics - Theory and Methods, 10.1080/03610926.2019.1576889, (1-19), (2019).
- Shaoxin Wang, Hu Yang, Chaoli Yao, On the penalized maximum likelihood estimation of high-dimensional approximate factor model, Computational Statistics, 10.1007/s00180-019-00869-z, (2019).
- Jinyu Li, Yutong Lai, Chi Zhang, Qi Zhang, TGCnA: temporal gene coexpression network analysis using a low-rank plus sparse framework, Journal of Applied Statistics, 10.1080/02664763.2019.1667311, (1-20), (2019).
- Ruili Sun, Tiefeng Ma, Shuangzhe Liu, Milind Sathye, Improved Covariance Matrix Estimation for Portfolio Risk Measurement: A Review, Journal of Risk and Financial Management, 10.3390/jrfm12010048, 12, 1, (48), (2019).
- Yuxin Chen, Jianqing Fan, Cong Ma, Yuling Yan, Inference and uncertainty quantification for noisy matrix completion, Proceedings of the National Academy of Sciences, 10.1073/pnas.1910053116, (201910053), (2019).
- Haishu Qiao, Yaya Su, Media Coverage and Decomposition of Stock Market Volatility:Based on the Generalized Dynamic Factor Model, Emerging Markets Finance and Trade, 10.1080/1540496X.2019.1686974, (1-13), (2019).
- Serge B. Provost, John N. Haddad, A recursive approach for determining matrix inverses as applied to causal time series processes, METRON, 10.1007/s40300-019-00147-4, (2019).
- Gianluca De Nard, Olivier Ledoit, Michael Wolf, Factor Models for Portfolio Selection in Large Dimensions: The Good, the Better and the Ugly, Journal of Financial Econometrics, 10.1093/jjfinec/nby033, (2019).
- Mårten Gulliksson, Stepan Mazur, An Iterative Approach to Ill-Conditioned Optimal Portfolio Selection, Computational Economics, 10.1007/s10614-019-09943-6, (2019).
- Xin-Bing Kong, Jin-Guan Lin, Guang-Ying Liu, Asymptotics for the systematic and idiosyncratic volatility with large dimensional high-frequency data, Random Matrices: Theory and Applications, 10.1142/S2010326320500070, (2050007), (2019).
- Young-Geun Choi, Johan Lim, Sujung Choi, High-dimensional Markowitz portfolio optimization problem: empirical comparison of covariance matrix estimators, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1577855, (1-23), (2019).
- Laurent Callot, Mehmet Caner, A. Özlem Önder, Esra Ulaşan, A Nodewise Regression Approach to Estimating Large Portfolios, Journal of Business & Economic Statistics, 10.1080/07350015.2019.1683018, (1-12), (2019).
- Yuanpei Cao, Wei Lin, Hongzhe Li, Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding, Journal of the American Statistical Association, 10.1080/01621459.2018.1442340, 114, 526, (759-772), (2018).
- Jacob Bien, Graph-Guided Banding of the Covariance Matrix, Journal of the American Statistical Association, 10.1080/01621459.2018.1442720, 114, 526, (782-792), (2018).
- Wei Lan, Lilun Du, A Factor-Adjusted Multiple Testing Procedure With Application to Mutual Fund Selection, Journal of Business & Economic Statistics, 10.1080/07350015.2017.1294078, 37, 1, (147-157), (2018).
- Markus Pelger, Large-dimensional factor modeling based on high-frequency observations, Journal of Econometrics, 10.1016/j.jeconom.2018.09.004, (2018).
- Donggyu Kim, Xin-Bing Kong, Cui-Xia Li, Yazhen Wang, Adaptive thresholding for large volatility matrix estimation based on high-frequency financial data, Journal of Econometrics, 10.1016/j.jeconom.2017.09.006, 203, 1, (69-79), (2018).
- Dachuan Chen, Per A. Mykland, Lan Zhang, The Five Trolls Under the Bridge: Principal Component Analysis with Asynchronous and Noisy High Frequency Data, SSRN Electronic Journal, 10.2139/ssrn.3118039, (2018).
- Fei Liu, Jiti Gao, Yanrong Yang, Nonparametric Time--Varying Panel Data Models with Heterogeneity, SSRN Electronic Journal, 10.2139/ssrn.3214046, (2018).
- Jia Chen, Degui Li, Oliver B. Linton, A New Semiparametric Estimation Approach of Large Dynamic Covariance Matrices with Multiple Conditioning Variables, SSRN Electronic Journal, 10.2139/ssrn.3210726, (2018).
- Jianqing Fan, Yuan Ke, Kaizheng Wang, Factor-Adjusted Regularized Model Selection, SSRN Electronic Journal, 10.2139/ssrn.3248047, (2018).
- Xin-Bing Kong, Cheng Liu, Testing against constant factor loading matrix with large panel high-frequency data, Journal of Econometrics, 10.1016/j.jeconom.2018.03.001, 204, 2, (301-319), (2018).
- Taras Bodnar, Nestor Parolya, Wolfgang Schmid, Estimation of the global minimum variance portfolio in high dimensions, European Journal of Operational Research, 10.1016/j.ejor.2017.09.028, 266, 1, (371-390), (2018).
- Young-Geun Choi, Johan Lim, Anindya Roy, Junyong Park, Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.12.002, (2018).
- Mingjuan Zhang, Cheng Zhou, Yong He, Xinsheng Zhang, Adaptive test for mean vectors of high-dimensional time series data with factor structure, Journal of the Korean Statistical Society, 10.1016/j.jkss.2018.05.003, 47, 4, (450-470), (2018).
- Natalia Bailey, M. Hashem Pesaran, L. Vanessa Smith, A multiple testing approach to the regularisation of large sample correlation matrices, Journal of Econometrics, 10.1016/j.jeconom.2018.10.006, (2018).
- Donggyu Kim, Jianqing Fan, Factor GARCH-Itô models for high-frequency data with application to large volatility matrix prediction, Journal of Econometrics, 10.1016/j.jeconom.2018.10.003, (2018).
- Zhihong Jian, Pingjun Deng, Zhican Zhu, High-dimensional covariance forecasting based on principal component analysis of high-frequency data, Economic Modelling, 10.1016/j.econmod.2018.07.015, 75, (422-431), (2018).
- Wei Biao Wu, Zhipeng Lou, Yuefeng Han, Hypothesis Testing for High-Dimensional Data, Handbook of Big Data Analytics, 10.1007/978-3-319-18284-1_8, (203-224), (2018).
- Shujie Ma, Liangjun Su, Estimation of large dimensional factor models with an unknown number of breaks, Journal of Econometrics, 10.1016/j.jeconom.2018.06.019, 207, 1, (1-29), (2018).
- Lilun Du, Changliang Zou, On-line control of false discovery rates for multiple datastreams, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2017.10.006, 194, (1-14), (2018).
- See more



:

,
and
.
:



penalized estimation of large covariance matrices
