Volume 75, Issue 4
Original Article
Free Access

Large covariance estimation by thresholding principal orthogonal complements

Jianqing Fan

Corresponding Author

Princeton University, USA

Address for correspondence: Jianqing Fan, Department of Operations Research and Financial Engineering, Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA. E‐mail: jqfan@princeton.eduSearch for more papers by this author
Yuan Liao

University of Maryland, College Park, USA

Search for more papers by this author
First published: 12 August 2013
Citations: 229

Summary

The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross‐sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method ‘POET’ to explore such an approximate factor structure with sparsity. The POET‐estimator includes the sample covariance matrix, the factor‐based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.

1. Introduction

Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high dimensional data involves the estimation of a covariance matrix or its inverse (the precision matrix). Examples include portfolio management and risk assessment (Fan et al., 2008), high dimensional classification such as the Fisher discriminant (Hastie et al., 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap et al., 2009; Xiong et al., 2011) and testing the capital asset pricing model (Sentana, 2009), among others. See Section 5. for some of those applications. Yet, the dimensionality is often either comparable with the sample size or even larger. In such cases, the sample covariance is known to have poor performance (Johnstone, 2001), and some regularization is needed.

Realizing the importance of estimating large covariance matrices and the challenges that are brought by the high dimensionality, in recent years researchers have proposed various regularization techniques to estimate Σ consistently. One of the key assumptions is that the covariance matrix is sparse, namely many entries are 0 or nearly so (Bickel and Levina, 2008; Rothman et al., 2009; Lam and Fan, 2009; Cai and Zhou, 2012; Cai and Liu, 2011). In many applications, however, the sparsity assumption directly on Σ is not appropriate. For example, financial returns depend on the equity market risks, housing prices depend on the economic health and gene expressions can be stimulated by cytokines, among others. Because of the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a factor model structure, as in Fan et al. (2008). However, they restrict themselves to the strict factor models with known factors.

A natural extension is conditional sparsity. Given the common factors, the outcomes are weakly correlated. To do so, we consider an approximate factor model, which has been frequently used in economic and financial studies (Chamberlain and Rothschild (1983), Fama and French (1992) and Bai and Ng (2002), among others):
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0001(1.1)

Here urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0002 is the observed response for the ith (i=1,…,p) individual at time t=1,…,T, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0003 is a vector of factor loadings, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0004 is a K×1 vector of common factors and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0005 is the error term, which is usually called the idiosyncratic component, uncorrelated with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0006. Both p and T diverge to ∞, whereas K is assumed fixed throughout the paper, and p is possibly much larger than T.

We emphasize that, in model (1.1), only urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0007 is observable. It is intuitively clear that the unknown common factors can only be inferred reliably when there are sufficiently many cases, i.e. p→∞. In a data rich environment, p can diverge at a rate that is faster than T. The factor model (1.1) can be put in a matrix form as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0008(1.2)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0009, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0010 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0011. We are interested in Σ, the p×p covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0012, and its inverse, which are assumed to be time invariant. Under model (1.1), Σ is given by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0013(1.3)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0014 is the covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0015. The literature on approximate factor models typically assumes that the first K eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0016 diverge at rate O(p), whereas all the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0017 are bounded as p→∞. This assumption holds easily when the factors are pervasive in the sense that a non‐negligible fraction of factor loadings should be non‐vanishing. The decomposition (1.3) is then asymptotically identified as p→∞. In addition to it, in this paper we assume that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0018 is approximately sparse as in Bickel and Levina (2008) and Rothman et al. (2009): for some q ∈ [0,1),
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0019
does not grow too fast as p→∞. In particular, this includes the exact sparsity assumption (q=0) under which urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0020, the maximum number of non‐zero elements in each row.

The conditional sparsity structure of form (1.2) was explored by Fan et al. (2011a) in estimating the covariance matrix, when the factors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0021 are observable. This allows them to use regression analysis to estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0022. This paper deals with the situation in which the factors are unobservable and must be inferred. Our approach is simple and optimization free and it uses the data only through the sample covariance matrix. Run the singular value decomposition on the sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0023 of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0024, keep the covariance matrix that is formed by the first K principal components and apply the thresholding procedure to the remaining covariance matrix. This results in a principal orthogonal complement thresholding estimator POET. When the number of common factors K is unknown, it can be estimated from the data. See Section 2. for additional details. We shall investigate various properties of POET under the assumption that the data are serially dependent, which includes independent observations as a specific example. The rate of convergence under various norms for both estimated Σ and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0025 and their precision (inverse) matrices will be derived. We show that the effect of estimating the unknown factors on the rate of convergence vanishes when p log (p)≫T and, in particular, the rate of convergence for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0026 achieves the optimal rate in Cai and Zhou (2012).

This paper focuses on the high dimensional static factor model (1.2), which is innately related to the principal component analysis (PCA), as clarified in Section 2.. This feature makes it different from the classical factor model with fixed dimensionality (e.g. Lawley and Maxwell (1971)). In the last decade, much theory on the estimation and inference of the static factor model has been developed, e.g. Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003) and Doz et al. (2011), among others. Our contribution is on the estimation of covariance matrices and their inverse in large factor models.

The static model that is considered in this paper is to be distinguished from the dynamic factor model as in Forni et al. (2000); the latter allows urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0027 to depend also on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0028 with lags in time. Their approach is based on the eigenvalues and principal components of spectral density matrices, and on the frequency domain analysis. Moreover, as shown in Forni and Lippi (2001), the dynamic factor model does not really impose a restriction on the data‐generating process, and the assumption of idiosyncrasy (in their terminology, a p‐dimensional process is idiosyncratic if all the eigenvalues of its spectral density matrix remain bounded as p→∞) asymptotically identifies the decomposition of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0029 into the common component and idiosyncratic error. The literature includes, for example, Forni et al. (2000, 2004), Forni and Lippi (2001), Hallin and Liška (2007, 2011) and many other references therein. Above all, both the static and the dynamic factor models are receiving increasing attention in applications of many fields where information usually is scattered through a (very) large number of interrelated time series.

There has been extensive literature in recent years that deals with sparse principal components, which has been widely used to enhance the convergence of the principal components in high dimensional space. d'Aspremont et al. (2008), Shen and Huang (2008), Witten et al. (2009) and Ma (2013) proposed and studied various algorithms for computations. More literature on sparse PCA is found in Johnstone and Lu (2009), Amini and Wainwright (2009), Zhang and El Ghaoui (2011) and Birnbaum et al. (2012), among others. In addition, there has also been a growing literature that theoretically studies the recovery from a low rank plus sparse matrix estimation problem; see, for example, Wright et al. (2009), Lin et al. (2008), Candès et al. (2011), Luo (2011), Agarwal et al. (2012) and Pati et al. (2012). It corresponds to the identifiability issue of our problem.

There is a big difference between our model and those considered in the aforementioned literature. In the current paper, the first K eigenvalues of Σ are spiked and grow at a rate O(p), whereas the eigenvalues of the matrices that have been studied in the existing literature on covariance estimation are usually assumed to be either bounded or slowly growing. Because of this distinctive feature, the common components and the idiosyncratic components can be identified and, in addition, PCA on the sample covariance matrix can consistently estimate the space that is spanned by the eigenvectors of Σ. The existing methods of either thresholding directly or solving a constrained optimization method can fail in the presence of very spiked principal eigenvalues. However, there is a price to pay here: as the first K eigenvalues are ‘too spiked’, one can hardly obtain a satisfactory rate of convergence for estimating Σ in absolute terms, but it can be estimated accurately in relative terms (see Section 3.3. for details). In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0030 can be estimated accurately.

We would like to note further that the low rank plus sparse representation of our model is on the population covariance matrix, whereas Candès et al. (2011), Wright et al. (2009) and Lin et al. (2009) considered such a representation on the data matrix. (We thank a referee for reminding us about these related works.) As there is no Σ to estimate, their goal is limited to producing a low rank plus sparse matrix decomposition of the data matrix, which corresponds to the identifiability issue of our study, and does not involve estimation and inference. In contrast, our ultimate goal is to estimate the population covariance matrices as well as the precision matrices. For this, we require the idiosyncratic components and common factors to be uncorrelated and the data‐generating process to be strictly stationary. The covariances that are considered in this paper are constant over time, though slow time varying covariance matrices are applicable through localization in time (time domain smoothing). Our consistency result on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0031 demonstrates that decomposition (1.3) is identifiable, and hence our results also shed the light of the ‘surprising phenomenon’ of Candès et al. (2011) that one can separate fully a sparse matrix from a low rank matrix when only the sum of these two components is available.

The rest of the paper is organized as follows. Section 2. gives our estimation procedures and builds the relationship between the PCA and the factor analysis in high dimensional space. Section 3. provides the asymptotic theory for various estimated quantities. Section 4. illustrates how to choose the thresholds by using cross‐validation and guarantees the positive definiteness in any finite sample. Specific applications of regularized covariance matrices are given in Section 5.. Numerical results are reported in Section 6.. Finally, Section 7. presents a real data application on portfolio allocation. All proofs are given in Appendix A. Throughout the paper, we use urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0032 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0033 to denote the minimum and maximum eigenvalues of a matrix A. We also denote by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0034, ‖A‖, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0035 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0036 the Frobenius norm, spectral norm (also called the operator norm), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0037‐norm and elementwise norm of a matrix A, defined respectively by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0038, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0039, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0040 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0041. When A is a vector, both urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0042 and ‖A‖ are equal to the Euclidean norm. Finally, for two sequences, we write urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0043 if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0044 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0045 if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0046 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0047

The programs that were used to analyse the data can be obtained from

http://www.blackwellpublishing.com/rss

2. Regularized covariance matrix via principal components analysis

There are three main objectives of this paper:
  1. to understand the relationship between PCA and high dimensional factor analysis;
  2. to estimate both covariance matrices Σ and the idiosyncratic urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0048 and their precision matrices in the presence of common factors;
  3. to investigate the effect of estimating the unknown factors on the covariance estimation.

The propositions in Section 2.1. show that the space that is spanned by the principal components in the population level Σ is close to the space that is spanned by the columns of the factor loading matrix B.

2.1. High dimensional principal components analysis and factor model

Consider a factor model
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0049
where the number of common factors, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0050, is small compared with p and T, and thus is assumed to be fixed throughout the paper. In the model, the only observable variable is the data urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0051. One of the distinguished features of the factor model is that the principal eigenvalues of Σ are no longer bounded, but growing fast with the dimensionality. We illustrate this in the following example.

2.1.1. Example 1

Consider a single‐factor model urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0052 where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0053. Suppose that the factor is pervasive in the sense that it has non‐negligible effect on a non‐vanishing proportion of outcomes. It is then reasonable to assume that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0054 for some c>0. Therefore, assuming that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0055, an application of decomposition (1.3) yields
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0056
for all large p, assuming that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0057.

We now elucidate why PCA can be used for the factor analysis in the presence of spiked eigenvalues. Write urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0058 as the p×K loading matrix. Note that the linear space that is spanned by the first K principal components of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0059 is the same as that spanned by the columns of B when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0060 is non‐degenerate. Thus, we can assume without loss of generality that the columns of B are orthogonal and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0061, the identity matrix. This canonical form corresponds to the identifiability condition in decomposition (1.3). Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0062 be the columns of B, ordered such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0063 is in a non‐increasing order. Then, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0064 are eigenvectors of the matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0065 with eigenvalues urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0066 and the rest 0. We shall impose the pervasiveness assumption that all eigenvalues of the K×K matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0067 are bounded away from 0, which holds if the factor loadings urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0068 are independent realizations from a non‐degenerate population. Since the non‐vanishing eigenvalues of the matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0069 are the same as those of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0070, from the pervasiveness assumption it follows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0071 are all growing at rate O(p).

Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0072 be the eigenvalues of Σ in a descending order and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0073 be their corresponding eigenvectors. Then, an application of Weyl's eigenvalue theorem (see Appendix A) yields the following proposition.

Proposition 1.Assume that the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0074 are bounded away from 0 for all large p. For the factor model (1.3) with the canonical condition

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0075(2.1)
we have
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0076

In addition, for jK, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0077.

Using proposition 1 and the  sin (θ) theorem of Davis and Kahn (1970) (see their appendix), we have the following proposition.

Proposition 2.Under the assumptions of proposition 1, if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0078 are distinct, then

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0079

Propositions 1 and 2 state that PCA and factor analysis are approximately the same if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0080. This is assured through a sparsity condition on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0081, which is frequently measured through
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0082(2.2)
The intuition is that, after taking out the common factors, many pairs of the cross‐sectional units become weakly correlated. This generalized notion of sparsity was used in Bickel and Levina (2008) and Cai and Liu (2011). Under this generalized measure of sparsity, we have
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0083
if the noise variances urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0084 are bounded. Therefore, when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0085, proposition 1 implies that we have distinguished eigenvalues between the principal components urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0086 and the rest of the components urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0087 and proposition 2 ensures that the first K principal components are approximately the same as the columns of the factor loadings.

The aforementioned sparsity assumption appears reasonable in empirical applications. Boivin and Ng (2006) conducted an empirical study and showed that imposing zero correlation between weakly correlated idiosyncratic components improves the forecast. (We thank a referee for this interesting reference.) More recently, Phan (2012) empirically estimated the level of sparsity of the idiosyncratic covariance by using UK market data.

Recent developments on random‐matrix theory, e.g. Johnstone and Lu (2009) and Paul (2007), have shown that, when p/T is not negligible, the eigenvalues and eigenvectors of Σ might not be consistently estimated from the sample covariance matrix. A distinguished feature of the covariance that is considered in this paper is that there are some very spiked eigenvalues. By propositions 1 and 2, in the factor model, the pervasiveness condition
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0088(2.3)
implies that the first K eigenvalues are growing at a rate p. Moreover, when p is large, the principal components urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0089 are close to the normalized vectors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0090 when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0091. This provides the mathematics for using the first K principal components as a proxy for the space that is spanned by the columns of the factor loading matrix B. In addition, because of condition (2.3), the signals of the first K eigenvalues are stronger than those of the spiked covariance model that was considered by Jung and Marron (2009) and Birnbaum et al. (2012). Therefore, our other conditions for the consistency of principal components at the population level are much weaker than those in the spiked covariance literature. However, this also shows that, under our setting, PCA is a valid approximation to factor analysis only if p→∞. The fact that PCA on the sample covariance is inconsistent when p is bounded has also previously been demonstrated in the literature (see, for example, Bai (2003)).

With assumption (2.3), the standard literature on approximate factor models has shown that PCA on the sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0092 can consistently estimate the space that is spanned by the factor loadings (e.g. Stock and Watson (1998) and Bai (2003)). Our contribution in propositions 1 and 2 is that we connect the high dimensional factor model to the principal components and obtain the consistency of the spectrum in the population level Σ instead of the sample level urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0093. The spectral consistency also enhances the results in Chamberlain and Rothschild (1983). This provides the rationale behind the consistency results in the factor model literature.

2.2. Principal orthogonal complement thresholding

A sparsity assumption directly on Σ is inappropriate in many applications owing to the presence of common factors. Instead, we propose a non‐parametric estimator of Σ based on PCA. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0094 be the ordered eigenvalues of the sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0095 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0096 be their corresponding eigenvectors. Then the sample covariance has the following spectral decomposition:
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0097(2.4)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0098 is the principal orthogonal complement, and K is the number of diverging eigenvalues of Σ. Let us first assume that K is known.
Now we apply thresholding on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0099. Define
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0100(2.5)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0101 is a generalized shrinkage function of Antoniadis and Fan (1976), employed by Rothman et al. (2009) and Cai and Liu (2011), and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0102 is an entry‐dependent threshold. In particular, the hard thresholding rule urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0103 (Bickel and Levina, 2008) and the constant thresholding parameter urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0104 are allowed. In practice, it is more desirable to have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0105 entry adaptive. An example of the adaptive thresholding is
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0106(2.6)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0107 is the ith diagonal element of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0108. This corresponds to applying the thresholding with parameter τ to the correlation matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0109.
The estimator of Σ is then defined as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0110(2.7)

We shall call this estimator the principal orthogonal complement thresholding estimator POET. It is obtained by thresholding the remaining components of the sample covariance matrix, after taking out the first K principal components. One of the attractive features of POET is that it is optimization free and hence is computationally appealing. (We have written an R package for POET, which outputs the estimated Σ, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0111, K, the factors and the loadings.)

With the choice of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0112 in expression (2.6) and the hard thresholding rule, our estimator encompasses many popular estimators as its specific cases. When τ=0, the estimator is the sample covariance matrix and, when τ=1, the estimator becomes that based on the strict factor model (Fan et al., 2008). When K=0, our estimator is the same as the thresholding estimator of Bickel and Levina (2008) and (with a more general thresholding function) Rothman et al. (2009) or the adaptive thresholding estimator of Cai and Liu (2011) with a proper choice of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0113.

In practice, the number of diverging eigenvalues (or common factors) can be estimated on the basis of the sample covariance matrix. Determining K in a data‐driven way is an important topic and is well understood in the literature. We shall describe the estimator POET with a data‐driven K in Section 2.4.

2.3. Least squares point of view

The estimator POET (2.7) has an equivalent representation using a constrained least squares method. The least squares method seeks urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0114 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0115 such that
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0116(2.8)
subject to the normalization
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0117(2.9)
The constraints (2.9) correspond to the normalization (2.1). Here we assume that the mean of each variable urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0118 has been removed, i.e. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0119 for all ip,jK and tT. Putting it in a matrix form, the optimization problem can be written as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0120(2.10)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0121 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0122. For each given F, the least squares estimator of B is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0123, using the constraint (2.9) on the factors. Substituting this into problem (2.10), the objective function now becomes urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0124 The minimizer is now clear: the columns of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0125 are the eigenvectors corresponding to the K largest eigenvalues of the T×T matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0126 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0127 (see, for example, Stock and Watson (2002)).
We shall show that under some mild regularity conditions, as p and T→∞, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0128 consistently estimates the true urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0129 uniformly over ip and tT. Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0130 is assumed to be sparse, we can construct an estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0131 by using the adaptive thresholding method by Cai and Liu (2011) as follows. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0132 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0133 For some predetermined decreasing sequence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0134, and sufficiently large C>0, define the adaptive threshold parameter as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0135 The estimated idiosyncratic covariance estimator is then given by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0136(2.11)
where, for all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0137 (see Antoniadis and Fan (2001)),
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0138

It is easy to verify that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0139 includes many interesting thresholding functions such as hard thresholding (urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0140), soft thresholding (urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0141), smoothly clipped absolute deviation and the adaptive lasso (see Rothman et al. (2009)).

Analogous to the decomposition (1.3), we obtain the following substitution estimators:
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0142(2.12)
and, by the Sherman–Morrison–Woodbury formula, noting that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0143
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0144(2.13)

In practice, the true number of factors K might be unknown to us. However, for any determined urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0145, we can always construct either urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0146 as in estimator (2.7) or urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0147 as in estimator (2.12) to estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0148. The following theorem shows that, for each given urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0149, the two estimators based on either regularized PCA or least squares substitution are equivalent. Similar results were obtained by Bai (2003) when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0150 and no thresholding was imposed.

Theorem 1.Suppose that the entry‐dependent threshold in definition (2.5) is the same as the thresholding parameter that is used in expression (2.11). Then, for any urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0151, estimator (2.7) is equivalent to the substitution estimator (2.12), i.e.

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0152

In this paper, we shall use a data‐driven urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0153 to construct POET (see Section 2.4.), which has two equivalent representations according to theorem 1.

2.4. Principal orthogonal complement thresholding with unknown K

Determining the number of factors in a data‐driven way has been an important research topic in the econometrics literature. Bai and Ng (2002) proposed a consistent estimator as both p and T diverge. Other recent criteria have been proposed by Kapetanios (2010), Onatski (2010) and Alessi et al. (2010), among others.

Our method also allows a data‐driven urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0154 to estimate the covariance matrices. In principle, any procedure that gives a consistent estimate of K can be adopted. In this paper we apply the well‐known method in Bai and Ng (2002). It estimates K by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0155(2.14)
where M is a prescribed upper bound, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0156 is a urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0157 matrix whose columns are √T times the eigenvectors corresponding to the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0158 largest eigenvalues of the T×T matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0159 and g(T,p) is a penalty function of (p,T) such that g(T,p)=o(1) and min{p,Tg(T,p)→∞. Two examples suggested by Bai and Ng (2002), IC1 and IC2, are respectively
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0160
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0161
Throughout the paper, we let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0162 be the solution to problem (2.14) by using either IC1 or IC2. The asymptotic results are not affected regardless of the specific choice of g(T,p). We define the POET‐estimator with unknown K as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0163(2.15)

The procedure is as stated in Section 2.2. except that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0164 is now data driven.

3. Asymptotic properties

3.1. Assumptions

This section presents the assumptions on model (1.2), in which only urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0165 are observable. Recall the identifiability condition (2.1).

The first assumption has been one of the most essential in the literature of approximate factor models. Under this assumption and other regularity conditions, the number of factors, loadings and common factors can be consistently estimated (e.g. Stock and Watson (1998, 2002), Bai and Ng (2002) and Bai (2003)).

Assumption 1.All the eigenvalues of the K×K matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0166 are bounded away from both 0 and ∞ as p→∞.

Remark 1.

  1. It is implied from proposition 1 in Section 2. that the first K eigenvalues of Σ grow at rate O(p). This unique feature distinguishes our work from most of other work on low rank plus sparse covariances that has been considered in the literature, e.g. Luo (2011), Pati et al. (2012), Agarwal et al. (2012) and Birnbaum et al. (2012). (To our best knowledge, the only other references that estimate large covariances with diverging eigenvalues (growing at the rate of dimensionality O(p)) are Fan et al. (2008, 2008) and Bai and Shi (2011). Whereas Fan et al. (2008, 2011a) assumed that the factors are observable, Bai and Shi (2011) considered the strict factor model in which urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0167 is diagonal.)
  2. Assumption 1 requires the factors to be pervasive, i.e. to impact a non‐vanishing proportion of individual time series. See example 1 in Section 2.1.1 for its meaning. (It is important to distinguish the model that we consider in this paper from the ‘sparse factor model’ in the literature, e.g. Carvalho et al. (2008) and Pati et al. (2012), which assumes that the loading matrix B is sparse. The intuition of a sparse loading matrix is that each factor is related to only a relatively small number of stocks, assets, genes, etc. With B being sparse, all the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0168 and hence those of Σ are bounded.)
  3. As to be illustrated in Section 3.3. below, owing to the fast diverging eigenvalues, we can hardly achieve a good rate of convergence for estimating Σ under either the spectral norm or Frobenius norm when p>T. This phenomenon arises naturally from the characteristics of the high dimensional factor model, which is another distinguished feature compared with those convergence results in the existing literature.

Assumption 2.

  1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0169 is strictly stationary. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0170 for all ip,jK and tT.
  2. There are constants urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0171 such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0172, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0173 and
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0174
  3. There are urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0175 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0176 such that, for any s>0, ip and jK,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0177

Condition (a) requires strict stationarity as well as the non‐correlation between urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0178 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0179. These conditions are slightly stronger than those in the literature, e.g. Bai (2003), but are still standard and simplify our technicalities. Condition (b) requires that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0180 be well conditioned. The condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0181 instead of a weaker condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0182 is imposed here to estimate K consistently. But it is still standard in the approximate factor model literature as in Bai and Ng (2002), Bai (2003), etc. When K is known, such a condition can be removed. Fan et al. (2011b) shows that the results continue to hold for a growing (known) K under the weaker condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0183. Condition (c) requires exponential‐type tails, which allow us to apply the large deviation theory to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0184 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0185.

We impose the strong mixing condition. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0186 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0187 denote the σ‐algebras that are generated by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0188 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0189 respectively. In addition, define the mixing coefficient
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0190(3.1)

Assumption 3. (strong mixing.)There exists urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0191 such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0192, and C>0 satisfying, for all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0193,

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0194

In addition, we impose the following regularity conditions.

Assumption 4.There exists M>0 such that, for all ip, tT and sT,

  1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0195,
  2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0196 and
  3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0197.

These conditions are needed to estimate consistently the transformed common factors as well as the factor loadings. Similar conditions were also assumed in Bai (2003) and Bai and Ng (2006). The number of factors is assumed to be fixed. Our conditions in assumption 4 are weaker than those in Bai (2003) as we focus on different aspects of the study.

3.2. Convergence of the idiosyncratic covariance

Estimating the covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0198 of the idiosyncratic components urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0199 is important for many statistical inferences. For example, it is needed for large sample inference of the unknown factors and their loadings, for testing the capital asset pricing model (Sentana, 2009) and large‐scale hypothesis testing (Fan et al., 2012). See Section 5..

We estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0200 by thresholding the principal orthogonal complements after the first urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0201 principal components of the sample covariance have been taken out: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0202 By theorem 1, it also has an equivalent expression given by estimator (2.11), with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0203. Throughout the paper, we apply the adaptive threshold
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0204(3.2)
where C>0 is a sufficiently large constant, though the results hold for other types of thresholding. As in Bickel and Levina (2008) and Cai and Liu (2011), the threshold that is chosen in the current paper is in fact obtained from the optimal uniform rate of convergence of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0205 When direct observation of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0206 is not available, the effect of estimating the unknown factors also contributes to this uniform estimation error, which is why urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0207 appears in the threshold.

The following theorem gives the rate of convergence of the estimated idiosyncratic covariance. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0208. In the convergence rate below, recall that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0209 and q are defined in the measure of sparsity (2.2).

Theorem 2.Suppose that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0210, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0211 and assumptions 1–4 hold. Then, for a sufficiently large constant C>0 in the threshold (3.2), the POET‐estimator urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0212 satisfies

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0213

If further urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0214, then the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0215 are all bounded away from 0 with probability approaching 1, and

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0216

When estimating urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0217, p is allowed to grow exponentially fast in T, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0218 can be made consistent under the spectral norm. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0219 is asymptotically invertible whereas the classical sample covariance matrix based on the residuals is not when p>T.

Remark 2.

  1. Consistent estimation of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0220 indicates that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0221 is identifiable in model (1.3), namely the sparse urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0222 can be separated perfectly from the low rank matrix there. The result here gives another proof (when assuming that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0223) of the ‘surprising phenomenon’ in Candès et al. (2011) under different technical conditions.
  2. Fan et al. (2011a) recently showed that, when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0224 are observable and q=0, the rate of convergence of the adaptive thresholding estimator is given by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0225
  3. Hence, when the common factors are unobservable, the rate of convergence has an additional term urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0226, coming from the effect of estimating the unknown factors. This effect vanishes when p  log (p)≫T, in which case the minimax rate as in Cai and Zhou (2012) is achieved. As p increases, more information about the common factors is collected, which results in more accurate estimation of the common factors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0227.
  4. When K is known and grows with p and T, with slightly weaker assumptions, Fan et al. (2011b) shows that, under the exactly sparse case (i.e. q=0), the result continues to hold with convergence rate
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0228

3.3. Convergence of POET

Since the first K eigenvalues of Σ grow with p, we can hardly estimate Σ with satisfactory accuracy in absolute terms. This problem does not arise from the limitation of any estimation method but is due to the nature of the high dimensional factor model. We illustrate this by using a simple example.

3.3.1. Example 2

Consider an ideal case where we know the spectrum except for the first eigenvector of Σ. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0229 be the eigenvalues and vectors, and assume that the largest eigenvalue urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0230 for some c>0. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0231 be the estimated first eigenvector and define the covariance estimator urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0232 Assume that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0233 is a good estimator in the sense that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0234. However,
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0235
which can diverge when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0236.
In the presence of very spiked eigenvalues, although the covariance Σ cannot be consistently estimated in absolute terms, it can be well estimated in terms of the relative error matrix
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0237
which is more relevant for many applications (see example 4 in 5.). The relative error matrix can be measured by either its spectral norm or the normalized Frobenius norm defined by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0238(3.3)
In equality (3.3), there are p terms being added in the trace operation and the factor urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0239 plays the role of normalization. The loss (3.3) is closely related to the entropy loss, which was introduced by James and Stein (1961). Also note that
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0240
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0241 is the weighted quadratic norm in Fan et al. (2008).

Fan et al. (2008) showed that, in a large factor model, the sample covariance is such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0242 which does not converge if p>T. In contrast, theorem 3 below shows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0243 can still be convergent as long as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0244. Technically, the effect of high dimensionality on the convergence rate of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0245 is via the number of rows in B. We show in Appendix A that B appears in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0246 through urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0247 whose eigenvalues are bounded. Therefore it successfully cancels out the curse of high dimensionality that is introduced by B.

Compared with estimating Σ, in a large approximate factor model, we can estimate the precision matrix with a satisfactory rate under the spectral norm. The intuition follows from the fact that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0248 has bounded eigenvalues.

The following theorem summarizes the rate of convergence under various norms.

Theorem 3.Under the assumptions of theorem 2 the POET‐estimator that is defined in equation (2.15) satisfies

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0249

In addition, if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0250, then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0251 is non‐singular with probability approaching 1, with

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0252

Remark 3.

  1. When estimating urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0253, p is allowed to grow exponentially fast in T, and the estimator has the same rate of convergence as that of the estimator urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0254 in theorem 2. When p becomes much larger than T, the precision matrix can be estimated at the same rate as if the factors were observable.
  2. As in remark 2, when K>0 is known and grows with p and T, Fan et al. (2011a) prove the following results (when q=0):
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0255
  3. The results state explicitly the dependence of the rate of convergence on the number of factors. (The assumptions in Fan et al. (2011a) are slightly weaker than those presented here, in that they required that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0256 instead of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0257 be bounded.)
  4. The relative error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0258 in operator norm can be shown to have the same order as the maximum relative error of estimated eigenvalues. It does not converge to 0 nor diverge. It is much smaller than urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0259, which is of order p/√T (see example 2).

3.4. Convergence of unknown factors and factor loadings

Many applications of the factor model require estimating the unknown factors. In general, factor loadings in B and the common factors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0260 are not separably identifiable, as, for any matrix H such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0261, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0262. Hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0263 cannot be identified from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0264. Note that the linear space that is spanned by the rows of B is the same as that by those of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0265. In practice, it often does not matter which is used.

Let V denote the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0266 diagonal matrix of the first urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0267 largest eigenvalues of the sample covariance matrix in decreasing order. Recall that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0268 and define a urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0269 matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0270 Then, for tT, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0271 Note that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0272 depends only on the data urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0273 and an identifiable part of parameters urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0274. Therefore, there is no identifiability issue in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0275 regardless of the identifiability condition imposed.

Bai (2003) obtained the rate of convergence for both urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0276 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0277 for any fixed (i,t). However, the uniform rate of convergence is more relevant for many applications (see example 3 in Section 5.). The following theorem extends those results in Bai (2003) in a uniformity sense. In particular, with a more refined technique, we have improved the uniform convergence rate for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0278.

Theorem 4.Under the assumptions of theorem 2,

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0279

As a consequence of theorem 4, we obtain the following corollary (recall that the constant urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0280 is defined in assumption 2).

Corollary 1.Under the assumptions of theorem 2,

urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0281

The rates of convergence that were obtained above also explain the condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0282 in theorems 2 and 3. It is needed to estimate the common factors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0283 uniformly in tT. When we do not observe urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0284, in addition to the factor loadings, there are KT factors to estimate. Intuitively, the condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0285 requires the number of parameters that are introduced by the unknown factors to be ‘not too many’, so that we can consistently estimate them uniformly. Technically, as demonstrated by Bickel and Levina (2008), Cai and Liu (2011) and many others, achieving uniform accuracy is essential for large covariance estimations.

4. Choice of threshold

4.1. Finite sample positive definiteness

Recall that the threshold value urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0286, where C is determined by the users. To make POET operational in practice, we must choose C to maintain the positive definiteness of the estimated covariances for any given finite sample. We write urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0287, where the covariance estimator depends on C via the threshold. We choose C in the range where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0288. Define
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0289(4.1)

When C is sufficiently large, the estimator becomes diagonal, whereas its minimum eigenvalue must retain strict positivity. Thus, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0290 is well defined and, for all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0291, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0292 is positive definite under finite samples. We can obtain urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0293 by solving urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0294 We can also approximate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0295 by plotting urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0296 as a function of C, as illustrated in Fig. 1. In practice, we can choose C in the range urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0297 for a small ɛ and sufficiently large M. Choosing the threshold in a range to guarantee the finite sample positive definiteness has also been previously suggested by Fryzlewicz (2012).

image
Minimum eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0298 as a function of C for three choices of thresholding rules (the plot is based on the simulated data set in Section 6.2): image, hard thresholding; image, soft thresholding; image, smoothly clipped absolute deviation

4.2. Multifold cross‐validation

In practice, C can be data driven, and chosen through multifold cross‐validation. After obtaining the estimated residuals urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0299 by PCA, we divide them randomly into two subsets, which are, for simplicity, denoted by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0300 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0301. The sizes of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0302 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0303, which are denoted by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0304 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0305, are urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0306 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0307 For example, in sparse matrix estimation, Bickel and Levina (2008) suggested the choice urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0308.

We repeat this procedure H times. At the jth split, we denote by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0309 the POET‐estimator with the threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0310 on the training data set urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0311 We also denote by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0312 the sample covariance based on the validation set, defined by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0313 Then we choose the constant urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0314 by minimizing a cross‐validation objective function over a compact interval
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0315(4.2)
Here urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0316 is the minimum constant that guarantees the positive definiteness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0317 for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0318 as described in the previous subsection, and M is a large constant such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0319 is diagonal. The resulting urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0320 is data driven, so it depends on Y as well as p and T via the data. In contrast, for each given N×T data matrix Y, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0321 is a universal constant in the threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0322 in the sense that it does not change with respect to the position (i,j). We also note that the cross‐validation is based on the estimate of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0323 rather than Σ because POET thresholds the error covariance matrix. Thus cross‐validation improves the performance of thresholding.

It is possible to derive the rate of convergence for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0324 under the current model setting, but it ought to be much more technically involved than the regular sparse matrix estimation that was considered by Bickel and Levina (2008) and Cai and Liu (2011). To keep our presentation simple we do not pursue it in the current paper.

5. Applications of POET

We give four examples to which the results in theorems 2–4 can be applied. Detailed pursuits of these are beyond the scope of the paper.

5.1. Example 3 (large‐scale hypothesis testing)

Controlling the false discovery rate in large‐scale hypothesis testing based on correlated test statistics is an important and challenging problem in statistics (Leek and Storey, 2008; Efron, 2010; Fan et al., 2012). Suppose that the test statistic for each of the hypotheses
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0325
is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0326 and these test statistics Z are jointly normal N(μ,Σ) where Σ is unknown. For a given critical value x, the false discovery proportion is then defined as FDP(x)=V(x)/R(x) where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0327 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0328 are the total number of false discoveries and the total number of discoveries respectively. Our interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. Only V(x) needs to be estimated.
If the covariance Σ admits the approximate factor structure (1.3), then the test statistics can be stochastically decomposed as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0329(5.1)
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0330 is sparse. By the principal factor approximation (theorem 1, Fan et al. (2012))
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0331(5.2)
when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0332 and the number of true significant hypotheses urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0333 is o(p), where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0334 is the upper x‐quantile of the standard normal distribution, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0335 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0336.

Now suppose that we have n repeated measurements from model (5.1). Then, by corollary 1, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0337 can be uniformly consistently estimated, and hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0338 and FDP(x) can be consistently estimated. Efron (2010) obtained these repeated test statistics on the basis of the bootstrap sample from the original raw data. Our theory (theorem 4) gives a formal justification to the framework of Efron (2007, 2010).

5.2. Example 4 (risk management)

The maximum elementwise estimation error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0339 appears in risk assessment as in Fan et al. (2012). For a fixed portfolio allocation vector w, the true portfolio variance and the estimated variance are given by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0340 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0341 respectively. The estimation error is bounded by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0342
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0343, the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0344‐norm of w, is the gross exposure of the portfolio. Usually a constraint is placed on the total percentage of the short positions, in which case we have a restriction urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0345 for some c>0. In particular, c=1 corresponds to a portfolio with no short positions (all weights are non‐negative). Theorem 3 quantifies the maximum approximation error.
The above discussion compares the absolute error of perceived risk and true risk. The relative error is bounded by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0346
for any allocation vector w. Theorem 3 quantifies this relative error.

5.3. Example 5 (panel regression with a factor structure in the errors)

Consider the panel regression model
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0347
where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0348 is a vector of observable regressors with fixed dimension. The regression error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0349 has a factor structure and is assumed to be independent of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0350, but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0351, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0352 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0353 are all unobservable. We are interested in the common regression coefficients β. This panel regression model has been considered by many researchers, such as Ahn et al. (2001) and Pesaran (2006), and has broad applications in social sciences.

Although ordinary least squares produces a consistent estimator of β, a more efficient estimation can be obtained by generalized least squares. The generalized least squares method depends, however, on an estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0354, which is the inverse of the covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0355. By assuming that the covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0356 is sparse, we can successfully solve this problem by applying theorem 3. Although urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0357 is unobservable, it can be replaced by the regression residuals urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0358, obtained via first regressing urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0359 on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0360. We then apply POET to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0361. By theorem 3, the inverse of the resulting estimator is a consistent estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0362 under the spectral norm. A slight difference lies in the fact that, when we apply POET, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0363 is replaced with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0364, which introduces an additional term urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0365 in the estimation error.

5.4. Example 6 (validating an asset pricing theory)

A celebrated financial economic theory is the capital asset pricing model (Sharpe, 1964) that helped William Sharpe to win the Nobel prize in economics in 1990, whose extension is the multifactor model (Ross, 1976; Chamberlain and Rothschild, 1983). It states that, in a frictionless market, the excessive return of any financial asset equals the excessive returns of the risk factors times its factor loadings plus noise. In the multiperiod model, the excess return urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0366 of firm i at time t follows model (1.1), in which urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0367 are the excess returns of the risk factors at time t. To test the null hypothesis (1.2), we embed the model into the multivariate linear model
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0368(5.3)
and wish to test urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0369. The F‐test statistic involves the estimation of the covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0370, whose estimates are degenerate without regularization when pT. Therefore, in the literature (Sentana (2009), and references therein), we focus on the case that p is relatively small. The typical choices of parameters are T=60 monthly data and the number of assets p=5, 10, 25. However, the capital asset pricing model should hold for all tradeable assets, not just a small fraction of assets. With our regularization technique, non‐degenerate estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0371 can be obtained and the F‐test or likelihood ratio test statistics can be employed even when pT.
To provide some insights, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0372 be the least squares estimator of model (5.3). Then, when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0373, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0374 for a constant urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0375 which depends on the observed factors. When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0376 is known, the Wald test statistic is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0377. When it is unknown and p is large, it is natural to use the F‐type of test statistic urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0378. The difference between these two statistics is bounded by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0379

Since under the null hypothesis urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0380, we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0381. Thus, it follows from boundness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0382 that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0383 Theorem 2 provides the rate of convergence for this difference. Detailed development is out of the scope of the current paper, and we shall leave it as a separate research project (see Pesaran and Yamagata (2012)).

6. Monte Carlo experiments

In this section, we shall examine the performance of POET in a finite sample. We shall also demonstrate the effect of this estimator on asset allocation and risk assessment. Similarly to Fan et al. (2008, 2011a), we simulated from a standard Fama–French three‐factor model, assuming a sparse error covariance matrix and three factors. Throughout this section, the timespan is fixed at T=300, and the dimensionality p increases from 1 to 600. We assume that the excess returns of each of p stocks over the risk‐free interest rate follow the model
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0384

The factor loadings are drawn from a trivariate normal distribution urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0385 and the idiosyncratic errors from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0386, and the factor returns urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0387 follow a vector auto‐regressive VAR(1) model. To make the simulation more realistic, model parameters are calibrated from the financial returns, as detailed in the following section.

6.1. Calibration

To calibrate the model, we use the data on annualized returns of 100 industrial portfolios from the Web site of Kenneth French, and the data on 3‐month Treasury bill rates from the Center for Research in Security Prices database. These industrial portfolios are formed as the intersection of 10 portfolios based on size (market equity) and 10 portfolios based on the book equity to market equity ratio. Their excess returns urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0388 are computed for the period from January 1st, 2009, to December 31st, 2010. Here, we present a short outline of the calibration procedure.
  1. Given urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0389 as the input data, we fit a Fama–French three‐factor model and calculate a 100×3 matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0390, and 500×3 matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0391, using the principal components method that was described in Section 3.1.
  2. We summarize 100 factor loadings (the rows of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0392) by their sample mean vector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0393 and sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0394, which are reported in Table 1. The factor loadings urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0395 for i=1,…,p are drawn from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0396.
  3. We run the stationary vector auto‐regressive model urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0397, which is a VAR(1) model, to the data urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0398 to obtain the multivariate least squares estimator for μ and Φ, and we estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0399. Note that all eigenvalues of Φ in Table 2 fall within the unit circle, so our model is stationary. The covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0400 can be obtained by solving the linear equation urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0401 The estimated parameters are depicted in Table 2 and are used to generate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0402.
  4. For each value of p, we generate a sparse covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0403 of the form
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0404
    Here, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0405 is the error correlation matrix, and D is the diagonal matrix of the standard deviations of the errors. We set urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0406, where each urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0407 is generated independently from a gamma distribution G(α,β), and α and β are chosen to match the sample mean and sample standard deviation of the standard deviations of the errors. A similar approach to that of Fan et al. (2011a) has been used in this calibration step. The off‐diagonal entries of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0408 are generated independently from a normal distribution, with mean and standard deviation equal to the sample mean and sample standard deviation of the sample correlations between the estimated residuals, conditional on their absolute values being no larger than 0.95. We then employ hard thresholding to make urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0409 sparse, where the threshold is found as the smallest constant that provides the positive definiteness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0410. More precisely, start with threshold value 1, which gives urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0411, and then decrease the threshold values in a grid until positive definiteness is violated.
Table 1. Mean and covariance matrix used to generate b
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0412 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0413
0.0047 0.0767 −0.00004 0.0087
0.0007 −0.00004 0.0841 0.0013
−1.8078 0.0087 0.0013 0.1649
Table 2. Parameters of the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0414‐generating process
μ urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0415 Φ
−0.0050 1.0037 0.0011 −0.0009 −0.0712 0.0468 0.1413
0.0335 0.0011 0.9999 0.0042 −0.0764 −0.0008 0.0646
−0.0756 −0.0009 0.0042 0.9973 0.0195 −0.0071 −0.0544

6.2. Simulation

For the simulation, we fix T=300, and let p increase from 1 to 600. For each fixed p, we repeat the following steps N=200 times, and record the means and the standard deviations of each respective norm.

  • Step 1: generate independently urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0416, and set urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0417
  • Step 2: generate independently urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0418.
  • Step 3: generate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0419 as a vector auto‐regressive sequence of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0420.
  • Step 4: calculate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0421 from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0422.
  • Step 5: set hard thresholding with threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0423. Estimate K by using IC1 of Bai and Ng (2002). Calculate covariance estimators by using POET. Calculate the sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0424.

In the graphs below, we plot the averages and standard deviations of the distance from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0425 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0426 to the true covariance matrix Σ, under norms urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0427, ‖·‖ and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0428. We also plot the means and standard deviations of the distances from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0429 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0430 to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0431 under the spectral norm. The dimensionality p ranges from 20 to 600 in increments of 20. Because of invertibility, the spectral norm for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0432 is plotted only up to p=280. Also, we zoom into these graphs by plotting the values of p from 1 to 100, this time in increments of 1. Note that we also plot the distance from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0433 to Σ for comparison, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0434 is the estimated covariance matrix that was proposed by Fan et al. (2011a), assuming that the factors are observable.

6.3. Results

In a factor model, we expect POET to perform as well as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0435 when p is relatively large, since the effect of estimating the unknown factors should vanish as p increases. This is illustrated in the plots.

From the simulation results, reported in Figs 2-5, we observe that POET under the unobservable factor model performs just as well as the estimator in Fan et al. (2011a) if the factors are known, when p is sufficiently large. The cost of not knowing the factors is approximately of order urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0436. It can be seen in Figs 2 and 3 that this cost vanishes for p≥200. To give a better insight of the effect of estimating the unknown factors for small p, a separate set of simulations is conducted for p≤100. As we can see from Figs 2(b) and 2(d) and 3(b), 3(c), 3(e) and 3(f) the effect decreases quickly. In addition, when estimating urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0437, it is difficult to distinguish the estimators with known and unknown factors, whose performances are quite stable compared with the sample covariance matrix. Also, the maximum absolute elementwise error (Fig. 4) of our estimator performs very similarly to that of the sample covariance matrix, which coincides with our asymptotic result. Fig. 5 shows that the performances of the three methods are indistinguishable in the spectral norm, as expected.

image
(a), (b) Averages and (c), (d) standard deviations of the relative error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0438 with known factors (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0439), POET (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0440) and the sample covariance (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0441) over 200 simulations, as a function of the dimensionality p: (a), (c) p ranges in 20–600 with increment 20; (b), (d) p ranges in 1–100 with increment 1
image
(a)–(c) Averages and (d)–(f) standard deviations of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0442 with known factors (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0443), POET (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0444) and sample covariance (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0445) over 200 simulations, as a function of the dimensionality p: (a), (d) p ranges in 20–600 with increment 20; (b), (e) p ranges in 1–100 with increment 1; (c), (f) the same as (a) and (d) with the sample covariance curve omitted
image
(a) Averages and (b) standard deviations of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0451 with known factors (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0452), POET (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0453) and sample covariance (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0454) over 200 simulations, as a function of the dimensionality p: they are nearly undifferentiable
image
(a) Averages of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0446 and (b) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0447 with known factors (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0448), POET (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0449) and sample covariance (image, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0450) over 200 simulations, as a function of the dimensionality p: the three curves are barely distinguishable in (a)

6.4. Robustness to the estimation of K

POET depends on the estimated number of factors. Our theory uses a consistent esimator urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0455. To assess the robustness of our procedure to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0456 in finite samples, we calculate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0457 for K=1,2,…,10. Again, the threshold is fixed to be urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0458.

6.4.1. Design 1

The simulation set‐up is the same as before where the true urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0459. We calculate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0460, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0461, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0462 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0463 for K=1,2,…,10. Fig. 6 plots these norms as p increases but with a fixed T=300. The results demonstrate a trend that is quite robust when K≥3; especially, the accuracy of estimation of the spectral norms for large p are close to each other. When K=1 or K=2, the estimators perform badly because of modelling bias. Therefore, POET is robust to overestimated K, but not to underestimation.

image
Robustness of K as p increases for various choices of K (design 1, T=300): (a) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0464; (b) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0465; (c) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0466; (d) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0467

6.4.2. Design 2

We also simulated from a new data‐generating process for the robustness assessment. Consider a banded idiosyncratic matrix
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0468
We still consider a urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0469 factor model, where the factors are independently simulated as
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0470

Table 3 summarizes the average estimation error of covariance matrices across K in the spectral norm. Each simulation is replicated 50 times and T=200.

Table 3. Robustness of K: design 2, estimation errors in spectral norm†
Errors for the following values of K:
1 2 3 4 5 6 8
p=100
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0471 10.70 5.23 1.63 1.80 1.91 2.04 2.22
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0472 2.71 2.51 1.51 1.50 1.44 1.84 2.82
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0473 2.69 2.48 1.47 1.49 1.41 1.56 2.35
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0474 94.66 91.36 29.41 31.45 30.91 33.59 33.48
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0475 17.37 10.04 2.05 2.83 2.94 2.95 2.93
p=200
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0476 11.34 11.45 1.64 1.71 1.79 1.87 2.01
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0477 2.69 3.91 1.57 1.56 1.81 2.26 3.42
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0478 2.67 3.72 1.57 1.55 1.70 2.13 3.19
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0479 200.82 195.64 57.44 63.09 64.53 60.24 56.20
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0480 20.86 14.22 3.29 4.52 4.72 4.69 4.76
p=300
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0481 12.74 15.20 1.66 1.71 1.78 1.84 1.95
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0482 7.58 7.80 1.74 2.18 2.58 3.54 5.45
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0483 7.59 7.49 1.70 2.13 2.49 3.37 5.13
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0484 302.16 274.12 87.92 92.47 91.90 83.21 92.50
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0485 23.43 16.89 4.38 6.04 6.16 6.14 6.20
  • †True K=3.

Table 3 illustrates some interesting patterns. First, the best accuracy of estimation is achieved when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0486. Second, the estimation is robust for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0487. As K increases from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0488, the estimation error becomes larger but is increasing slowly in general, which indicates the robustness when a slightly larger K has been used. Third, when the number of factors is underestimated, corresponding to K=1,2, all the estimators perform badly, which demonstrates the danger of missing any common factors. Therefore, overestimating the number of factors, while still maintaining a satisfactory accuracy of estimation of the covariance matrices, is much better than underestimating. The resulting bias caused by underestimation is more severe than the additional variance that is introduced by overestimation. Finally, estimating Σ, the covariance of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0489, does not achieve good accuracy even when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0490 in the absolute term urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0491, but the relative error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0492 is much smaller. This is consistent with our discussions in Section 3.3.

6.5. Comparisons with other methods

6.5.1. Comparison with related methods

We compare POET with related methods that address low rank plus sparse covariance estimation, specifically, the low rank and sparse covariance estimator LOREC proposed by Luo (2011), the strict factor model SFM by Fan et al. (2008), the dual method (Dual) by Lin et al. (2009) and, finally, the singular value thresholding method of Cai et al. (2008), SVT. In particular, SFM is a special case of POET which employs a large threshold that forces urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0493 to be diagonal even when the true urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0494 might not be. Note that Dual, SVT and many others dealing with low rank plus sparseness, such as Candès et al. (2011) and Wright et al. (2009), assume a known Σ and focus on recovering the decomposition. Hence they do not estimate Σ or its inverse, but decompose the sample covariance into two components. The resulting sparse component may not be positive definite, which can lead to large estimation errors for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0495 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0496.

Data are generated from the same set‐up as design 2 in Section 6.4. Table 4 reports the averaged estimation error of the five methods being compared, calculated on the basis of 50 replications for each simulation. Dual and SVT assume that the data matrix has a low rank plus sparse representation, which is not so for the sample covariance matrix (though the population Σ has such a representation). The tuning parameters for POET, LOREC, Dual and SVT are chosen to achieve the best performance for each method. (We used the R package for LOREC that was developed by Luo (2011) and the MATLAB codes for Dual and SVT provided on Yi Ma's Web site ‘Low‐rank matrix recovery and completion via convex optimization’ at the University of Illinois. The tuning parameters for each method have been chosen to minimize the sum of relative errors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0497. We have also written an R package for POET.)

Table 4. Method comparison under spectral norm for T=100†
Method urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0498 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0499 RelE urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0500 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0501
p=100
POET 1.624 1.336 2.080 1.309 29.107
LOREC 2.274 1.880 2.564 1.511 32.365
SFM 2.084 2.039 2.707 2.022 34.949
Dual 2.306 5.654 2.707 4.674 29.000
SVT 2.59 13.64 2.806 103.1 29.670
p=200
POET 1.641 1.358 3.295 1.346 58.769
LOREC 2.179 1.767 3.874 1.543 62.731
SFM 2.098 2.071 3.758 2.065 60.905
Dual 2.41 6.554 4.541 5.813 56.264
SVT 2.930 362.5 4.680 47.21 63.670
p=300
POET 1.662 1.394 4.337 1.395 65.392
LOREC 2.364 1.635 4.909 1.742 91.618
SFM 2.091 2.064 4.874 2.061 88.852
Dual 2.475 2.602 6.190 2.234 74.059
SVT 2.681 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0502 6.247 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0503 80.954
  • †RelE represents the relative error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0504.

6.5.2. Comparison with direct thresholding

This section compares POET with direct thresholding on the sample covariance matrix without taking out common factors (Rothman et al., 2009; Cai and Liu, 2011). We denote this method by THR. We also run simulations to demonstrate the finite sample performance when Σ itself is sparse and has bounded eigenvalues, corresponding to the case K=0. Three models are considered and both POET and THR use soft thresholding. We fix T=200. Reported results are the average of 100 replications.
  1. Model 1, one‐factor: the factors and loadings are independently generated from N(0,1). The error covariance is the same banded matrix as design 2 in Section 6.4. Here Σ has one diverging eigenvalue.
  2. Model 2, sparse covariance: set K=0; hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0505 itself is a banded matrix with bounded eigenvalues.
  3. Model 3, cross‐sectional AR(1): set K=0, but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0506. Now Σ is no longer sparse (or banded) but is not too dense either since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0507 decreases to 0 exponentially fast as |ij|→∞. This is the correlation matrix if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0508 follows a cross‐sectional AR(1) process: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0509.

For each model, POET uses an estimated urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0510 based on IC1 of Bai and Ng (2002), whereas THR thresholds the sample covariance directly. We find that, in model 1, POET performs significantly better than THR as the latter misses the common factor. For model 2, IC1 estimates urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0511 precisely in each replication, and hence POET is identical to THR. For model 3, POET still outperforms THR. The results are summarized in Table 5.

Table 5. Method comparison for T=200†
Model urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0512 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0513 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0514
POET THR POET THR
p=200
1 26.20 240.18 1.31 2.67 1
2 2.04 2.04 2.07 2.07 0
3 7.73 11.24 8.48 11.40 6.2
p=300
1 32.60 314.43 2.18 2.58 1
2 2.03 2.03 2.08 2.08 0
3 9.41 11.29 8.81 11.41 5.45
  • †The reported numbers are the averages based on 100 replications.

6.6. Simulated portfolio allocation

We demonstrate the improvement of our method compared with the sample covariance and that based on the strict factor model, in a problem of portfolio allocation for risk minimization purposes.

Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0515 be a generic estimator of the covariance matrix of the return vector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0516, and w be the allocation vector of a portfolio consisting of the corresponding p financial securities. Then the theoretical and the empirical risk of the given portfolio are urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0517 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0518 respectively. Now, define
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0519
the estimated (minimum variance) portfolio. Then the actual risk of the estimated portfolio is defined as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0520, and the estimated risk (which is also called the empirical risk) is equal to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0521. In practice, the actual risk is unknown, and only the empirical risk can be calculated.

For each fixed p, the population Σ was generated in the same way as described in Section 6.1., with a sparse but not diagonal error covariance. We use three different methods to estimate Σ and to obtain urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0522: strict factor model urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0523 (estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0524 by using a diagonal matrix), our POET‐estimator urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0525 (both are with unknown factors) and sample covariance urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0526. We then calculate the corresponding actual and empirical risks.

It is interesting to examine the accuracy and the performance of the actual risk of our portfolio urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0527 in comparison with the oracle risk urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0528, which is the theoretical risk of the portfolio that we would have created if we knew the true covariance matrix Σ. We thus compare the regret urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0529, which is always non‐negative, for three estimators of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0530. They are summarized by using the boxplots over the 200 simulations. The results are reported in Fig. 7. In practice, we are also concerned about the difference between the actual and empirical risk of the chosen portfolio urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0531. Hence, in Fig. 8, we also compare the average estimation error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0532 and the average relative estimation error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0533 over 200 simulations. When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0534 is obtained on the basis of the strict factor model, both differences—between actual and oracle risk, and between actual and empirical risk—are persistently greater than the corresponding differences for the approximate factor estimator. Also, in terms of the relative estimation error, the factor‐model‐based method is negligible, whereas the sample covariance does not have such a property.

image
Boxplots of regrets urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0535 for (a) p=80 and (b) p=140: in each panel, the boxplots from left to right correspond to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0536 obtained by using urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0537 based on the approximate factor model, the SFM and the sample covariance
image
Estimation errors for risk assessments as a function of the portfolio size p (image, POET; image, SFM; image, sample): (a) average absolute error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0538; (b) average relative error urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0539 (here, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0540 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0541 are obtained on the basis of three estimators of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0542)

7. Real data example

We demonstrate the sparsity of the approximate factor model on real data and present the improvement of POET over the SFM in a real world application of portfolio allocation.

7.1. Sparsity of idiosyncratic errors

The data were obtained from the Center for Research in Security Prices database and con sist of p=50 stocks and their annualized daily returns for the period January 1st, 2010–December 31st, 2010 (T=252). The stocks are chosen from five different industry sectors (more specifically, ‘consumer goods—textiles and apparel clothing’, ‘financial—credit services’, ‘healthcare—hospitals’, ‘services—restaurants’ and ‘utilities—water utilities’), with 10 stocks from each sector. We made this selection to demonstrate a block diagonal trend in the sparsity. More specifically, we show that the non‐zero elements are clustered mainly within companies in the same industry. We also note that these are the same groups that show predominantly positive correlation.

The largest eigenvalues of the sample covariance equal 0.0102,0.0045 and 0.0039, whereas the rest are bounded by 0.0020. Hence K=0,1,2,3 are the possible values of the number of factors. Fig. 9 shows the heat map of the thresholded error correlation matrix (for simplicity, we applied hard thresholding). The threshold has been chosen by using cross‐validation as described in Section 4.. We compare the level of sparsity (the percentage of non‐zero off‐diagonal elements) for the five diagonal blocks of size 10×10, versus the sparsity of the rest of the matrix. For K=2, our method results in 25.8% non‐zero off‐diagonal elements in the five diagonal blocks, as opposed to 7.3% non‐zero elements in the rest of the covariance matrix. Note that, out of the non‐zero elements in the central five blocks, 100% are positive, as opposed to a distribution of 60.3% positive and 39.7% negative among the non‐zero elements in off‐diagonal blocks. There is a strong positive correlation between the returns of companies in the same industry after the common factors have been taken out, and the thresholding has preserved them. The results for K=1, 2, 3 show the same characteristics. These provide stark evidence that the strict factor model is not appropriate.

image
Heat map of the thresholded error correlation matrix for number of factors (a) K=0, (b) K=1, (c) K=2 and (d) K=3

7.2. Portfolio allocation

We extend our data size by including larger industrial portfolios (p=100), and a longer period (10 years): from January 1st, 2000, to December 31st, 2010, of annualized daily excess returns. Two portfolios are created at the beginning of each month, based on two different covariance estimates through approximate and strict factor models with unknown factors. At the end of each month, we compare the risks of both portfolios.

The number of factors is determined by using the penalty function that was proposed by Bai and Ng (2002), as defined in expression (2.14). For calibration, we use the last 100 consecutive business days of the above data, and both IC1 and IC2 give urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0543. On the first of each month, we estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0544 (method SFM) and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0545 (POET with soft thresholding) using the historical data of excess daily returns for the preceeding 12 months (T=252). The value of the threshold is determined by using the cross‐validation procedure. We minimize the empirical risk of both portfolios to obtain the two respective optimal portfolio allocations urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0546 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0547 (based on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0548 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0549): urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0550. At the end of the month (21 trading days), their actual risks are compared, calculated by
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0551

We can see from Fig. 10 that the minimum risk portfolio that was created by POET performs significantly better, achieving lower variance 76% of the time. Among those months, the risk is decreased by 48.63%. In contrast, during the months that POET produces a higher risk portfolio, the risk is increased by only 17.66%.

image
Risk of portfolios created with POET and SFM

Next, we demonstrate the effect of the choice of number of factors and threshold on the performance of POET. If cross‐validation seems computationally expensive, we can choose a common soft threshold throughout the whole investment process. The average constant in the cross‐validation was 0.53, which is close to our suggested constant 0.5 used for simulation. We also present the results based on various choices of constant C=0.5,0.75,1,1.25, with soft threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0552. The results are summarized in Table 6. The performance of POET seems consistent across different choices of these parameters.

Table 6. Comparisons of the risks of portfolios by using POET and SFM†
C Results for the following values of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0553 :
urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0554 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0555 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0556
0.25 0.58/29.6% 0.68/38% 0.71/33%
0.5 0.66/31.7% 0.70/38.2% 0.75/33.5%
0.75 0.68/29.3% 0.70/29.6% 0.71/25.1%
1 0.66/20.7% 0.62/19.4% 0.69/18%
  • †The first number is the proportion of the time that POET outperforms and the second number is the percentage of average risk improvements. C represents the constant in the threshold.

8. Conclusion and discussion

We study the problem of estimating a high dimensional covariance matrix with conditional sparsity. Realizing that an unconditional sparsity assumption is inappropriate in many applications, we introduce a latent factor model that has a conditional sparsity feature and propose POET to take advantage of the structure. This expands considerably the scope of the model based on the strict factor model, which assumes independent idiosyncratic noise and is too restrictive in practice. By assuming a sparse error covariance matrix, we allow for the presence of the cross‐sectional correlation even after taking out the common factors. The sparse covariance is estimated by the adaptive thresholding technique.

It is found that the rates of convergence of the estimators have an extra term approximately urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0557 in addition to the results based on observable factors by Fan et al. (2008, 2011a), which arises from the effect of estimating the unobservable factors. As we can see, this effect vanishes as the dimensionality increases, as more information about the common factors becomes available. When p grows sufficiently large, the effect of estimating the unknown factors is negligible, and we estimate the covariance matrices as if we knew the factors.

The proposed POET also has wide applicability in statistical genomics. For example, Carvalho et al. (2008) applied a Bayesian sparse factor model to study breast cancer hormonal pathways. Their real data results have identified about two common factors that have highly loaded genes (about half of 250 genes). As a result, these factors should be treated as ‘pervasive’ (see the explanation in example 1 in Section 2.1.1), which will result in one or two very spiked eigenvalues of the gene expressions’ covariance matrix. POET can be applied to estimate such a covariance matrix and its network model.

Acknowledgements

The research was partially supported by National Institutes of Health grants R01GM100474‐01 and R01‐GM072611, grant DMS‐0704337 and the Bendheim Center for Finance at Princeton University. The bulk of the research was carried out while Yuan Liao was a postdoctoral fellow at Princeton University.

    Appendix A:: Estimating a sparse covariance with contaminated data

    We estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0558 by applying the adaptive thresholding given by expression (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0559 are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0560 represent the error terms in regression models or when data are subject to the measurement of errors. Instead, we may observe urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0561. For instance, in the approximate factor models, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0562

    We can estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0563 by using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0564, define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0565
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0566(A.1)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0567 satisfies, for all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0568, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0569 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0570

    When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0571 is sufficiently close to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0572, we can show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0573 is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0574 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0575 that are defined in assumptions 2 and 3, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0576.

    Theorem 5.Suppose that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0577, and that assumptions 2 and 3 hold. In addition, suppose that there is a sequence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0578 so that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0579 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0580 then there is a constant C>0 in the adaptive thresholding estimator (A.1) with

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0581
    such that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0582

    If further urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0583, then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0584 is invertible with probability approaching 1, and

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0585

    Proof.By assumptions 2 and 3, the conditions of lemmas A.3 and A.4 of Fan et al. (2011a) are satisfied. Hence, for any ɛ>0, there are positive constants urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0586 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0587 such that each of the events

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0588
    occurs with probability at least 1−ɛ. By the condition of threshold function, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0589. Now for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0590 under the event urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0591
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0592

    Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0593. Then, with probability at least 1−2ɛ, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0594. Since ɛ is arbitrary, we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0595. If, in addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0596, then the minimum eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0597 is bounded away from 0 with probability approaching 1 since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0598. This then implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0599.

    Appendix B:: Proofs for Section 2

    We first cite two useful theorems, which are needed to prove propositions 1 and 2. In lemma 1 below, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0600 be the eigenvalues of Σ in descending order and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0601 be their associated eigenvectors. Correspondingly, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0602 be the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0603 in descending order and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0604 be their associated eigenvectors.

    Lemma 1.

    1. (Weyl's theorem) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0605.
    2. ( sin (θ) theorem; Davis and Kahan (1970)):
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0606

    B.1. Proof of proposition 1

    Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0607 are the eigenvalues of Σ and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0608 are the first K eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0609 (the remaining pK eigenvalues are 0), then by Weyl's theorem, for each jK,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0610

    For j>K, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0611. However, the first K eigenvalues of BB are also the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0612. By the assumption, the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0613 are bounded away from 0. Thus, when jK, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0614 are bounded away from 0 for all large p.

    B.2. Proof of proposition 2

    Applying the  sin (θ) theorem yields
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0615

    For a generic constant c>0, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0616 for all large p, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0617 but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0618 is bounded by prosposition 1. However, if j<K, the same argument implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0619. If j=K, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0620, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0621 is bounded away from 0, but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0622. Hence, again, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0623.

    B.3. Proof of theorem 1

    The sample covariance matrix of the residuals by using the least squares method is given by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0624
    where we used the normalization condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0625 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0626. If we show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0627, then from the decompositions of the sample covariance,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0628
    we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0629. Consequently, applying thresholding on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0630 is equivalent to applying thresholding on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0631, which gives the desired result.

    We now show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0632 indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0633, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0634 is diagonal. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0635 be the solution to the new optimization problem. Switching the roles of B and F, then the solution of problem (2.10) is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0636 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0637. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0638. From urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0639, it follows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0640.

    Appendix C:: Proofs for Section 3

    We shall proceed by subsequently showing theorems 4, 2 and 3.

    C.1. Preliminary lemmas

    The following results are to be used subsequently. The proofs of lemmas 2, 3 and 4 are found in Fan et al. (2011a).

    Lemma 2.Suppose that A and B are symmetric semipositive definite matrices, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0641 for a sequence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0642. If urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0643, then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0644, and

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0645

    Lemma 3.Suppose that the random variables urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0646 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0647 both satisfy the exponential‐type tail condition: there exist urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0648, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0649 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0650, such that, ∀s>0,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0651
    Then, for some urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0652 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0653, and any s>0,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0654(C.1)

    Lemma 4.Under the assumptions of theorem 2,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0655,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0656 and
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0657.

    Lemma 5.Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0658 denote the Kth largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0659; then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0660 with probability approaching 1 for some urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0661.

    Proof.First, by proposition 1, under assumption 1, the Kth largest eigenvalue urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0662 of Σ satisfies, for some c>0,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0663
    for sufficiently large p. Using Weyl's theorem, we need only to prove that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0664. Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0665. Using this and model (1.3), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0666 can be decomposed as the sum of the four terms
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0667

    We now deal with them term by term. We shall repeatedly use the fact that, for a p×p matrix A,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0668
    First, by lemma 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0669, which is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0670 if K log(p)=o(T). Consequently, by assumption 1, we have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0671
    We now deal with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0672. It follows from lemma 4 that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0673
    Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0674, it remains to deal with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0675, which is bounded by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0676
    which is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0677 since log(p)=o(T).

    Lemma 6.Under assumption 3, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0678.

    Proof.Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0679 is weakly stationary, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0680. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0681 for some constant M and any i and t since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0682 has an exponential tail. Hence by Davydov's inequality (corollary 16.2.4 in Athreya and Lahiri (2006)), there is a constant C>0, for all ip,tT, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0683, where α(t) is the α‐mixing coefficient. By assumption 3, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0684. Thus, uniformly in T,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0685

    C.2. Proof of theorem 4

    Our derivation below relies on a result that was obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0686 equals the true K with probability approaching 1. Note that, under our assumptions 1–4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following lemma.

    Lemma 7. (theorem 2 in Bai and Ng [2002].)For urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0687 defined in expression (2.14),

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0688

    Proof.For a proof, see Bai and Ng (2002).

    Using expression (A.1) in Bai (2003), we have the identity
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0689(C.2)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0690, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0691 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0692.

    We first prove some preliminary results in the following lemmas. Denote urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0693.

    Lemma 8.For all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0694,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0695,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0696,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0697 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0698.

    Proof.

    1. We have, ∀i, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0699. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0700
    2. By lemma 6, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0701, which then yields the result.
    3. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0702
    4. Note that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0703. By assumption 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0704, which implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0705 and yields the result.
    5. By definition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0706. We first bound urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0707. Assumption 4 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0708. Therefore, by the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0709
    6. Similarly to part (c), noting that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0710 is a scalar, we have
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0711
      where the last line follows from the Cauchy–Schwarz inequality.

    Lemma 9.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0712,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0713,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0714 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0715.

    Proof.

    1. By the Cauchy–Schwarz inequality and the fact that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0716,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0717
      The result then follows from assumption 3.
    2. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0718
      It follows from assumption 4 that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0719. It then follows from Chebyshev's inequality and Bonferroni's method that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0720.
    3. By assumption 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0721. Chebyshev's inequality and Bonferroni's method yield urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0722 with probability 1, which then implies
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0723
    4. By the Cauchy–Schwarz inequality and assumption 4, we have demonstrated that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0724. In addition, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0725, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0726. It follows that
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0727

    Lemma 10.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0728.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0729.
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0730.

    Proof.We prove this lemma conditioning on the event urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0731. Once this has been done, because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0732, it then implies the unconditional arguments.

    1. When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0733, by lemma 5, all the eigenvalues of V/p are bounded away from 0. Using the inequality urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0734 and identity (C.2), we have, for some constant C>0,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0735
      Each of the four terms on the right‐hand side are bounded in lemma 8, which then yields the desired result.
    2. Part (b) follows from part (a) and
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0736
      Part (c) is implied by identity (C.2) and lemma 9.

    Lemma 11.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0737.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0738.

    Proof.We first condition on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0739.

    1. Lemma 5 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0740. Also urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0741. In ad dition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0742. It then follows from the definition of H that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0743. Define urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0744. Applying the triangular inequality gives
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0745(C.3)
    2. By lemma 4, the first term in inequality (C.3) is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0746. The second term of inequality (C.3) can be bounded, by the Cauchy–Schwarz inequality and lemma 10, as follows:
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0747
    3. Still conditioning on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0748, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0749 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0750, right multiplying H gives urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0751. Part (a) also gives, conditioning on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0752, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0753. Hence further left multiplying urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0754 yields urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0755. Because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0756, we reach the desired result.

    C.2.1. Completion of proof of theorem 4

    The second part of theorem 4 was proved in lemma 10. We now derive the convergence rate of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0757.

    Using the facts that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0758, and that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0759, we have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0760(C.4)
    We bound the three terms on the right‐hand side. It follows from lemmas 4 and 11 that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0761
    For the second term, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0762. Therefore, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0763. The Cauchy–Schwarz inequality and lemma 10 imply
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0764

    Finally, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0765 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0766 imply that the third term is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0767.

    C.2.2. Proof of corollary 1

    Under assumption 3, it can be shown by Bonferroni's method that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0768. By theorem 4, uniformly in i and t,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0769

    C.3. Proof of theorem 2

    Lemma 12.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0770, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0771.

    Proof.We have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0772. Therefore, using the inequality urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0773, we have

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0774
    The first part of lemma 12 then follows from theorem 4 and lemma 10. The second part follows from corollary 1.

    C.3.1. Completion of proof of theorem 2

    Theorem 2 follows immediately from theorem 5 and lemma 12.

    C.4. Proof of theorem 3

    Define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0775

    Lemma 13.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0776, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0777.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0778.
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0779.
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0780

    Proof.

    1. We have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0781. Moreover, since all the eigenvalues of Σ are bounded away from 0, for any matrix A, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0782. Hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0783.
    2. By theorem 2, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0784.
    3. The same argument for the proof of theorem 2 in Fan et al. (2008) implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0785. Thus, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0786 is upper bounded by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0787.
    4. Again, by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0788, and lemma 11,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0789(C.5)

    C.4.1. Proof of theorem 3, part (a)

    By lemma 13, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0790. Hence, for a generic constant C>0,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0791

    Lemma 14.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0792.

    Proof.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0793. Hence

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0794(C.6)

    Lemma 15.If urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0795, then with probability approaching 1, for some c>0,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0796,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0797,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0798 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0799.

    Proof.

    1. By lemma 11, with probability approaching 1, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0800 is bounded away from 0. Hence,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0801
    2. The result follows from part (a) and lemma 14. Parts (c) and (d) follow from a similar argument to that for part (a) and lemma 11.

    C.4.2. Completion of proof of theorem 3

    We derive the rate for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0802. Define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0803
    Note that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0804 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0805. The triangular inequality gives
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0806
    Using the Sherman–Morrison–Woodbury formula, we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0807, where
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0808(C.7)
    We bound each of the six terms. First, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0809 is bounded by theorem 2. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0810; then
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0811
    Note that theorem 2 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0812. Lemma 15 then implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0813. This shows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0814. Similarly urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0815. In addition, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0816, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0817. Similarly urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0818. Finally, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0819. By lemma 15, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0820. Then, by lemma 14,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0821
    Consequently, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0822. Adding up urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0823urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0824 gives
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0825
    However, using the Sherman–Morrison–Woodbury formula again implies that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0826

    C.4.3. Completion of proof of theorem 3: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0827

    We first bound urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0828. Repeatedly using the triangular inequality yields
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0829
    However, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0830 be the (i,j) entry of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0831. Then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0832.
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0833

    Hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0834. The result then follows immediately.

    Discussion on the paper by Fan, Liao and Mincheva

    Marc Hallin (Université Libre de Bruxelles and Princeton University)

    Fan and his colleagues are dealing with finite realizations
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0835
    of double‐indexed stochastic processes of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0836. Such realizations can be seen either as a collection of p one‐dimensional time series, related to p individuals or cross‐sectional items, or as one observed time series urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0837 in dimension p. The objective is the estimation of the covariance matrix Σ of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0838. As both p and T are ‘large’, (p, T) asymptotics are appropriate. Owing to the effect of a common environment (unobserved ‘factors’ or ‘common shocks’ generating a large number of cross‐covariances), the traditional assumption of a sparse Σ is unlikely to hold. However, the same assumption becomes reasonable once the effect of unobserved factors or common shocks has been removed. The main challenge, thus, consists in removing that effect—which is highly non‐trivial, as those factors or common shocks are not observed and are not even well defined.

    Factor model methods in such context are tailor‐made solutions. The general idea behind factor models consists in decomposing each urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0839 into urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0840 where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0841 (Y's common component, accounting for the effect of the factors) and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0842 (Y's idiosyncratic component, on which a sparsity assumption is to be made) are unobserved mutually orthogonal (at all leads and lags) processes. Further identification constraints on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0843 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0844 of course are needed, yielding a variety of factor models. The better urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0845 is at accounting for the effect of common factors (cross‐correlations), the more plausible the sparsity assumption on the covariance of the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0846.

    The identification assumption that was chosen by the authors requires urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0847 to be of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0848, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0849 is some latent K‐dimensional vector process (the factors)—yielding a static (approximate) factor model of the type studied by Chamberlain and Rothschild (1983), Bai and Ng (2002), Stock and Watson (2002a,b), and many others. The authors then successfully propose a principal‐component‐based method for reconstructing urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0850 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0851, followed by a thresholding procedure for the estimation of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0852, and derive powerful consistency results, with rates that depend on the sparsity of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0853.

    This is a path breaking contribution to the literature on high dimensional covariance estimation. The authors should be congratulated for this, and I have no hesitation in proposing a vote of thanks. However, ‘Sans la liberté de blurn:x-wiley:13697412:media:rssb12016:rssb12016-math-0854mer, il n'est point déloge flatteur’ (‘Praising has no value in the absence of free criticism’ (Beaumarchais, Le Mariage de Figaro)), and I would like to enhance my praising of the authors’ work with a couple of friendly critical comments.

    Principal components—the traditional static ones, computed from the ‘instantaneous’ covariances of the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0855s—are a fundamental tool in the authors’ approach. Since Brillinger (1981), however, it is widely admitted that static principal components are not the adequate concept of principal components in a time series context. By maximizing normed linear combinations of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0856, indeed, they completely overlook serial dependences. A static principal component with a small eigenvalue may have a negligible contemporaneous effect on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0857, but a significant effect on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0858, and hence a high predictive value: discarding it, as does the static principal component method, shifts its contribution to the idiosyncratic component, possibly jeopardizing assumed idiosyncratic sparsity.

    Static principal components of course are fine under the assumptions of the static factor model, which in turn are pertinent in the presence of independent observations (the type that the authors probably have in mind despite the time series setting of their paper: gene expressions, financial returns, etc.). They are no longer adequate in the presence of serial dependence, and this is an indication that the static factor model assumptions are unlikely to hold in a genuine time series context. The weak point of static factor assumptions is the fact that all factors are to be loaded contemporaneously at time t whereas, in most practical situations, factors are loaded with lags. A general form for the common component is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0859 instead of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0860 with K×1 loading filters urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0861 instead of the K×1 loading matrices urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0862 (equivalently, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0863, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0864 is a K‐tuple of mutually orthogonal white noises, the common shocks). Adopting this dynamic characterization of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0865 leads to the general dynamic factor that was studied in Forni et al. (2000) or Forni and Lippi (2001). An important advantage of that dynamic model is that, in contrast with the static one, it holds, basically, without any assumption (but second‐order stationarity) on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0866.

    With this dynamic specification replacing the static model, the idiosyncratic covariance matrices, as well as the lagged idiosyncratic cross‐covariance matrices, are much more likely to satisfy sparsity assumptions. Brillinger's dynamic principal components can be used (Forni et al., 2000), very much in the same way as the static principal components in the static model, to reconstruct the decomposition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0867, then applying the thresholding technique that is recommended by the authors. (Dynamic principal components are based on the maximization of linear combinations of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0868 involving the past, present and future values of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0869 and can be computed from the eigenvalues and eigenvectors of the spectral density matrices of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0870.) The resulting consistency rates will involve the sparsity of the dynamic factor idiosyncratic covariance matrix rather than that of its static factor counterpart. And, the same methods as proposed by the authors naturally apply with the more ambitious objective of estimating the full (high dimensional) autocovariance structure of the observed series.

    Piotr Fryzlewicz and Na Huang (London School of Economics and Political Science)

    We would like to start by congratulating Professor Fan, Dr Liao and Ms Mincheva for the stimulating and thought‐provoking paper.

    The POET‐estimator is the sum of two parts: the non‐sparse, low rank part resulting from the factor model, and the sparse part arising as a result of thresholding the ‘principal orthogonal complement’. The estimator has been designed with a particular factor model in mind, and therefore it is natural to ask, firstly, whether and how one could verify this model assumption and, secondly, whether POET offers acceptable performance if the assumption does not hold.

    We may be wrong here, but we are unaware of a reliable technique for estimating the number of factors K which works well except in the most ‘textbook’ cases of the first few eigenvalues being ‘visibly’ larger than others. Even if a factor structure is present, the presence of both stronger and less strong factors may lead to the cut‐off in the eigenvalues being less obvious, in which case any inference for the number of factors may not be reliable. However, it is important to choose K correctly from the point of view of the usability of POET: the authors warn us that POET may perform poorly if K is underestimated. It is therefore tempting to ask whether POET may benefit from averaging over K as a possible guard against picking one ‘wrong’ (e.g. underestimated) value of K. Averaging may also be beneficial in cases when the factor model assumption is not satisfied.

    An appealing aspect of the construction of POET is the inclusion of the non‐sparse part (which is done in case the target matrix Σ is not sparse) and the sparse part (to ensure the invertibility of the estimator). It is tempting to consider other possible estimators along these lines. Motivated by POET, we propose an estimator of Σ of the form
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0871
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0872 is the p×p sample covariance matrix, δ is a constant in [0,1], λ is a p×p matrix with non‐negative entries and t(·,·) is a function that applies soft, hard or other thresholding to each non‐diagonal entry of its first argument, with the threshold value equal to the corresponding entry of its second argument. λ will typically be parameterized by one scalar parameter. Obviously, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0873 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0874 are the non‐sparse and sparse components respectively.

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0875 performs ‘shrinkage of the sample covariance towards a sparse target’. To the best of our knowledge, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0876 is a new proposal, although shrinkage towards some other targets has been studied extensively before, notably by Ledoit and Wolf (2003), who proposed shrinkage towards a one‐factor target and Schaefer and Strimmer (2005), who reviewed and discussed six commonly used targets. Some ideas for the ‘optimal’ choice of δ are proposed by Ledoit and Wolf (2003) and Schaefer and Strimmer (2005) and can be adopted in the context of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0877, thereby reducing the number of ‘free’ parameters of the procedure to the single scalar parameter of the threshold matrix λ. If all new covariance estimators were required to have ‘literary’ names (such as POET), we would name ours ‘NOVELIST’, for ‘novel integration of the sample and thresholded covariance estimators’. The benefits of NOVELIST include simplicity, ease of implementation and the fact that its application avoids eigenanalysis, which is unfamiliar to many practitioners.

    We now briefly exhibit the performance of POET versus NOVELIST on a simulated covariance matrix Σ of size 100×100, available from http://stats.lse.ac.uk/fryzlewicz/testcov.RData (use load(“uottestcov.RData”uot) in R; the variable name is testcov). Σ was not generated from a factor model and is not sparse. The range of its diagonal elements is [3.32, 7.09], and only 56 of the non‐diagonal entries are larger than 1 in absolute value. The sample size is n=100, so urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0878 itself is not invertible. In NOVELIST, we use both δ=0 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0879, and the constant matrix λ≡1. In POET, we use K=7, following the authors’ advice, given in the R package POET, to choose a large K (to avoid issues related to K being underestimated), but preferably smaller than 8. Both POET and NOVELIST use soft thresholding. Other POET parameters are set to default values. The data are Gaussian, and there are N=100 repetitions. Table 7 shows the results. POET performs poorly for Σ: it is the worst in all norms by a large margin. NOVELIST with δ=0 (which reduces to the simple soft thresholding estimator) and with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0880 are difficult to tell apart in terms of their performance. However, as far urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0881 is concerned, NOVELIST with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0882 is the best, followed by POET and then by the simple soft thresholding. The overall clear ‘winner’ in this example is NOVELIST with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0883.

    Table 7. Averaged (and rounded except max‐) distances to Σ (left‐hand section) and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0884 (right‐hand section) for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0885 with δ=0, with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0886 and for the POET‐estimator, in the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0887‐, Frobenius, max‐ and spectral norms†
    Norm Σ urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0888
    δ=0 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0889 POET δ=0 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0890 POET
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0891 34 34 61 42 38 41
    Frobenius 30 32 50 34 30 33
    max 2.09 2.03 2.29 4.88 3.47 3.92
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0892 10 9 19 17 15 17
    • †Distances to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0893 were multiplied by 10 before averaging. The best results are in italics.
    By way of summary, POET is an elegant construction which combines parsimony of representation in the low rank component with sparsity in the thresholded part. This brief discussion
    1. attempts to list some research questions regarding POET which we believe are worth exploring further and
    2. proposes a simple competitor.

    We found the paper a pleasure to read and thought it was written in a clear and pedagogical way. We are convinced that POET will stimulate further research in the important field of large covariance estimation. It therefore gives us great pleasure to second the vote of thanks for this paper.

    The vote of thanks was passed by acclamation.

    Wenyang Zhang (University of York) and Heng Peng (Hong Kong Baptist University)

    We congratulate Professor Fan, Dr Liao and Ms Mincheva for such a brilliant paper. We believe that this paper will have huge influence on the estimation of covariance matrices of large size and will stimulate many further researches in this topic.

    The condition of sparsity is often imposed when it comes to large matrix estimation. As the authors rightly point out sparsity may not be appropriate in some circumstances; the covariance matrix of asset returns used in portfolio allocation is an example. It is better to impose some kind of structure on the covariance matrix on the basis of the data concerned. The matrices with the structure stated in this paper are very general and appear in many research areas, such as portfolio allocation, risk management and image analysis. The estimation that is proposed in this paper is intuitive and easy to implement; it will become very popular in the areas where large matrix estimation is needed.

    Among the many clever and stimulating ideas in this paper, we particularly appreciate the connection between principal component analysis and factor models. This connection leads to a brand new estimation of factor loadings in factor analysis and of the covariance matrix of idiosyncratic components.

    The estimation of the covariance matrix of idiosyncratic components is based on
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0894

    When p is larger than T, the sample covariance matrix would be singular, and some urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0895 will be 0 when i>K+1. This problem will become more acute when p>>T. How would this affect the estimator of the covariance matrix of idiosyncratic components? Would some kind of iteration improve the accuracy of the estimator? For example: treat urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0896 as an initial estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0897, decompose urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0898 by its principal components and denote the sum of the first K terms in the decomposition by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0899 and apply the thresholding rule on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0900 to obtain improved urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0901, and continue this iterative procedure until convergence.

    The presence of spiked eigenvalues is clearly formulated in this paper; however, when the factor loadings are not available, which is the case in reality, how do we check whether there are spiked eigenvalues? Would it work simply by checking whether there is a jump among the eigenvalues of the sample covariance matrix?

    As far as the estimation of the covariance matrix Σ is concerned, is it really necessary to have the condition of the presence of spiked eigenvalues? Would urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0902 always work regardless of whether the spiked eigenvalues exist or not? We may have missed some important points on this issue.

    The selection of K on the basis of the Akaike or Bayesian information criterion does not seem to make use of the information about the presence of spiked eigenvalues; is there any room to improve the selection by incorporating this information in the selection procedure?

    Alexei Onatski (University of Cambridge)

    My comments on this interesting paper are confined to its central assumption that the first K eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0903 diverge at rate O(p), whereas all the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0904 are bounded as p→∞. This factor pervasiveness assumption implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0905 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0906 are good approximations to the sample covariances of the systematic and idiosyncratic components of the data respectively, which is a key to the success of POET. Unfortunately, it may be misleading in many economic and financial applications.

    For example, as Fig. 11 shows, except for i=1, there are no large gaps between eigenvalues i and i+1 of the sample covariance matrix of the excess return data that were used in Section 6. However, since, as is commonly believed, such data contain at least three factors, the factor pervasiveness assumption suggests the existence of a large gap for i≥3.

    image
    Sample covariance eigenvalues of 100 industrial portfolio data (image) and of data simulated from the factor model calibrated as in Section 6 (boxplots based on 100 replications)

    The absence of the gap between ‘systematic’ and ‘idiosyncratic’ eigenvalues may have a negative effect on the performance of POET. Table 8 reports the mean quality of POET over 1000 replications of data simulated as in Section 6.2, but with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0907. As ρ increases from 0.4 to 0.9, the systematic–idiosyncratic gap measured by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0908 decreases from 3.7 to 1.1. For the 100 industrial portfolios data that were used in Section 6, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0909, which is best matched by the simulated data with ρ=0.8. The quality of POET dramatically deteriorates in the neighbourhood of ρ=0.8.

    Table 8. Performance of POET and shrinkage when the systematic–idiosyncratic eigenvalue gap becomes small
    Estimator Results for the following values of ρ and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0910 :
    ρ=0.4, ρ=0.5, ρ=0.6, ρ=0.7, ρ=0.8, ρ=0.9,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0911 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0912 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0913 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0914 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0915 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0916
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0917 POET 0.74 1.04 1.57 2.10 3.89 8.98
    Shrinkage 1.21 1.63 2.18 2.96 4.68 9.24
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0918 POET 7.38 10.1 18.7 43.2 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0919 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0920
    Shrinkage 6.16 9.63 14.1 22.2 39.2 90.7
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0921 POET 0.85 0.90 1.07 1.33 2.14 6.65
    Shrinkage 0.97 1.05 1.11 1.16 1.20 1.27
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0922 POET 7.09 9.64 17.8 39.0 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0923 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0924
    Shrinkage 4.35 5.67 7.47 10.5 16.9 41.9
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0925 POET 20.2 20.4 20.4 21.3 19.7 21.2
    Shrinkage 20.3 20.4 20.4 21.3 19.7 21.1

    For comparison, in rows of Table 8 marked ‘shrinkage’, I report the quality of the estimator that replaces POET's thresholding step by Ledoit and Wolf's (2004) linear shrinkage procedure applied to the principal orthogonal complement. The deterioration of the quality of parameter shrinkage is not as dramatic as that of POET.

    Continuing the comparison, I computed the risk of portfolios created as in Section 7 with parameter shrinkage. The minimum risk portfolio that was created with shrinkage had lower variance than that created with POET 51% of the time. Among those months, the risk was decreased by 19%. During the months that POET produced a lower risk portfolio, the risk was decreased by 15%. These results indicate that POET does not dominate other simple covariance matrix estimation methods in applications where there is no clear gap between systematic and idiosyncratic eigenvalues. Developing covariance estimation methods that would work well in such situations is an important task for future research.

    Clifford Lam and Charlie Hu (London School of Economics and Political Science)

    We congratulate Fan and his colleagues for this insightful paper. Here we suggest a method to address two concerns:
    1. the potential underestimation of the number of factors K;
    2. the potential non‐sparseness of the estimated principal orthogonal complement.

    Point (a) is addressed by using a larger K. With pervasive factors assumed in the paper, it is relatively easy to find such K. However, in an analysis of macroeconomic data for example, there can be a mix of pervasive factors and many weaker ones; see Chudik et al. (2011), Lam et al. (2011) and Lam and Yao (2012) for a general definition of weak factors.

    In Stock and Watson (2005), monthly data of p=132 US macroeconomic time series from 1959 to 2003 (n=526) were analysed. Using principal component analysis (Bai and Ng, 2002) the method in Lam et al. (2011) and a modified version called autocovariance‐based factor modelling, we compute the average forecast errors of 30 monthly forecasts by using a vector auto‐regressive model VAR(3) on the estimated factors from these methods with different number of factors r (Fig. 12). Whereas three pervasive factors decrease forecast errors sharply, including more factors, up to r=35, decreases forecast errors more slowly, showing the existence of many ‘weaker’ factors.

    image
    Average forecast errors for various numbers of factors r: image, autocovariance‐based factor modelling; image, Lam et al. (2011); image, principal component analysis

    Hence it is not always possible to have ‘enough’ factors for accurate thresholding of the principal orthogonal complement, which can still include contributions from many weak factors and is not sparse. Points (a) and (b) can therefore be closely related, and can be addressed if we regularize the condition number of the orthogonal complement instead of thresholding. Whereas Won et al. (2013) restrict the extreme eigenvalues with a tuning parameter to be chosen, we use the idea of Abadir et al. (2010) (the properties have not been investigated sufficiently unfortunately). We are studying its theoretical properties.

    We simulate 100 times from the panel regression model
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0926
    with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0927 being independent AR(1) processes and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0928 the standardized macroeconomic data in Stock and Watson (2005) plus independent N(0,0.2) noise. Following example 5 of the paper, we estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0929 by using different methods and plot the sum of absolute bias for estimating β by using generalized least squares against the number of factors r used in Fig. 13. Clearly regularizing on condition number leads to stabler estimators.
    image
    Sum of absolute bias (averaged over 100 simulations) for estimating β by using generalized least squares (image) against the number of factors r used in POET (C=0.5) (image) and the condition‐number‐regularized estimator (image) (the bias for the least square method is constant throughout)

    Parallel to Section 7.2, we compare the risk of portfolios created by using POET and the method above. Again Fig. 14 and Table 9 show stabler performance of regularization on condition number.

    Table 9. Comparisons of the risks of portfolios by using POET (C=0.5 and the condition‐number‐regularized estimator
    K Proportion of time % of average risk
    POET outperforms improvements
    1 0.40 −4.07
    2 0.46 −2.50
    3 0.56 0.66
    4 0.56 0.71

    Natalia Bailey (University of Cambridge), M. Hashem Pesaran (University of Southern California, Los Angeles, and Trinity College, Cambridge) and Takashi Yamagata (University of York)

    image
    Risk of portfolios created with POET (C=0.5) and the condition‐number‐regularized estimator: (a) K=1; (b) K=2; (c) K=3; (d) K=4

    The paper's key contribution lies in tackling the problem of estimation of a large covariance matrix, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0930, when the data set examined is ‘contaminated’ with strong unobserved factors. Both cross‐sectional and time dimensions (p,T) are assumed large. Fan and his colleagues extend existing literature on regularization of such a matrix via thresholding when this is strictly sparse—e.g. Bickel and Levina (2008) and Cai and Liu (2011). Their proposed estimator, POET, is applied to the covariance matrix of residuals extracted from a regression of the original data urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0931 on K estimated factors. The errors urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0932 are considered to be weakly cross‐sectionally dependent (i.e. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0933 is sparse or urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0934).

    The main result of the paper is the order condition obtained for the norm of the deviation of POET from the true value of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0935:
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0936(1)
    where q ∈ [0,1). urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0937 is the thresholded version of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0938, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0939 when q=0. They find the same rate (l) for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0940, ensuring positive definiteness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0941 by setting a lower bound on their thresholding parameter. Using this result they then suggest that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0942 can be used in various applications in finance, in particular in their example 6 where they consider testing α=0, in the linear asset pricing model
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0943(2)
    It is claimed that use of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0944 in the development of such a test will be valid even when pT. However, as shown in Pesaran and Yamagata (2012) this cannot be so even if the rate condition (1) holds. Pesaran and Yamagata (2012) conduct a test of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0945 using as generating process (2) and propose a test statistic which, under normality of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0946, can be written as
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0947(3)
    as p→∞ for any fixed TK+1, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0948 is a T×1 vector of 1s and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0949, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0950. When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0951 is known the test is valid for any T>K+1 but, if an estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0952 is inserted in expression (3), then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0953 only if p  log (p)/T→0, which requires p<T.

    To illustrate this point we conducted a small Monte Carlo simulation following the set‐up in Pesaran and Yamagata (2012) where we plugged in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0954 and the Ledoit and Wolf (2004) shrinkage estimator, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0955, as estimates of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0956 in expression (3) for a set of (p,T) combinations. As shown in Table 10, considerable size distortions are visible when either estimator is used for p>T. Size improves only when T increases.

    Table 10. Size and power of the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0957 test in the case of models with three factors†
    T Results for Results for Results for
    p=50 p=100 p=200
    Size Power Size Power Size Power
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0958 60 0.14 0.78 0.19 0.91 0.25 0.98
    100 0.08 0.93 0.11 0.98 0.14 1.00
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0959 60 0.14 0.62 0.18 0.78 0.25 0.92
    100 0.11 0.85 0.13 0.92 0.17 0.98
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0960 60 0.05 0.58 0.04 0.74 0.04 0.89
    100 0.05 0.87 0.05 0.95 0.05 1.00
    • †Errors are weakly cross‐sectionally dependent and normally distributed. Sparseness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0961 is defined as in Table 3 of Pesaran and Yamagata (2012) with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0962. Size: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0963 for all i=1, … ,p. Power: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0964IIDN(0,1) for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0965, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0966; otherwise urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0967. The number of replications is set to 2000. ‘Hard’ thresholding in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0968 is conducted by using cross‐validation.
    To overcome this problem Pesaran and Yamagata (2012) propose the following simple test statistic that ignores the off‐diagonal elements of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0969:
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0970(4)
    where υ=TK−1, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0971 denotes the standard t‐ratio of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0972 in the ordinary least squares regression of individual asset returns, and
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0973(5)
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0974I(·) is an indicator function and the threshold value urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0975 is chosen such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0976 declines steadily with p. Size and power for this test are also summarized in Table 10 and show the ability of the test to control the size well with high power even if pT.

    Cinzia Viroli (University of Bologna)

    This paper is very interesting and rich with thought‐provoking themes. I would like to comment on the double formulation of the estimation problem for large covariance matrices.

    Fan and his colleagues assume that the data have been generated according to an approximate K‐factor model and suggest recovery of the covariance matrix via its decomposition into a low rank matrix and a sparse error matrix. They first address this issue by resorting to the first K principal components to estimate the factor loadings and to the thresholded principal orthogonal complement for the estimation of the idiosyncratic variance.

    They then show that the estimator has an equivalent representation by using a constrained least squares method. For each given vector F of factor scores they derive the least squares estimator of the factor loadings (as a function of F) and hence estimate the factor scores. Given both the factor scores and the factor loadings they recover the idiosyncratic factors, derive their covariance matrix and suitably threshold it.

    One of the advantages of the least squares formulation is that it generalizes and extends, to the unknown factor case, the results that the authors have obtained for known factors, both assuming a strict (Fan et al., 2008) and an approximate (Fan et al., 2011) factor model. It is also coherent with a large part of the literature on the topic and allows us to use consolidated proof strategies and model selection tools.

    However, by pursuing the least squares approach the authors seem somewhat to neglect the principal components analysis results. The way that they solve the least squares problem (first obtain the loadings and then the factor scores) is natural when the factors are known and the loadings unknown, but when both of them are unknown the dual problem (first scores and then loadings) could be as meaningful, as in Stock and Watson (2002b). This would allow us to obtain the factor loadings as the eigenvectors of the sample covariance matrix directly, thus leading to the POET‐estimator, making theorem 1 in Section 2 no longer needed. Also, the principal components analysis approach would allow us to obtain the idiosyncratic errors simply as the scores in the principal components orthogonal complement with no need to estimate the common factor scores first. I have really appreciated the paper, but I cannot help noting that along the paper the poesy of POET somewhat fades. I am wondering whether there is some drawback of the principal components analysis approach that justifies the least squares prevalence.

    Angela Montanari (University of Bologna)

    This is a very interesting paper, full of stimuli in all its parts.

    One of the most relevant aspects is the identification of sparsness of the idiosyncratic covariance matrix as a sufficient condition (together with factor pervasiveness) for the identifiability of an approximate factor model and for its estimation through principal components analysis (PCA).

    This result, besides further justifying Chamberlain and Rothschild's (1983) result (which required limited idiosyncratic eigenvalues only) also provides a clear indication on the kind of dependence structure which the model can capture. And this is very important from an empirical point of view, as the examples in Section 5 clearly show.

    Sparsness of the idiosyncratic covariance (together with diverging p) also offers PCA a sort of ground for revenge as an estimation method for a factor model. For finite p, and under the strict factor model assumption, it is well known that PCA performs poorly as an estimation method, since it generates correlated errors; but for diverging p,and under the approximate factor model assumption, this paper shows that PCA represents a natural and theoretically grounded estimation instrument. I would speak, for PCA, of a blessing of dimensionality.

    Within this coherent framework, anyway, I feel a little uneasy with the empirical application in Section 7.

    50 series, related to as many stocks (chosen from five different industry sectors), and their annualized daily returns for T=252 days are considered. Fan and his colleagues identify three relevant factors, they estimate the factor loadings through the first three PCs of the sample covariance matrix and finally obtain the thresholded error correlation matrix. This matrix shows that a strong positive correlation between the returns of companies in the same industry is still present after taking out the common factors and from this the authors conclude that it provides strong evidence that the strict factor model is inappropriate.

    In this case p is not very large with respect to T. My feeling is that we are still dealing with the finite p situation in which, as already said, PCA returns correlated idiosyncratic errors, even when they are actually uncorrelated. In other words, I am wondering whether the residual correlation is evidence of the inappropriateness of a strict factor model or, on the contrary, it is simply induced by an inappropriate use of PCA. If the factor loadings had been estimated by any of the estimation methods ordinarily used in classical factor analysis, would the residual correlations still be non‐vanishing?

    Yi Yu and Richard J. Samworth (University of Cambridge)

    We congratulate the authors on their paper. POET elegantly tackles low rank plus sparse matrix estimation, provided that the eigenvalues of the low rank matrix grow at rate O(p) (see assumption 1). Suppose now that this assumption does not hold, and instead, we have the following condition.

    Assumption 5.All the eigenvalues of the K×K matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0977 are bounded away from both 0 and ∞ as p→∞, where 0<α<1.

    Similar conditions are widely used in sparse principal components analysis and low rank plus sparse matrix estimation problems; see, for example, Amini and Wainwright (2009) and Agarwal et al. (2012). In what follows, we consider the three main objectives in Section 2. The notation and model are the same as those in the paper.

    Proposition 3.Assume assumption 1. For the factor model with condition (2.1), we have

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0978

    Moreover, if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0979 are distinct, then

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0980

    From this we see that, under a suitable sparsity condition on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0981, the first K principal components are still approximately the same as the columns of the factor loadings, even if the eigenvalues are not as spiked as O(p).

    However, for POET to control the relative error of the matrix estimate, assumption 1 is necessary, as can be seen from a close inspection of the proof of theorem 2 of Bai and Ng (2002). In fact, if assumption 1 is replaced with assumption 5, we have, for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0982 that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0983

    The other half of this theorem still holds, however, so the less spiked structure will not asymptotically increase the risk of overestimation in the selection of K.

    The performances of IC, the Akaike information criterion AIC and the Bayesian information criterion BIC are compared in Table 11, with the corresponding largest eigenvalues urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0984 in Fig. 15. If the spectrum structure satisfies assumption 1 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0985, both IC and BIC select the correct value of K. However, if we shrink the spiked eigenvalues, IC and BIC tend to underestimate, whereas AIC overestimates, the true K.

    Table 11. Perfomance of IC, AIC and BIC†
    Methods Results for the following values of C:
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0986 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0987 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0988 urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0989
    IC 6.00 (0.00) 1.08 (0.27) 1.00 (0.00) 6.00 (0.00)
    AIC 20.00 (0.00) 20.00 (0.00) 20.00 (0.00) 20.00 (0.00)
    BIC 6.00 (0.00) 2.00 (0.00) 1.00 (0.00) 6.00 (0.00)
    • †For the same u and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0990 as in Section 6.2, define urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0991 and expand urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0992 to a block diagonal matrix, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0993 by making urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0994 the diagonal block of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0995. The rows of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0996 are generated from an urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0997 distribution. Expand the generating process of F similarly to match urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0998 and generate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0999 accordingly, and then let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1000. Here, K=6 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1001. The means of the estimated K are reported over 100 repetitions, with standard error in parentheses.
    image
    Largest 20 eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1009 in cases (a) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1010, (b) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1011, (c) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1012 and (d) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1013
    To examine the effect of missing the Kth common factor, assume condition (2.1) and that rank urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1002, but the estimator is
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1003
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1004 is the entrywise shrunk estimator of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1005. In this case, owing to the common factor, most of the pairs of cross‐sectional units in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1006 are no longer ‘weakly correlated’. Note that the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1007s in Appendix A are still the same, i.e. no extra shrinkage is introduced. However, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1008 used in theorems 2 and 3 is not o(p), so the error bound does not converge to 0, but, when K is correctly estimated or overestimated, even substituting assumption 5 for assumption 1, the corresponding results in theorems 2 and 3 still hold. Thus, if there is doubt about the validity of assumption 1, a less severe penalty (e.g. AIC) may be preferable, to avoid the more serious error of underestimation of K.

    Frank Critchley (The Open University, Milton Keynes)

    In warmly welcoming tonight's paper, I offer three sets of comments, adding the aside that there could be value in using measures of discrepancy specifically adapted to non‐negative definite symmetric matrices.
    1. Consistent estimation of Σ, or its inverse, relies on key assumptions—typically, a static model with pervasiveness factors and (approximately) sparse errors. Two key questions arise.
      1. How far can they be checked? Given POET's fast (singular value decomposition) nature, deletion diagnostics seem entirely feasible. Again, individuals and/or time points can be deleted. Further, the static model can be tested within broader—e.g. first‐order auto‐regressive—models.
      2. What effects do these assumptions have on subsequent inference? Indeed, could such context‐specific considerations help to guide the choices to be made within a POET‐analysis? In short, is there scope for a ‘POET for purpose’?
    2. POET having an equivalent least squares formulation, potential lack of robustness to outliers merits consideration, with the usual array of possible solutions. In particular, there is a variety of robust versions of both principal component analysis and factor analysis.
    3. Might there be a role for the invariant co‐ordinate selection (ICS) methodology introduced in Tyler et al. (2009)? If so, there would seem to be several possible advantages. ICS requiring two affine equivariant scatter functionals, a robust choice could, for example, be used alongside the regular covariance. Of particular relevance here, I have recently shown that one of these functionals can be singular, without essential loss. ICS could be used as a complement to POET, the former using a generalized form of the principal component analysis asymptotically determining the latter. More radically, it could replace POET, as is perhaps natural on invariance grounds: subsuming centring, ICS's invariance to linear transformation of the data combines principal component analysis's invariance to orthogonal transformation with that of factor analysis to separate scaling of the variables. This would have the additional advantage of potentially extending POET to a wider range of data types. In particular, incommensurable variables could be accommodated. Finally, however ICS is implemented, it retains POET's computational speed, while providing visual displays. These offer a range of diagnostic and other potential benefits, notably, multivariate outlier detection or, again, group detection (via implicit estimation of Fisher's discriminant subspace).

    Once again, it is a pleasure to congratulate the authors on a very stimulating paper.

    Jian Zhang (University of Kent, Canterbury)

    I congratulate Fan and his colleagues on their ground breaking and innovative contribution to high dimensional covariance estimation. I would like to contribute to the discussion by the following comments.

    Their method is applicable to studying the source localization problem in magnetoencephalography‐based neuroimaging. Suppose that we observe a multivariate time course Y(t) from n sensors, which can be modelled by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1014(6)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1015 is a grid approximation to the brain, β(r,t) is a latent univariate time source of interest at location r, x{r,η(r)} is a design vector determined by the so‐called Maxwell equations with orientation η(r) and ɛ(t) is noise. Assume that β(r,t) is sparse, i.e. the temporal variability (called power or the marginal variance) var{β(r,·)}=0 for all r ∈ Ω except a few locations (i.e. non‐null sources). We want to localize these non‐null sources among an infinite number of candidates. A spatial filtering theory has been developed by Zhang et al. (2012) and Zhang (2012) for searching for a non‐null source. Under certain conditions, the covariance matrix of Y(t) can be expressed as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1016 where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1017 is the output vector of these sensors that would be induced by a unit magnitude source located at urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1018 along orientation urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1019 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1020 is the power at urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1021. So it is not surprising that our theory is relying on an appropriate estimate of the covariance matrix when p is much larger than n and J. In this sense, I expect that POET can significantly improve our spatial filters.

    There are two different asymptotics: expanding time domain asymptotics, the time window is expanding when J increases and infill asymptotics, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1022, 1≤jJ, are restricted to a certain time window when J increases. Under the infill setting, the current strong mixing condition in Section 3 will not hold. I am curious about the performance of POET in the infill setting.

    John T. Kent (University of Leeds)

    It is interesting to compare the methodology of this paper with conventional multivariate analysis in a fixed dimension p, where p is small or moderate and the relevant asymptotics involve the sample size n growing large. The simplest methodology is either invariant or equivariant under affine transformations, e.g. Hotelling's urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1023‐statistic. Thus if there are two variables, height and weight, say, it does not matter what units we use to measure them; further the original two variables can be replaced by any two linearly independent linear combinations.

    However, even in this simple setting, there is typically less invariance for dimension reduction methods. For example, principal component analysis is equivariant only under orthogonal transformations of the data. Factor analysis is equivariant only under diagonal transformations, i.e. rescaling the variables.

    For large p, especially when p exceeds the sample size n, some sort of regularization is needed; the price is stronger assumptions and less invariance. In the current paper it is assumed that
    1. the factor loadings are pervasive and
    2. the idiosyncratic covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1024 is sparse.
    My impression is that these assumptions imply that the methodology of the paper is not equivariant under either orthogonal or diagonal transformations. If so, there are several limitations on the types of data for which this paper might be useful:
    1. the variables themselves (rather than linear combinations) are important;
    2. the choice of variables is important (for example we are not in a situation where half the underlying variables measure the same feature of the data);
    3. the choice of measurement units is important (ideally the variables should be commensurate, so that all the variables are measured in the same units with comparable variances).

    I would be interested in the authors’ comments on these thoughts.

    The following contributions were received in writing after the meeting.

    Amir Ahmad, Sarosh Hashmi and Sami M. Halawani (King Abdulaziz University, Rabigh)

    We congratulate Fan and his colleagues for this interesting paper. The paper proposed a model for the estimation of high dimensional covariance. The proofs are detailed and the experiments are extensive. The discussion provides a good insight into the problem.

    Gene expression data sets and protein expression data sets (e.g. Golub et al. (1999) and Alon et al. (1999)) provide a challenge because of their high dimensions and small number of data points. The authors have talked about statistical genomics as one of the fields of application of the methods proposed in the paper. Hence, it would be interesting if they could show some results obtained by the proposed method on these data sets and if they could comment on future extensions of the model proposed.

    Charles Bouveyron (University Paris 1 Panthéon‐Sorbonne)

    Before I go further, I would like to thank the authors greatly for this very interesting and painstaking work. I found this paper made with a real care. I particularly appreciated the fair balance between theory and experiments.

    The subject of the paper, large covariance matrix estimation, has become a central problem in modern statistics. Indeed, the technological advances of the last two decades have significantly modified the nature of data, and consequently of statistics. In particular, modern data are often high dimensional (large numbers of variables), big (large numbers of observations) or available as a stream (the observations pass and cannot be stored).

    The paper focuses on the factor model and discusses solutions for estimating the covariance matrix. The POET‐method that is introduced has the advantage of including existing regularization strategies for large covariance matrix estimation. Among those strategies, one consists in thresholding the principal directions associated with the smallest eigenvalues. For this, POET completes the eigendecomposition with a thresholded matrix, let us say R. This allows us in particular to perform the inversion of the covariance matrix efficiently. An alternative would be to use the covariance matrix approximation that was used in Bouveyron et al. (2007) which leads to an explicit inversion for the covariance matrix. Furthermore, recent strategies in estimating sparse covariance matrices include urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1025‐type penalties. A theoretical and experimental comparison with these approaches would be interesting.

    D. S. Coad and H. Maruri‐Aguilar (Queen Mary University of London)

    We congratulate Fan and his colleagues on this beautiful paper, which provides an elegant method for estimating a high dimensional covariance matrix with a conditional sparsity structure. The simplicity of the approach and its wide applicability make it very appealing. Asymptotic properties and simulation results convincingly demonstrate the superiority of the method. We feel that the estimator proposed has a multitude of other potential uses in practice.

    The problem of controlling the false discovery rate in example 3 often presents itself in gene association analysis, but limited numbers of observations are available. Since there is only a small number of observations for each hypothesis, a one‐stage design can lead to tests with poor power. However, Zehetmayer et al. (2005) have shown that a two‐stage design based on combining the p‐values from a screening stage and a testing stage can significantly improve the power. A generalization to multistage designs is provided by Zehetmayer et al. (2008). A natural question is whether the principal factor approximation can be applied to these designs.

    In the multiperiod asset pricing model that is outlined in example 6, to test the null hypothesis (1.2), the model is embedded in the multivariate linear model (5.3). When p<T, the usual test statistic has either a urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1026‐ or an F‐distribution under the null hypothesis, according to whether the covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1027 is used or an estimate. However, when pT, the estimate of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1028 is degenerate and the non‐degenerate estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1029 can be employed instead. It would be interesting to know what the distributions are of the test statistics W and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1030. In particular, it is unclear what the corresponding degrees of freedom would be.

    A problem with large data sets in computer experiments is the intractability of the usual Gaussian process model. The main obstacle is the evaluation and inversion of large covariance matrices. Kaufman et al. (2008, 2011) used respectively tapering to produce sparse correlation matrices and correlation functions with compact support. The thresholding methods that are described could be used for the analysis of computer experiments, by devising a special form for the entry‐adaptive thresholding rule urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1031. This would allow fast covariance computations and tractability of the problem.

    Wei Dang (Shihezi University) and Keming Yu (Brunel University, London)

    The principal component analysis method for large covariance matrix estimation is a novel idea for a challenging issue. By assuming a sparse error covariance matrix in a multifactor model, the proposed principal orthogonal complement thresholding estimator POET does have a proper rate of convergence.

    Whereas principal component analysis can apply to the analysis of non‐stationary time series (Lansangan and Barrios, 2009), POET may lose those good properties presented in theorems 1 and 3 for non‐stationary and non‐ergodic time series. Because POET relies on the stationary and ergodic assumptions of underlying time series, it may exclude many important application examples, including financial time series analysis and health science data analysis. For example, modern mathematical models largely focus on martingale models, including Brownian motion. But a multi‐dimensional Brownian motion may not be ergodic. In health sciences the data under analysis may be the yearly heights and weights of a large group of children recorded from their early ages to the end of their high school studies. It is often observed that the heights and weights of these children rise much quicker in certain years than in some other years, so the difference between the sample means and variance of the period of quick growth are statistically different from some of the other years; then the data under analysis would be non‐stationary. One way to apply POET in these problems may use transformation first, such as detrended non‐stationary processes transformed into stationary ones, and Laplace transform non‐ergodic processes into ergodic processes.

    The other issue with the proposed POET is to incorporate it for the analysis of data with outliers. Many empirical studies find that the distribution of stock returns departs from normality, including the stock return from the Center for Research in Security Prices database and used in the paper. Like principal component analysis, POET may become unreliable if outliers are present in the data. The same type of data may occur in health science. As Jolliffe (2002) pointed out, for a sample of healthy children of various ages between 5 and 15 years old, an observation with height and weight 175 cm and 25 kg respectively is not particularly extreme on either the height or the weight, individually, but the combination (17 cm, 25 kg) is an outlier. In such cases, it is desirable to employ a statistical estimation procedure that may be more efficient and robust than ordinary least squares for a robust POET‐estimator.

    Matteo Farnè (University of Bologna)

    I thank the authors for this very challenging paper. While reading it and listening to the presentation, I have learnt much. My comment is on possible extensions of the method proposed.

    In Farnè and Montanari (2013) I have done some work on a different approach to the estimation of large covariance matrices, namely the approach based on shrinkage, under assumptions which differ from those considered in this paper, Ledoit and Wolf (2003, 2004) suggested to obtain a well‐behaved covariance matrix by shrinking the sample covariance matrix either towards a scaled identity matrix or, to impose some structure on the estimator, towards a single index model covariance matrix. Boehm and Von Sachs (2008, 2009) have successfully extended shrinkage approaches to the estimation of the spectral matrix of a multivariate time series.

    My feeling is that the POET‐method could be profitably extended to the estimation of large spectral matrices also. Of course, owing to the particular nature of spectral matrices the extension is not straightforward; for instance the effect of smoothing must also be taken into account. I am wondering whether Fan and his co‐authors would suggest that we employ their method in the frequency domain also or on the contrary whether they see any reason why such an extension is not advisable.

    Marco A. R. Ferreira (University of Missouri, Columbia)

    I congratulate Professor Fan and his colleagues for their valuable contribution to the area of large covariance matrix estimation.

    Professor Fan and colleagues have developed a method for estimating large covariance matrices when there are common unobservable factors and additional cross‐sectional correlation. They consider the case when, as the number of individuals p and the number of time points T grow to infinite, the number of common unobservable factors K remains fixed. In addition, in their set‐up the eigenvalues corresponding to the common factors are divergent as p→∞. Finally, they assume that the covariance matrix of the idiosyncratic component is approximately sparse.

    With these assumptions, the authors develop a method based on principal component analysis for covariance matrix estimation. Specifically, first they estimate the contribution of the common factors to the covariance matrix by the sum of the K first terms of the sample covariance matrix spectral decomposition. Then, they subtract the estimated common factors contribution from the sample covariance matrix to obtain the principal orthogonal complement. Further, they apply thresholding to the principal orthogonal complement to obtain an estimator of the idiosyncratic covariance matrix. Finally, their covariance matrix estimator is the sum of the estimated common factors contribution and the estimated idiosyncratic covariance matrix.

    I have two main comments or questions on the paper.
    1. As the number of individuals p increases, it seems intuitive to assume that the underlying process generating the data should grow in complexity, i.e. it seems intuitive that K should grow with p. What would be the potential technical issues that would arise if one decides to extend the current work to the case when K grows with p?
    2. For the application of thresholding, there are a number of constants that must be chosen such as τ in equation (2.6) and C in equation (3.2). There seems to be an opportunity for the use of empirical Bayes methodology for the estimation of those threshold parameters.

    Florian Frommlet (Medical University Vienna)

    I congratulate the authors on this impressive paper concerned with estimating high dimensional covariance matrices under conditional sparsity. Their approach is surprisingly simple: first compute the principal components of the sample covariance matrix, then estimate the number of relevant components and finally apply a thresholding procedure on the remaining covariance matrix. In spite of this simplicity extensive simulation studies in their paper show that POET, the implementation of the approach presented, outperforms competing algorithms in various scenarios.

    It is not too surprising that POET performs well in those scenarios based on factor models with few factors, which mimic the situation under which the authors have derived asymptotic results for their method, i.e. when the covariance matrix has a small (fixed) number K of very large eigenvalues. It is quite intuitive that in this situation the first K principal components will simply represent the corresponding factors of the factor model. Also it appears to be clear that the procedure works well when no factors are present, as long as the number of components is then correctly estimated to be 0.

    For me the most astonishing result is that POET appears to do relatively well in model 3 of Section 6.5.2, where data were simulated according to an auto‐regressive AR(1) model. This is the only presented simulation scenario where data were not simulated either from a factor model with a small number of strong factors, or alternatively from a model without factors and sparse covariance matrix. The covariance matrix of the suggested AR(1) model does not have particularly spiked eigenvalues, but the eigenvalues smoothly decrease from their maximum. In fact for p=200 and p=300 there are 36 and 53 eigenvalues larger than 1 respectively. According to the simulations presented POET picks for this scenario (both for p=200 and p=300) on average roughly six factors to model the covariance structure, outperforming direct thresholding of the sample covariance matrix. This result indicates that POET might work well even in situations which are not covered by the asymptotic analysis presented. However, further work seems to be necessary to explain why that would be so.

    I. Gijbels and K. Herrmann (KU Leuven) and A. Verhasselt (Universiteit Hasselt and Universiteit Antwerpen)

    Fan and his colleagues present a very nice estimation technique (POET) for high dimensional covariance estimation, based on principal component estimation and thresholding the orthogonal complement of the principal components. They show that POET is equivalent to constrained least squares (CLS) estimation.

    We wonder how robust POET is when the data matrix is corrupted, since it is well known that least‐squares‐based methods are not robust. The equivalence of POET to a CLS estimation problem seems to open the way for a more robust procedure. The use of robust principal component methods (e.g. Engelen et al. (2005)) could also offer a possibility.

    As pointed out in the literature (see for example Antoniadis (2007)), the qualitative properties of a thresholding rule turn out to be important. For example, the hard thresholding rule is discontinuous, whereas the soft thresholding rule is continuous. In CLS regression hard thresholding leads to a larger variance of the estimates, whereas soft thresholding shifts the estimates, creating a bias. What is the effect of such qualitative properties of the thresholding rule on the POET estimator?

    The authors use a computationally expensive cross‐validation criterion to choose C. It might be worth the effort to exploit the equivalent CLS problem and to use criteria based on this equivalence, such as an Akaike type of criterion.

    Portfolio allocation in the Markowitz (1952) framework is chosen as a numerical illustration of POET. In the simulation studies and empirical application, the emphasis is on estimating the weights of the minimum variance (MV) portfolio as the solution to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1032. This is in line with current literature (Kourtis et al. (2012) and references therein) where the MV portfolio is preferred because it alleviates the necessity of estimating the stock returns. As the MV weights admit the expression urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1033, the comparison of POET, the strict factor model (Fan et al., 2008) and the sample covariance (SC) matrix estimator amounts to a comparison of the estimated precision matrix in the models considered. It is known that the SC precision matrix performs poorly (Fan et al., 2008; Kourtis et al., 2012) and measures to counterbalance estimation errors must be taken. In a p<T framework shrinkage methods for example are applied to the SC matrix before (Ledoit and Wolf, 2003) or after inversion (Kourtis et al., 2012), significantly enhancing results. Shrinkage methods have also been applied to the pT framework (see Ledoit and Wolf (2004)), establishing a possible competitor in this scenario as well. Owing to the known shortcomings of the SC precision matrix deeper insights can be expected from a comparison with such refined methods.

    Wally Gilks (University of Leeds)

    Fan and his colleagues state that the low rank plus sparse representation of their model is for the population covariance matrix. The most obvious interpretation of this assertion is that the model is intended to describe the population, not the specific individuals sampled. This interpretation is somewhat at odds with the design of the sparse component of the model urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1034, which accounts for idiosyncratic correlations between specific individuals.

    At a population level, such idiosyncratic components can only be represented in terms of probabilities of idiosyncratic correlation. For example, suppose that two individuals i and j, randomly and independently sampled from the population, have a probability π of interacting idiosyncratically. Suppose further that the covariance urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1035 of their idiosyncratic errors is ρ if i and j interact, and 0 otherwise. In a sample of size p, the probability that individual i idiosyncratically interacts with k other individuals is distributed as binomial(π,p−1). The authors require that their measure of sparsity, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1036, grows with sample size as o(p). Letting urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1037, we have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1038
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1039 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1040 denote the maximum and mean values in a sample of size p from a binomial(π,p−1) distribution. Thus, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1041 unless ρ=0.

    Hajo Holzmann and Anna Leister (Philipps‐Universität Marburg)

    We congratulate Fan and his colleagues for an inspiring paper on estimating the factor structure in high dimensional, approximate factor models, and its consequences for estimating the underlying covariance matrix.

    Let us consider implications for the time series structure of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1042, specifically its lagged covariance matrix, and convergence in the urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1043‐norm.

    When estimating urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1044, Fan et al. (2011), theorem 3.2, obtain the rate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1045 in urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1046 for an estimate based on an observed factor structure, whereas in the present paper, utilizing estimated factors, the authors obtain the rate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1047. Now, for the sample covariance, writing
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1048
    and using lemma 4 (in the present paper) as well as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1049 we obtain urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1050. Thus, using an estimated factor structure may not be beneficial for moderate values of p.

    For the lagged covariance urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1051h≥1, assume for distinction that the errors (urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1052) are known to be serially uncorrelated: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1053.

    Then using either an observed factor structure or the sample autocovariance gives the rate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1054. In contrast, for the estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1055, i,j=1,…,p, we obtain from corollary 1 (in the present paper)
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1056

    We give a finite sample illustration, similar in setting to Fan et al. (2011), but with factors following a more strongly dependent AR(1) process. The results are plotted in Fig. 16. Further details for the above statements and the simulation can be found at http://www.unimarburg.de/fb12/stoch/files/holzmann/fandiscuss.pdf.

    image
    Time lag h=1: (a) averages of error (max‐norm) over 500 simulations against p (image, sample; image, factor; image, observed factor); (b) averages of lagged covariance (max‐norm) over 500 simulations against p

    Hanwen Huang, Yufeng Liu, J. S. Marron, Dan Shen and Haipeng Shen (University of North Carolina at Chapel Hill)

    We congratulate Fan and his colleagues on a very interesting contribution, which takes the fundamentally important field of covariance matrix estimation in some important new directions. We agree that now is a good time to be studying asymptotic contexts, where the first K eigenvalues of Σ grow quickly. The asymptotic mode of the sample size tending to ∞, with an exponentially growing dimension, can be improved by taking the dimension as the asymptotic driver, with the sample size growing at a logarithmic rate. This makes it clear that this setting is very close to the high dimension, low sample size setting with fixed sample size (Hall et al., 2005). Shen et al. (2012) studied another notion of principal component analysis consistency, in a wide range of such settings. Shen et al. (2013) studied another approach to sparsity under a growing eigenvalue assumption, establishing a new characterization of the boundary between regions of consistency and strong inconsistency for sparse principal component analysis in high dimension, low sample size settings. Can similar results be established for POET?

    Another reason why we are excited about these results is that covariance estimation is a critical component of SigClust, which is very useful for testing statistical significance of clusters in high dimensional contexts (Liu et al., 2008; Huang et al., 2013). This motivated us to compare POET with the approaches used in SigClust. A key step of the SigClust analysis is to estimate the eigenvalues of the covariance matrix of the null multivariate Gaussian distribution. Huang et al. (2013) proposed a likelihood‐based soft thresholding approach for estimating the covariance eigenvalues which gave a large improvement relative to the hard thresholding approach of the former paper. Fig. 17 shows estimates of the eigenvalue spectrum for two simulated high dimension, low sample size examples with sample size n=50 and dimension d=1000. Gaussian data are simulated with mean 0 and covariance matrix
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1057
    image
    Estimated covariance matrix eigenvalues based on the hard (image), soft (image) and POET (image) methods (image, true) for two simulated data sets with d=1000 and n=50: (a) results for spike size λ=5 and w=200 spike entries (POET works best); (b) results for λ=100 and w =10 (soft thresholding works best)

    Fig. 17(a) shows that, in situations where the number of the spikes is larger than the sample size, the POET‐method gives a major improvement. Fig. 17(b) shows that, in situations with few large spikes, the soft method works better than POET owing to better background noise estimation. Ultimately, a combination of POET with existing SigClust methods may work better.

    Jian Huang (University of Iowa, Iowa City, and Shanghai University of Finance and Economics) and Yong Zhou (Chinese Academy of Sciences, Beijing, and Shanghai University of Finance and Economics)

    We congratulate Fan and his colleagues on presenting a wonderful and thought‐provoking paper dealing with an important topic in high dimensional data analysis. They introduce the POET‐methodology for covariance matrix estimation and study its properties. The theoretical results obtained by them are highly original, notably the aspects concerning the relative magnitude of the sample size and the number of variables. They also describe several poetic and important applications. We focus our discussion on an application of POET in the linear regression model y=Xβ+ɛ. Here urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1058, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1059, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1060 is the vector of regression coefficients and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1061 consists of independent errors.

    An attractive approach to selection and estimation in high dimensional regression is based on the penalized least squares criterion urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1062), where ρ(·;λ) is a suitable penalty function with a tuning parameter λ≥0. The success of this approach depends on the behaviour of the restricted eigenvalues and related quantities of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1063 (see, for example, Bickel et al. (2009)). We discuss a way to repair the degeneracy of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1064 based on POET.

    Let Σ be the covariance matrix of the row vectors in X. Assume factor model (1.1) in the paper for the predictors and denote the POET‐estimator by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1065. Consider the spectral decomposition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1066, where V is a p×p orthonormal matrix of the eigenvectors and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1067 is a diagonal matrix of the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1068. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1069. Then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1070. This expression is reminiscent of singular value decomposition. But here U is only approximately orthogonal, since V is from urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1071. However, it can be viewed as a POET‐regularized singular value decomposition.

    The least squares loss equals urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1072. Replacing urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1073 by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1074 in this expression and noting that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1075, we obtain urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1076 plus a term independent of β. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1077 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1078. It is natural to consider the penalized criterion urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1079. The loss function here can be considered a regularized version of the least squares loss, in which the rank deficiency of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1080 is repaired by making use of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1081.

    In particular, the least squares estimator based on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1082 is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1083, which is well defined because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1084 is invertible. With standardized predictors, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1085 was used for screening variables by Fan and Lv (2008). The urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1086 can be considered a corrected version of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1087 based on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1088. It also can be used for screening.

    The validity of the above proposal rests on the properties of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1089. Simulation studies are needed to evaluate their finite sample performance. Also, much work is required to analyse their theoretical properties. The results obtained by the authors provide a solid basis for such analyses.

    Sujngkyu Jung (University of Pittsburgh) and Jason P. Fine (University of North Carolina at Chapel Hill)

    In this very stimulating paper, Fan and colleagues show the gain of conditional sparsity assumptions in covariance matrix estimation when principal components are pervasive. It is striking that the estimation procedure uses the first few principal components from the sample covariance matrix without any thresholding. The sample principal components are known to be inconsistent without strong assumptions (Johnstone and Lu, 2009). The present paper shows that under the conditional sparsity assumption the covariance estimator based on these principal components is consistent.

    It seems that the gains in the methodology proposed arise in part from the sparsity assumptions and in part from the pervasive factor assumptions. The latter assumption requires that the K largest eigenvalues of the p×p covariance matrix Σ are of order p. It is well known that the magnitude of eigenvalues is a critical condition for the consistency of principal component directions when dimension p increases with the sample size n. As an example, suppose that the largest eigenvalue urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1090 of Σ is of magnitude δ(p) with the rest being simply 1. The corresponding sample eigenvector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1091 is consistent in the sense that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1092 as p→∞ when δ(p)/p→∞. In contrast, such a strong result does not hold whenever δ(p)=O(p) or o(p) (Jung and Marron, 2009; Jung et al., 2012). This gives the insight that the sparsity assumption on the error covariance matrix is critical for the proposed estimator under the pervasive factor assumption (i.e. δ(p)=O(p)).

    Should we be tied to the pervasive factor assumption? Many other conditions have been considered in the literature. For example, in random‐matrix theory where n and p increase at the same rate, it is customary to assume fixed eigenvalues for all p (see, for example, Paul (2007)), which corresponds to δ(p)≡δ=o(p). Meanwhile, implicit assumptions in sparse estimation of principal components are that the number of non‐zero loadings in population eigenvectors grows at a slower rate than p (Shen et al., 2013). The case δ(p)/p→∞ yields a trivial solution, with easy separation of the leading eigenvectors from the error covariance matrix. In contrast, the case δ(p)/p→0 makes estimation of the covariance matrix much more difficult. What can be said about the proposed estimator when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1093 It would be worthwhile to investigate more carefully the interaction between the two key assumptions on the magnitudes of the eigenvalues and the sparsity of the orthogonal complement matrix.

    Although the theoretical results are quite elegant, concerns arise about the practical implications of the two key assumptions. This is particularly true since there may not be information in the data to detect violations of the assumptions. In what types of applications are such assumptions reasonable? Are there situations where one type of assumption is more realistic than the other type? Are there diagnostics that might be employed in real data analysis? Additional practical guidance would be welcomed.

    Oliver Linton and Michael Vogt (University of Cambridge)

    Fan and his colleagues address the important issue of estimating large structured covariance matrices. They restrict the large p×p covariance matrix Σ to have the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1094, where
    1. the systematic part urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1095 has K large (O(p)) eigenvalues and pK zero eigenvalues and
    2. the residual part urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1096 is a sparse matrix with bounded eigenvalues.

    In financial applications, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1097 represents idiosyncratic risk that can be diversified away, and so makes a smaller order contribution to portfolio risk, but in practice it can be important. The authors are to be congratulated on their comprehensive and useful method for taking full account of this structure.

    The assumption that all of the non‐zero eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1098 dominate the largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1099 by the magnitude p is likely to be a little strong in practice. If K is moderately large, the Kth eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1100 can be expected to be much closer to the largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1101 than the first eigenvalue. This may affect the quality of the estimation procedure and make the problem of selecting K difficult. Fig. 18 illustrates this point with the help of the data from Section 7. We wonder whether the main theoretical results continue to hold under weaker assumptions on the growth rate of the smallest positive eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1102.

    image
    Estimated eigenvalues of the matrices urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1103 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1104 for the various yearly data samples used in Section 7.2 (the time points on the x‐axis indicate the starting date of each sample and K=3 as in the paper; the plots show that the first (i.e. the largest) eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1105 is much more spiked than the third, the latter roughly having the same magnitude as the largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1106): (a) largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1107 (image) and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1108 (image) in each sample; (b) comparison of the third largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1109 (image) with the largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1110 (image)

    Another remark concerns the technical assumption 2, part (c), which imposes exponentially decaying tails on the model residuals. This is a very strong condition which in particular implies that all moments exist. In applications to daily equity returns this is likely to be violated. Fig. 19 shows a log‐rank plot of the data from Section 7 which suggests that the residuals are far from having exponentially decaying tails. To have a better idea of how the estimation procedure works when applied to financial data, it would thus be important to understand which parts of the procedures are robust to weaker moment conditions and which are not.

    image
    Estimated tail exponents of the model residuals for the various yearly data samples in Section 7.2: for each sample we calculate the residuals and apply a log‐rank regression to them to obtain estimates of the Pareto tail exponents; image, median of the estimated exponents obtained for each sample (as can be seen, the exponents take values roughly between 3.5 and 5, indicating that in many cases only the first few moments will exist)

    Regarding the portfolio choice application, the authors use shrinkage methods to impose sparsity on the idiosyncratic part of the covariance matrix of returns. An alternative or perhaps complementary approach here (see Yen (2011)) is to impose sparsity on the portfolio weights through an urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1111‐penalty. Each non‐zero investment entails a transaction cost and so it makes financial sense to minimize the number of such transactions; this is especially relevant for very large portfolios turned over daily. One further concern with the portfolio methodology is that no smoothness assumptions on the thresholded idiosyncratic covariances are exploited. In particular, the location of the 0s in the thresholded matrices (and thus their eigenstructure) may change abruptly over time, even though the rolling window data overlap considerably from period to period.

    Han Liu (Princeton University) and Lie Wang (Massachusetts Institute of Technology, Cambridge)

    We congratulate Professor Fan, Professor Liao and Miss Mincheva for their thought‐provoking paper. We believe that the proposed methodology will have profound impact and stimulate many further researches.

    Estimating a large covariance matrix under a small sample size is a fundamental problem. However, it suffers from the challenge that the eigenvalues of the sample covariance matrix do not converge to the population truth when the population eigenvalues are bounded or grow at a slow rate. In this paper, Professor Fan, Dr Liao and Miss Mincheva avoid this problem by exploiting an approximate factor model with a spiked eigenvalue condition: they assume that the population covariance matrix decomposes into a low rank component and a residual component. The eigenvalues of the low rank component are spiked and diverge at a fast rate, whereas the eigenvalues of the residual component are bounded. The POET‐estimator directly runs the singular value decomposition on the sample covariance matrix. It estimates the low rank component by the top principal components and applies thresholding methods to estimate the residual component according to different sparsity and smoothness conditions. The covariance matrix is then estimated by combining these two components.

    Their paper stimulated us to consider the following two extensions.

    Semiparametric extension

    The POET‐method requires exponential‐type tails of the data to establish large deviation results. It is interesting to extend this method to handle data from the semiparametric non‐paranomral family (Liu et al., 2012).

    A random vector X belongs to a non‐paranomral family if there is a set of univariate monotone functions urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1112 such that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1113 is Gaussian, i.e. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1114. For identifiability urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1115 is constrained to be a correlation matrix. Under the non‐paranomral model, Liu et al. (2012) suggested replacing the sample correlation matrix by the Kendall's τ rank correlation matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1116 with
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1117(7)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1118 is the empirical Kendall τ‐statistic between urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1119 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1120. By assuming that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1121 admits the ‘low rank plus sparse’ structure, we could apply the POET‐method on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1122. We have obtained encouraging numerical results; further theoretical investigation is on the way.

    Tuning‐insensitive extension

    Another extension is to apply more sophisticated methods to estimate the residual component matrix. For Gaussian models, Liu and Wang (2012) proposed a sparse inverse covariance estimation method named TIGER. The TIGER‐estimator is tuning sensitive and achieves the optimal rates of convergence for both covariance and inverse covariance estimation under different norms. It would be interesting to see whether these good theoretical properties still hold for the corresponding POET‐estimator.

    Jorge Mateu (University Jaume I, Castellón)

    Fan and his colleagues are to be congratulated on a valuable and thought‐provoking contribution on the estimation of high dimensional covariances with a conditional sparsity structure. As they note, this problem can be encountered in a wide variety of practical examples and scientific fields. In particular they mention the problem of high dimensional classification.

    I would like to comment on this problem in the context of spatial point processes. Byers and Raftery (1998) considered the problem of detecting features in spatial point processes in the presence of substantial clutter with the aim of outlining seismic faults. They used kth‐nearest‐neighbour distances to produce high breakdown point robust estimators of a covariance matrix in a high dimensional problem. If in this context we assume that we have some common but unknown factors, we can then use the idea of the principal orthogonal complement thresholding method to explore such an approximate factor structure with sparsity. We have sound strategies to calculate in this spatial context the sample covariance matrix and the factor‐based covariance matrix so that we can use the idea of conditional sparsity.

    In the context of spatial point processes, Collins and Cressie (2001) developed exploratory data analytic tools, in terms of local indicators of spatial association functions based on the product density, to examine individual points in the point pattern in terms of how they relate to their neighbouring points. For each point of the point pattern we have a local indicators of spatial association function. To perform statistical inference, needed for example in testing for local clustering, Collins and Cressie (2001) developed closed expressions of the autocovariance and cross‐covariance between any two such functions. These covariance structures are complicated to work with as they live in (very) high dimensional spaces. Again, it is not difficult to assume common factors among these functions and thus it could be appropriate to consider conditional sparsity to estimate the covariance matrix consistently. This will provide a new insight in such a problem.

    Consider, finally, an approach based on latent process modelling and principal component analysis to obtain a computationally feasible exploratory tool for discovering patterns of association between components of a highly multivariate point process. The latent Gaussian fields are obtained as linear combinations of some independent Gaussian processes. Again it is easy to think about the POET‐method to estimate the complicated covariance matrix.

    Guangming Pan (Nanyang Technological University, Singapore) and Heng Peng (Hong Kong Baptist University)

    We congratulate Professor Fan, Dr Liao and Ms Mincheva for such a timely paper. We enjoyed reading it since there are some good and novel ideas here, particularly the idea to estimate a covariance matrix by reducing Σ to a low rank matrix plus a sparse matrix and the concept of conditional sparsity.

    The idea of this paper can be applied also to an ultrahigh dimensional linear model. Consider a high dimensional linear regression model
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1123(8)
    When p, the dimension of β, is large, β is always assumed to be sparse. In some applications, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1124 would depend on the first finite principal components of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1125. The regressor urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1126 can be then supposed to follow a factor model like
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1127(9)
    where B is the r×p factor loading matrix, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1128 is the r×1 factor process and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1129 are the p×1 idiosyncratic error components. Combining model (8) with model (9) we then have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1130(10)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1131. When considering the new model (10), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1132 is not necessarily sparse. Though the dimension of the model is still ultrahigh, as in Wang (2012) and Ke et al. (2012), it can be efficiently reduced by the sure independence screening procedure (Fan and Lv, 2008) if we impose some simple structure on the covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1133.

    Fan, Liao and Mincheva assume that the number of principal components, K, is known. In many applications K is unknown and must be estimated. There is some literature focusing on this problem, e.g. Bai and Ng (2002) and Onatski (2009, 2010). Their approaches require similar spike conditions such that the first K largest eigenvalues go to ∞ and the remaining eigenvalues are bounded. But what would happen if Σ is structured in a different way, say, Σ is a Toeplitz matrix where the eigenvalues are not spiked? Can the number of factors still be estimated consistently? Below we consider the problem of estimating K (the first K largest eigenvalues do not tend to ∞).

    In some sense, estimating the number of factors is equivalent to finding the number of eigenvalues of the population covariance matrix Σ which are greater than a constant number C, i.e. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1134, where #{i:i ∈ A} denotes the number of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1135 which satisfies the property A and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1136, are eigenvalues of Σ.

    Note that when p/n→0, under some regularity conditions, eigenvalues of the sample covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1137 are consistent estimates of the respective population eigenvalues of Σ (theorem 4 of Chen and Pan (2012)). Hence K can be determined by the sample eigenvalues as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1138. When p/nc ∈ (0,∞), Baik and Silverstein (2006) and Bai and Yao (2008) stated that the eigenvalues of spiked Σ that are greater than 1+√c can be recovered from those of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1139. Each population eigenvalue urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1140 outside [1−√c,1+√c] pulls one sample eigenvalue away from the support urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1141 of the Marchenko–Pastur distribution and positions it at urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1142. Therefore, when C>1+√c, we can estimate K as
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1143

    When p/n→∞, we can firstly split the random vector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1144 into several subgroups by some criteria. For every subgroup, its dimension would be smaller than or proportional to n. Hence the number of factors for every subgroup can be determined by the method suggested above. Since every subgroup uses the same factors, those estimated numbers of factors for every different subgroup can be averaged, or maximized to obtain the final estimate of K, the number of factors for all of the random vector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1145. Though such an idea including computational burden would need further investigation, we believe that, when p is significantly larger than n, the number of factors in the model should be able to be estimated accurately. We would be very interested in hearing the authors’ views on this point.

    Mohsen Pourahmadi (Texas A&M University, College Station)

    How do we go beyond the prevalent sparsity assumption in the recent literature and estimate a large, non‐sparse covariance matrix? The question arises naturally in the factor models of the form urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1146, which is a low rank plus a sparse (non‐diagonal) matrix, known as approximate factor models (Chamberlain and Rothschild, 1983). Given the sample data urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1147 with dimension pn, the attraction of the POET covariance estimator proposed by Fan, Liao and Mincheva is in its simplicity, transparency, generality and rigour. These attributes are highly desirable and we would like to see more of them in the rapidly growing algorithm‐driven area of high dimensional data analysis (Pourahmadi, 2013).

    Construction of a POET‐estimator is simple and proceeds as follows.

    Step 1: start with the spectral decomposition of the sample covariance matrix of the data,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1148
    where q is the number of selected principal components and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1149 is the residuals.
    Step 2: apply the adaptive thresholding (Cai and Liu, 2011) to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1150
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1151
    where
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1152
    Step 3: a (q,δ) POET‐estimator of Σ is
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1153(11)
    In spite of its simplicity, the POET‐estimator is quite general in that it subsumes some important old and new covariance estimators for various choices of (q,δ) in equation (11).
    1. When δ=0 and q=p, it reduces to the sample covariance matrix.
    2. When δ=1, the estimator reduces to that based on the standard factor model.
    3. When q=0 it reduces to the thresholded estimator of Bickel and Levina (2008) or the adaptive thresholded estimator of Cai and Liu (2011) depending on the choice of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1154.

    In addition as a bonus, using equation (11) and the Sherman–Morrison–Woodbury formula we obtain estimators of the the precision matrix.

    The asymptotic properties of the POET covariance estimators are established when the data are temporally correlated and under the strict stationarity assumption. In this set‐up, it is desirable to go beyond estimating urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1155 and to have consistent estimators of the autocovariance matrices urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1156 or the spectral density matrix of the underlying process. I wonder whether the authors have thought of this problem and can shed any light on how their conditions relate to those in Forni et al. (2004) in the context of generalized dynamic factor models.

    Cheng Yong Tang (University of Colorado, Denver) and Yingying Fan (University of Southern California, Los Angeles)

    We most heartily congratulate Fan and his colleagues for their thought‐provoking and impactful work on estimating the large covariance matrix, which is pivotal in many contemporary scientific and practical studies. Facilitated by a factor model, a parsimonious structure is proposed for the large covariance matrix by combining a low rank matrix and a sparse covariance matrix. In the authors’ framework, a factor model is used to characterize the systematic common components underlying the target large‐scale dynamics in various problems, and a sparse covariance matrix is imposed to incorporate the remaining idiosyncratic contributions to the variations and covariations. Our comments are mainly on the treatment for the idiosyncratic component, i.e. the remaining dynamics after identifying and removing the systematic part.

    An important assumption of the approach proposed is that a sparse covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1157 is imposed for modelling the idiosyncratic component. One may naturally wonder that, in situations when a sparse urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1158 is inadequate, what alternative approach can be used for modelling the idiosyncratic component. Further, can a similar idea of parsimonious modelling by structural decomposition be extended for solving other problems such as large precision matrix estimation? In the framework of graphical models, Tang and Fan (2013) investigate the problem of large precision matrix estimation by parsimoniously modelling the idiosyncratic component by using a sparse precision matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1159. They observe that the large‐scale precision matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1160 depends on the idiosyncratic component only through the precision matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1161. Thus a similar idea of structural decomposition can be equally applied for estimating the large precision matrix, with the systematic component being captured by a factor model. Facilitated by the interpretation that 0s in a precision matrix imply conditional independence between the corresponding components, a sparse urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1162 can have useful practical implications. For example, in the famous Fama–French factor model (Fama and French, 1993) in finance, a non‐diagonal sparse precision matrix for the idiosyncratic component characterizes the interpretable market effects among returns of stocks at different levels, such as the industrial segmentwise connections, and the intrinsic within‐industry associations, say, among financial firms. Existence of such effects after removing the dynamics corresponding to the systematic component may result in a non‐sparse urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1163, yet sparse modelling can still be valid by exploring the sparse precision matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1164.

    Joong‐Ho Won (Korea University, Seoul) and Woncheol Jang and Johan Lim (Seoul National University)

    We congratulate Fan and his colleagues for a stimulating paper in which they have made a substantial contribution to challenging problems in large covariance estimation.

    As practitioners, we are most interested in finite sample positive definiteness of the estimator proposed by the authors. They suggest using a scaling constant C in the threshold for the idiosyncratic covariance matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1165 and adjusting C to render its minimum eigenvalue positive. This idea leads to the univariate root finding procedure of expression (4.1). Although this procedure looks apparently simple, it requires computing the minimum eigenvalue of a p×p matrix, which is computationally expensive by itself for even a modest value of p, for every value of C tried. Furthermore, altering C means that the thresholding must be recomputed in every iteration, changing the sparsity pattern of the initial urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1166. Thus we are concerned that the resulting cost of solving expression (4.1) may not be so cheap, especially when the target function in it is not smooth (Fig. 1).

    Here we consider an alternative procedure that ensures positive definiteness while preserving the initial sparsity pattern. First, project urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1167 onto a space of positive definite matrices. This can be done by solving
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1168(12)
    for a matrix variable X and some μ>0. The solution to problem (12) is given by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1169 for the spectral decomposition of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1170 (Boyd and Vandenberghe, 2004). Second, replace the entries of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1171 that correspond to the zero‐thresholded entries of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1172 with 0. Repeat these two steps until convergence. This alternating projections procedure is guaranteed to converge, as both steps are convex (Boyd and Dattorro, 2003). The first step (12) requires a spectral decomposition of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1173 as in the root finding procedure, but the second step is free of comparisons with varying thresholds.

    We numerically compared two procedures in a simple setting. The comparison was done for 100 data sets with n=50 samples of a p=100‐dimensional standard normal vector. The results are summarized in Table 12. The root finding took roughly 1 min to converge, whereas the alternating projections converged in 2.5 s, with little additional time to the ordinary POET‐estimator, i.e. POET without adjustment for positive definiteness, in less than half the iterations.

    Table 12. Comparison of the root finding and the alternating projections procedures†
    Procedure Computing Minimum Number of
    time (s) eigenvalue iterations
    to converge
    POET 2.34 (0.148) <0
    Root finding 62.5 (9.93) 0.149 (0.054) 20.7 (4.14)
    Alternating 2.43 (0.118) 0.0997 (0.000) 7.91 (0.71)
    projections
    • †Numbers in cells are averages over 100 data sets along with their empirical standard deviations in parentheses. The code is written in Octave 3.2.3 (Eaton, 2002) on a laptop computer (MacBook Air, 1.8 GHz i5 processor with 4 Gbytes memory). Covariance hard thresholding was used in the ordinary POET with C=0.1 and K=3. In the root finding, Octave function fzero()was used to find urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1174, the root of equation (4.1), starting from C=0.1; final thresholding was conducted for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1175. In the alternating projections, the lower bound μ for the minimum eigenvalue was set to 0.1. Both procedures terminated if urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1176 or urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1177 does not change up to the third digit after the decimal point.

    The POET‐method may theoretically be optimization free, but the post hoc adjustment to make the ordinary POET‐estimator positive definite involves some numerical optimization anyway. A little more attention to this step may greatly improve the practicality of the method proposed.

    Lingzhou Xue (Princeton University) and Hui Zou (University of Minnesota, Minneapolis)

    We first congratulate Fan, Liao and Mincheva for their innovative and timely contribution to high dimensional covariance matrix estimation. POET is a statistically and computationally appealing method for estimating a large covariance matrix with a conditional sparsity structure. We discuss two alternative methods for estimating the error covariance matrix in POET.

    POET2 via positive definite adaptive thresholding estimation

    POET uses adaptive thresholding estimation (Cai and Liu, 2011) on the principal orthogonal complement urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1178 to estimate the sparse error covariance matrix, namely
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1179
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1180 is the entry‐dependent threshold. In Section 4.1 Fan, Liao and Mincheva discussed the importance of choosing a proper threshold to guarantee the finite sample positive definiteness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1181  . POET chooses the threshold C in the range urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1182 where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1183 is defined in expression (4.1). Xue et al. (2012) proposed a direct convex programme to deliver a positive definite thresholding covariance matrix estimator. We adopt the idea thereof to construct another positive definite adaptive thresholding estimator for POET. Specifically, we consider the following constrained urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1184‐minimization problem:
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1185
    where ɛ>0 is some arbitrarily small constant. The alternating direct method of multipliers algorithm in Xue et al. (2012) can be easily modified to solve urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1186. We introduce a new variable Θ and an equality constraint Σ=Θ, namely
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1187
    We minimize its augmented Lagrangian function for some given parameter ρ>0 (for simplicity we can fix ρ=1), i.e.
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1188

    We iteratively solve L(Θ,Σ;Λ) for (urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1189 by alternating minimization, and then we update the Lagrange multiplier urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1190. The complete alternating direct method of multipliers algorithm proceeds as follows.

    For i=1,2,…,: Θ step,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1191
    Σ step
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1192
    Λ step
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1193

    The two operators urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1194 and ST(·) are defined in Xue et al. (2012).

    We call urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1195 the POET2 estimator of Σ. We compared POET2 and POET by using simulation models 1–3 with T=200 and p=200 from Section 6.5.2. As can be seen from Table 13, the two versions of POET have very similar performance.

    Table 13. Comparison of POET2 and POET in terms of average spectral norm loss over 100 replications (T=200, p=200)
    Results for model 1 Results for model 2 Results for model 3
    POET POET2 POET POET2 POET POET2
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1196 26.20 26.18 2.04 2.04 7.73 7.74
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1197 1.31 1.30 2.07 2.06 8.48 8.50

    POET3 via principal orthogonal complement banding

    If urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1198 is in fact bandable, another version of POET can use banding instead of thresholding to regularize the principal orthogonal complement. The bandable structure is widely used to model dependence between ordered variables. Given a banding parameter k, principal orthogonal complement banding yields urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1199. To guarantee the positive definiteness, we consider the eigendecomposition of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1200, and then define urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1201. The POET3‐estimator of Σ is defined as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1202. We compared POET3 and POET by using simulation models 1 and 2. As shown in Table 14, POET3 performs better than POET by taking advantage of the bandable structure. However, POET3 is potentially better only when the bandable structure is reliable and the ordering information is accurate. Otherwise, POET (or POET2) should be preferred.

    Table 14. Comparison of POET3 and POET in terms of average spectral norm loss over 100 replications (T=200, p=200)
    Results for model 1 Results for model 2
    POET POET3 POET POET3
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1203 26.20 25.76 2.04 1.68
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1204 1.31 1.26 2.07 1.73

    The authors replied later, in writing, as follows.

    We are very grateful to all contributors for their stimulating comments and questions on high dimensional covariance matrix estimation in the presence of common factors. They have touched many important issues, from theoretical understanding to methodological improvements and applications. Their contribution is important for the better understanding of the proposed POET‐estimator. We shall not be able to resolve all points in a brief rejoinder. Indeed, the discussion can be seen as a collective research agenda for the future, and some of the agendas have already been undertaken by the discussants.

    Spiked eigenvalues

    Several discussants (Critchley, Jung and Fine, Lam and Hu, Linton and Vogt, Onatski, and Yu and Samworth) gave their detailed comments and questions regarding the spikiness of the eigenvalues. They express some concern that the separation between large and remaining eigenvalues is too distinct. Their concerns are very relevant. If there are no large gaps between the large eigenvalues and the small ones, the systematic component of the covariance cannot even be differentiated from the idiosyncratic part in our factor model: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1205. We impose the pervasiveness of the factors through the assumption that the eigenvalues of the K×K matrix
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1206
    are bounded away from both 0 and ∞ as p grows. The interpretation of this is very natural: the factors are common to the majority of variables. Under this condition and the sparsity assumption on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1207, the first K eigenvalues are of order p whereas the remaining eigenvalues are bounded.

    This pervasiveness is not the minimum condition to make the problem identifiable. As correctly pointed out by Jung and Fine, the spikiness of the eigenvalues of the low rank matrix urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1208 and sparseness of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1209 together play an important role in distinguishing the systematic and idiosyncratic components. As long as urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1210 is much smaller than urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1211, these two components can be distinguished. Of course, the rates of convergence depend on the size of the gaps and other parameters. For example, Yu and Samworth suggested a weaker version of the pervasive condition, which replaces urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1212 in the definition of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1213 with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1214 for some α ∈ (0,1). With this weaker condition, all results should still go through, and carefully inspecting our technical proofs should yield the rates of convergence. In contrast, there is also recent literature that requires α=0 or replaces urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1215 with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1216, which corresponds to approximately ‘sparse loading matrices’ (Pati et al., 2012; Carvalho et al., 2009). See also the discussion by Pan and Peng for a novel approach. Intuitively, this allows for non‐pervasive (weak) factors that have no effect on a non‐negligible portion of the individuals. However, this will bring more difficulty to estimating the number of spiked eigenvalues, and identifying the low rank part from the idiosyncratic part, because the signal is too weak.

    We agree wholeheartedly with H. Huang, Y. Liu, Marron, D. Shen and H. Shen that now is a good time to study asymptotic contexts, where the first K eigenvalues of Σ grow quickly. Indeed, sparsity appears rarely in applications, yet conditional sparsity is likely to be more relevant for many applications. Studying spiked eigenvalues amounts to exploring the main structure of the covariance matrix.

    We agree on the existence of weaker factors in applications (Lam and Hu, Linton and Vogt, and Onatski). These factors are usually difficult to differentiate from the idiosyncratic components and do not play a noticeable role without a large amount of data. We would like to add that our assumption on the spikiness of eigenvalues is imposed on the population covariance, not on the sample covariance matrix. Model diagnostics based on sample eigenvalues should be interpreted with care owing to large estimation errors in high dimensional matrices.

    Choice of the number of factors K

    The gaps between the spiked eigenvalues and the remaining eigenvalues have impact on the choice of the number of factors K, Fryzlewicz and N. Huang, Lam and Hu, and other discussants carried out many interesting simulations about the issue of choosing K, the number of these spiked eigenvalues. In many simulations by the contributors, the responses are not driven by a few common factors. In contrast, POET builds on the principal components analysis based on the sample covariance matrix, whose first K eigenvalues are growing at rate O(p). The existence of these spiked eigenvalues is implied by the pervasive condition for the common factors. This gap can be made smaller if Yu and Samworth's assumption is imposed instead. As pointed out by the discussants (H. Huang, Y. Liu, Marron, D. Shen and H. Shen) and Shen et al. (2012), the existence of spiked eigenvalues is necessary to achieve the principal components analysis consistency in the high dimension, low sample size context.

    One can apply standard testing procedures to test the existence of spiked eigenvalues (e.g. Onatski (2009)), and consistently estimate the number of these eigenvalues by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1217
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1218 is the orthogonal complement of the sample covariance matrix; M is a given upper bound and IC(p,T) is one of the information criteria in Bai and Ng (2002). However, if there is no gap among the eigenvalues, then there are either no pervasive common factors (all eigenvalues are small) or too many common factors (most eigenvalues are very large). In the first case, the consistent method will estimate K as 0. In the latter case, factor analysis is inappropriate because it does not effectively reduce the dimension. Pan and Peng suggested a new way of choosing K by comparing the sample eigenvalues with a given threshold. Their method should work well even when the factors are weak. Tests based on the gaps among eigenvalues were also proposed by Onatski (2009).

    We assume that K is fixed in the current version of the paper. Our working paper version allows K also to grow with p but is assumed to be known. We agree that allowing a growing and unknown K can be a good extension, as commented by Ferreira, and the first step should be consistently estimating it. This can be done by carefully reviewing the proofs in Bai and Ng (2002). Successfully solving this problem will also contribute to the literature on approximate factor models.

    Generalized dynamic factor model and spectral matrix

    Our paper was written for modelling large covariance matrices in genomics and finance. The former typically assumes that data are collected independently across the population, whereas the latter assumes that markets are efficient so past data play limited roles in asset returns. Hence, a conditional sparsity structure (conditioning here means specifically taking the linear dependence out) is imposed without considering the time‐lagged variables.

    As correctly pointed out by Hallin and Pourahmadi, allowing lagged factors is important for other applied time series problems. In addition, Farnè and Pourahmadi raise the question of estimating the spectral matrix and frequency domain analysis. Indeed, the generalized dynamic factor model, except requiring second‐order stationarity, holds without any further assumptions. The method of Forni et al. (2000) should naturally extend POET to the frequency domain analysis, which will also enable us to estimate the autocovariance matrices and the spectral matrix. This will further broaden the scope of POET in both theory and application.

    By expanding the vector of state variables, POET can be used to construct the factors that depend on the lagged variables. For a given lag q, let
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1219

    The sample covariance matrix of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1220 involves the cross‐covariance matrices. An application of POET to the data vector urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1221 would yield the factors that are constructed on the basis of the present and past data.

    Holzmann and Leister provided an interesting derivation for estimating the lagged covariance under the max‐norm. In terms of the convergence rate under the max‐norm, knowing the factor structure does not yield a significant improvement over the sample covarianee. This has been demonstrated and explained earlier in Fan et al. (2008).

    Alternative and related methods

    We thank several discussants for suggesting useful alternative steps to improve POET. Won, Jang and Lim propose iterative procedures to produce a finite sample positive definite error covariance matrix. A similar and conceptually simpler method is given by the contribution of Xue and Zou. Zhang and Peng also propose an iterative version of our method. The cost of the potential improvements is the loss of the simplicity of the POET method. Onatski's linear shrinkage to the principal orthogonal components provides a useful alternative approach to regularizing the error covariance matrix.

    Fryzlewicz and N. Huang suggest an interesting alternative covariance estimation based on the aggregation of the sample covariance matrix (unbiased) and regularized covariance matrix (biased but with low variance). This aggregation allows a trade‐off between the bias and variance in the estimation. Although NOVELIST works well in their simulations, theoretical understanding of the procedures is needed. For example, in the approximate factor model, Σ is a dense matrix. Hence, it is difficult to explain why a thresholding is applied to the sample covariance matrix, which should also be dense.

    Critchley, Dang and Yu, and Gijbels, Herrmann and Verhasselt recommend robust principal components analysis to protect against outliers and possible extensions to deal with non‐stationary time series. The transformed Kendall's τ rank correlation matrix suggested by H. Liu and Wang provides an answer to the robust issue on the tails of errors, raised by Linton and Vogt. Bouveyron proposes alternative methods based on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1222‐penalization. He suggests use of the covariance matrix approximation approach in Bouveyron et al. (2007). The effectiveness of the proposed approach for high dimensional covariance matrices hinges on good approximations and remains to be seen. We emphasize that POET works particularly well in the presence of common factors, as it takes out a few principal components in the first step of the singular value decomposition. Moreover, based on singular value decomposition, POET is optimization free except for choosing the number of factors (which involves a one‐dimensional optimization). It is also adaptive to locally stationary processes through time localization and time domain smoothing.

    As pointed out by Gijbels, Herrmann and Verhasselt, and Ferreira, the issue of choosing the tuning parameters for thresholding is important in practice and also exists in all regularization procedures. Besides the method suggested by Gijbels and her colleagues, additional research is still needed.

    Extensions

    Xue and Zou suggest a banding orthogonal complements estimator, to deal with banded idiosyncratic components. This case is asymptotically nested in POET because thresholding can also produce a banded matrix. But, if the structure is indeed ‘conditionally banded’ (given the common factors), their suggested method should improve the finite sample performance. Technically, it is not difficult to achieve similar rates of convergence in this case.

    It is also interesting to work with the sparse inverse idiosyncratic covariance, as suggested by Tang and Fan, and H. Liu and Wang. Tang and Fan also mention a couple of interesting applications that fit into this case. Under the high dimensionality, estimating a sparse precision matrix usually involves optimizations that may introduce some computational burdens. H. Liu and Wang suggest a viable column‐by‐column penalized square‐root lasso method to explore sparsity in inverse idiosyncratic covariance matrices, which is insensitive to the tuning parameter. Lam and Hu suggest an idea to deal with weak factors, which complements our method.

    Applications

    One of the immediate applications of POET is portfolio allocation, as commented by Linton and Vogt, and Gijbels and her colleagues because the problem crucially depends on estimating a high dimensional covariance matrix, and financial returns are often driven by a few common factors. Once the volatility matrix has been well estimated, we can proceed to portfolio selection via the Markowitz framework. We agree with Linton and Vogt that sparse portfolio allocation is another interesting idea to enhance the stability and the performance of portfolios. It has been thoroughly studied by Jagannathan and Ma (2003) and Fan et al. (2012).

    We appreciate the detailed comments provided by Bailey, Pesaran and Yamagata and Coad and Maruri‐Aguilar on testing the multifactor capital asset pricing model. When the POET estimator is used to replace urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1223, if we simply bound the estimation error by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1224
    then indeed the upper bound is not urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1225 unless p  log (p)≪T even when the factors are observable. We would like to note that the above upper bound is too crude. To show that the estimation error is asymptotically negligible, we should not separate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1226 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1227 because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1228 is a weighted estimator error. More careful investigation of this term can yield an improved rate of convergence. However, we wholeheartedly agree with the notion in Pesaran and Yamagata (2012) that ignoring the correlation structure in constructing testing statistics yields more stable test statistics, whose sizes of tests can be more accurately determined.

    Pan and Peng, and J. Huang and Zhou connect POET with an application to the high dimensional linear regression. Indeed, when the regressors depend on a few common factors, POET can be applied to estimate their joint covariance, which will help variable selections and prediction. J. Huang and Zhou suggest the use of POET to improve sure independence screening in Fan and Lv (2008) and this can be a fruitful direction to pursue. Sparse principal components can also be used as predictors. Further research along this line is required. We thank H. Huang, Y. Liu, Marron, D. Shen and H. Shen for reminding us of the literature on SigClust. POET works well in dealing with spiked eigenvalues, so we can foresee the success of combining POET and SigClust or other methods to estimate the principal eigenvalues in high dimension, low sample size contexts.

    Ahmad, Hashmi and Halawani suggest genomics applications as an interesting test bed of POET. We agree with them. In fact, Fan and Han (2013) address large‐scale hypothesis testing problems in considerable detail. It includes applications to the type of genomic data as Ahmad and his colleagues suggested.

    POET is very fast to compute. We are also excited to learn about the potential applications of POET to the spatial point processes suggested by Mateu, source localization problems discussed by Jian Zhang and computer experiments suggested by Coad and Maruri‐Aguilar, among others. They open broad areas where POET can be successfully applied and achieve important scientific discoveries.

    Comments

    Various contributors raise excellent comments. For brevity, some of them have been partially answered above, and many can be seen as a good research agenda.

    We appreciate that Kent and Critchley remind us of the issues on invariance or equivariance under affine transformation. We agree that, if the measurement units change, the principal components will then change and hence POET does not have equivariance. In many applications like finance and genomics, the measurement units are comparable and affine transformations will create some interpretation problems. If the units used are a concern, one can apply POET to a correlation matrix. The sparsity of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1229 remains intact. The equivariance issue in high dimensional problems is very challenging. It is at a very different scale of details from high dimensional data analysis. If affine transforms are considered as in Tyler et al. (2009), sparsity should be imposed to enhance the interpretability.

    Viroli makes an interesting comment on the dual problem of estimating the factors and loadings, corresponding to the principal components analysis on either urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1230 or urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1231, but give the same estimate for the common components. Theorem 1 in the paper is similar to what was obtained by Stock and Watson (2002b). But the result presented here allows any Kp, not just the true urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1232.

    Montanari is concerned about the precision of estimated factors in our empirical illustration. Here urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1233. As every time series loads on the same common factors, 50 series contain enough information to estimate the factors and therefore the idiosyncratic components. So the heat maps presented in the paper indeed demonstrate the sparsity of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1234. We feel that the spurious correlations that are created by not accurately estimated factors should be small. Besides PCA, one can perform a quasi‐maximum‐likelihood method, which is typically used for classical factor analysis (Lawley and Maxwell, 1971) and is also consistent under high dimensionality (Bai and Li, 2012).

    Gilks gives an interesting example where each off‐diagonal entry of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1235 is a non‐zero constant with probability π. This is a Bayesian perspective in which the population covariance is generated according to some probability distribution. From this perspective, in many applied problems, the probability of being 0 for each off‐diagonal entry should not be a universal constant but varies over the entries. For instance, correlations between the companies in the same industry may have smaller probabilities of being 0 than those across industries. Hence we can write urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1236 for each position (i,j) and require that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-1237 fast for most of the (i,j)s.

    Frommlet makes a comment on our simulated results where Σ is generated from a cross‐sectional auto‐regressive AR(1) process instead of a factor model. We do not claim that POET can solve all large covariance estimation problems, but indicate its power by first controlling a few relatively large eigenvalues. Since POET is reasonably robust to the overestimation of the number of factors, it works well also with sparse matrices. This explains why it works well with AR(1) covariance structure, which is effectively sparse.

    Conclusion

    In summary, the contributors have provided a wide range of discussions about many aspects of estimating a high dimensional covariance matrix. Many applications suggested have motivated exciting opportunities for interdisciplinary collaborations. We feel very pleased to exchange our ideas and very much look forward to new tools in this important research area. We conclude by reiterating our thanks to all the contributors, and to the Royal Statistical Society and the journal for hosting this forum.

    Appendix A:: Estimating a sparse covariance with contaminated data

    We estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0558 by applying the adaptive thresholding given by expression (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0559 are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0560 represent the error terms in regression models or when data are subject to the measurement of errors. Instead, we may observe urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0561. For instance, in the approximate factor models, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0562

    We can estimate urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0563 by using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0564, define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0565
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0566(A.1)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0567 satisfies, for all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0568, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0569 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0570

    When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0571 is sufficiently close to urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0572, we can show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0573 is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0574 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0575 that are defined in assumptions 2 and 3, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0576.

    Theorem 5.Suppose that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0577, and that assumptions 2 and 3 hold. In addition, suppose that there is a sequence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0578 so that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0579 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0580 then there is a constant C>0 in the adaptive thresholding estimator (A.1) with

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0581
    such that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0582

    If further urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0583, then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0584 is invertible with probability approaching 1, and

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0585

    Proof.By assumptions 2 and 3, the conditions of lemmas A.3 and A.4 of Fan et al. (2011a) are satisfied. Hence, for any ɛ>0, there are positive constants urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0586 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0587 such that each of the events

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0588
    occurs with probability at least 1−ɛ. By the condition of threshold function, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0589. Now for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0590 under the event urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0591
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0592

    Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0593. Then, with probability at least 1−2ɛ, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0594. Since ɛ is arbitrary, we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0595. If, in addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0596, then the minimum eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0597 is bounded away from 0 with probability approaching 1 since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0598. This then implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0599.

    Appendix B:: Proofs for Section 2

    We first cite two useful theorems, which are needed to prove propositions 1 and 2. In lemma 1 below, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0600 be the eigenvalues of Σ in descending order and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0601 be their associated eigenvectors. Correspondingly, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0602 be the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0603 in descending order and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0604 be their associated eigenvectors.

    Lemma 1.

    1. (Weyl's theorem) urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0605.
    2. ( sin (θ) theorem; Davis and Kahan (1970)):
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0606

    B.1. Proof of proposition 1

    Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0607 are the eigenvalues of Σ and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0608 are the first K eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0609 (the remaining pK eigenvalues are 0), then by Weyl's theorem, for each jK,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0610

    For j>K, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0611. However, the first K eigenvalues of BB are also the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0612. By the assumption, the eigenvalues of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0613 are bounded away from 0. Thus, when jK, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0614 are bounded away from 0 for all large p.

    B.2. Proof of proposition 2

    Applying the  sin (θ) theorem yields
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0615

    For a generic constant c>0, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0616 for all large p, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0617 but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0618 is bounded by prosposition 1. However, if j<K, the same argument implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0619. If j=K, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0620, where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0621 is bounded away from 0, but urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0622. Hence, again, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0623.

    B.3. Proof of theorem 1

    The sample covariance matrix of the residuals by using the least squares method is given by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0624
    where we used the normalization condition urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0625 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0626. If we show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0627, then from the decompositions of the sample covariance,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0628
    we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0629. Consequently, applying thresholding on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0630 is equivalent to applying thresholding on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0631, which gives the desired result.

    We now show that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0632 indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0633, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0634 is diagonal. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0635 be the solution to the new optimization problem. Switching the roles of B and F, then the solution of problem (2.10) is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0636 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0637. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0638. From urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0639, it follows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0640.

    Appendix C:: Proofs for Section 3

    We shall proceed by subsequently showing theorems 4, 2 and 3.

    C.1. Preliminary lemmas

    The following results are to be used subsequently. The proofs of lemmas 2, 3 and 4 are found in Fan et al. (2011a).

    Lemma 2.Suppose that A and B are symmetric semipositive definite matrices, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0641 for a sequence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0642. If urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0643, then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0644, and

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0645

    Lemma 3.Suppose that the random variables urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0646 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0647 both satisfy the exponential‐type tail condition: there exist urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0648, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0649 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0650, such that, ∀s>0,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0651
    Then, for some urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0652 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0653, and any s>0,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0654(C.1)

    Lemma 4.Under the assumptions of theorem 2,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0655,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0656 and
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0657.

    Lemma 5.Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0658 denote the Kth largest eigenvalue of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0659; then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0660 with probability approaching 1 for some urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0661.

    Proof.First, by proposition 1, under assumption 1, the Kth largest eigenvalue urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0662 of Σ satisfies, for some c>0,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0663
    for sufficiently large p. Using Weyl's theorem, we need only to prove that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0664. Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0665. Using this and model (1.3), urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0666 can be decomposed as the sum of the four terms
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0667

    We now deal with them term by term. We shall repeatedly use the fact that, for a p×p matrix A,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0668
    First, by lemma 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0669, which is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0670 if K log(p)=o(T). Consequently, by assumption 1, we have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0671
    We now deal with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0672. It follows from lemma 4 that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0673
    Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0674, it remains to deal with urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0675, which is bounded by
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0676
    which is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0677 since log(p)=o(T).

    Lemma 6.Under assumption 3, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0678.

    Proof.Since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0679 is weakly stationary, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0680. In addition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0681 for some constant M and any i and t since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0682 has an exponential tail. Hence by Davydov's inequality (corollary 16.2.4 in Athreya and Lahiri (2006)), there is a constant C>0, for all ip,tT, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0683, where α(t) is the α‐mixing coefficient. By assumption 3, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0684. Thus, uniformly in T,

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0685

    C.2. Proof of theorem 4

    Our derivation below relies on a result that was obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0686 equals the true K with probability approaching 1. Note that, under our assumptions 1–4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following lemma.

    Lemma 7. (theorem 2 in Bai and Ng [2002].)For urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0687 defined in expression (2.14),

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0688

    Proof.For a proof, see Bai and Ng (2002).

    Using expression (A.1) in Bai (2003), we have the identity
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0689(C.2)
    where urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0690, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0691 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0692.

    We first prove some preliminary results in the following lemmas. Denote urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0693.

    Lemma 8.For all urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0694,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0695,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0696,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0697 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0698.

    Proof.

    1. We have, ∀i, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0699. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0700
    2. By lemma 6, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0701, which then yields the result.
    3. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0702
    4. Note that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0703. By assumption 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0704, which implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0705 and yields the result.
    5. By definition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0706. We first bound urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0707. Assumption 4 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0708. Therefore, by the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0709
    6. Similarly to part (c), noting that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0710 is a scalar, we have
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0711
      where the last line follows from the Cauchy–Schwarz inequality.

    Lemma 9.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0712,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0713,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0714 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0715.

    Proof.

    1. By the Cauchy–Schwarz inequality and the fact that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0716,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0717
      The result then follows from assumption 3.
    2. By the Cauchy–Schwarz inequality,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0718
      It follows from assumption 4 that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0719. It then follows from Chebyshev's inequality and Bonferroni's method that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0720.
    3. By assumption 4, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0721. Chebyshev's inequality and Bonferroni's method yield urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0722 with probability 1, which then implies
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0723
    4. By the Cauchy–Schwarz inequality and assumption 4, we have demonstrated that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0724. In addition, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0725, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0726. It follows that
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0727

    Lemma 10.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0728.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0729.
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0730.

    Proof.We prove this lemma conditioning on the event urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0731. Once this has been done, because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0732, it then implies the unconditional arguments.

    1. When urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0733, by lemma 5, all the eigenvalues of V/p are bounded away from 0. Using the inequality urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0734 and identity (C.2), we have, for some constant C>0,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0735
      Each of the four terms on the right‐hand side are bounded in lemma 8, which then yields the desired result.
    2. Part (b) follows from part (a) and
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0736
      Part (c) is implied by identity (C.2) and lemma 9.

    Lemma 11.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0737.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0738.

    Proof.We first condition on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0739.

    1. Lemma 5 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0740. Also urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0741. In ad dition, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0742. It then follows from the definition of H that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0743. Define urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0744. Applying the triangular inequality gives
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0745(C.3)
    2. By lemma 4, the first term in inequality (C.3) is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0746. The second term of inequality (C.3) can be bounded, by the Cauchy–Schwarz inequality and lemma 10, as follows:
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0747
    3. Still conditioning on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0748, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0749 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0750, right multiplying H gives urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0751. Part (a) also gives, conditioning on urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0752, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0753. Hence further left multiplying urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0754 yields urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0755. Because urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0756, we reach the desired result.

    C.2.1. Completion of proof of theorem 4

    The second part of theorem 4 was proved in lemma 10. We now derive the convergence rate of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0757.

    Using the facts that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0758, and that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0759, we have
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0760(C.4)
    We bound the three terms on the right‐hand side. It follows from lemmas 4 and 11 that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0761
    For the second term, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0762. Therefore, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0763. The Cauchy–Schwarz inequality and lemma 10 imply
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0764

    Finally, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0765 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0766 imply that the third term is urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0767.

    C.2.2. Proof of corollary 1

    Under assumption 3, it can be shown by Bonferroni's method that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0768. By theorem 4, uniformly in i and t,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0769

    C.3. Proof of theorem 2

    Lemma 12.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0770, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0771.

    Proof.We have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0772. Therefore, using the inequality urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0773, we have

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0774
    The first part of lemma 12 then follows from theorem 4 and lemma 10. The second part follows from corollary 1.

    C.3.1. Completion of proof of theorem 2

    Theorem 2 follows immediately from theorem 5 and lemma 12.

    C.4. Proof of theorem 3

    Define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0775

    Lemma 13.

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0776, and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0777.
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0778.
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0779.
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0780

    Proof.

    1. We have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0781. Moreover, since all the eigenvalues of Σ are bounded away from 0, for any matrix A, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0782. Hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0783.
    2. By theorem 2, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0784.
    3. The same argument for the proof of theorem 2 in Fan et al. (2008) implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0785. Thus, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0786 is upper bounded by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0787.
    4. Again, by urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0788, and lemma 11,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0789(C.5)

    C.4.1. Proof of theorem 3, part (a)

    By lemma 13, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0790. Hence, for a generic constant C>0,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0791

    Lemma 14.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0792.

    Proof.urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0793. Hence

    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0794(C.6)

    Lemma 15.If urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0795, then with probability approaching 1, for some c>0,

    1. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0796,
    2. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0797,
    3. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0798 and
    4. urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0799.

    Proof.

    1. By lemma 11, with probability approaching 1, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0800 is bounded away from 0. Hence,
      urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0801
    2. The result follows from part (a) and lemma 14. Parts (c) and (d) follow from a similar argument to that for part (a) and lemma 11.

    C.4.2. Completion of proof of theorem 3

    We derive the rate for urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0802. Define
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0803
    Note that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0804 and urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0805. The triangular inequality gives
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0806
    Using the Sherman–Morrison–Woodbury formula, we have urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0807, where
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0808(C.7)
    We bound each of the six terms. First, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0809 is bounded by theorem 2. Let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0810; then
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0811
    Note that theorem 2 implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0812. Lemma 15 then implies that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0813. This shows that urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0814. Similarly urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0815. In addition, since urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0816, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0817. Similarly urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0818. Finally, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0819. By lemma 15, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0820. Then, by lemma 14,
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0821
    Consequently, urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0822. Adding up urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0823urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0824 gives
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0825
    However, using the Sherman–Morrison–Woodbury formula again implies that
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0826

    C.4.3. Completion of proof of theorem 3: urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0827

    We first bound urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0828. Repeatedly using the triangular inequality yields
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0829
    However, let urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0830 be the (i,j) entry of urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0831. Then urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0832.
    urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0833

    Hence urn:x-wiley:13697412:media:rssb12016:rssb12016-math-0834. The result then follows immediately.

      Number of times cited according to CrossRef: 229

      • Robust Methods for High-Dimensional Regression and Covariance Matrix Estimation, Macroeconomic Forecasting in the Era of Big Data, 10.1007/978-3-030-31150-6_19, (625-653), (2020).
      • Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 10.1111/rssb.12359, 82, 2, (361-389), (2020).
      • Generalized dynamic factor models and volatilities: Consistency, rates, and prediction intervals, Journal of Econometrics, 10.1016/j.jeconom.2020.01.003, (2020).
      • Multi-Scale Factor Analysis of High-Dimensional Functional Connectivity in Brain Networks, IEEE Transactions on Network Science and Engineering, 10.1109/TNSE.2018.2869862, 7, 1, (449-465), (2020).
      • Factor-adjusted regularized model selection, Journal of Econometrics, 10.1016/j.jeconom.2020.01.006, (2020).
      • Ridge-type linear shrinkage estimation of the mean matrix of a high-dimensional normal distribution, Journal of Multivariate Analysis, 10.1016/j.jmva.2020.104608, (104608), (2020).
      • Sequential testing for structural stability in approximate factor models, Stochastic Processes and their Applications, 10.1016/j.spa.2020.03.003, (2020).
      • Estimating latent asset-pricing factors, Journal of Econometrics, 10.1016/j.jeconom.2019.08.012, (2020).
      • High-dimensional two-sample mean vectors test and support recovery with factor adjustment, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107004, (107004), (2020).
      • Understanding Systematic Risk: A High‐Frequency Approach, The Journal of Finance, 10.1111/jofi.12898, 75, 4, (2179-2220), (2020).
      • Augmented factor models with applications to validating market risk factors and forecasting bond risk premia, Journal of Econometrics, 10.1016/j.jeconom.2020.07.002, (2020).
      • Estimation and inference in semiparametric quantile factor models, Journal of Econometrics, 10.1016/j.jeconom.2020.07.003, (2020).
      • Large-dimensional Dynamic Factor Models: Estimation of Impulse–Response Functions with cointegrated factors , Journal of Econometrics, 10.1016/j.jeconom.2020.05.004, (2020).
      • Autoregressive models for matrix-valued time series, Journal of Econometrics, 10.1016/j.jeconom.2020.07.015, (2020).
      • High dimensional minimum variance portfolio estimation under statistical factor models, Journal of Econometrics, 10.1016/j.jeconom.2020.07.013, (2020).
      • On factor models with random missing: EM estimation, inference, and cross validation, Journal of Econometrics, 10.1016/j.jeconom.2020.08.002, (2020).
      • Covariance Matrix Estimation under Total Positivity for Portfolio Selection*, Journal of Financial Econometrics, 10.1093/jjfinec/nbaa018, (2020).
      • Forecasting large covariance matrix with high-frequency data using factor approach for the correlation matrix, Economics Letters, 10.1016/j.econlet.2020.109465, (109465), (2020).
      • Bootstrapping factor models with cross sectional dependence, Journal of Econometrics, 10.1016/j.jeconom.2020.04.026, (2020).
      • Estimation and inference of change points in high-dimensional factor models, Journal of Econometrics, 10.1016/j.jeconom.2019.08.013, (2020).
      • Rank-based tests of cross-sectional dependence in panel data models, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107070, (107070), (2020).
      • An Econometrician’s Perspective on Big Data, Essays in Honor of Cheng Hsiao, 10.1108/S0731-905320200000041009, (413-423), (2020).
      • A Large-Dimensional Test for Cross-Sectional Anomalies: Efficient Sorting Revisited, SSRN Electronic Journal, 10.2139/ssrn.3560178, (2020).
      • A Bayesian-motivated test for high-dimensional linear regression models with fixed design matrix, Statistical Papers, 10.1007/s00362-020-01157-5, (2020).
      • Mahalanobis Metric Based Clustering for Fixed Effects Model, Sankhya B, 10.1007/s13571-019-00211-z, (2020).
      • Matrix decomposition for modeling lesion development processes in multiple sclerosis, Biostatistics, 10.1093/biostatistics/kxaa016, (2020).
      • TESTING FOR STRUCTURAL CHANGES IN FACTOR MODELS VIA A NONPARAMETRIC REGRESSION, Econometric Theory, 10.1017/S0266466619000446, (1-32), (2020).
      • A Dynamic Mean-Variance Analysis for Log Returns, Management Science, 10.1287/mnsc.2019.3493, (2020).
      • The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation, Journal of Financial Econometrics, 10.1093/jjfinec/nbaa007, (2020).
      • Community Detection in Partial Correlation Network Models, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1798241, (1-11), (2020).
      • Model-Free Feature Screening and FDR Control With Knockoff Features, Journal of the American Statistical Association, 10.1080/01621459.2020.1783274, (1-16), (2020).
      • The Five Trolls Under the Bridge: Principal Component Analysis With Asynchronous and Noisy High Frequency Data, Journal of the American Statistical Association, 10.1080/01621459.2019.1672555, (1-18), (2020).
      • Large-dimensional Factor Analysis without Moment Constraints*, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1811101, (1-31), (2020).
      • Estimating Number of Factors by Adjusted Eigenvalues Thresholding, Journal of the American Statistical Association, 10.1080/01621459.2020.1825448, (1-33), (2020).
      • Diagonally Dominant Principal Component Analysis, Journal of Computational and Graphical Statistics, 10.1080/10618600.2020.1713798, (1-16), (2020).
      • Homogeneity and structure identification in semiparametric factor models, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1831516, (1-39), (2020).
      • On variable ordination of Cholesky‐based estimation for a sparse covariance matrix, Canadian Journal of Statistics, 10.1002/cjs.11564, 0, 0, (2020).
      • High‐dimensional covariance matrix estimation, WIREs Computational Statistics , 10.1002/wics.1485, 12, 2, (2019).
      • Nonlinear Factor‐Augmented Predictive Regression Models with Functional Coefficients, Journal of Time Series Analysis, 10.1111/jtsa.12511, 41, 3, (367-386), (2019).
      • Teacher Skill Development: Evidence from Performance Ratings by Principals, Journal of Policy Analysis and Management, 10.1002/pam.22193, 39, 2, (315-347), (2019).
      • Extracting Conditionally Heteroskedastic Components using Independent Component Analysis, Journal of Time Series Analysis, 10.1111/jtsa.12505, 41, 2, (293-311), (2019).
      • High‐dimensional covariance matrix estimation using a low‐rank and diagonal decomposition, Canadian Journal of Statistics, 10.1002/cjs.11532, 48, 2, (308-337), (2019).
      • Structured volatility matrix estimation for non-synchronized high-frequency financial data, Journal of Econometrics, 10.1016/j.jeconom.2018.12.019, (2019).
      • High-dimensional multivariate realizedvolatility estimation, Journal of Econometrics, 10.1016/j.jeconom.2019.04.023, (2019).
      • The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation, SSRN Electronic Journal, 10.2139/ssrn.3384500, (2019).
      • Factor Models With Real Data: A Robust Estimation of the Number of Factors, IEEE Transactions on Automatic Control, 10.1109/TAC.2018.2867372, 64, 6, (2412-2425), (2019).
      • Estimation of Weak Factor Models, SSRN Electronic Journal, 10.2139/ssrn.3374750, (2019).
      • Efficient estimation of heterogeneous coefficients in panel data models with common shocks, Journal of Econometrics, 10.1016/j.jeconom.2019.08.011, (2019).
      • A large covariance matrix estimator under intermediate spikiness regimes, Journal of Multivariate Analysis, 10.1016/j.jmva.2019.104577, (104577), (2019).
      • Rank regularized estimation of approximate factor models, Journal of Econometrics, 10.1016/j.jeconom.2019.04.021, (2019).
      • A new semiparametric estimation approach for large dynamic covariance matrices with multiple conditioning variables, Journal of Econometrics, 10.1016/j.jeconom.2019.04.025, (2019).
      • A rank test for the number of factors with high-frequency data, Journal of Econometrics, 10.1016/j.jeconom.2019.03.004, (2019).
      • Improved Shrinkage Estimators of Covariance Matrices With Toeplitz-Structured Targets in Small Sample Scenarios, IEEE Access, 10.1109/ACCESS.2019.2936402, 7, (116785-116798), (2019).
      • Semiparametric model for covariance regression analysis, Computational Statistics & Data Analysis, 10.1016/j.csda.2019.106815, (106815), (2019).
      • Novel approach of Principal Component Analysis method to assess the national energy performance via Energy Trilemma Index, Energy Reports, 10.1016/j.egyr.2019.06.009, 5, (704-713), (2019).
      • Optimizing Large Portfolios Using Risk Factors and Sparse Hedging, SSRN Electronic Journal, 10.2139/ssrn.3364968, (2019).
      • Threshold selection for covariance estimation, Biometrics, 10.1111/biom.13048, 75, 3, (895-905), (2019).
      • Macroeconomic forecasting for Australia using a large number of predictors, International Journal of Forecasting, 10.1016/j.ijforecast.2018.12.002, 35, 2, (616-633), (2019).
      • , High-Dimensional Statistics, 10.1017/9781108627771, (2019).
      • Optimal shrinkage estimator for high-dimensional mean vector, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.07.004, 170, (63-79), (2019).
      • Principal envelope model, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2019.10.001, (2019).
      • The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project, BMC Medical Research Methodology, 10.1186/s12874-019-0737-5, 19, 1, (2019).
      • Learning Latent Factors from Diversified Projections and its Applications to Over-Estimated and Weak Factors, SSRN Electronic Journal, 10.2139/ssrn.3446097, (2019).
      • Estimation of a multiplicative correlation structure in the large dimensional case, Journal of Econometrics, 10.1016/j.jeconom.2019.12.012, (2019).
      • A test of sphericity for high-dimensional data and its application for detection of divergently spiked noise, Sequential Analysis, 10.1080/07474946.2018.1548850, 37, 3, (397-411), (2019).
      • Dantzig Type Optimization Method with Applications to Portfolio Selection, Sustainability, 10.3390/su11113216, 11, 11, (3216), (2019).
      • LARGE SYSTEM OF SEEMINGLY UNRELATED REGRESSIONS: A PENALIZED QUASI-MAXIMUM LIKELIHOOD ESTIMATION PERSPECTIVE, Econometric Theory, 10.1017/S026646661900015X, (1-33), (2019).
      • Sufficient direction factor model and its application to gene expression quantitative trait loci discovery, Biometrika, 10.1093/biomet/asz010, (2019).
      • Spiked sample covariance matrices with possibly multiple bulk components, Random Matrices: Theory and Applications, 10.1142/S2010326321500143, (2019).
      • A self-reliant projected information criterion for the number of factors, Communications in Statistics - Theory and Methods, 10.1080/03610926.2019.1576889, (1-19), (2019).
      • On the penalized maximum likelihood estimation of high-dimensional approximate factor model, Computational Statistics, 10.1007/s00180-019-00869-z, (2019).
      • TGCnA: temporal gene coexpression network analysis using a low-rank plus sparse framework, Journal of Applied Statistics, 10.1080/02664763.2019.1667311, (1-20), (2019).
      • Improved Covariance Matrix Estimation for Portfolio Risk Measurement: A Review, Journal of Risk and Financial Management, 10.3390/jrfm12010048, 12, 1, (48), (2019).
      • Inference and uncertainty quantification for noisy matrix completion, Proceedings of the National Academy of Sciences, 10.1073/pnas.1910053116, (201910053), (2019).
      • Media Coverage and Decomposition of Stock Market Volatility:Based on the Generalized Dynamic Factor Model, Emerging Markets Finance and Trade, 10.1080/1540496X.2019.1686974, (1-13), (2019).
      • A recursive approach for determining matrix inverses as applied to causal time series processes, METRON, 10.1007/s40300-019-00147-4, (2019).
      • Factor Models for Portfolio Selection in Large Dimensions: The Good, the Better and the Ugly, Journal of Financial Econometrics, 10.1093/jjfinec/nby033, (2019).
      • An Iterative Approach to Ill-Conditioned Optimal Portfolio Selection, Computational Economics, 10.1007/s10614-019-09943-6, (2019).
      • Asymptotics for the systematic and idiosyncratic volatility with large dimensional high-frequency data, Random Matrices: Theory and Applications, 10.1142/S2010326320500070, (2050007), (2019).
      • High-dimensional Markowitz portfolio optimization problem: empirical comparison of covariance matrix estimators, Journal of Statistical Computation and Simulation, 10.1080/00949655.2019.1577855, (1-23), (2019).
      • A Nodewise Regression Approach to Estimating Large Portfolios, Journal of Business & Economic Statistics, 10.1080/07350015.2019.1683018, (1-12), (2019).
      • Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding, Journal of the American Statistical Association, 10.1080/01621459.2018.1442340, 114, 526, (759-772), (2018).
      • Graph-Guided Banding of the Covariance Matrix, Journal of the American Statistical Association, 10.1080/01621459.2018.1442720, 114, 526, (782-792), (2018).
      • A Factor-Adjusted Multiple Testing Procedure With Application to Mutual Fund Selection, Journal of Business & Economic Statistics, 10.1080/07350015.2017.1294078, 37, 1, (147-157), (2018).
      • Large-dimensional factor modeling based on high-frequency observations, Journal of Econometrics, 10.1016/j.jeconom.2018.09.004, (2018).
      • Adaptive thresholding for large volatility matrix estimation based on high-frequency financial data, Journal of Econometrics, 10.1016/j.jeconom.2017.09.006, 203, 1, (69-79), (2018).
      • The Five Trolls Under the Bridge: Principal Component Analysis with Asynchronous and Noisy High Frequency Data, SSRN Electronic Journal, 10.2139/ssrn.3118039, (2018).
      • Nonparametric Time--Varying Panel Data Models with Heterogeneity, SSRN Electronic Journal, 10.2139/ssrn.3214046, (2018).
      • A New Semiparametric Estimation Approach of Large Dynamic Covariance Matrices with Multiple Conditioning Variables, SSRN Electronic Journal, 10.2139/ssrn.3210726, (2018).
      • Factor-Adjusted Regularized Model Selection, SSRN Electronic Journal, 10.2139/ssrn.3248047, (2018).
      • Testing against constant factor loading matrix with large panel high-frequency data, Journal of Econometrics, 10.1016/j.jeconom.2018.03.001, 204, 2, (301-319), (2018).
      • Estimation of the global minimum variance portfolio in high dimensions, European Journal of Operational Research, 10.1016/j.ejor.2017.09.028, 266, 1, (371-390), (2018).
      • Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage, Journal of Multivariate Analysis, 10.1016/j.jmva.2018.12.002, (2018).
      • Adaptive test for mean vectors of high-dimensional time series data with factor structure, Journal of the Korean Statistical Society, 10.1016/j.jkss.2018.05.003, 47, 4, (450-470), (2018).
      • A multiple testing approach to the regularisation of large sample correlation matrices, Journal of Econometrics, 10.1016/j.jeconom.2018.10.006, (2018).
      • Factor GARCH-Itô models for high-frequency data with application to large volatility matrix prediction, Journal of Econometrics, 10.1016/j.jeconom.2018.10.003, (2018).
      • High-dimensional covariance forecasting based on principal component analysis of high-frequency data, Economic Modelling, 10.1016/j.econmod.2018.07.015, 75, (422-431), (2018).
      • Hypothesis Testing for High-Dimensional Data, Handbook of Big Data Analytics, 10.1007/978-3-319-18284-1_8, (203-224), (2018).
      • Estimation of large dimensional factor models with an unknown number of breaks, Journal of Econometrics, 10.1016/j.jeconom.2018.06.019, 207, 1, (1-29), (2018).
      • On-line control of false discovery rates for multiple datastreams, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2017.10.006, 194, (1-14), (2018).
      • See more