Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding

We consider the problem of estimating a high dimensional p×p covariance matrix Σ, given n observations of confounded data with covariance Σ+ΓΓT , where Γ is an unknown p×q matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection onto the right singular vectors of the observed data matrix, which we call right singular vector projection (RSVP). Our theoretical analysis of this method reveals that, in contrast with approaches based on the removal of principal components, RSVP can cope well with settings where the smallest eigenvalue of ΓTΓ is relatively close to the largest eigenvalue of Σ, as well as when the eigenvalues of ΓTΓ are diverging fast. RSVP does not require knowledge or estimation of the number of latent factors q, but it recovers Σ only up to an unknown positive scale factor. We argue that this suffices in many applications, e.g. if an estimate of the correlation matrix is desired. We also show that, by using subsampling, we can further improve the performance of the method. We demonstrate the favourable performance of RSVP through simulation experiments and an analysis of gene expression data sets collated by the GTEX consortium.


Introduction
Suppose that a random vector w ∈ R p follows a multivariate normal distribution with covariance matrix Σ: w ∼ N p .μ w , Σ/: Given n independent and identically distributed (IID) copies of w whose rows form a data matrix W ∈ R n×p it is often of interest to estimate either Σ, or certain quantities that are derived from this such as the precision matrix Ω := Σ −1 or collections of conditional independences that may then be used to infer causal structure (Spirtes et al., 2000).
Suppose now that we cannot observe W directly, but we instead observe n IID copies of a random vector x which form the rows of X ∈ R n×p ; x is related to w through x = w + Γh: .1/ Here h ∈ R q is a vector of unobserved latent random variables, and Γ ∈ R p×q a fixed matrix of loadings. If we assume that h is normally distributed, without loss of generality we may take h ∼ N q .μ h , I/. We then have that the covariance Θ of the observed x contains a contribution ΓΓ T from latent confounding and a contribution Σ from idiosyncratic noise: If we simply ignore the confounding, we shall have the covariance Θ as the target of inference instead of Σ, and the two can be very different.
Applications where such confounding is important in practice include the following.
(a) Cell biology: the activities of proteins and messenger ribonucleic acid, for example, can be confounded by environmental factors. Two highly correlated protein activities are thus not necessarily close in a causal network (Leek and Storey, 2007;Stegle et al., 2012). (b) Financial assets: the returns of various stocks will be confounded by some latent factors (such as general market movement or sector influences) without the covariance necessarily revealing anything about causal connections between companies (Menchero et al., 2010). (c) Confounding in biology and genetics can also occur due to technical malfunction and laboratory effects (Gagnon-Bartsch et al., 2013).
Thus, in various settings, to infer meaningful connections between variables we would like to remove the effect of confounding from the empirical covarianceΘ of X and to estimate Σ.
As well as the intrinsic ill-posedness of the problem of separating Σ from a noisy observation of Σ + ΓΓ T with Γ unknown, a further challenge in the applications above and many others is that the dimension p may be very large indeed; of the order of thousands or more. This high dimensionality brings computational difficulties that must be addressed by any practical procedure.
For Σ to be identifiable, appropriate assumptions on both Σ and Γ must be made. One natural assumption is that the minimum eigenvalue γ l of Γ T Γ is larger than the largest eigenvalue σ u of Σ. In this setting, a popular strategy to deal with unwanted confounding is removal of top principal components fromΘ. This has been proposed in Gagnon-Bartsch et al. (2013) and Fan et al. (2013). The latter work, a discussion paper in the Journal of the Royal Statistical Society, Series B, shows that, when σ u is bounded and γ l = O.p/, and so the gap between the quantities is large, Σ may be recovered consistently. In this case the top q eigenvalues ofΘ will be well separated from the rest, and so exactly q principal components can be removed fromΘ: this is important as removing too many or too few principal components can result in a poor estimate.
However, as several discussants of Fan et al. (2013) pointed out, in many settings empirical covariances do not display well-separated eigenvalues even when latent factors are known to be present. When the gap between σ u and γ l is not sufficiently large, the top q eigenvalues can be close to the bulk, making estimation of q challenging and potentially impossible (Barigozzi and Cho, 2018). Furthermore the top principal components (PCs) of the empirical covariance can be far from those of Θ (Donoho et al., 2018), so, even if q were known, the PC removal approach would not work well.
In this paper, we propose a simple approach to estimating Σ that can cope with settings where the gap between γ l and σ u may range from large and O.p/ to potentially small. To achieve this ambitious objective, the method sacrifices estimation of the scale of Σ: we recover Σ only up to an unknown positive scalar factor. The loss of scale, however, is inconsequential when the ultimate goal is rather to estimate the correlation matrixΣ, or to locate the top s largest entries in Σ for a prespecified s, to build a network. In fact, we show that the scale-free nature of our estimator gives it an in-built robustness in that, if the rows of X have elliptical distributions, its distribution is precisely the same as if the data were Gaussian (see proposition 3 in Section 3).
Let V ∈ R p×.n−1/ be the matrix of right singular vectors with non-zero singular values of a column-centred version of X. Our estimator is based onΣ rsvp := VV T ; we call this right singular vector projection (RSVP). The PC removal estimate is proportional to VH 2 V T where H is a diagonal matrix of singular values of the centred X with the first q entries set to 0 (when q is known). Thus RSVP may be seen as a highly regularized version of PC removal, where the random H is set to the identity matrix to reduce its variance. In fact, we show that each entry ofΣ rsvp concentrates around its expectation at the same rate as the empirical covariance matrix after rescaling, even in settings where q is allowed to grow at almost the same rate as n (see theorem 1 in Section 3).
Despite the aggressive regularization, it turns out that the bias is dominated by the variance provided that p n, so n=p is small. As a consequence, we can show that, with high probability, inf κ>0 max j,k |Σ jk − κΣ rsvp,jk | c log.p/ n for some constant c > 0, even in certain settings when γ l is only larger than σ u by a constant factor, and the latter is bounded. In fact, we show that the statistical properties ofΣ rsvp are such that, when used as input to several standard procedures for conditional independence graph estimation or causal discovery procedures, the performances of the resulting estimates are, in many settings, identical to those attained when working with the unconfounded data, up to constant factors. One requirement forΣ rsvp to work well is that p n. For settings where n is large, we can circumvent this condition by using a subsampling strategy. We show that, surprisingly, by computing our estimator on subsamples of the data and averaging (Breiman, 1996), the bias may be reduced, and the variance inflated by only a factor of √ {log.p/}. Subsampling with a very small number of samples in each subsample is both statistically and computationally attractive and is the approach that we would recommend in settings where we do not have p n.

Related work
There is a large body of work on high dimensional covariance and precision matrix estimation: see for example the recent review paper of Cai et al. (2016) and references therein. Much of the work on the specific setting with latent confounding has focused on estimation of the precision matrix Ω, which is assumed to be sparse. The presence of the latent confounding causes the overall precision matrix of x to be a sum of Ω and a low rank component. One approach to sparse precision matrix estimation in the absence of confounding is the graphical lasso (Yuan and Lin, 2007;Yuan, 2010;Friedman et al., 2008). Building on this and work on sparsedense matrix decompositions in the noiseless setting (Candès et al., 2011;Chandrasekaran et al., 2011), the work of Chandrasekaran et al. (2012) formulates a convex objective involving nuclear norm penalization for Gaussian graphical model estimation with latent confounders. The work of Frot et al. (2019b) uses this as a stepping-stone for causal structure learning and causal effect estimation in low dimensional settings. A challenge for nuclear norm penalization and related approaches is that, although the objective is convex, optimizing it is nevertheless a computationally intensive task that does not scale to very large dimensions. A second approach to precision matrix estimation exploits the fact that coefficients from regressions of each variable on all others, known as nodewise regressions, match the entries of the precision matrix up to scale factors (Lauritzen, 1996;Meinshausen and Bühlmann, 2006). Adjusting for confounding can be built into a nodewise regression procedure, e.g. by using the 'Lava' method of Chernozhukov et al. (2017) which employs a sparse-dense decomposition of the regression coefficients; the sparse part of the coefficients can then be retained as the dense part is generally due to confounding. This regression may be formulated as a lasso regression with a transformed response and a particular preconditioned design matrix; see also Rohe (2015) for an earlier equivalent proposal. Ćevid et al. (2018) studied the theoretical properties of the Lava approach as well as more general forms of preconditioning including the Puffer transform that was proposed in Jia and Rohe (2015) and further investigated in Wang and Leng (2015). This, in analogy with RSVP, modifies the design matrix by replacing non-zero singular values with a constant. We also note that the asymptotic normal thresholding procedure of Ren et al. (2015), which employs nodewise regressions in a different fashion, is robust to weak confounding.
There has been comparatively less work on covariance matrix estimation in the presence of confounding, though, as we discuss and make use of in this work, an estimated covariance can be used as a starting point for conditional independence graph estimation or causal discovery. In addition to the work of Fan et al. (2013) and Gagnon-Bartsch et al. (2013) that was mentioned earlier, Fan et al. (2018) proposed a PC removal approach that can be applied to heavy-tailed data that follow an elliptical distribution.

Organization of the paper
The rest of the paper is organized as follows. In Section 2 we first discuss asymptotic identifiability of Σ and then introduce our RSVP estimatorΣ rsvp and versions involving subsampling. We present theoretical properties ofΣ rsvp and RSVP with sample splitting in Section 3. In Section 4 we present results on the use of RSVP estimators as input to methods for conditional independence graph estimation and for causal discovery via the PC algorithm. Numerical experiments are contained in Section 5 and we conclude with a discussion in Section 6. The on-line supplementary material for this paper contains all proofs and further results concerning the GTEX consortium data analyses that are presented in Section 5.2.

Notation
We write a b as shorthand for 'there is a constant c > 0 such that a cb'. This constant may be a universal constant, or a function of quantities that have been designated as constants in our assumptions. If a b and b a, we may write a b. For a matrix A ∈ R d×m , A will denote the operator norm, and A ∞ = max i=1,:::,d,j=1,:::,m |A ij |.
When d = m, so A is square, we shall write λ max .A/ and λ min .A/ for the maximum and minimum eigenvalues of A respectively. Further, given sets I, J ⊆ D := {1, : : : , d}, we shall denote by A I,J the |I| × |J| submatrix of A that is formed from those rows and columns of A indexed by I and J respectively. Such matrix subsetting operations will always be considered to have been performed first so that for example, when A I,I is invertible, A −1 I,I ≡ .A I,I / −1 . For j ∈ D, j or −j used in place of the subscripts I or J above will represent {j} and D \ {j} respectively, so A j,−j for example is the 1 × .d − 1/ matrix that is formed from the jth row of A with its jth entry removed.
In analogy with the matrix subsetting notation that was set out above, we shall write, for a vector v ∈ R d , v I for the subvector formed from the components of v indexed by I. Also, for j, k ∈ D, v −j and v −jk will be subvectors of v with jth and both the jth and kth components removed respectively. We denote by e j the jth standard basis vector; the dimension of this will be clear from the context.

Right singular vector projection
Let us assume that the observed data matrix X ∈ R .n+1/×p has rows given by n + 1 independent realizations of an N p .μ, Θ/ random vector (we shall later relax the Gaussian assumption; see proposition 3 in Section 3). The n + 1 rather than n is for mathematical convenience: the columncentred versionX := ΠX of X effectively contains n observations. Here Π = I − .n + 1/ −1 11 T where 1 is an .n + 1/-vector of 1s. Our goal is to construct an estimate of Σ based on these data where Σ + ΓΓ T = Θ and both Γ ∈ R p×q and q are unknown. We are interested in the case p n and shall assume that p > cn for some c > 1, unless specified otherwise.
In what follows we first study the identifiability of Σ in the model above. In Section 2.2 we discuss a general approach for estimating Σ based on transforming the spectrum of the covariance matrix, which includes PC removal and our RSVP method presented in Section 2.3 as special cases. Finally we introduce a sample splitting version of RSVP in Section 2.4.

Asymptotic identifiability
First consider an artificial setting where Θ itself is directly observed. Even in this noiseless setting, certain conditions must be placed on Γ and Σ for Σ to be recoverable given Θ. Define If γ l is large compared with σ u , we might hope that the top q eigenvectors of Θ will span most of the column space of Γ. Therefore removing these from Θ should yield a matrix that is close to Σ. Proposition 1 below, based in part on an application of the Davis-Kahan sin.θ/ theorem (Davis and Kahan, 1970), formalizes this intuition.
Let Θ have eigendecomposition PD 2 P T where the diagonal matrix D has D 11 D 22 : : : D pp . Also define, for l ∈ {1, : : : , p}, function H l taking as argument a square matrix, and outputting a matrix of the same dimension, by Thus the top left-hand l × l submatrix of H l .E/ is a matrix of 0s. Define Π Γ := Γ.Γ T Γ/ −1 Γ T , ρ 1 := Π Γ Σ and ρ 2 := max j Π Γ e j 2 : Proposition 1. Suppose that σ l is bounded away from 0 and γ l > cσ u for a constant c > 1. Then In order for removal of q PCs to yield a matrix that is close to Σ at the population level, we require ρ 2 to be small; this essentially requires that the column space of Γ is not too closely aligned with any of the standard basis vectors. We always have the bound ρ 1 σ u . However, in the setting where Γ is entirely uninformative about Σ, we might expect that ρ 1 may be smaller. Specifically, if we imagine that nature has chosen the column space of Γ uniformly at random, we shall have with high probability that  Fan et al. (2013Fan et al. ( , 2018 when σ u and q are bounded, and both γ l and γ u are O.p/. In these settings it is straightforward to show that ρ 2 p −1=2 , in which case the right-hand side of expression (2) may be replaced by p −1=2 .

Spectral transformations
We now return to the original noisy version of the problem. The empirical covariance matrix Θ =X TX =n has expectation Θ = PD 2 P T , so we would ideally like to modifyΘ such that the eigenstructure of its expectation more closely resembles PH q .D 2 /P. Therefore consider the following family of estimators that involve transforming the spectrum ofΘ. Note that, asX has been centred and p > n, the rank ofX is n. Let the singular value decomposition ofX be given byX = UΛV T where Λ ∈ R n×n is diagonal, and U ∈ R .n+1/×n and V ∈ R p×n each have orthonormal columns. Definê where function H here outputs n × n diagonal matrices. For such estimators, we have the following property.
Proposition 2. We have that the expectation E.Σ H / = PC 2 H P T where C H is diagonal. The fact that the eigenvectors of E.Σ H / coincide with those of Θ suggests that we should pick function H such that C 2 H is close to H q .D 2 /. A natural choice is a simple PC-analysis-based adjustment (Fan et al., 2013(Fan et al., , 2018Gagnon-Bartsch et al., 2013) of the form Σ pca .l/ :=Σ H l = n −1 VH l .Λ 2 /V T : The resulting PC removal estimator can be further thresholded as in Bickel and Levina (2008) and Fan et al. (2013), though, if our aim is to recover the locations of the largest entries of the covariance, this additional thresholding step is without consequence. The choice of the number l of principal components to remove is rather critical to the method but can be challenging. Even if we had knowledge about the dimensionality q of the latent confounders, the optimal choice would depend on the relative magnitude of the eigenvalues of Γ T Γ in relation to the eigenvalues of Σ. In the absence of this knowledge, one might resort to cross-validation schemes. Since the target of inference is the unobserved idiosyncratic part Σ of the covariance, it is not obvious how such a cross-validation can be set up in a meaningful way. Information criteria may be used as in Fan et al. (2013), but these rely on γ l =σ u p.

Right singular vector projection
One reason that the PC removal approach can struggle in settings where the separation between γ l and γ u is relatively small is that the top q eigenvectors ofΘ need not span the column space of Γ well and in general will have high variability. Thus althoughΘ = n −1 V Λ 2 V T concentrates well around its expectation Θ in l ∞ -norm, an approach that involves manipulating the contributions of individual singular vectors in V to the overall estimator is likely to have high variance. This suggests that some form of regularization may be helpful.
Taking the function H as one which always returns n times the identity matrix results in the simple estimatorΣ rsvp := VV T : Note that this is invariant to permutations of the columns of V and so is less dependent on properties of individual eigenvectors. As a consequence of the regularization, we have lost the scaling of the original covariance: the estimator is invariant to multiplying X from the left by any invertible n × n matrix. Thus we can only hope to recover Σ up to a constant scale factor. This suffices for our purposes, and we argue that this gives the estimator a certain robustness in that it is insensitive to particular pretransformations of the data such as scaling of the rows of X. In factΣ rsvp is more generally robust; see proposition 3 in Section 3. The computation time is dominated by the matrix multiplication of V and V T , which is O.np 2 /; thus the computational complexity is the same as that for computing the empirical covariance.
In a regression context, an analogous approach for preconditioning the design matrix has been explored in Jia and Rohe (2015) and Wang and Leng (2015). The Lava estimator (Chernozhukov et al., 2017) employs a similar preconditioning strategy but, instead of setting all non-zero singular values of the design matrix to 1, the singular values d i are transformed implicitly as where the constant c depends on a tuning parameter and the sample size. It may seem as if all information regarding the eigenvalues of Σ has been lost in the regularization as Λ does not play a role in the estimator. However, we show in Section 3 that, in certain high dimensional settings, we can even estimate Σ in l ∞ -norm at the same rate as the empirical covariance matrix in the absence of confounding, though only up to an unknown scale factor. Intuitively, the reason is that, when p n, with the exception of certain large eigenvalues in Λ due to large eigenvalues in Γ T Γ, the rest of the eigenvalues are essentially noise and bear no resemblance to the eigenvalues of Σ. This peculiar blessing of high dimensionality is a phenomenon that fails when p is of the same order as n, for example. It is, however, possible to subsample the data, and to average over estimates computed on the samples, to mimic the high dimensional setting. We discuss this below.

Subsampling right singular vector projection
Given m ∈ {1, : : : , n}, let V .b/ be the matrix of right singular vectors of a random sample of m rows ofX. We define the subsampling RSVP estimator aŝ The sample splitting RSVP estimatorΣ rsvp-split is defined similarly but where the sets of indices of the sampled rows are disjoint, and so B = .n + 1/=m . In practice, the subsampling estimator is preferable as the additional sampling can help to reduce the variance of the estimator. Our main reason for introducing the sample splitting version is that it is simpler to understand its theoretical properties (see theorem 4 in Section 3); however, sample splitting still performs well empirically as we demonstrate in Section 5. Both estimators are trivially parallelizable: the singular value decomposition computations for each subsample can be performed simultaneously and then added at the end. If B machines were available for the computations, the overall parallel computation time would be O.mp 2 / provided that log.B/ m.  (2007) and (f) Bai and Ng (2002) suggested in the principal orthogonal complement thresholding method (Fan et al., 2013); (g) proposed RSVP estimator with a subsample size of m D 20; RSVP manages to recover the smaller blocks, whereas the PC removal methods seemingly fail to recover any structure in the covariance matrix (a) Fig. 1 but for sample size n D 1000 and very weak latent confounding (ν D 0.01): the large block is now even visible in the empirical covariance matrix; the PC-removal-based methods fail to recover the structure of the large block as they all remove at least one PC

Example
Figs 1 and 2 show an example of the proposed sample splitting RSVP estimator, compared with the target Σ and PC removal. In Fig. 1, the latent confounding is so strong that the empirical covariance shows very little visual indication of the block structure of the idiosyncratic covariance. Likewise, PC removal fails to recover the structure, whether we use an oracle for determining the number of factors to remove or estimate the optimal number of factors. RSVP in contrast recovers the smaller blocks. It is shown here for m = 20 samples in each subsample (default) but the results do not change appreciably when choosing a different subsample size. When reducing the strength of the latent confounders ( Fig. 2), the empirical covariance shows the correct underlying structure visually but all PC removal methods fail to recover the largest block of variables as even just removing the first PC removes the large block.

Theoretical properties
In this section, we present some theoretical properties ofΣ rsvp andΣ rsvp-split . We first explain howΣ rsvp has low variance and then argue that its bias is also well controlled in the high dimensional setting. We then discuss the consequences forΣ rsvp-split . We shall assume condition 1 below in several of the results to follow.
Theorem 1. Assume condition 1 and that σ u p={n log.p/} and q n= log.p/. Then there exists a constant c > 0 such that, with probability at least 1 − c=p, We show in theorem 2 below that the entries in E.Σ rsvp / are of the order n=p, so the result shows that the rate at whichΣ rsvp concentrates is equivalent to that enjoyed by the empirical covariance matrix in the absence of confounding. The proof, given in section E.2 in the online supplementary material, is based on a variant of the classical concentration inequality for a Lipschitz function f : R d → R of IID Gaussian random variables ζ ∼ N d .0, I/, which may be of independent interest. Whereas the original result guarantees fast concentration when sup v∈R d ∇f.v/ 2 is small, our new result (theorem 7 in the supplementary material) requires only a high probability bound on ∇f.ζ/ 2 , and a potentially loose bound on E ∇f.ζ/ 2 2 . See also lemma 1.3 of Klochkov and Zhivotovskiy (2018) for a related result.
Although our proof technique for concentration ofΣ rsvp makes use of particular properties of Gaussian distributions, one attractive feature of the estimator is that it enjoys a certain in-built robustness to deviations from Gaussianity in the distribution of X. Indeed, consider now the weaker requirement that where M ∈ R .n+1/×.n+1/ is invertible and the rows of Z ∈ R .n+1/×p are independent following (potentially different) spherically symmetric distributions, so ZQ = d Z for any orthogonal matrix Q ∈ R .n+1/×.n+1/ . A sufficient condition for this to occur is that the rows of X are IID and have a density with elliptical contours. In this more general setting we have the following result.
Proposition 3. The law ofΣ rsvp under requirement (5) above is the same as that when X has independent rows distributed as N p .μ, Θ/.
For example, the entries in Z can have arbitrarily heavy tails; provided that the spherical symmetry is satisfied, all results in this section hold under this setting and more generally under requirement (5). This may seem surprising at first sight but is analogous to how, if ζ has a spherically symmetric distribution, then the distribution of ζ= ζ 2 is simply the uniform distribution on the d-dimensional spherical shell, and in particular identical to the distribution that is obtained when ζ ∼ N d .0, I/.
We now turn to the expectation ofΣ rsvp . Theorem 2 below shows that E.Σ rsvp / is approximately a scaled version of Σ.
Theorem 2. Assume condition 1. We have that E.Σ rsvp / = PC 2 P T where C is a diagonal matrix with C satisfying max j,k∈{q+1,:::,p} . 6/ This result shows that the ratio of C 2 jj to D 2 jj does not vary much across j ∈ {q + 1, : : : , p} provided that p n. In fact we also have max j∈{q+1,:::,p} in the case where σ u is bounded, which reveals the form of the scale factor, and in particular its dependence on the unknown q. A derivation is given in section F of the on-line supplementary material. We do not make direct use of this in the proof of theorem 3 below, however, as it is useful only when γ l is large; in contrast, expression (6) is valid for any value of γ l . Combining the results of proposition 1 and theorems 1 and 2 gives the following high probability bound on the l ∞ -norm error of estimating Σ, up to an unknown scale factor.
Theorem 3. Assume condition 1 and that σ u p={n log.p/}, q n= log.p/. With probability at least 1 − c=p for some constant c > 0, we have that there exists κ > 0 such that If we additionally assume that ρ 2 2 q=p and ρ 1 is bounded, we have that there exists κ > 0 such that The first two terms in the bounds (8) and (9) come directly from the population level result proposition 1. The remaining terms do not depend on γ l , demonstrating how RSVP, in contrast with the PC removal approach, does not rely on a large eigengap between Γ T Γ and Σ. The final √ {log.p/=n}-term is due to the variance (see theorem 1). Considering expression (9), in the case where σ u p √ log.p/=n 3=2 , q √ {n log.p/} and p √ log.p/ n 3=2 , we have that with high probability If the condition number of Γ T Γ were bounded, we need only γ l √ {n= log.p/} for the l ∞ -norm error above to be of the same order as that achieved by the empirical covariance matrix of the (unobserved) unconfounded data W . Although RSVP does not require strong eigengap conditions, we do need p n so that the term involving σ u n=p, due to the bias of the estimator, is small. By sample splitting and averaging in constructingΣ rsvp-split , we effectively reduce n, but only introduce an extra √ log.p/-factor in the variance term, as the following result shows.
Theorem 4. LetΣ rsvp-split be the sample splitting RSVP estimator with B subsamples of size m, so n + 1 = mB. We consider for simplicity the case where the data are column centred in each subsample. Assume condition 1, but without the requirement that c 3 n < p; instead suppose that c 1 m < p for c 1 > 1, and B < p c 2 for some c 2 > 0. Assume that 1 σ u p={m log.p/} and q m= log.p/. With probability at least 1 − c=p for some constant c > 0, we have that there exists κ > 0 such that If we additionally assume that ρ 2 2 q=p and ρ 1 is bounded, we have that there exists κ > 0 such that . 10/ Considering expression (10), we see that for an optimal m √ .pq=σ u / we have with high probability that . 11/ Although the simple RSVP estimator is most useful in the high dimensional case p n, this result shows that sample splitting gives good performance in moderate to low dimensional settings, which will be confirmed empirically in Section 5.

Weak confounding
The results and discussion thus far have considered the case where γ l > σ u . In cases where the confounding is sufficiently weak such that is small, and so the empirical covarianceΘ is itself a good estimator of Σ, a straightforward consequence of our previous results and their proofs is that RSVP will behave similarly to the empirical covariance.
Corollary 1. Consider the set-up of theorem 3 but now without any restriction on q and γ l (so in particular γ l < σ u is permitted). With probability at least 1 − c=p for some constant c > 0, there exists κ > 0 such that Σ − κΣ rsvp ∞ γ u ρ 2 2 + .γ u + σ u / n p + log.p/ n : Suppose additionally that γ u =γ l is bounded, ρ 2 2 q=p, σ u p √ log.p/=n 3=2 , q n max{ √ {log.p/=n}, 1= log.p/} and p log.p/ n 2 . Then, with probability at least 1 − c=p, Σ − κΣ rsvp ∞ log.p/ n : Note that the final result holds regardless of the strength of confounding, which can be arbitrarily weak or strong, though it relies on the condition number of Γ T Γ being bounded.

Conditional independence graph estimation and causal structure learning
In this section we consider using RSVP in conjunction with existing methods for conditional independence graph (CIG) estimation and causal structure learning. We first turn to the problem of estimating the CIG corresponding to Σ: this is the undirected graph on p nodes with an edge between nodes j and k with j = k if and only if w j ⊥ ⊥w k |w −jk , where recall that w ∼ N p .μ w , Σ/. Equivalently, we have an edge between j and k if and only if the precision matrix Ω = Σ −1 has Ω jk = 0.

Conditional independence graph estimation
Methods for CIG estimation when p n typically rely on Ω being sparse. Applying them directly to the observed data X will in general not work well, firstly as the inverse covariance Θ −1 of the observed data may be far from Ω, and secondly because Θ −1 will not be sparse but rather a sum of the sparse Ω and a low rank component due to the presence of latent confounding. However, many of the methods for sparse precision matrix estimation require only an estimated covariance as input and so can be readily applied to any estimate of Σ. Examples include neighbourhood selection (Meinshausen and Bühlmann, 2006), the graphical lasso (Yuan and Lin, 2007;Yuan, 2010;Friedman et al., 2008) and constrained l 1 -minimization for inverse matrix estimation (Cai et al., 2011). Note that, as RSVP estimates Σ up to an unknown scale factor only, we can similarly only hope to recover the precision matrix up to an unknown scale factor; this, however, suffices for estimating the CIG. Theoretical results for constrained l 1 -minimization for inverse matrix estimation and the graphical lasso require only an initial estimate of Σ that is close in l ∞ -norm, so our estimation error bounds for Σ translate directly into estimation error bounds on Σ −1 . We now present the corresponding result for neighbourhood selection, which is more delicate.
The procedure of neighbourhood selection involves running p lasso regressions of each variable against all others. The resulting coefficient estimates may then be used to derive an estimate of the CIG. Phrased in terms of an estimateΣ of the covariance, the so-called nodewise regressions take the formβ .j/ := arg min The population level minimizer β .j/ ∈ R p (i.e. withΣ replaced by Σ) of the above when λ j = 0 satisfies The {β .j/ } p j=1 encode the CIG; indeed w j ⊥ ⊥w k |w −jk if and only if β Here we shall takeΣ =Σ rsvp in expression (12); we thus expect that a scaled version ofβ .j/ becomes close to β .j/ .
To present our result on the statistical properties ofβ .j/ , we introduce the following quantities. Let S j = {l : β .j/ l = 0} and let s j := |S j | and s = max j s j ; thus s j and s are the degree of the jth node and the maximal degree in the CIG respectively. Also define η j = .β .j/ / T ΓΓ T β .j/ : Our theory will require the η j to be small. We always have η j s j for all j. Indeed As Θ is positive semidefinite, we have |Θ lk | max.Θ ll , Θ kk / 1 for all l and k. Thus, by the Gershgorin circle theorem, λ max .Θ S j ,S j / s j . Also, as , we have β .j/ 2 1, whence η j s j . However, in many settings we can expect the η j to be smaller: if we consider the column space of Γ to have been chosen (by nature) uniformly at random conditionally on Σ, then we have η j 1 for all j: .13/ A derivation of this is given in section I of the on-line supplementary material.
Theorem 5. Assume condition 1. Let Letβ .j/ be the nodewise regression coefficient whenΣ =Σ rsvp and Suppose that ρ 2 2 q=p, ρ 1 1, γ 2 l =γ u √ {sn= log.p/}, q √ {n log.p/=s} and σ u p √ {log.p/=.sn 3 /}; then Δ 1. If in addition η j 1 for all j, we recover the usual estimation error rates for the lasso: The following simple corollary shows that, under a minimum signal strength condition, appropriately thresholding the estimatesβ .j/ recovers the true CIG. Although edges in a CIG are typically given a causal interpretation, structural equation models (Pearl, 2009) and graphical modelling with directed acyclic graphs (Lauritzen, 1996) offer a more principled approach for causal inference. Below we explain how the popular PC algorithm (Spirtes et al., 2000) may be run with our RSVP estimate as its input to enable causal structure learning in the presence of hidden confounding.

Causal structure learning
In this section we describe how our RSVP estimator may be used for causal structure learning concerning the unconfounded w ∼ N p .μ w , Σ/. If we assume a structural causal model for w with an underlying directed acyclic graph (DAG) encoding parent-children relationships (Pearl, 2009), then the observational distribution factorizes according to this DAG. The interventional distributions under 'do' interventions can then be obtained by truncated factorizations (Robins, 1986;Pearl, 2009) under an assumption known as autonomy (Haavelmo, 1944).
If the underlying DAG G is unknown, it needs to be estimated from data; for a general overview of causal structure learning see for example Heinze-Deml et al. (2018). Under a faithfulness assumption (Meek, 1995), the set of conditional independences in the observational distribution will be exactly those that may be inferred via d-separation from G. In general, many DAGs will be compatible with the observational distribution in this way and these form an equivalence class which may be conveniently represented through a completed partially directed acyclic graph (CPDAG). A CPDAG contains both directed and undirected edges, and essentially contains all the information relating to causal structure that may be inferred from a given observational distribution under the assumption of faithfulness.
Our goal here is to infer the CPDAG corresponding to the distribution of the unconfounded data. To do this we employ the PC algorithm (Spirtes et al., 2000;Kalisch and Bühlmann, 2007). The population version of the PC algorithm is a procedure for determining the CPDAG C.G/ corresponding to a distribution P that is faithful to a DAG G given a list of conditional independences satisfied by P. In our context where P = N p .μ w , Σ/ with Σ positive definite, these conditional independences may be equivalently represented by partial correlations: we have for w ∼ N p .μ w , Σ/ that .14/ where the partial correlation ρ jk|S satisfies Harris and Drton, 2013). Here we have indexed the rows and columns of Ψ according to the elements of A. The sample version of the PC algorithm replaces queries of conditional independence with conditional independence tests. In our case, in analogy with expressions (14) and (15)  In the case where confounding is not present, the PC algorithm requires faithfulness and a certain minimum signal strength condition for partial correlations. We shall therefore assume that N p .μ w , Σ/ is faithful to a DAG G and our target of inference will be the corresponding CPDAG C.G/ =: C. We denote the maximum degree of G by d. Define also the following parameter controlling minimum signal strength: It will also be convenient to introduce a particular minimum restricted eigenvalue σ r of Σ defined through σ r := min I:|I| d+2 λ min .Σ I,I /. Note that we always have σ r σ l .
The result below follows directly from the proof of theorem 8 in Harris and Drton (2013).
Lemma 1. LetĈ τ be the output of the PC algorithm using conditional independence tests given by expression (16) with threshold τ . For any A 1 we have TakingΣ as eitherΣ rsvp or its sample splitting variantΣ rsvp-split , by combining lemma 1 with one of theorems 3 or 4, we can obtain high probability guarantees on recovering the CPDAG corresponding to the unconfounded data. As an example, we consider the setting where the assumptions of theorem 4 and those leading to expression (11) hold. Additionally, consider an asymptotic regime where . 17/ Then usingΣ rsvp-split with an optimal subsample size m √ .pq=σ u / we have the following conclusion: there is a sequence a n → 0 and constant c > 0 such thatĈ τ = C for all τ ∈ [a n , 1 − a n ] with probability at least 1 − c=p. We may compare this conclusion with the results that were obtained in Kalisch and Bühlmann (2007) that provide similar guarantees for the PC algorithm when confounding is not present. If we assume that the final term of log.p/= √ n on the left-hand side of equation (17) is the dominant term, our requirement is log.p/= √ n = o.ω=d/ whereas the equivalent result in Kalisch and Bühlmann (2007) In particular, we see that, in our setting, the maximal degree d cannot grow as quickly. This restriction is also present in the analogous result of Harris and Drton (2013) who considered applying the PC algorithm (in the absence of hidden confounding) by using conditional independence tests based on partial correlations derived from rank correlations.

Simulation experiments
In this section we provide some numerical results for various scenarios and compare the proposed estimator with the PC removal estimators, as employed in the principal orthogonal complement thresholding method (Fan et al., 2013). Results for shrinkage estimators of Ledoit-Wolf type (Ledoit and Wolf, 2004) are also included in our comparison.

Experimental set-ups
We consider five scenarios described below. For each of these, we generate n ∈ {100, 200, 500, 1000, 2000} independent samples from N p .0, Θ/ for a covariance matrix Θ ∈ R p×p that has an idiosyncratic component and a component due to confounding Θ = Σ + Γ T Γ. The number of variables is varied in p ∈ {100, 200, 500, 1000, 2000}. For q latent confounders, the entries of the matrix Γ ∈ R p×q are sampled independently from a standard normal distribution, and column k ∈ {1, : : : , q} of Γ is scaled by a factor ν exp.−k/ to have a decaying spectrum among the latent confounders. The strength ν ∈ {0:01, 0:1, 0:5, 1, 5, 20} allows for a variation in the overall strength of the latent confounding. The five scenarios that are considered distinguish themselves by a different structure of the idiosyncratic covariance matrix Σ and the number of latent confounders q. All diagonal entries of Σ are set to 1. Varying the structure, number of samples n, dimension p and strength ν of the latent confounders, we run 200 simulations of each unique parameter configuration and compute the following quantities.
(a) We obtain the estimated covariance matrixΣ pca .l/, where the number l is chosen first as l = 0, leading to the empirical covariance matrix. This first estimator is also the basis for comparisons with Ledoit-Wolf-type shrinkage (Ledoit and Wolf, 2004). (The results for a Ledoit-Wolf covariance estimator with the identity matrix as the shrinkage target are identical to those for PC removal with l = 0 (i.e. the empirical covariance matrix) as the objective that we measure will be unchanged by the shrinkage.) Next we use the oracle value l = q (which is of course unavailable in practice) and then, as suggested in Fan et al. (2013), the values of the two estimators of q that are based on the respective first information criterion in Bai and Ng (2002) and Hallin and Liška (2007). (b) We calculate the sample splitting RSVP estimatorΣ rsvp-split for subsample size m ∈ {20, 50, 70}.
Some results are shown in Figs 3 and 4. Other possible approaches such as the sparse-dense decomposition approach of Chandrasekaran et al. (2012) are unfortunately computationally infeasible for these settings. We would like to compare for each estimate its accuracy with respect to the true idiosyncratic covariance in a suitable norm, which we chose here for simplicity as the Frobenius norm. To be invariant with respect to scaling, we may consider Σ − κΣ F , which is monotonically decreasing with the empirical correlation ρ Σ,Σ between the vectorized matrices Σ andΣ; we shall use ρ Σ,Σ as a criterion for simplicity, and also we omit the diagonals from Σ andΣ in the computation. For estimation of the inverse covariance Ω := Σ −1 , we invert our estimates above by using the approach of Meinshausen and Bühlmann (2006) as implemented in the R package glasso (Friedman et al., 2018) to give estimatesΩ. The penalty parameter is set to a very small uniform value of λ = 10 −6 for computational speed and easier comparison between methods. Cross-validation of the penalty is also not straightforward to implement here as we do not have access to clean data that would be free of the influence of the latent confounders.

Results
A summary of results from each of the 750 = 5 × 5 × 6 × 5 unique parameter settings is shown in Fig. 5. The RSVP estimator with low number m = 20 of samples in each subsample in general dominates the other estimators (in terms of having higher mean correlation and higher quartiles), no matter whether we stratify according to design matrix structure, strength of latent confounders, sample size or dimension of the graph. The only exception seems to be the case ν = 0:01, where the latent confounders are effectively absent. Here the empirical covariance improves the RSVP estimator, as expected.
Comparing the various PC removal approaches, it is noteworthy that, for an increasing strength of the latent confounding, the oracle (true) value of q performs much better than using any of the suggested empirical estimates of q. In contrast, for weak confounding, removing all q latent confounders performs worse in general because of the decaying spectrum of the latent confounding: too much of the idiosyncratic covariance is removed by the oracle estimate in these cases. RSVP tends to perform at least as well as the optimal approach among the three PC removal approaches across all strengths of the latent confounding, even though in practice the oracle choice of q for PC removal is clearly not even available.
Analogous results for inverse covariance matrix estimation are shown in Fig. 6, with a single example outcome in Fig. 7. The differences between RSVP with different numbers of samples in each subsample are smaller, arguably because the error that is introduced by matrix inversion dominates the relatively small differences. Although estimating the covariance of a random Erdó´s-Rényi graph seems easy for the covariance, it becomes relatively difficult for the inverse covariance matrix. Finally, whereas a dimension of p = 2000 still yields very good results in Frobenius norm for covariance estimation, it seems to become very challenging for inverse covariance estimation.
The relative performance of the sample splitting version of RSVP as a function of number of samples m in each subsample is shown in Fig. 8. For very weak latent confounding, taking very small values of m performs optimally as the sampling splitting RSVP estimator then converges to the empirical covariance matrix. Whereas the scaling of the optimal m as proportional to √ .pq/=σ u emerges from the theory, in our examples the choice m = 2 √ p seems to be a good rule of thumb choice for the size of the subsamples.

Model violations
To investigate robustness against model violations for covariance estimation, we replace the normal distribution for the idiosyncratic noise for X and H by multivariate t-distributions with df 1 degrees of freedom, where df 1 ∈ {1, 2, 3, 5, 10, 20, 50, 100}. We also generate the loading matrix Γ by using a multivariate t-distributions with df 2 degrees of freedom and we vary this parameter among the same set of values as those used for df 1 . Analogously to Fig. 5, Fig. 9 shows the performance for covariance estimation marginally as a function of both df 1 and df 2 , where the remaining parameters (graph structure, dimension, sample size and strength of confounding) are averaged out. The cases df 1 = 1 and df 2 = 1 correspond to Cauchy distributions. We comment here that Σ does not correspond to a covariance if df 1 2. Nevertheless, Σ can still be identifiable from the distribution of w.
As an additional test of robustness, we consider, in a second set of experiments, replacing the linear structural equation x = w + Γh (1) with a max-linear model (Gissibl and Klüppelberg, 2018) x j = max{w j , .Γh/ j }; .18/ our goal is as before to recover Σ = cov.w/. We present in Fig. 10 the results averaged over all other parameters of our simulation set-up (graph structure, dimension, sample size, strength of confounders, df 1 and df 2 ). The performances of both oracle PC removal and RSVP suffer in the max-linear case and drop to similar levels to the data-driven PC removal methods. However, even in this case, RSVP outperforms data-driven PC removal approaches; in the case of the Hallin and Liška (2007) choice of the number of components, RSVP gives better results in more than three-quarters of all simulation settings, as can be seen in Fig. 10(h).

GTEX data analysis
In this section we illustrate the key properties of RSVP on a collection of gene expression data sets made publicly available by the GTEX consortium (Aguet et al., 2016). Such data sets are particularly prone to the type of confounding that is studied in this paper (Leek and Storey, 2007;Stegle et al., 2012;Gagnon-Bartsch et al., 2013). Our aim is to determine which genes are biologically related in that they regulate each other. To validate our results, we use the gene ontology database (Ashburner et al., 2000). The GTEX consortium conducted a large-scale ribonucleic acid sequencing experiment which resulted in the the collection of gene expression data from hundreds of donors in more than 50 human tissues. To carry out their analyses, they estimated confounders by leveraging external information such as gender and genetic relatedness between donors, and by inferring some confounders from the data themselves by using probabilistic estimation of expression residuals (PEER) (Stegle et al., 2012). Both the confounders and the fully processed, normalized and filtered gene expression data are available on the website of the consortium (https:// gtexportal.org/home/datasets; in addition, code to compute RSVP and subsampling versions, and also to reproduce all the results that are described in this section, is available from https://github.com/benjaminfrot/RSVP).
For each tissue T , where T is for example whole blood, lung or thyroid, there is available a data matrix X T of gene expression levels with dimensions n T × p T along with an n T × q T matrix of confounders. We removed tissues for which n T 100; the 44 remaining tissues had a ratio n T =p T ranging between 0:006 and 0:03 and values of p T ranging between 14337 and 17855. (The list of tissues as well as the number of samples and variables for each of them can be found in the on-line supplementary materials.) In line with the analysis methods of the GTEX consortium, we used all the PEER factors at our disposal, resulting in a total number of q T confounders for each tissue equal to the number of PEER factors for that tissue plus five confounders derived from external sources (e.g. donors' genotypes, gender, etc.). (According to the analysis methods section of the consortium's website 'the number of PEER factors was determined as a function of sample size N: 15 factors for N < 150, 30 factors for 150 N < 250, Because these covariates and factors are deemed the most relevant by the GTEX consortium, we refer to a data set X T from which all q T confounders have been removed as 'unconfounded'. However, it is possible that there is still unobserved confounding in the data sets. For each tissue, we create a sequence of data sets by regressing out 0, 1, 2, : : : , q T confounders. On each of these data sets, we run RSVP, PC removal with different values l of components removed. We also run the neighbourhood selection with the square-root lasso (Belloni et al., 2011) on both the sample covariance matrix of the raw data set, NS, and on the covariance matrix estimated by RSVP, RSVP+NS. Two commonly used proxies for pairs of genes being co-regulated are large off-diagonal entries in the covariance or non-zero entries in the inverse covariance matrix. We therefore form for each estimated covariance matrix, a sequence of estimated co-regulation networks containing edges corresponding to the largest r entries, with r ranging from 1 to 100. In the case of NS and RSVP+NS, we vary the tuning parameter of the square-root lasso until we obtain a graph with approximately 100 edges and then form a sequence of 100 networks corresponding to the largest r entries in the estimated inverse covariance matrices, with 1 r 100.
We first sought to quantify how sensitive the graphs that are returned by the various methods are to the addition of confounding. For that, for each (tissue, method, r) triple, we computed the Jaccard similarity between the edge set of a graph estimated on the unconfounded data and the graph with r edges estimated on the data set with k ∈ {0, 1, 5, 10, 20} confounders removed. Fig. 11 shows the resulting Jaccard similarities averaged across the 44 tissues. Unsurprisingly the more confounders are removed, the more similar the estimated graphs are to that obtained on the unconfounded data (k = q T ). However, this change for RSVP is only very slight and the method yields large similarities across different numbers of edges and k. This is an encouraging result, particularly given that a number of the confounders, such as gender and genotype data, were derived entirely from external data. In contrast, the performances of PC removal and NS are strongly influenced by the presence of the confounders, with the Jaccard similarity between raw and unconfounded data close to 0.
Consistently returning the same set of edges irrespectively of confounding does not imply anything about the quality of the estimates. To obtain a sense of their accuracy, we scored the graphs by using a reference data set: the gene ontology (Ashburner et al., 2000). Briefly, the gene ontology is a popular database which allows the annotation of each gene by a set of terms classified in three categories: cellular components, molecular function and biological process. Genes that tend to perform similar functions or to interact are expected to be annotated by similar terms. By mapping each node of each graph to its gene ontology terms, one can compute a so-called enrichment statistic (Frot et al., 2019a) reflecting whether the graph contains edges between related genes more often than would be expected in a random graph with a similar topology (such a graph has an expected statistic of 1). Fig. 12(a) shows the enrichment scores that were obtained in the raw data set (no confounders regressed out), averaged across all tissues. Fig.  12(b) gives the average score as a function of the number of confounders regressed out. In the online supplementary materials, the scores for each of the 44 tissues is plotted. Several comments are in order. RSVP performs well across the data sets and is the best performer on average when applied to the unconfounded data. Interestingly, as shown in the supplementary materials, there is at least one selection of l for each tissue where PC removal performs comparably with RSVP, but the optimal value of l changes from tissue to tissue. This would suggest a data-based selection for l; however, the selection criteria of Bai and Ng (2002) and Hallin and Liška (2007) both yield l = 0 on every tissue. The performance of the neighbourhood selection NS steadily increases as increasingly more confounders are regressed out, until it outperforms RSVP. This tends to  . 11. Average Jaccard similarity between the edge sets of graphs estimated on the unconfounded data (with all confounders regressed out) and data from which k confounders have been removed, for k D 0, 1, 5, 10, 20; similarities are averaged over 44 tissues (the RSVP estimate is here seen to be the most stable with respect to removal of confounders; the RSVP estimator shows highest Jaccard similarity across all graph sizes when zero or just a few confounders are regressed out): (a) confirm that the raw data do indeed contain latent confounders masking true biological signal. Moreover, the fact that methods forming networks based on the estimated inverse covariances, NS and RSVP+NS, perform best on the unconfounded data sets tends to confirm that it is indeed the precision matrices which contain relevant signal when it comes to co-regulation networks.
The computational cost of performing NS is far greater than for RSVP or the PC removal approaches. We also note that the RSVP and PC removal methods may be further accelerated by using large inner product search algorithms. For example, the xyz algorithm of Thanei et al. (2018) can locate the large entries in the matrix product VV T that forms RSVP at a fraction of the cost of performing the full matrix multiplication. On these GTEX data sets, it delivers similar performance to regular RSVP but cuts the computational cost by a factor of around 2000.

Discussion
In this work, we have introduced RSVP as a simple and fast method for estimating the idiosyncratic covariance Σ given data where latent factors are present. A notable aspect of the method is that all information about Σ that is contained in the spectrum of the empirical covariance matrix is thrown away. Estimation of Σ, which is permitted to have a diverging condition number, is performed by using a scaled multiple of a projection matrix whose eigenvalues are necessarily in {0, 1}. It may seem surprising at first sight that this should work at all, and the success of the method underlines the message that has emerged on the vast theory surrounding high dimensional PC analysis and covariance estimation, saying that the eigenvalues of the empirical covariance matrixΘ are extremely noisy. By removing the variance due to these noisy eigenvalues, RSVP can cope well even in settings that are particularly challenging for PC removal approaches where the eigenvalues of the combined covariance Θ are not well separated into two groups. A drawback of RSVP is that the scale of Σ is lost, but this is of little consequence in many applications of interest and has the advantage of allowing the method to be robust to certain heavy-tailed data, for example.
Our work leaves open some questions. For example, it would be interesting to explore whether there are other estimators of the form (4) that depend on the spectrum ofΘ such that the scale of Σ is not lost, but in a sufficiently smooth way as not to have high variance even in the challenging scenarios that were mentioned above. Another interesting problem is that of controlling for latent confounding when the influence of the confounding is not linear, such as the max-linear settings (Gissibl and Klüppelberg, 2018).