The conditional permutation test for independence while controlling for confounders

We propose a general new method, the conditional permutation test, for testing the conditional independence of variables X and Y given a potentially high dimensional random vector Z that may contain confounding factors. The test permutes entries of X non‐uniformly, to respect the existing dependence between X and Z and thus to account for the presence of these confounders. Like the conditional randomization test of Candès and co‐workers in 2018, our test relies on the availability of an approximation to the distribution of X|Z—whereas their test uses this estimate to draw new X‐values, for our test we use this approximation to design an appropriate non‐uniform distribution on permutations of the X‐values already seen in the true data. We provide an efficient Markov chain Monte Carlo sampler for the implementation of our method and establish bounds on the type I error in terms of the error in the approximation of the conditional distribution of X|Z, finding that, for the worst‐case test statistic, the inflation in type I error of the conditional permutation test is no larger than that of the conditional randomization test. We validate these theoretical results with experiments on simulated data and on the Capital Bikeshare data set.


Introduction
Independence is a central notion in statistical model building, as well as being a foundational concept for much of statistical theory. Originating with Francis Galton's work on correlation at the end of the 19th century (Stigler, 1989), many measures of dependence have been proposed, including mutual information, the Hilbert-Schmidt independence criterion and distance covariance (Cover and Thomas, 2012;Gretton et al., 2005;Székely et al., 2007); see also Josse and Holmes (2013) for an overview. Simultaneously, much research effort has gone into developing several different tests of independence, e.g. based on ranks, kernel methods, copulas and nearest neighbours (Weihs et al., 2018;Pfister et al., 2018;Kojadinovic and Holmes, 2013;Berrett and Samworth, 2019). Permutation tests are particularly attractive because of their simplicity and their ability to control the type I error (i.e. the false positive rate) without any distributional assumptions.
In practice, it is often conditional independence that is of primary interest (Dawid, 1979). For instance, in generalized linear models for a response Y ∈ R regressed on a high dimensional feature vector .X, Z/ = .X, Z 1 , : : : , Z p / ∈ R p+1 , the regression coefficient on feature X is 0 if and only if Y and X are conditionally independent given the remaining p features, Z = .Z 1 , : : : , Z p /. In this paper, we shall study the general problem of testing X ⊥ ⊥ Y |Z. (In the regression literature, it is more common to use the notation of regressing Y on .X 1 , : : : , X p /, and testing whether the coefficient on feature X j is 0 after controlling for the remaining features X −j = .X 1 , : : : , X j−1 , X j+1 , : : : , X p /; thus X j and X −j correspond to our X and Z respectively.) We are typically interested in the setting where X and Y are one dimensional whereas Z is a high dimensional set of confounding variables that we would like to control for, but our results are not specific to this setting.
Within standard parametric regression models, conditional independence tests are well developed; unfortunately, however, they fail to control type I error under model misspecification. In fact, the recent work of Shah and Peters (2019) has shown that, without placing some assumptions on the joint distribution of .X, Y , Z/, conditional testing is effectively impossible-when .X, Y , Z/ is continuously distributed, they proved that there is no conditional independence test that both (a) controls type I error over any null distribution (i.e. any distribution of .X, Y , Z/ with X ⊥ ⊥ Y |Z) and (b) has better than random power against even one alternative hypothesis.
Our work complements this fundamental result of Shah and Peters (2019) by demonstrating that, given some additional knowledge, namely an approximation to the conditional distribution of X given Z, we can derive conditional independence tests that are approximately valid in finite samples, and that have non-trivial power.

Summary of contributions
In this paper, we introduce a new method, called the conditional permutation test (CPT), which is inspired by the conditional randomization test (CRT) of Candès et al. (2018). The CPT modifies the standard permutation test by using available distributional information to account correctly for the confounding variables Z, which leads to a non-uniform distribution over the set of possible permutations π on the n observations in our data set and restores type I error control.
Implementing the CPT is a challenging problem since we are sampling from a highly nonuniform distribution over the space of n! permutations, but we propose a Monte Carlo sampler that yields an efficient implementation of the test. We additionally develop theoretical results examining the robustness of both the CPT and the CRT to slight errors in modelling assumptions, proving that the type I error is only slightly inflated in both tests when our available distributional information is only approximately correct. In fact, in the worst case, the type I error is always less inflated for the new CPT method compared with the CRT. Our empirical results verify the greater robustness of the CPT, while maintaining comparable power in a range of scenarios.

Background
In this section, we briefly summarize several existing approaches to the problem of testing for dependence between X and Y in the presence of confounding variables. Before beginning, it will be helpful to define some brief notation. Throughout, we shall assume that the data consist of independent and identically distributed (IID) data points .X i , Y i , Z i / ∈ X × Y × Z for i = 1, : : : , n and write X = .X 1 , : : : , X n /, Y = .Y 1 , : : : , Y n / and Z = .Z 1 , : : : , Z n /.

Permutation tests
One key reason why handling conditional independence in non-parametric contexts is so challenging is that the permutation approaches that are so effective for testing unconditional independence, X ⊥ ⊥ Y , cannot be directly applied when we seek to test conditional independence, X ⊥ ⊥ Y |Z. This is because it may be that the null hypothesis H 0 : X ⊥ ⊥ Y |Z is true, but X and Y are highly marginally dependent because of correlation induced via each variable's dependence on Z. Under this null, if we sample a permutation π of {1, : : : , n} uniformly at random, then the permuted data set .X π.1/ , Y 1 /, : : : , .X π.n/ , Y n / may have a very different distribution from the original data set .X 1 , Y 1 /, : : : , .X n , Y n /, because of the confounding effect of Z.
In certain settings, in particular where Z is categorical, there is a simple and well-known fix for this problem: we can group the observations according to their value of Z, and then permute within groups. For example, if Z ∈ {0, 1} is binary, we could draw a permutation π that permutes the X i s within the set of indices {i : Z i = 0} and separately permutes the X i s within the set {i : Z i = 1}. However, this strategy cannot be applied directly in the case where Z is continuously distributed, or where Z is discrete but with few repeated values (note that, when Z is high dimensional, even if it is discrete, each observation i will typically have a unique feature vector Z i ). In these settings, it is common to use a binning strategy, where first Z is discretized to fall into finitely many bins, and then the 'permute-within-groups' strategy is deployed. However, type I error control is no longer guaranteed, since the null hypothesis H 0 : X ⊥ ⊥ Y |Z does not imply that X ⊥ ⊥ Y |.Z ∈ bin b/; the best that we can usually hope for is that the latter statement would be approximately true under the null. Furthermore, in a high dimensional setting, choosing these bins can itself be very challenging.
Apart from independence testing, permutation tests are also popular in other settings in which the null hypothesis is exchangeable (Ernst, 2004). Moreover, Roach and Valdar (2018) developed a theory of generalized permutation tests, primarily in the context of testing simple hypotheses, for non-exchangeable null models where the weights that are assigned to permutations are nonuniform.

The conditional randomization test
The CRT, which was proposed by Candès et al. (2018), works in a setting where no assumptions are made about the distribution of the response variable Y but, instead, it is assumed that the conditional distribution of X given Z is known. In practice, in semisupervised learning settings where unlabelled data .X, Z/ are easier to obtain than labelled data .X, Y , Z/, it may be possible to obtain a very accurate estimate of the conditional distribution of X|Z, but testing for independence with Y remains challenging because of limited sample size of the labelled data. Candès et al. (2018), section 1.3, gave examples of applications where unlabelled .X, Z/ data are amply available whereas labelled data .X, Y , Z/ are scarce-e.g. genomewide association studies, where it is important to determine whether a particular genetic variant X affects a response Y such as disease status or some other phenotype, even after controlling for the rest of the genome, encoded in Z. Human genome data, i.e. .X, Z/ data, are now plentiful, but labelled data .X, Y , Z/ are expensive; if we do not know the disease status Y of the individuals in previously collected samples, we need to obtain the .X, Y , Z/ samples ourselves.
Assuming then that the distribution of X|Z is known (or is estimated accurately from a large sample of unlabelled data), the CRT operates by sampling a new copy of the X-values in the data set. Letting Q.·|z/ denote the distribution of X given Z = z, conditionally on Z 1 , : : : , Z n , the CRT draws X .1/ i ∼ Q.·|Z i /, independently for each i = 1, : : : , n, and independently of the observed X i s and Y i s. (In the special case where X is binary, earlier work by Rosenbaum (1984) proposed a related test, which is referred to as a 'CPT' but which in fact resamples X by estimating P.X = 1|Z/ with a logistic model.) Under the null hypothesis H 0 that X ⊥ ⊥ Y |Z, we see that where = d denotes equality in distribution. This means that .1/ 1 , : : : , X .1/ n /. Any large differences between these two triples-for instance, if Y is highly correlated with X but not with X .1/ -can therefore be interpreted as evidence against the null hypothesis. To construct a test of H 0 , then, the CRT repeats this process M times, sampling .X .m/ i |X, Y, Z/ ∼ Q.·|Z i /, independently for i = 1, : : : , n and m = 1, : : : , M, to form control vectors X .1/ , : : : , X .M/ . Under the null hypothesis, the triples .X, Y, Z/, .X .1/ , Y, Z/,: : :,.X .M/ , Y, Z/ are all identically distributed; in fact, they are exchangeable. For any statistic T = T.X, Y, Z/ that is chosen in advance (or, at least, without looking at X), the random variables T.X, Y, Z/, T.X .1/ , Y, Z/, : : : , T.X .M/ , Y, Z/ . 1/ are therefore exchangeable as well. We can compute a p-value by ranking the value that is obtained from the true X-vector against the values that are obtained from the CRT's copies: The exchangeability of the random variables in expression (1) ensures that this is a valid p-value under the null, i.e. it satisfies P.p α/ α for all α ∈ [0, 1] if the null hypothesis H 0 is true. The 'model-X knockoffs' framework of Candès et al. (2018) also extends the CRT technique to the high dimensional variable selection setting, where each of p features is tested in turn for conditional independence with the response Y , with the goal of false discovery rate control. In this framework, only a single copy of each feature is created. The robustness of the model-X knockoffs method, with respect to errors in the conditional distributions that are used to construct the knockoff copies of each feature (analogous to the X .m/ s above), was studied by .

Other tests of conditional independence
Before introducing our new work, we give a brief overview of some additional conditional independence testing methods that have been proposed in the literature. Many methods assume some parametric model for the response Y , such as a linear model, Y = αX + β T Z + (noise), in which case the problem reduces to testing whether α = 0. This can be tested by, for instance, computing an estimateβ and testing whether the residual Y −β T Z is correlated with X. Belloni et al. (2014) proposed a variant on this approach, which assumes approximate linear models for both Y and X. Their method regresses both X and Y on Z and then tests for correlation between the two resulting residual vectors; this 'double regression' offers superior performance by removing much of the bias coming from errors in estimating the effect of Z. Shah and Peters (2019) consider a more general double-regression framework, assuming that the conditional means E[X|Z = z] and E[Y |Z = z] can be estimated at a sufficiently fast rate. Away from the regression setting, many proposed methods are based on using kernel representations or low dimensional projections of the data. Tests based on embedding the data in reproducing kernel Hilbert spaces have been studied by, for example, Fukumizu et al. (2008), Zhang et al. (2011) andStrobl et al. (2019). Other works use permutations of the data, including Doran et al. (2014) and Sen et al. (2017), where the methods have the flavour of binning Z and then permuting within groups. Bergsma (2004), Song (2009) and Veraverbeke et al. (2011) studied copula methods for testing conditional independence. There is also a large literature on extending measures of marginal independence to the conditional setting, including partial distance covariance (Székely and Rizzo, 2014), conditional mutual information (Runge, 2018), characteristic functions (Su and White, 2007), Hellinger distances (Su and White, 2008) and smoothed empirical likelihoods (Su and White, 2014).
A related problem is that of testing the null hypothesis that a certain treatment has no effect in a randomized experiment. In the treatment effects literature it is common to calculate p-values by comparing a test statistic with null statistics that are based on randomly reassigning treatments in the data. However, in some situations, uniformly random reassignment is inappropriate and does not result in valid p-values, because of some underlying structure; see Athey et al. (2018) for network dependence and Hennessy et al. (2016) for covariate imbalance. In such cases it is sometimes possible to develop non-uniform randomization schemes that result in valid p-values, as with the CPT and the CRT.

The conditional permutation test
Recall that the CRT (Candès et al., 2018) creates copies X .m/ of the vector X sampled under the null hypothesis that X ⊥ ⊥ Y |Z, by drawing X .m/ |X, Y, Z ∼ Q n .·|Z/, .2/ independently for m = 1, : : : , M, where we define Q n .·|Z/ := Q.·|Z 1 / × : : : × Q.·|Z n /. This mechanism creates copies X .1/ , : : : , X .M/ that are exchangeable with the original vector X under the null hypothesis that X ⊥ ⊥ Y |Z. Our proposed method, the CPT, is a variant on the CRT, with X .1/ , : : : , X .M/ drawn as in distribution (2) but under the constraint that each X .m/ must be a permutation of the original vector X. Once we have drawn X .1/ , : : : , X .M/ , they will then be used exactly as for the CRTgiven some predefined statistic T = T.X, Y, Z/, our p-value is given by . 3/ All that remains, then, is to specify how these permuted copies X .m/ will be drawn.
To draw the X .m/ s, we first need to define some notation. Let S n denote the set of permutations on the indices {1, : : : , n}. Given any vector x = .x 1 , : : : , x n / and any permutation π ∈ S n , define x π = .x π.1/ , : : : , x π.n/ /, i.e. the vector x with its entries reordered according to the permutation π.
The CPT copies X .1/ , : : : , X .M/ are then drawn as follows: after observing X, Y and Z, we draw M permutations π .1/ , : : : , π .M/ according to the conditional distribution of X|Z, and then apply these permutations to X. Specifically, let X .m/ = X π .m/ P.π .m/ = π|X, Y, Z/ = q n .X π |Z/ π ∈S n q n .X π |Z/ : . 4/ Here we let q.·|z/ be the density of the distribution Q.·|z/ (i.e. q.·|z/ is the conditional density of X given Z = z), with respect to some base measure ν on X that does not depend on z. We write q n .·|Z/ := q.·|Z 1 /: : : q.·|Z n / to denote the product density. Note that we are not assuming a continuous distribution necessarily; the base measure may be discrete, allowing X to be discrete as well.
Why is this the right distribution for drawing the permuted copies X .1/ , : : : , X .M/ ? To understand this, it is helpful to consider a different formulation of the permutation scheme. Let X ./ = .X .1/ , : : : , X .n/ / be the order statistics of the list of values X = .X 1 , : : : , X n /. (In the setting where X = R, we can of course use the usual ordering on R. In the general case we can simply take an arbitrary total ordering on X ; the choice of ordering is irrelevant as its only role is to allow us to observe the set of values of X without knowing which one corresponds to which data point.) Define also X .π/ = .X .π.1// , : : : , X .π.n// / for each π ∈ S n , and let Π ∈ S n be the permutation given by the ranks of the true observed vector X, so that X = X .Π/ . In other words, X ./ gives the order statistics of X, and Π reveals the ranks; together these two pieces of information are sufficient to reconstruct X. (If the unlabelled values X .i/ are not unique then, formally, we define Π by choosing it uniformly at random from the set of all permutations that satisfy this condition.) Under the null hypothesis that X ⊥ ⊥ Y |Z, we can verify that the distribution of the true ranks Π, conditionally on Y and Z as well as on the order statistics X ./ , is given by P.Π = π|X ./ , Y, Z/ = q n .X .π/ |Z/ π ∈S n q n .X .π / |Z/ : . 5/ Furthermore, examining the definition (4) of the CPT copies X .1/ , : : : , X .M/ , we can see that the CPT can equivalently be defined by (5). In fact, comparing with expression (4), we see that Π .m/ = Π • π .m/ . The following theorem formalizes the above intuition and verifies that this procedure yields a valid test of H 0 .

Comparing the conditional permutation test and conditional randomization test
To compare the construction of the copies X .1/ , : : : , X .M/ in each of the two methods, for the CPT, the copies X .1/ , : : : , X .M/ are IID draws from the null distribution of X, conditionally on X ./ , Y and Z. In comparison, the CRT copies defined in expression (2) are IID draws from the null distribution of X conditioned on Y and Z-but without conditioning on X ./ .
Each of these two constructions yields a valid test if the distribution Q.·|z/, which is used to draw the (resampled or permuted) copies X .m/ , is correct-i.e. if we know the true conditional distribution of X|Z. This result is proved in theorem 1 above for the CPT, whereas the analogous result for the CRT is proved in lemma F.1 on page 4 of the on-line supplement to Candès et al. (2018). However, if the null hypothesis is not true, which method might be more sensitive and better able to detect a non-null? Furthermore, what might occur for these two methods if Q.·|z/ is not exactly correct? We next explore the difference between the two methods in greater detail to begin to address these questions.

Use of marginal distribution of X
In terms of how the tests are run, the difference between the CPT and CRT can be described as follows: whereas both tests use the (true or estimated) conditional distribution Q.·|Z/, the CPT additionally uses the marginal distribution of the observed data vector X, by observing its unlabelled values X ./ . Intuitively, using this additional information can in some cases make the copies X .m/ more similar to the original X, than for the CRT. Therefore, the CPT may be somewhat less likely to reject H 0 , which could lead to lower type I error if H 0 is true, or reduced power to detect when H 0 is false. In Section 5, we shall develop theory to examine the two tests' robustness to errors in estimating the conditional distribution Q.·|Z/, and we shall compare the tests in terms of both type I error and power in experiments in Section 6.

Invariance to base measure
Since the CPT works only over permutations of the same set of X-values, it follows that it is invariant to changes in the base measure on X . To make this concrete, suppose that q 1 .·|z/ is another conditional density, with the property that there are functions h.·/ and c.·/ such that q 1 .x|z/ = q.x|z/h.x/c.z/ for all x ∈ X and all z ∈ Z. (Here we can think of h.x/ as changing the base measure on X , whereas c.z/ adjusts the normalizing constants as needed.) If this is so, then running the CPT with q 1 in place of q will have no effect on the outcome-this is because we can calculate The first term, q n .X π |Z/, is the same as for the CPT run with conditional density q, whereas the second term, Π n i=1 h.X i /c.Z i /, does not depend on the permutation π and therefore does not affect the resulting distribution of the sampled permutations. In other words, the CPT sampling distribution (4) is unchanged if we replace q with q 1 .
This means that the CPT is a valid test, i.e. the result of theorem 1 holds, even if the conditional density q.·|z/ is correct only up to a change in base measure-that is, theorem 1 holds whenever the conditional distribution Q.·|Z/ has a density of the form q.x|z/h.x/c.z/, for some functions h.·/ and c.·/. Indeed, in some settings, it may be substantially simpler to estimate the conditional density only up to base measure-for instance, we can consider a semiparametric model with a conditional density of the form exp{xz T θ − f.x/ − g.z/}, in which case the CPT would need to estimate only the parametric component θ. In contrast, running the CRT requires being able to sample from the conditional distribution Q.·|Z/, so we would need to approximate the full conditional density.

Sampling algorithms for the conditional permutation test
To run the CPT, we need to be able to sample permutations Π .1/ , : : : , Π .M/ from distribution (4). We now turn to the problem of generating such samples efficiently.
One simple approach would be to run a Metropolis-Hastings algorithm with a proposal distribution that, from a current state π, draws its proposed permutation π uniformly at random. For even a moderate n, however, the acceptance odds ratio .7/ will be extremely low for nearly all permutations π (unless, of course, the dependence of X on Z is very weak). In other words, a uniformly drawn permutation π is not likely to lead to a plausible vector of X-values, leading to slow mixing times. As a second attempt, we can consider a different proposal distribution: from the current state π, we propose the permutation π = π • σ ij , where σ ij is the permutation that swaps indices i and j, which are drawn at random. The acceptance odds ratio (7) now simplifies to q.X π.j/ |Z i /q.X π.i/ |Z j / q.X π.i/ |Z i /q.X π.j/ |Z j / : . 8/ The probability of accepting a swap will now be reasonably high; however, each step can alter only two of the n indices, again leading to slow mixing times.

A parallelized pairwise sampler
To address these issues, we propose a parallelized version of this pairwise algorithm. At each step, we first draw n=2 disjoint pairs of indices from {1, : : : , n}. Next, independently and in parallel for each pair, we decide whether or not to swap this pair .i, j/, according to the odds ratio (8). This sampler is defined formally in algorithm 1 in Table 1. For ease of our theoretical .j s,k / for each k with B s,k = 1 end for analysis, we shall work with the order statistics X ./ , rather than the original ordered vector X, in our sampler; this difference is only in the notation, i.e. the algorithm can equivalently be implemented with X in place of X ./ .
The next theorem verifies that the resulting Markov chain yields the desired stationary distribution. (The proof of this theorem and all remaining proofs are given in Appendix A.) Theorem 2. For every initial permutation Π [0] , the distribution (5) of the permutation Π conditionally on X ./ , Y and Z is a stationary distribution of the Markov chain defined in algorithm 1. If additionally q.x|z/ > 0 for all x ∈ X and all z ∈ Z, then it is the unique stationary distribution.
This result justifies the thought that, if algorithm 1 is run for a sufficient number of steps S, then the resulting copy X .Π [S] / acts as an appropriate control for X in testing conditional independence. In fact, though, we can make a much stronger statement-since the original permutation Π also follows distribution (5) conditionally on X ./ , Y and Z under the null, this means that, by initializing algorithm 1 at Π [0] = Π (i.e. at the original data vector X), we are initializing with a draw from the stationary distribution. Therefore X [S] = X .Π [S] / is a draw from the target distribution at any S and is a valid control for X even if the number of steps S is small. Of course, if S is too small, then the control copy will be too similar to the original data vector X, and our power to reject the null will be low; we explore this empirically in Section 6, and we shall see that the sampler mixes well at even a moderate S (for example, in our experiments, we used S = 50).
In practice, we want to draw M copies, X .m/ for m = 1, : : : , M, and we need to ensure that the original data X and each of the M permutations X .m/ are all exchangeable with each other. If we sample the permuted vectors X .1/ , : : : , X .M/ sequentially, by running algorithm 1 for SM steps and extracting one copy X .m/ after each round of S steps, then we would not achieve exchangeability, since there would be some correlation between adjacent copies in this sequence. (Of course, in practice, if the number of steps S is chosen to be large, then the violation of exchangeability would be very mild.) Instead, we can construct an exchangeable sampling mechanism with algorithm 2 in Table 2.
Theorem 3. Let X ./ and Π be the order statistics and ranks of X, as defined previously, so that X = X .Π/ . Let Π .1/ , : : : , Π .M/ be the output of algorithm 2, when initialized at Π init = Π, and let X .m/ = X .Π .m/ / for each m = 1, : : : , M. Assume that the null hypothesis that X ⊥ ⊥ Y |Z holds, and the conditional distribution of X|Z is given by Q.·|Z/, so that the distribution of Π conditionally on X ./ , Y and Z is given by expression (5). Then the triples .X, Y, Z/, .X .1/ , Y, Z/, : : : , .X .M/ , Y, Z/ are exchangeable. This result ensures that the results of theorem 1 hold when the permuted vectors X .1/ , : : : , X .M/ are obtained via the exchangeable sampler.

Robustness of the conditional permutation test and conditional randomization test
We next consider whether the CPT and CRT, based on resampling X from a known or estimated conditional distribution given Z, are robust to slight errors in this distribution. Suppose that the conditional distribution Q.·|Z/ that we use for sampling when running the CPT or CRT is only an approximation to the true conditional, denoted by Q Å .·|Z/. In this section we provide bounds on the excess type I error of the CPT and CRT as a function of the difference between the true conditional Q Å and its approximation Q. Throughout, we shall assume that the statistic T : X n × Y n × Z n → R that is used in the test as well as the approximation Q to the conditional distribution are chosen independently of X and Y. For instance, in many applications, we may have access to unlabelled data, i.e. draws of .X, Z/ without Y , which we can use to construct an estimate Q.
Our first result demonstrates that, conditionally on Y and Z, the excess type I error of both the CPT and the CRT is bounded by the total variation distance between Q Å and Q. (For any two distributions Q 1 and Q 2 that are defined on the same probability space, the total variation distance is defined as d TV .Q 1 , Q 2 / = sup A |Q 1 .A/ − Q 2 .A/|, where the supremum is taken over all measurable sets.) Theorem 4. Assume that H 0 : X ⊥ ⊥ Y |Z is true, and that the conditional distribution of X|Z is given by Q Å .·|Z/. For a fixed integer M 1, let X .1/ , : : : , X .M/ be copies of X generated either from the CRT (2) from the CPT (4) or from the exchangeable sampler for the CPT (algorithm 2) with any fixed parameter S 1, using an estimate Q of the true conditional distribution Q Å .
Then, for any desired type I error rate α ∈ [0, 1], where p is the p-value that is computed in expression (3), and the probability is taken with respect to the distribution of X, X .1/ , : : : , X .M/ conditionally on Y and Z.
Of course, we can also bound the type I error rate unconditionally, with which we obtain from the result above by marginalizing over Y and Z. This result ensures that, if Q is a good approximation to Q Å , then both the CPT and the CRT will have at most a mild increase in their type I error. Of course, theorem 4 is a worst-case result, proved with respect to an arbitrary statistic T which may be chosen adversarially to be maximally sensitive to errors in estimating the true conditional distribution Q Å . In practice, we might expect that the simple statistics T that we would most often use, such as correlation between X and Y, could be more robust to errors than theorem 4 suggests.
Although theorem 4 provides an upper bound on the type I error for both the CPT and the CRT, we do not yet have a comparison between the two. The following theorem proves that, for the case of the CRT, the upper bound is tight when the number of copies X .1/ , : : : , X .M/ is large.
Theorem 5. Under the setting and assumptions of theorem 4, there is a statistic T : X n × Y n × Z n → R such that, for the CRT, In other words, if we use the statistic T that is best able to detect errors in our conditional distribution, and we choose α adversarially, then the excess type I error of the CRT is exactly equal to d TV {Q n Å .·|Z/, Q n .·|Z/} (up to a vanishing factor), and therefore is at least as high as that of the CPT under any statistic.
Unlike for the CRT, we have found that there is no simple characterization of the worst-case scenario for the CPT. In particular, for some specially constructed distributions on .X, Y , Z/, we can show that the CPT achieves the same lower bound as given in theorem 5 for the CRT (again, under a worst-case choice of the statistic T ), but for other joint distributions on .X, Y , Z/ we can verify that the CPT cannot achieve this error rate. In particular, since the CPT is invariant to the base measure (as discussed in Section 3.1), if Q.·|z/ is correct up to the base measure, then the excess type I error of the CRT may be as large as d TV {Q n Å .·|Z/, Q n .·|Z/} whereas the CPT is guaranteed to control the type I error at level α.
It is important to note that the lower bound for the CRT in theorem 5 applies only to a specific worst-case statistic T and does not guarantee that the excess error of the CRT will bound that of the CPT when both tests use some other statistic T . However, in Section 6 we shall see that, empirically, the CPT often yields a far lower type I error than does the CRT in simulations. Thus, we interpret theorem 5 as giving us a partial theoretical understanding of this phenomenon, since it addresses only the worst-case statistic.

When is the total variation distance small?
For theorem 4 to have practical implications, we need to verify that there are settings where, although the true distribution Q Å of X|Z is unknown, it can be estimated to high accuracy, with d TV Q n Å .·|Z/, Q n .·|Z/} = o p .1/ (so that excess type I error is guaranteed to be small). As discussed in Section 2.2, in many applications we may have a large unlabelled data set, say .X unlab i , Z unlab i /, i = 1, : : : , N, with which we can compute an estimate Q of Q Å . As discussed by  in the setting of model-X knockoffs, the unlabelled data set does not need to have the same distribution over .X, Z/ as the labelled data, as long as the conditional distribution of X|Z is the same.) In this section, we briefly sketch two settings where, given a large unlabelled sample size N, our estimate Q is likely to satisfy d TV {Q n Å .·|Z/, Q n .·|Z/} = o p .1/. Our results here are stated informally, with no technical details, since we aim only to give intuition for the settings where theorem 4 is useful.

Parametric setting
We shall use Pinsker's inequality relating total variation distance to the Kullback-Leibler divergence, namely It is therefore sufficient to show that Σ n i=1 d KL {Q Å .·|Z i /, Q.·|Z i /} = o p .1/. If the true conditional distribution Q Å .·|z/ belongs to a parametric family, then this will typically hold whenever the unlabelled sample size satisfies N nk, where k is the number of parameters defining the models in the family. Specifically, we can think of a setting where Q Å .·|z/ has density f θ Å .·|z/, where θ Å ∈ R k is the unknown parameter vector while the family of densities f θ .·|z/ is known. For example, suppose that Z = R k−1 , and the conditional distribution of X|Z is given by X|Z = z ∼ N .z T β Å , σ 2 Å /: Then the unknown parameters are θ Å = .β Å , σ 2 Å / and standard least squares theory enables us to produce independent (maximum likelihood) estimatesβ andσ 2 satisfyinĝ under mild conditions on the distribution of Z. Putting everything together, if Z has a finite second moment we then have which is vanishing as long as the unlabelled sample size satisfies N nk.

Non-parametric setting with binary data
As a second example, suppose that X = {0, 1}, so that estimating Q Å .·|z/ is equivalent to estimating the regression function p Å .z/ := P.X = 1|Z = z/. Assuming that this probability is bounded away from 0 and 1, and again applying Pinsker's inequality, we see that, under mild conditions, wherep.z/ is our estimate of p Å .z/ = P.X = 1|Z = z/ based on the unlabelled sample. Since we are working in a non-parametric setting, suppose that we estimate p Å .z/ = P.X = 1|Z = z/ via a kernel method, working in a low dimensional space Z = R k . Then standard non-parametric theory ensures that, for z in a high probability subset of Z, we can achieve error where the exponent a k is a small positive value, depending on both the ambient dimension k and the properties of the function z → p Å .z/ (e.g. smoothness or Lipschitz properties). Therefore, we can expect to have which is vanishing whenever the unlabelled sample size N is sufficiently large relative to the labelled sample size n.

Empirical results
We next examine the empirical performance of the CPT and CRT on simulated data, and on real data from the Capital Bikeshare system. The code that was used to analyse the data can be obtained from https://rss.onlinelibrary.wiley.com/hub/journal/14679868/seriesb-datasets.

Simulated data: power and error control
The results of Section 5 show that the CPT is more robust than the CRT to errors in the estimated conditional distribution Q.·|Z/, when the worst-case test statistics T.X, Y, Z/ are used. Our first aim here is to provide evidence to validate this result, and to show that this extra robustness is not only exhibited by the worst-case test statistic but also for practical and simple choices of T . Our second aim is to examine the power of the CPT and CRT to detect deviations from the null hypothesis.
In all of our simulations we set α = 0:05 as the desired type I error rate and use marginal absolute correlation T.X, Y, Z/ = |corr.X, Y/| as our test statistic. We generate M = 500 copies of X under either the CPT or CRT. To run the CPT, we use algorithm 2 with S = 50 steps. All results are shown averaged over 1000 trials.

Simulations under the null
First we test whether the CPT and CRT show large increases in type I error when the conditional distribution estimate Q.·|Z/ is incorrect, in a setting where the null hypothesis H 0 : X ⊥ ⊥ Y |Z holds.
We shall have X, Y ∈ R and Z ∈ R p for p = 20. We first draw independent parameter vectors a, b ∼ N p .0, I p /: The variables .X, Y , Z/ are then generated as where Q Å .·|Z/ will be specified below. (Note that Y |X, Z depends on Z only, since we are working under the null hypothesis that X ⊥ ⊥ Y |Z.) Throughout, the estimated conditional distribution of X|Z will be given by but this estimate might not be exactly correct. We shall consider several sources of error in this model.
(a) Non-linear mean: one source of error comes from assuming a linear relationship between variables where this is not so. We choose sample size n = 50 and try three simple examples, taking Q Å .·|z/ = N {μ.z/, 1}, where μ.·/ is given by iii) hyperbolic tangent, μ.z/ = tanh.θb T z/=θ. In each case, θ 0 is the model misspecification parameter. Note that θ = 0 corresponds to the case that Q.·|Z/ = Q Å .·|Z/, i.e. the estimate is indeed correct, whereas larger values of θ correspond to increasing errors.
(b) Coefficients estimated on unlabelled data: even if the form of the model for X|Z is correct, the coefficients b may not be known perfectly. As described earlier, in many practical settings we may have access to ample unlabelled data .X, Z/, separate from our labelled data set of points .X, Y , Z/ that is used to test the hypothesis of conditional independence. For this setting, we estimate the unknown coefficient vector b byb, defined as the least squares estimate by using an unlabelled sample .X unlab i , Z unlab i /, i = 1, : : : , N, generated independently of the data points .X i , Y i , Z i /. This experiment is repeated for unlabelled sample sizes N = 50, 100, : : : , 500. The labelled sample size is given by n = 50 in each case. (c) Coefficients estimated by reusing the data: finally, in settings where unlabelled data may not be available, we may be tempted to estimate the model of X|Z simply by using our data points .X i , Y i , Z i /, i = 1, : : : , n. This approach is not covered by our theory (since the conditional distribution Q.X|Z/ is data dependent in this case), but it is certainly of practical interest to see how the method performs in this setting. We test sample sizes n = 50, 100, : : : , 500, in each case estimating the unknown true coefficient vector b byb, which in this case is now given by the least squares regression of X on Z trained on the same data set, .X 1 , Z 1 /, : : : , .X n , Z n /.  Figs 1 and 2 show the results of these experiments when we have a non-linear mean, and when we estimate the coefficients by using unlabelled data or reusing data respectively. As the null hypothesis H 0 : X ⊥ ⊥ Y |Z is true in all these experiments, we would hope for the probability of rejection to be close to the nominal level of α = 0:05, at least when the model misspecification parameter θ is not too large (for the non-linear mean setting) or when the unlabelled sample size N or labelled sample size n is not too small (when the model coefficients are trained on unlabelled data or reused data).
For the non-linear mean experiments, in Fig. 1 we see that in many cases the CPT is significantly more robust than the CRT. The cases θ = 0 confirm that both tests achieve the nominal type I error level α = 0:05 when the assumed distribution Q is correct. As the misspecification parameter θ increases (so that the model Q.·|z/ that we use for running the CPT or CRT grows further from the true model Q Å .·|z/), we see that both methods suffer an inflation of the type I error level, but for the CPT the excess type I error is substantially lower than that of the CRT.
Next, we turn to the setting where the estimated model Q.·|z/ is obtained by regressing X on Z by using either a separate unlabelled data set, shown in Fig. 2(a), or by reusing the same data set, shown in Fig. 2(b). The results are encouraging, showing that, when using unlabelled data, the type I error is already very close to the nominal level as soon as the unlabelled sample size N is larger than n. When reusing the data, the method appears to be quite conservative at smaller sample sizes n-the cause of this is an interesting question that we hope to study in future work.

Simulations under the alternative
Our final simulation concerns the power of the tests. Here we generate Z as before, and generate X|Z ∼ N .b T Z, 1/, exactly according to the assumed distribution Q.·|Z/, so that both tests have the nominal type I error level α = 0:05. Unlike the null setting, we now generate Y |X, Z ∼ N .a T Z + cX, 1/. The strength of the signal is controlled by the parameter c 0, where c = 0 corresponds to the null hypothesis being true whereas larger values of c move further away from the null. The results, which are shown in Fig. 3, reveal that the CPT is slightly less powerful than the CRT across a range of values of c but overall shows fairly similar performance. Thus there is only a small price to pay for the additional robustness of the CPT.

Simulated data: mixing of the conditional permutation test sampler
In practice, we cannot implement the CPT method as defined in expression (4) (unless, of course, the sample size n is so small that we can simply enumerate all n! possible permutations). Instead, in our experiments, we use the exchangeable Markov chain Monte Carlo sampler, which is defined in algorithm 2. All our simulations and real data experiments implement this sampler with S = 50, meaning that the Markov chain is run for 50 steps for each new permuted copy X .m/ of the data. Is this moderate number of steps sufficient to ensure that the chain has mixed well, or are we producing highly correlated data that will lead to reduced power? To examine this question, we generate one data set, consisting of confounders Z and feature X generated exactly as in Section 6.1.2, and then run the parallel pairwise sampler (algorithm 1) independently for 20 trials (i.e. each time initializing at the same original data). At each iteration, setting X [s] = X .Π [s] / to be our current CPT copy of the original data vector X, we track the log-likelihood, Σ n i=1 q.X [s] i |Z i /, and the correlation with the original data vector, corr.X, X [s] /. (Since X is strongly dependent on Z, it is to be expected that two draws of the data, i.e. X and X [s] , will necessarily have a high correlation.) The trace plots of these two quantities, plotted over s = 0, 1, 2, : : : , 250 in Fig. 4, demonstrate that, in this simulation, the Markov chain appears to mix quickly, within about 50 or 100 iterations. Of course, this will be affected by factors such as the strength of the dependence between X and Z, and the sample size n.

Capital Bikeshare data set
We next implement the CPT and CRT on the Capital Bikeshare data set. (The data were obtained from https://www.capitalbikeshare.com/system-data.) Capital Bikeshare is a bike sharing programme in Washington DC, where users may check out a bike from one of their locations and return it at any other location. The data set contains each ride ever taken, recording the start time and location, end time and location, bike identification number and a user type which can be 'member' (i.e. purchasing a long-term membership in the system) or 'casual' (i.e. paying for one-time rental or a short-term pass). We use the following data: (a) test data set, all rides taken on weekdays (Monday-Friday) in October 2011, and sample size n = 7346 rides, after an initial screening step (details are given below);  In our experiments, we are interested in determining whether the duration X of the ride is dependent on various factors Y , such as user type (member or casual). Of course, the duration of the ride will be heavily dependent on the length of the route, in addition to other factors, and so to control for this we let Z encode both the route, i.e. the start and end locations, as well as the time of day at the start of the ride, since varying traffic might also affect the speed of the ride.
To implement the CPT and CRT, we shall use a conditional normal distribution, i.e. .X|Z = z/ ∼ N {μ.z/, σ 2 .z/} as an estimate Q.·|z/ of Q Å .·|z/. Before running the CPT or CRT, as an initial screening step we discard any test points for which we do not have a good estimate of the conditional distribution of X, keeping only those test data points where we have ample training data for rides taken along the same route and at similar times of day. The details for fitting Q.·|Z/, and for this initial screening step, are given in Appendix B. For both the CPT and the CRT, we sample M = 1000 copies of X to produce the p-value. For the CPT, the Monte Carlo sampler given in algorithm 2 is run with S = 50 as the number of steps for producing each copy.

Results
We test the null hypothesis H 0 : X ⊥ ⊥ Y |Z for several choices of the response Y .
(a) User type (member or casual): we might expect that casual users, who are likely to be tourists or infrequent bike riders, may ride at a slower speed. (b) Date, treated as continuous: since the test data set is taken from the single month October 2011, the date of this month is a continuous variable that acts as a proxy for factors such as weather and the time of sunrise and sunset. (c) Day of the week (Monday-Friday), treated as categorical: bike riders' behaviour may differ on different days of the week, for instance, if rides on Friday are more likely to be leisure rides than on the other days of the week.
For user type and date, the statistic T.X, Y, Z/ that we use is the correlation between the vector Y and the vector of ride duration residuals after controlling for the effects of Z-in other words, the vector with entries R i = X i − E X∼Q.·|Z i / [X]. For day of the week, our statistic T.X, Y, Z/ is given by max y∈{Mon,:::,Fri} correlation between .R 1 , : : : , R n / and .1.Y 1 = y/, : : : , 1.Y n = y// : Table 3 shows the resulting p-values for each choice of the variable Y . We can see that the CPT and CRT produce nearly identical p-values in all three cases. We conclude that the user type and duration of ride are dependent, even after controlling for our various confounding variables; in contrast, there is insufficient evidence to reach the same conclusion for the corresponding tests for the date and the day of the week.

Discussion
In this work, we have developed a CPT that modifies the standard permutation test of independence between X and Y to account for a known dependence of X on potentially relevant confounding variables Z. Our theoretical results prove finite sample type I error control, even when the distribution of X|Z is not known exactly.
We have shown that, empirically, resampling from the set of observed X-values preserves better type I error control under mild errors in our model, and does not lose much power, in settings where we use intuitive statistics such as correlation between Y and X. In contrast, our theoretical understanding of type I error control covers the worst-case scenario over all possible statistics, and it may be that the simple statistics that are used in practical analyses suffer much less inflation of the type I error. We hope to bridge this gap in future work, and also to provide some theoretical insight into the power of the CPT method, as well as to study the efficiency of the Monte Carlo sampler for the CPT and to examine whether proposing swaps non-uniformly may improve the speed at which we can obtain copies X .m/ that are not too correlated with each other.
Furthermore, although in many applications we have access to plenty of unlabelled data, there will certainly be some domains where this is not so and it may not be possible to estimate the conditional distribution of X|Z independently of the data. If only a small labelled data set .X, Y , Z/ is available, with no additional unlabelled data .X, Z/ with which to estimate this distribution, we would not want to split the data set to use one half for fitting Q.X|Z/ and the remaining half to run the CPT, since this would incur both a substantial loss in the type I error control (under the theoretical results of Section 5.1) and loss of power when the sample size is limited. It is therefore important to consider how the CPT (and the CRT) can retain their validity when the data are used for estimating Q.X|Z/ and then reused for testing H 0 : X ⊥ ⊥ Y |Z. It is possible that tools from the selective inference literature may enable us to develop theory towards addressing this question.
Finally, both the CPT and the CRT are based on a setting where it is assumed that modelling X|Z is easy whereas modelling Y |X, Z is difficult-i.e. our estimate Q.·|Z/ of the conditional distribution X|Z is assumed to be highly accurate, but testing H 0 : X ⊥ ⊥ Y |Z is a substantial challenge. In contrast, many of the asymptotic tests that were described in Section 2.3 treat the X-and Y -variables symmetrically when testing X ⊥ ⊥ Y |Z. Are there settings in which we can construct methods offering finite sample guarantees in the style of the CPT and CRT while taking a more symmetric approach to this testing problem?
implementing code for our algorithms. We thank the reviewers for helpful and constructive feedback on an earlier draft.

Appendix A: Proofs
A.1. Proving validity of the sampling mechanisms A.1.1. Proof of theorem 2 The proof of theorem 2 consists of simply checking the detailed balance equations for the Markov chain that is defined by the algorithm.
Let P be the set of all partitions of {1, : : : , n} into n=2 disjoint pairs. For any p ∈ P and any permutations π and π , we write π ∼ p π if π can be transformed to π by swapping any subset of the pairs in the partition p. For example, if .i, j/ and .k, l/ are two of the disjoint pairs in the partition p, and π and π are related via π = π • σ ij • σ kl , then 'π ∼ p π (recall that σ ij is the permutation that swaps i and j). We note that '∼ p ' defines an equivalence relation on the set of permutations.
We now compute the transition probability matrix of the Markov chain that is defined by algorithm 1. For ease of notation, for the remainder of this proof, we shall condition on X ./ , Y and Z implicitly. In particular, all probabilities P.·/ or P.·|·/ should be interpreted as P.·|X ./ , Y, Z/ or P.·| · , X ./ , Y, Z/.

A.2. Proving robust type I error control A.2.1. Proof of theorem 4
First we prove the result for the CRT. LetX be an additional copy drawn also from Q.·|Z/, independently of Y and of X, X .1/ , : : : , X .M/ . Then, since conditionally on Y and Z the copies X,X, X .1/ , : : : , X .M/ are independent, we have Finally, sinceX, X .1/ , : : : , X .M/ are clearly IID after conditioning on Y and Z, and are therefore exchangeable, by definition of A α we must have P{.X, X .1/ , : : : , X .M/ / ∈ A α |Y, Z} α, proving the desired bound for the CRT.
Next we turn to the CPT, for which the analysis is more complicated since the X .m/ s depend on the observed values in the vector X. We shall use the fact that whereX ./ andX .π/ are defined analogously to X ./ and X .π/ from Section 3. Next, by comparing with the CPT sampling mechanism (6), we observe that theX .m/ s, conditionally onX, are generated with the same mechanism as the X .m/ s conditionally on X. In other words, for any x ∈ X n , we have We can verify that the same equality in distribution holds if we instead use the exchangeable sampler (algorithm 2) with some choice S 1 of the number of steps. In either case, then, applying results (9)  and, sinceX,X .1/ , : : : ,X .M/ are exchangeable after conditioning on Y and Z, we see that P{.X,X .1/ , : : : , X .M/ / ∈ A α |Y, Z} α, proving the desired bound for the CPT (with permutations drawn either IID as in expression (6), or from the exchangeable sampler given in algorithm 2).