Multiple influential point detection in high dimensional regression spaces

Influence diagnosis is an integrated component of data analysis but has been severely underinvestigated in a high dimensional regression setting. One of the key challenges, even in a fixed dimensional setting, is how to deal with multiple influential points that give rise to masking and swamping effects. The paper proposes a novel group deletion procedure referred to as multiple influential point detection by studying two extreme statistics based on a marginal‐correlation‐based influence measure. Named the min‐ and max‐statistics, they have complementary properties in that the max‐statistic is effective for overcoming the masking effect whereas the min‐statistic is useful for overcoming the swamping effect. Combining their strengths, we further propose an efficient algorithm that can detect influential points with a prespecified false discovery rate. The influential point detection procedure proposed is simple to implement and efficient to run and enjoys attractive theoretical properties. Its effectiveness is verified empirically via extensive simulation study and data analysis. An R package implementing the procedure is freely available.


Introduction
Recent decades have witnessed an explosion of high dimensional data in applied fields including biology, engineering, finance and many other areas. Given a data set consisting of where Y i ∈ R is the response and X i ∈ R p is the covariate for the ith observation, the main interest is often to conduct a regression analysis to relate Y to X, the simplest model for which takes the linear form.
A usual assumption in linear regression is that the observations are all generated from the same model. In many applications, however, the data that are collected often contain contaminated or noisy observations due to a plethora of reasons. Those observations exerting great influence on statistical analysis, thus named influential points, can seriously distort all aspects of data analysis such as altering the estimate of the regression coefficient and swaying the outcome of statistical inference (Draper and Smith, 2014). Thus, when influential points are present, fitting the model based on a clean data assumption leads to at best a very crude approximation to the model and at worst a completely wrong solution. For fixed dimensional models, we refer the reader to Cook (1977), Belsley et al. (1980), Chatterjee and Hadi (1986), Imon (2005), Zhu et al. (2007Zhu et al. ( , 2012 and Nurunnabi et al. (2014), among many others. For high dimensional models, Zhao et al. (2013) found that influential observations could negatively impact many methods that have recently been developed for dealing with high dimensionality, such as the lasso for variable selection (Tibshirani, 1996) and sure independence screening for variable screening (Fan and Lv, 2008).
As a result, influence diagnosis has been long recognized as a central problem in statistical analysis. An entire line of research has been devoted to devising robust methods that are less prone to influential observations; see, for example, the books on robust regression by Maronna et al. (2006), Huber and Ronchetti (2009) and Rousseeuw and Hubert (2011), as well as the papers by Wang et al. (2007) and Fan et al. (2014) for variable selection with heavy-tailed noise and She and Owen (2011) for an outlier robust method for the mean shift model. However, identifying influential points can often be of major scientific interest. For multivariate high dimensional data containing only X i s, Aggarwal and Yu (2001) proposed to use projection and Filzmoser et al. (2008) and Shieh and Hung (2009) applied principal component analysis, whereas Ro et al. (2015) used a robust covariance matrix estimator for defining distance.
When p is fixed, an attractive measure is to quantify individual observations' influence in changing the ordinary least squares (OLS) estimator and resulting quantities; see, notably, Cook's distance (Cook, 1977), Studentized residuals (Velleman and Welsch, 1981), DFFITS (Welsch and Kuh, 1977;Belsley et al., 1980) and Welsch's distance (Welsch, 1982). Since these measures are all based on OLS estimation, they are not applicable to high dimensional data. In contrast, the problem of influence diagnosis in high dimensional regression has received little attention, mainly due to the difficulty in establishing a coherent theoretical framework, even in a fixed dimension setting, and a lack of easily implementable procedures. To overcome this, Zhao et al. (2013) found that influence can be measured by examining how an individual observation affects the marginal correlation between the response and the predictors, which is a ubiquitous quantity in almost all aspects of regression analysis. They proposed a high dimensional influence measure named HIM to flag those points that have a great influence on the calculated value of marginal correlations as influential, in a sense to be defined later. An attractive feature of HIM is that its asymptotic properties can be rigorously established to enable the use of a multiple-testing procedure for detecting influential points.
However, similarly to many fixed dimensional measures, HIM is based on the idea of leaveone-out observation, i.e. to quantify the influence of an observation, one compares a predefined measure that is evaluated on the whole data set and the measure that is evaluated on a subset of the data leaving out the observation under investigation. Because of this, HIM is useful for detecting the presence of a single influential point. In practice, however, multiple influential observations are commonly encountered and it is not appropriate to apply a test for a single influential point sequentially to detect multiple points. However, detecting multiple influential observations is much more challenging, because of the notorious 'masking' and 'swamping' effects (Hadi and Simonoff, 1993). Specifically, masking occurs when an influential point is not detected as influential, whereas swamping occurs when a non-influential point is classified as influential. In the language of multiple testing, masking is the problem of obtaining false negative results and swamping is the problem of obtaining false positive results. To handle these effects, group-deletion-based methods have been proposed for fixed dimensional problems (Rousseeuw and van Zomeren, 1990;Hadi and Simonoff, 1993;Imon, 2005;Pan et al., 2000;Nurunnabi et al., 2014;Roberts et al., 2015) but it is currently an open problem for high dimensional problems.
The main aim of this paper is to propose a new procedure for detecting multiple influential points for high dimensional data based on HIM. Via random group deletion, we propose a novel procedure named MIP, short for multiple influential point detection for high dimensional data. Along with the process, we propose two novel quantities named max-and min-statistics to assess the extremeness of each point when data are subsampled. Our theoretical studies show that these two statistics have complementary properties. The min-statistic is useful for overcoming the swamping effect but less effective for masked influential observations, whereas the max-statistic is well suited for detecting masked influential observations but is less effective in handling the swamping effect. Combining their advantages, we propose a computationally simple algorithm for obtaining a clean subset of the data that contains no points that greatly influence the marginal correlation with high probability. This clean set of data is then served as the benchmark for assessing the influence of other observations, which permits us to control the false discovery rate (FDR) of influential points by using, for example, the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Remarkably, the theoretical properties of maxand min-statistics can be studied and are rigorously established in this paper. We must point out that, even for fixed dimensional problems, there is a general lack of principled procedures for declaring significance for any defined influence measure. On the contrary, our proposed MIP procedure is the first theoretically justified method for the more challenging high dimensional regression setting.
Before we proceed, we highlight the usefulness of the max-and min-statistics via an analysis of the microarray data in Section 5. Fig. 1 plots the logarithms of the p-values that are associated with the max-statistic in Fig. 1(a) and the min-statistic in Fig. 1(b) of the observations. With a prespecified FDR of 0:05, using the min-statistic, we identify a set of seven influential observations, represented as the full circles in Figs 1(a) and 1(b). It is interesting that the MIP procedure combining the strengths of the two statistics identifies the same set of seven influential points. In contrast, using the max-statistic, four additional observations, represented as triangles in Fig. 1(a), are declared influential. These findings are consistent with our theory that the max-statistic tends to identify more influential observations, making it more suitable for overcoming the masking effect, but may suffer from the swamping effect. However, the fact that the min-statistic gives the same set of influential points as MIP in Fig. 1(b) implies that there may not be any masking effect in these data. Further analysis in Section 5 shows that the reduced data, which are obtained by removing the influential observations that are identified by MIP, result in a sparser model with a better fit, when the lasso is applied.
The rest of the paper is organized as follows. In Section 2, we review the high dimensional influence measure in Zhao et al. (2013). In Section 3, on the basis of the idea of random group deletion or leave-many-out observations, we propose max-and min-statistics for assessing extremeness and establish their theoretical properties. We show in theorem 1 that, surprisingly, when there is no influential point, these two statistics both follow a χ 2 .1/ distribution. When there are influential points, theorem 2 and theorem 3 show that, for a non-influential point, its max-and min-statistics still follow a χ 2 .1/ distribution. Furthermore with the presence of influential points, theorems 2 and 3 demonstrate that, under suitable conditions, the max-and min-statistics can identify the influential points with large probability. We then argue that these two statistics are complementary in detecting influential observations and we develop an algorithm to combine their strengths. Simulation results are presented in Section 4 and data analysis is provided in Section 5. In Section 6, we provide further discussion. All the proofs as well as  Here is the notation that is used throughout the paper. For any set A, we write |A| and N A interchangeably as its cardinality. Let S inf and S c inf be the set of the influential and non-influential observations respectively, such that S inf ∪ S c inf = {1, : : : , n}. We denote |S inf | = n inf as the size of the influential point set and |S c inf | = n − n inf as the number of non-influential points. Denote by v the l 2 -norm of a vector v ∈ R m . For any matrix A = .a ij / ∈ R m×n , A denotes its spectral norm. Finally, let A max = max i,j |a ij | and use C to denote a generic constant that may change depending on the context.

High dimensional influence measure, masking and swamping
Because HIM (Zhao et al., 2013) is the influence measure that is used for our influence diagnosis, we first give a brief review. Assume that the non-influential observations are independent and identically distributed from the model and μ x = .μ x1 , : : : , μ xp / T = E.X i / and σ xj = var.X ij / 1=2 , 1 j p. HIM defines the influence of a point by measuring its contribution to the average marginal correlation between the response and the predictors. Specifically, define ρ = .ρ 1 , : : : , ρ p / T where ρ j = corr.X ij , Y i / is the marginal correlation between the jth variable and the response. From the data, we can obtain a sample estimate asρ j = Σ n i=1 .X ij −μ xj /.Y i −μ y /=.nσ xjσy /, for j = 1, : : : , p, whereμ xj ,μ y ,σ xj andσ y are the sample estimates of μ xj , μ y , σ xj and σ y respectively. The sample marginal correlation with the kth observation removed is similarly defined asρ .k/ j for 1 k n. HIM then measures the influence of the kth observation by comparing the sample correlations with and without this observation, defined formally as Intuitively, the larger D k is, the more influential the corresponding observation is. When there is no influential point and min{n, p} → ∞, under mild conditions, it is proved thatn 2 D k → d χ 2 .1/, where χ 2 .1/ is the χ 2 -distribution with 1 degree of freedom. Based on this, we can formulate the problem of influential point detection as a multiple-hypothesis-testing problem where one tests n hypotheses, one for each observation, stating that the observation under investigation is non-influential. Subsequently, the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) for multiple testing can be used to control the false discovery rate.
Since HIM is based on the leave-one-out idea, the χ 2 .1/ distribution derived is invalid whenever there are one or more influential points, i.e., for a non-influential point, the presence of even one single influential point can distort the null distribution of its HIM value according to the definition above. Similarly, the presence of more than one influential point can distort the HIM value of an influential point as well. This is the manifestation of a more general difficulty of multiple-influential-point detection where the masking and swamping effects greatly hinder the usefulness of any leave-one-out procedures. To appreciate how masking and swamping effects negatively impact the performance of HIM, we quickly look at examples 1 and 2 in Section 4. The data are generated such that there is a strong masking effect in example 1 and a strong swamping effect in example 2. The magnitude of these effects depends on a parameter denoted as μ. Fig. 2 presents a comparison of HIM in Zhao et al. (2013) and MIP proposed in this paper for detecting influence, when the nominal level that is used for declaring influential in the Benjamini-Hochberg procedure is set at α = 0:05. From Fig. 2(a), we see that the true positive rates (TPRs) of HIM are much lower than those of MIP, i.e. HIM identifies much fewer influential points as influential and thus suffers severely from the masking effect. Meanwhile, the false positive rates (FPRs) of HIM are also much larger than the nominal level α = 0:05 especially when μ becomes large, i.e. HIM identifies many more non-influential points as influential, meaning that HIM also suffers from the swamping effect. From Fig. 2(b), we see that HIM suffers from the swamping effect greatly, as the FPRs can be very close to 1 for large μ. In contrast, for both examples, the FPRs of the MIP procedure are controlled well below the nominal level whereas its TPRs are monotone functions of μ and eventually become 1 for large μ.

A random group deletion procedure
As discussed earlier, any measure based on the leave-one-out approach may be ineffective when there are multiple influential observations due to the masking and swamping effects. Since the number of influential observations is generally unknown in practice, it is natural to employ a notion of leave-many-out or group deletion which has been used for fixed dimensional problems in identifying multiple influential points (Lawrence, 1995;Imon, 2005;Nurunnabi, 2011;Nurunnabi et al., 2014;Roberts et al., 2015), where deletion is often made according to the magnitude of (Studentized) residuals or similar criteria with a good estimate of β necessary. For our random group deletion procedure, we do something similar by choosing the subsets uniformly at random with replacement. Thus, the marginal correlations based on these subsets can be seen as some kind of perturbation to the marginal correlations based on the whole sample. It turns out that their extremeness can be summarized by two extremal statistics whose theoretical properties can be studied, as we do below. Existing group deletion procedures are not employed in a similar way and they do not usually give theoretically tractable results. Since the emphasis of this paper is on influence diagnosis, throughout the paper we assume that the non-influential observations are generated from model (2.1) whereas the influential observations are from a different model, the properties of which will be specified later.
Write Z k = .X k , Y k /, 1 k n, as the kth data point. For any fixed k, to check whether Z k is influential or not, we draw with replacement some subsets A 1 , : : : , A m ⊂ {1, : : : , n}={k} uniformly at random, i.e. these subsets do not include Z k . The choice of m will be discussed in Section 4. Write |A r | = n sub − 1 where n sub = k sub n + 1 for some k sub ∈ .0, 1/. We make the following assumption on n inf and k sub .
Assumption 1 allows min n δ inf , n → 0. Without loss of generality, from now on, we take a conservative choice k sub = 1 2 as non-influential points are expected to outnumber the influential points. For 1 r m, let B r be the subset of non-influential observations in A r and denote its size as N B r = |B r |. Under assumption 1, we have min 1 r m N B r > δ 1 n, i.e., for any subset A r , the number of non-influential observations does not vanish.
For 1 r m, let A .+k/ r = A r ∪ {k} which is of size n sub . For Z k , we compute its influence measure with respect to the rth random subset A r as whereρ A r andρ A .+k/ r denote the estimate of ρ based on observations in A r and A .+k/ r respectively. We are now ready to define the following two extreme statistics: We name them the min-and max-statistic respectively as they measure the extremeness of the influence measures based on randomly sampled data. Note that T min,k,m and T max,k,m are functions of Z n = {Z k , 1 k n} and A m = {A r , 1 r m}. The dependence on Z n and A m is summarized by using a subscript m to simplify the notation. These two statistics are invariant to the rotation of the covariates and to the scale translation of the response. Let To establish the asymptotics of T min,k,m and T max,k,m , we first characterize a key quantity where J B is defined as x .X t −μ x /, 1 t n, andD x being the estimate of D x = diag.σ x1 , : : : , σ xp /, a diagonal matrix in R p×p . By definition, J B is the square of the l 2 -norm associated with the non-influential observations and is therefore unknown. DenoteẊ t = D −1 x .X − μ x / as the population version ofX t and note thatẎ t is the population version ofŶ t . Without loss of generality, we assume in model (2.1) that μ y = μ x = 0 and σ y = σ xj = 1, 1 j p. Moreover, we make the following assumptions.
Assumption 2. For 1 j p, ρ j is constant and does not change as p increases.
Assumption 3. For the covariance matrix of the covariates Σ = cov.X i / with eigendecomposition Σ = Σ p j=1 λ j u j u T j , we assume that l p = Σ p j=1 λ 2 j = O.p r / for some 0 r < 2.
Assumption 4. The predictor X i follows a multivariate normal distribution and the random noise " i ∈ R follows a normal distribution with mean 0 and an unknown variance.
Assumption 5. Let .Q y , R y / = ..μ y − μ y /=σ y , σ y =σ y − 1/, S Qy = lim sup n→∞ E.n 1=2 Q y / 8 and S Ry = lim sup n→∞ E.n 1=2 R y / 8 . Assume that S Q y and S R y are finite. Furthermore, there are constants 0 < K, C < ∞, independent of n and p, such that, for any t > 0, Assumptions 2-4 on the non-influential observations were also made in Zhao et al. (2013). Since it is assumed that σ xj = 1, 1 j p, we have tr.Σ/ = p and consequently it holds that l p p 2 by the Cauchy-Schwarz inequality. When l p = p 2 , Σ is a degenerate matrix with rank 1 and assumption 3 rules out this case. In contrast, assumption 3 applies when the largest eigenvalue of Σ is bounded. Assumption 5 is similar to but stronger than condition (C.4) of Zhao et al. (2013), where only eighth moments of n 1=2 .μ xj − μ xj / and n 1=2 .σ x =σ x − 1/ are required. In assumption 5, n 1=2 .μ xj − μ xj / is assumed to have sub-Gaussian tails and n 1=2 .σ xj =σ xj − 1/s have subexponential tails. This assumption is satisfied for the sample mean and the sample variance under the normality of .X i , Y i /s. As alternatives to the sample estimates, robust estimates of μ x , μ y , σ xj and σ y can also be used in practice. For example, we can estimate μ xj and μ y by the sample median and σ xj by the median absolute deviation estimator. These estimates satisfy assumption 5 by noting the normality of .X i , Y i /s. These robust estimates are the quantities that are used in our numerical examples. We now quantify the magnitude of J max,n , the maximum effect of the non-influential points, which is independent of m and A m and is a key quantity for establishing the asymptotic properties of the min-and max-statistic.
Lemma 1. Assume that the non-influential observations, generated from model (2.1), satisfy assumptions 2-4 and 5. Assume further that B = ∅ and that ξ n,p = n −1=2 log.p/ log.n/ log.np/ → 0. Then The conclusion is derived by bounding the maximum of p −1 Ŷ tXt 2 over t. Obviously, ξ n,p → 0 if n −1=4+ 0 log.p/ → 0 for some sufficiently small 0 > 0. Under assumption 1, it is obvious that N B r nδ 1 and consequently B r ∈ B for any 1 r ∞. Then lemma 1 implies immediately that, under assumption 1, Replacing .X t ,Ŷ t / by .Ẋ t ,Ẏ t / in J B , it is shown in the proof that the corresponding statistic is no more than O p .n −1 + p −1 l 1=2 p /. On the basis of lemma 1, we make the following claim.
Theorem 1. Suppose that all observations are non-influential. Under assumption 1 and the assumptions of lemma 1, for any 1 k n, T min,k,m → d χ 2 .1/ and T max,k,m → d χ 2 .1/ uniformly over m and A m , which is denoted as over .m, A m / for brevity, as min{n, p} → ∞.
Theorem 1 seems surprising at first glance, since we always have T min,k,m T max,k,m . An explanation is in place. As we show below, D r,k can be decomposed into two parts. The first part, depending on the quantity E k to be defined soon, represents the effect of the observation Z k , and the second part is controlled by J max,n . Since J max,n = o p .1/ by lemma 1, the asymptotic distributions of T min,k,m and T max,k,m are mainly determined by E k . Thanks to the blessing of dimensionality, we can show that E k asymptotically has a χ 2 .1/ distribution. From theorem 1, when T max,k,m or T min,k,m is larger than χ 2 1−α .1/, the 100.1 − α/% quantile of the χ 2 .1/ distribution, for some prespecified α such as 0:05, we declare that there are outliers.
Recall that B r collects the indices of the non-influential observations in A r . Let O r = A r \ B r be its complement in A r . For each 1 r m, it is obvious that O r ⊆ S inf \ {k}, the latter equal to S inf if k ∈ S c inf . Since |A r | = k sub n, similarly to the proof of theorem 1, we have where W inf,k,O r = Σ t∈O rŶ tXt =nk sub and W non,k,B r = Σ t∈B rŶ tXt =nk sub are associated with influential and non-influential observations respectively. Under assumption 1 and the conditions in lemma 1, we see from lemma 1 that Based on expressions (3.1) and (3.2), it is shown in lemma 1.3 of the on-line supplementary materials that, uniformly over .m, k, A m /, we have which represents the effect of the kth observation Z k . Let Z inf n = {Z k , k ∈ S inf } be the influential observations and O m = {O r , 1 r m} be the indices of the influential observations in the subsets. Define which quantify the maximum and minimum joint effects of the influential observations respectively. Thus the asymptotic behaviour of T max,k,m and T min,k,m mainly depends on the magnitude is an increasing function of m. In lemma 1.1 in the on-line supplementary materials, we show that, uniformly over all .k, m, O m /, Note thatF max,k .Z inf n / is independent of m and O m . Deriving the upper bound of F min,k .Z inf n , O m / is rather involved and we present the result in lemma 1.2 of the on-line supplementary materials. We remark that the magnitude of F min,k .Z inf n , O m / depends on the distribution of the influential observations, and it is generally difficult to quantify explicitly how large m should be. This is illustrated by two cases that are considered in lemma 1.2 of the supplementary materials. In the first case where the effect of influential observations is small, it holds that F min,k .Z inf n , O m / → p 0 in probability uniformly over m, which means that m can be fixed. For the second case where m is sufficiently large (e.g. mv n inf log.n/ −1 → ∞ with v inf defined in lemma 1.2 in the supplementary materials), we have F min,k .Z inf n , O m / → p 0 uniformly over Z inf n , which means that F min,k .Z inf n , O m / can be small as long as m is sufficiently large, regardless of the distribution of influential observations. Simulation results show that a small or moderate m is usually enough to obtain satisfactory empirical results. If m must be chosen judiciously, we have also proposed a numerical criterion as in Section 4, which works well in simulations. Below we state the properties of T max,k,m and T min,k,m separately in theorem 2 in Section 3.1 and theorem 3 in Section 3.2.
3.1. Max-statistic T max,k,m for the kth point In theorem 1, we have derived the null distribution of T max,k,m and T min,k,m when there is no influential point. We now study T max,k,m when there are influential observations and develop the corresponding detection procedure. We have the following results.
Theorem 2. Under assumption 1 and the assumptions of lemma 1, when there are influential observations, the following two conclusions hold.
The max-unmask condition becomes weaker as m increases, since F min,k .Z inf n , O m / decreases with respect to m. Statement (b) of theorem 2 can be specified further by combining the properties of F min,k .Z inf n , O m /. For example, by expression (2) of lemma 1.2 in the on-line supplementary materials, if δ inf,n = o.1/ and d S inf = O p .1/, then F min,k .Z inf n , O m / → p 0 uniformly over .k, m, O m /, as min{n, p} → ∞. In this case, the max-unmask condition can be updated as P{E k > χ 2 1−α .1/ + 0 } → 1, as min{n, p} → ∞. Under the condition in result (b) for any non-influential observation Z k , the asymptotic distributions of T min,k,m and T max,k,m are the same as those in theorem 1, i.e. the distribution of the min-and max-statistic of a non-influential observation is not affected by the presence of influential observations. As such, a non-influential point can be identified as non-influential with high probability, i.e. the swamping effect can be overcome under the condition in result (a). SinceF max,k .Z inf n / R 2 inf,n d S inf by expression (3.3), a sufficient condition forF max,k .Z inf n / → p 0 is that R 2 inf,n d S inf → p 0, which holds if d S inf = O p .1/ and δ inf,n → 0. This condition might be violated, however, if δ inf,n does not vanish or some influential observations have large values in terms of E t . This condition implies that deleting points with large values in E t is helpful to alleviate the swamping effect.
For an influential observation Z k , the max-unmask condition in result (b) gives the requirement on its signal strength for it to be identified as influential. Since F min,k .Z inf n , O m / decreases with respect to m, this condition becomes weaker and easier to be satisfied, and Z k is easier to be detected. This provides an opportunity to identify the influential observations that are masked by others, as long as we can make F min,k .Z inf n , O m / sufficiently small. In fact, as shown in lemma 1.2 in the on-line supplementary materials, F min,k .Z inf n , O m / can be very small if m is sufficiently large. Therefore, T max,k,m has the advantage in overcoming the masking effect if m is large.
We do not assume explicitly any model for the influential observations, which makes the method proposed quite flexible. When a specific model is assumed on the influential observations, the conditions in theorem 2 have more explicit expression. We illustrate this point by investigating the mean shift model where c i = 0 and .X i , Y i /s are independent and identically distributed non-influential observations from model (2.1), i.e. the influential observations are generated by contaminating the response of non-influential observations. To simplify the argument, we assume that μ xj = μ y = 0 and σ xj = 1, 1 j p, and consider the mean and variance of E i . Specifically, 1/}, by noting that Y i ∼ N.0, 1/ and consequently max i∈S inf |Y i | = O p {log.n inf / 1=2 }. Then, by expression (3.3), the condition in result (a) of theorem 2 becomes .n inf =n/ 2 max i∈S inf c 2 i → 0. Since F min,k .Z inf n , O m / → p 0, as m → ∞, by lemma 1.2 in the on-line supplementary materials, the max-unmask condition can be relaxed as min i∈S inf |c i | > χ 2 1−α .1/ 1=2 + 0 for sufficiently large m, which holds trivially as |c i | log.n inf / 1=2 , i ∈ S inf .
We now formally formulate a multiple-testing problem to test the influentialness of individual observations with n null hypotheses H 0k : Z k is non-influential, 1 k n. By result (b) of theorem 2 and the above discussion, we can estimate the set of the influential observations aŝ where p max,k,m = P{χ 2 .1/ > T max,k,m } is the p-value under H 0k and the q k s are determined by the specific procedure that is used to control the error rate. Here the q k s can be independent of k, if we aim to control the familywise error rate by the Bonferroni test. Alternatively, the q k s can depend on k, if we want to control the FDR at level α 0 . For example, for the procedure in Benjamini and Hochberg (1995), q k can be taken as the largest p max,.k/ such that p max,.k/ kα 0 =n, where p max,.1/ p max,.2/ : : : p max,.n/ are the ordered p max,k,m s. We now state the theory of using the Benjamini-Hochberg procedure and will use it later for numerical illustration, although other procedures developed for controlling the FDR can also be used.
Proposition 1. Suppose that the Benjamini-Hochberg procedure is used to control the FDR at level α 0 . Assume that assumption 1 and the conditions in lemma 1 hold. Suppose that the maxunmask condition in result (b) of theorem 2 holds simultaneously for all k ∈ S inf with α < δ inf,n α 0 .
As discussed in remark 1 in the on-line supplementary materials, when n and m become large, we have max k F min,k .Z inf n , O m / a 0 with probability tending to 1 for some small a 0 . Proposition 1 shows that all the influential points will be identified as influential with high probability, i.e. the true positive rate (TPR) is well controlled. In addition, if R 2 inf , n d S inf → p 0, by expression (3.3) and result (a) in theorem 2, there will be no swamping effect and then the statistic T max,k,m under H 0k follows a χ 2 .1/ distribution. Let FPR.Ŝ max / = |Ŝ max ∩ S c inf |=|S c inf | be the estimated FPR. When the Benjamini-Hochberg procedure is applied and there is no swamping effect, FPR.Ŝ max / will be controlled. However, the condition R 2 inf,n d S inf → p 0 may fail if δ inf,n does not converge to 0. In this case, the FPR may be out of control.
To summarize, the detection procedure based on the max-statistic T max,k,m is effective in overcoming the masking effect, but it is somewhat aggressive in that the FPR may not be controlled well without strong conditions. However, we point out that the procedure based on T max,k,m is computationally efficient, compared with that based on T min,k,m below.

Min-statistic T min,k,m for the kth point
We have argued that the statistic T min,k,m is effective in alleviating the swamping effect. We formally state this in the following theorem. Recall that T min,k,m is a function of .Z n , A m /. It makes sense to investigate the behaviour of T min,k,m given A m .
Theorem 3. Under assumption 1 and the assumptions of lemma 1, when there are influential observations, the following two conclusions hold.
The min-unmask condition can be strengthened as P{Ĩ max,k .Z inf n /} → 1, as min{n, p} → ∞, and conclusion (b) of theorem 3 can be strengthened as P{min 1 m ∞ T min,k,m > χ 2 1−α .1/} → 1, as min{n, p} → ∞. When the influential observations are generated from the mean shift model (3.4), similarly to the discussion after theorem 2, we can see that the min-unmask condition holds with probability tending to 1 for any 1 m ∞, if |c i | log.n inf / 1=2 as i ∈ S inf and Note that m plays a role in the conditions of theorem 3 and theorem 2. Compared with result (a) of theorem 2 whereF max,k .Z inf n / → p 0 is required, condition (a) of theorem 3 is much weaker. As shown in lemma 1.2 in the on-line supplementary materials or the discussion just before Section 3.1, we see that F min,k .Z inf n , O m / → p 0 when m is sufficiently large. Therefore, the statistic T min,k,m is less sensitive to the swamping effect. In contrast, F max,k .Z inf n , O m / is involved in the min-unmask condition (b), which is much stronger than the max-unmask condition (b) of theorem 2, i.e. an influential observation Z k will not be identified as influential unless its signal is very strong. Thus, the min-statistic is efficient in preventing the swamping effect but may be conservative for identifying influential points. Combining with the result in Section 3.1 that the max-statistic T max,k,m is effective in overcoming the masking effect but is aggressive, we conclude that the max-statistic T max,k,m and the min-statistic T min,k,m are complementary to each other.
If the min-unmask condition holds for all k ∈ S inf simultaneously, then Z k with k ∈ S inf will be detected correctly, when a certain error control procedure is used. For example, similarly to proposition 1, with α = δ inf,n α 0 , one can show that the Benjamini-Hochberg procedure can correctly detect the influential observations. However, the min-unmask condition is very strong and may not be satisfied for all k ∈ S inf simultaneously. We provide a sufficient condition for this condition to hold. Without loss of generality, assume that S inf = {1, : : : , n inf } and write E .1/ E .2/ : : : E .n inf / ranking E i , 1 i n inf , in decreasing order. Recall that R inf,n = δ inf,n =k sub .
The condition in proposition 2 is strong. When δ inf,n > 0 and E .1/ is large, proposition 2 requires that E .n inf / is not too small but this condition may be violated easily. A remedy is to remove sequentially the influential observations that have been detected so far and then to apply the detecting procedure recursively on the remaining data, as we explain below.
To simplify the description, we introduce some notation. For any subset U ⊆ {1, : : : , n} with cardinality n U = |U| and any observation Z k with k ∈ U, we can draw at random with replacement subsets A 1,U , : : : , A m,U ⊂ U \ {k }, with the same cardinality n sub,U , where n sub,U < n U . Similarly to T min,k,m , we define T min,k ,m .U/ = min 1 r m n 2 sub,U D r,k ,U , where D r,k ,U = p −1 ρ A .+k / r, U −ρ A r, U 2 , Z U = {Z i , i ∈ U} and A m,U = .A 1,U , : : : , A m,U /.
Denote by B r,U the indices of non-influential observations in A r,U and let O r,U = A r,U \ B r,U , 1 r m. Let k sub,U be such that n sub,U = n U k sub,U + 1. Then, similarly to F min,k .Z inf n , O m /, we de- are exactly the same as T min,k ,m , F min,k .Z inf n , O m / and F max,k .Z inf n , O m / respectively. Generally, suppose that E .i/ s can be separated into several groups in successive order, i.e. G j = {E .m j−1 +1/ , : : : , E .m j / }, j = 1, : : : , τ , such that 0 = m 0 < m 1 < : : : < m τ = n inf . Denote I j = {.m j−1 + 1/, : : : , .m j /}, 1 j τ . Let M 0 = S inf , M j = M j−1 \ I j and U j = M j−1 ∪ S c inf , 1 j τ . For simplicity, we assume that the n sub, U j s are independent of j, denoted still as n sub , and that the sufficient condition in proposition 2 holds for group G j s, i.e. the following inequality holds simultaneously in probability tending to 1, as min{n, p} → ∞, .m j−1 +1/ + χ 2 1−α .1/ 1=2 + 0 , 1 j τ , .3:5/ which is referred to as the group-min-unmask condition for simplicity. Then, similarly to the argument of proposition 2, we see that the min-unmask condition holds simultaneously for any Z k , k ∈ I j , on the data set → 1 for any 1 m ∞. Consequently T min,k,m .U j / with k ∈ I j will be larger than χ 2 1−α .1/ with high probability. If influential observations in I 1 , : : : , I j−1 are detected correctly and removed sequentially, the influential observations in group I j can be detected successfully with high probability. We remark that the group-unmask condition is much weaker than the condition in proposition 2.
This motivates us to consider the following multiround procedure. Define the set of influential observations identified in the jth round aŝ where q k depends on the specific procedure that is used, similarly to the discussion in Section 3.1,Û j =Û j−1 \Ŝ j−1 min withÛ 0 = {1, : : : , n} andŜ 0 min = ∅. Finally, we can estimate S inf bŷ S τ = ∪ τ j=1Ŝ j min for some τ . Let FPR.Ŝ τ / be the FPR that is associated with estimateŜ τ . Recall condition (a) of theorem 3, where m = m.n/ is such that F k, min .Z inf n , O m / → p 0, as min{n, p} → ∞, with k ∈ S c inf . This condition ensures that the min-statistic for any Z k ∈ S c inf is smaller than χ 2 .1/ stochastically. When a multiround procedure is used, we need this condition to hold for each round. Specifically, we assume that m = m.n/ can be any series such that .3:6/ as min{n, p} → ∞, where O m is defined similarly to O m . By expression (ii) of lemma 1.2 in the online supplementary materials, we see that result (3.6) holds as m is sufficiently large, regardless of the distribution of Z inf n . Then the FPR of the multiround procedure can be controlled as follows.
Proposition 3. Suppose that the FDR is controlled at level α 0 in each round, and that result (3.6) holds. Under assumption 1 and the conditions of lemma 1, for any fixed τ such that |Ŝ c τ | > c 0 n for some c 0 > δ inf,n =k sub + δ min , where δ min > 0 is sufficiently small, it holds that E{FPR.Ŝ τ /} α 0 , as min{n, p} → ∞.
Although the above iterative procedure can improve the performance of the min-statistic for overcoming the masking effect, requiring only the weaker group-min-unmask condition in expression (3.5), the computation of this procedure will be more costly if the number of rounds τ is large. However, larger τ demands more intensive computing and may increase the FPR. If an early stopping strategy is adopted, it may still suffer from the masking effect.
As a quick summary, the test statistic T max,k,m is more efficient in dealing with the masking effect, because the strength of the influential observations that is required by T max,k,m in result (b) of theorem 2 is much weaker than the group-min-unmask condition (3.5) that is required by T min,k,m , when m is large. Moreover, any procedure based on T max,k,m is computationally efficient, identifying the influential observations in just one round. However, T max,k,m may suffer from the swamping effect if the strong condition (a) of theorem 2 is violated. In contrast, the estimateŜ τ based on the statistic T min,k,m can maintain a good FPR at the expense of more intensive computation. Taking advantage of both statistics, we propose the following computationally efficient min-max-checking algorithm for identifying with high probability a clean set that contains no influential points and can serve as the benchmark for assessing the influence of other points.

Min-max-checking algorithm
We now present the following min-max-checking algorithm to combine the strength of the maxand min-statistic.
Step 1 (min-max-step): let S .0/ total = {1, : : : , n} and fix k sub = 1 2 . Repeat the following steps 1.1 and 1.2 until stopping for estimating a clean set. Step 2 (checking step): denote the estimated clean subset as S c . Check for each k ∈ S = {1, : : : , n} \ S c whether the kth observation is influential.
In the min-max-step, this algorithm identifies with high probability a clean data set containing no influential points with cardinality at least n=2 by successively removing potential influential points. Here α k is specified by the procedure that controls the error rate and can be determined in the same way as q k in Section 3.1. The main rationale is, as argued, that the max-statistic T max,k,m is aggressive in declaring influential whereas the min-statistic T min,k,m is conservative. We first run a min-step to eliminate those influential observations with strong strength to alleviate the swamping effect. Combined with the efficiency of T max,k,m in overcoming the masking effect, it is highly possible to obtain a clean set with a large size in one iteration. If the clean set is not sufficiently large, we run the min-step again to remove further influential observations with strong strength.
In the checking step, we denote S c as the final clean set obtained by the min-max-step. Then its supplement, written as S = {1, : : : , n} \ S c , is an estimate of the set which contains all potential influential observations. However, S may still contain non-influential observations as the procedure for obtaining a clean set aims only to find a subset of the non-influential points. A further step is to check whether any point in S is truly influential if necessary. This step, however, is easy since we have now a clean data set. We now outline the exact procedure. For any Z i , i ∈ S, consider the data with indices in S c and S .i/ c = S c ∪ {i}. We then compute statistic D i as in Section 2 whereρ andρ .i/ are computed on data set S c and S .i/ c respectively. Since S c is a good estimate of the clean data, this leave-one-out approach may still be effective for testing multiple null hypotheses in the form of H 0i : Z i is non-influential, i ∈ S. Those whose corresponding hypotheses are rejected are labelled as influential observations.
Numerically, when the Benjamini-Hochberg procedure is used to control the error rate, we find that the min-statistic is not only robust to the swamping effect but also powerful, eliminating most influential observations of moderate or large effects in one go. For example, it is seen from Fig. S.2 of the on-line supplementary materials that the TPR of the min-statistic is quite insensitive to m and is only slightly worse than that of the max-statistic whereas its FPR is much smaller. As a result, the swamping effect in the following max-step is very weak and this max-step can remove further influential observations of weak signal strength. Thus, in most of the cases that we have considered, one iteration is all that is needed for the min-max-step. The resulting estimated clean set S c contains most of the non-influential observations, i.e. the difference between S c and true clean set S c inf is small. Since the non-influential observations are averaged in the HIM statistic, when the difference between S c and S c inf is small, the critical value determined by χ 2 .1/ still gives reasonable results in the checking step. Our numerical study in Section 4 shows that, indeed, the resulting procedure can still control the FDR at the desired level.
In the numerical study, we find that the min-max-step with just one iteration already leads to good results, and consequently the estimates are often identical with and without the checking step. In what follows we provide a high level theoretical analysis of the algorithm when the min-max-step is iterated once, while leaving the details to the on-line supplementary materials.
Without loss of generality, assume that S inf = {1, : : : , n inf } and write E .1/ E .2/ : : : E .n inf / ranking E i , 1 i n inf , in a decreasing order. Suppose that the influential observations are separated into the two groups, G st = {.k/ : 1 k m 1 } and G wk = {.k/ :m 1 + 1 k n inf }, where n inf −m 1 < .k 2 sub − δ 1 /n for some 0 < δ 1 < k 2 sub , which stand for the indices of the observations with large and small values of E t respectively.
Under conditions that are similar to those in Section 3.2, the min-statistic can identify G st successfully and, under conditions that are similar to those in Section 3.3, the following maxstep, applied to the reduced data, can identify the remaining influential observations. The details of the conditions that are required are presented in conditions (D1) and (D2) in the on-line supplementary materials.
Denote byŜ min andŜ ng max the indices of the observations that are labelled as influential by the min-step and the following max-step respectively. Specifically,Ŝ min = {k : p min,k < q k , 1 k n}, where p min,k = P{χ 2 .1/ > T min,k,m } is the p-value that is computed on the basis of the full data. AndŜ ng max = {k : p max,k < q k , k ∈Ŝ c min }, where p max,k = P{χ 2 .1/ > T max,k,m .Ŝ c min /} is the p-value that is computed from the reduced data {Z k : k ∈Ŝ c min } and T max,k,m .Ŝ c min / denotes the max-statistic applied to the reduced data. WriteŜ mm =Ŝ min ∪Ŝ ng max as the estimated influential observations in the min-and max-step together.
Theorem 4. Consider the min-max-step in one iteration with the Benjamini-Hochberg procedure that is used to control the FDR at level α 0 . Assume that assumptions 1-5 and conditions (D1) and (D2) in the on-line supplementary materials hold, where m = m.n/ is any series satisfying conditions (D1) and (D2). Then it follows that P.Ŝ mm ⊇ S inf / → 1 and that E{FPR.Ŝ mm /} α 0 , as min{n, p} → ∞.
The benefit of combining the min-and max-step can be seen by comparing conditions (D1) and (D2) with those required by the max-or min-step alone. For the min-step alone to detect all influential observations successfully, we need the min-unmask condition in proposition 2, i.e. E

Simulation
We conduct extensive simulation to assess the performance of MIP. Towards this, we generate n observations from the linear model (2.1). The influential points are generated by replacing the first 10 points by points generated differently so that the resulting data set contains 10 influential points. In particular, we consider three examples. The first is constructed such that there is a strong masking effect whereas the second is generated to have a strong swamping effect. The magnitude of these effects is determined by a parameter μ, and the true β in these two examples is sparse. For either example, we examine different combinations of n and p. We also consider a third example taken from Maronna (2011) in which β is random and non-sparse. For brevity, we summarize the main findings of the simulation study while leaving the details of the simulation to the on-line supplementary materials.
First, we set .n, p/ = .100, 1000/ in examples 1 and 2 to assess the performance of MIP. In particular, we evaluate TPR inf as the TPR for influential observation detection, FPR inf as the FPR for detection, ERR= β − β to measure the accuracy of the β estimate after influential points declared by MIP have been removed, TPR vs as the TPR for estimating the support of β, and FPR vs as the FPR for estimating the support of β. With some abuse of notation, the lasso estimate after influential points declared by MIP are removed is abbreviated as MIP. For comparison, we provide the corresponding quantities when HIM is used for outlier detection or when the lasso is fitted to the full data. The results are summarized in Table S.1 in the on-line supplementary materials and Fig. 2.
We then compare MIP with a few more competitors for the three examples. This is done by generating data sets for examples 1 and 2 with p equals n=2, n or 2n and n equals 100, 300 or 500 to assess how effective MIP is for different sample size dimensionality combinations. In particular, for influential point detection, we compare MIP with Cook's distance, DFFITS and the iterative procedure for outlier detection IPOD in She and Owen (2011) when p < n, and we compare MIP with HIM for p > n. For parameter estimation and identification of the support of β in examples 1 and 2, the performance is compared with penalized least absolute deviation (LAD) (Wang et al., 2007) and a robust estimate named MM-Lasso (Smucler and Yohai, 2017). For example 3, we compare MIP with LAD and a robust estimator called S-Ridge (Maronna, 2011), as in this case the true β is non-sparse. The results are summarized in section 2 of the on-line supplementary materials.

Tuning parameters of the algorithm
Before presenting the results, first we briefly discuss how to choose k sub , the relative size of the random subsets, and m, the number of the random subsets to implement the algorithm.
For k sub , we recommend specifying it as an upper bound of the proportion of influential points in the data set. A reasonable choice is 1 2 as we expect that, for any influential point identification method to work, the number of non-influential points should be larger than that of influential points. Additional simulation using k sub = 1 3 or k sub = 2 3 suggests that the results are quite insensitive to the choice of k sub (see Fig. S.1 in the on-line supplementary materials).
Intuitively, the effects of the number of random subsets m for computing T max and T min are opposite. For T max , larger m leads to higher TPR and FPR, whereas, for T min , larger m produces results with lower TPR and FPR. The min-max-checking algorithm of MIP somehow combines their advantages, giving results with higher TPR and better control of FPR as m increases. This is confirmed numerically in Fig. S.2 of the supplementary materials where seven values of m (m = 1, 5, 10, 50, 100, 500, 1000) are investigated for its influence on using T max , T min and MIP for influence diagnosis. It is seen that MIP is quite insensitive to the choice of m, especially so when m 50.
In practice, however, it may still be useful to have a data-driven procedure for specifying m and we here present one. Our starting point is that, as m increases, the estimated set of influential observations becomes stable and so does the sum of | log.p-value/| of all rejected hypotheses in the min-max-and checking step of the algorithm. We can plot this sum, which is denoted as g.m/, against m and identify a point, which is denoted as M 0 , such that g.m/ becomes flat after M 0 . In other words, M 0 is the elbow or knee point of the function g.x/. There are many algorithms for finding M 0 and we have implemented a method that was suggested in Satopaa et al. (2011). The effect of this data-driven procedure for choosing m is illustrated in Fig. S.3 in the supplementary materials. Since we have found that the identified M 0 performs similarly to using m = 100 and m = 1000, all the simulations below have been conducted by using m = 100.

Summary of the simulation study
In terms of identifying influential points, across all simulation settings in the three examples, we can see that MIP controls the FDR at the nominal level and often has more power than HIM, Cook's distance and DFFITS for the settings where their FPRs are controlled. Also, in all the cases that were considered, MIP performs uniformly better than IPOD. In terms of parameter estimation we see that MIP outperforms MM-Lasso and penalized LAD for examples 1 and 2, and S-Ridge and LAD for example 3, usually by a larger margin. In terms of identifying the support of β in examples 1 and 2, we again see that MIP almost always outperforms its competitors especially in obtaining the smallest FPR in terms of variable selection.
We now briefly discuss the time and space complexity of implementing MIP. To compute the min-or the max-statistic, we need to sample m subsets of the data. For each observation and each subset, we compute HIM, which has a computational complexity O.np/. Thus, the total computational complexity is of the order O.n 2 pm/ that scales linearly with m; see Fig. S.10 in the on-line supplementary materials for some simulation results. We note that the computational time can be substantially reduced because the computation of the min-and max-statistics can be parallelized by using multiple processors: one for each of the m subsets. In contrast, the space complexity is of the order O.n + p + m/.

Real data analysis
In this section, we apply MIP to a microarray data set in which p is large and compare it with HIM. We also apply MIP to two small data sets in which p is small and compare it with Cook's distance and DFFITS. We remark that our theory may not apply to the two small data sets as their dimensionality can be seen fixed. Nevertheless, it is interesting to see how MIP performs in this classical set-up.

High dimensional data
As an illustration, we apply MIP to detect influential points in the microarray data from Chiang et al. (2006) which were previously analysed by Zhao et al. (2013). For this data set, we focus on 120 12-week-old male offspring that were selected for tissue harvesting from the eyes and for microarray analysis. The data set contains over 31042 different probe sets. Following Huang et al. (2006), we take the probe gene TRIM32 as the response. This gene is interesting as it was found to cause Bardet-Biedl syndrome, which is a genetically heterogeneous disease of multiple-organ systems including the retina (Chiang et al., 2006). One question of interest in this data analysis is to find genes whose expressions are correlated with that of gene TRIM32. We followed Huang et al. (2006) to exclude probes that were not expressed in the eye or that lacked sufficient variation and we select p = 1500 genes that are mostly correlated with the probe of TRIM32. Therefore, the analysis has p = 1500 predictors and a sample size n = 120. Before further analysis, all the probes are standardized to have mean 0 and standard deviation 1 . Applying the lasso to the full data by using the default setting of the glmnet function in R (Friedman et al., 2010), we identify 15 significant variables and the l 2 -norm of the estimated coefficient vector equals 0:097.
Applying HIM and MIP to these data with the FDR level at α = 0:05, HIM finds 15 influential observations, whereas MIP obtains seven influential observations. Interestingly, the set of influential points by MIP is a subset of that by HIM. In Fig. 3, we plot the influential observations that were found by MIP as full circles and the extra influential observations by HIM as crossed circles, where the y-axis denotes the logarithm of the p-values that were obtained by using HIM as in Fig. 3(a) or using MIP as in Fig. 3(b). To make the plot more comparable, the checking step in the min-max-checking algorithm is applied to all observations such that we can obtain a p-value for each observation. From Fig. 3(b), we can see that the crossed circles that are identified by HIM as influential do not seem to have very small p-values.
To make further comparison, we use OLS estimation on the important variables found via the lasso, after applying either HIM or MIP, to the non-influential point set that was identified by HIM. We compare their Bayesian information criterion BIC-score defined as BIC = n log.RSS=n/ + k log.n/ where RSS is the residual sum of squares, n = 105 is the sample size after removing the 15 influential points that were identified by HIM, and k is the number of variables used. Obviously, a model with a smaller BIC is preferred. Note that k = 9 if HIM is used and k = 6 if MIP is applied. Because of the set-up, this comparison favours HIM in some sense. It is found that BIC = −567:34 if HIM is applied for influential point detection and BIC = −578:94 if MIP is applied. Thus, MIP is potentially more effective for finding a better model than HIM as its BIC-value is smaller.
For the real data, of course it is not known which observations are influential. To assess the performance of HIM and MIP further, we artificially add influential points to the data set and evaluate whether they can find these points afterwards. Specifically, we first remove the influential points that were detected by each method and add 10 additional observations to the remaining data. This scheme gives a total of 115 observations for assessing HIM and 123 observations for MIP. The 10 added influential observations are generated as X iS = 1:1x S + Z S , X iS c = x S c , Y i = 1:1y + , 1 i 10, where Z ∼ N.0, 0:01I p /, S is a random subset of {1, : : : , p} consisting of 10 distinctive indices, Z S is a subvector of Z with indices in S, .x, y/ is chosen randomly from a non-influential point set identified by HIM and ∼ N.0, 0:01/ is independent of Z.
We apply MIP and HIM to the contaminated data defined above with the nominal FPR set as 0:05 in the Benjamni-Hochberg procedure and repeat the process 100 times. Then we compute the TPR and FPR of the two methods for identifying these artificial influential points. It turns out that MIP gives a TPR of 1 and an FPR of 0.008, whereas HIM gives a TPR of 1 and an FPR as high as 0:585. Obviously, HIM suffers seriously from the swamping effect that is caused by the addition of new influential observations, wheres MIP does not seem to be affected by newly added observations.

Low dimensional data sets
We now apply MIP to two classical data sets with small p that are used extensively in the literature as benchmark cases for influential diagnosis.
For the first data set, we examine the stack loss data in Brownlee (1965) consisting of n = 21 observations with p = 3. The covariates X i ∈ R 3 are air flow, cooling water inlet temperature and acid concentration, and the response Y i is the stack loss. Several references identified five observations including cases 1, 2, 3, 4 and 21 as potential outliers (Rousseeuw and Leroy, 1987;Billor et al., 2000;Nurunnabi et al., 2014), usually by carefully examining the effect of deleting groups of observations on the OLS estimates. When applying MIP to this data set with the number of subsets m specified either as 20, 50 or 100, we always identify cases 1, 2 and 3 as outliers. If one believes that cases 1, 2, 3, 4 and 21 are the true outliers in some sense, then we can see that the TPR of MIP is 0.6, whereas its FPR is 0. However, if we examine just Cook's distance by using leave-one-out observation, the TPR becomes 0 and the FPR becomes 0.0625. If the DFFITS-statistic is used for identifying outliers, then the TPR is 0 and the FPR is also 0. Neither Cook's distance nor DFFITS has any power in detecting these outliers. For the second case, we look at a data set with p = 3 that is designed to have masking and swamping effects (Hawkins et al., 1984;Nurunnabi et al., 2014). There are n = 75 observations in total, the first 10 of which are specifically perturbed to be influential. Interestingly, after applying MIP, we find the first 13 observations as influential, meaning that the TPR of MIP is 1 and its FPR is 0.046. If we apply Cook's distance only, the TPR becomes 0 and the FPR is 0.0615, whereas the TPR becomes 0 and the FPR becomes 0 if we apply the DFFITSstatistic for outlier detection. For this example, MIP is much more powerful with a controlled FDR.
We point out that, to use MIP, we require min{n, p} → ∞, though the rates of n and p going to ∞ can be arbitrarily slow. From the analysis of the two low dimensional data sets above, however, we can see that MIP continues to provide useful results and at least is more competitive than examining Cook's distance or the DFFITS-statistic naively.

Discussion
We have proposed a novel procedure named MIP for multiple influential point detection in high dimensional spaces. The MIP procedure is intuitive, theoretically justified and easy to implement. In particular, by combining the strengths of the max-and min-statistics, the MIP framework proposed can overcome the masking and swamping effects that are notorious in influence diagnosis, and it can identify multiple influential points with prespecified accuracy in terms of FDR control which is empirically verified by extensive simulation.
Both HIM and MIP are based on the idea of measuring the change in marginal correlations when one observation is removed. The primary consideration for using the marginal correlation is due to its ubiquity in statistical analysis and the possibility of deriving rigorous theoretical results, as we have shown. But it need not be the only quantity that defines influence. Towards this, it will be interesting to explore the use of other quantities to define influence. In this paper, we have confined our attention to linear regression. An interesting topic for future research is to extend the idea to other models such as the generalized linear model. A major challenge, however, is to define a tractable influence measure that is similar to HIM.
Finally, we hope that this paper brings to the attention of the statistics community the importance of influence diagnosis and how one might think about defining influence and devising automatic procedures for assessing influence, in a theoretically justified fashion. With the rapid advances in 'big data' analytics, we believe that the issue of influence diagnosis will only become more relevant and we hope that this paper can serve as a catalyst to stimulate more research in this area.