ROBUST WITHIN GROUPS ANOVA: DEALING WITH MISSING VALUES

The paper considers the problem of testing the hypothesis that J ≥ 2 dependent groups have equal population measures of location when using a robust estimator and there are missing values. For J = 2, methods have been studied based on trimmed means. But the methods are not readily extended to the case J > 2. Here, two alternative test statistics were considered, one of which performed poorly in some situations. The one method that performed well in simulations is based on a very simple test statistic with the null distribution approximated via a basic bootstrap technique. The method uses all of the available data to estimate each of the marginal (population) trimmed means. Other robust measures of location were considered, for which imputation methods have been derived, but in simulations the actual Type I error probability was estimated to be substantially less than the nominal level, even when there are no missing values.


Introduction
Consider J ≥ 2 dependent groups and let θ j (j = 1, . . . , J) be some robust measure of location associated with the jth marginal distribution. The paper considers the problem of testing when there are missing values. A simple strategy is to use a complete case analysis. That is, exclude any rows of data where there are missing values and analyze the data that remain. But this approach might result in a reduction in power when testing hypotheses (e.g., [15]). For the special case where θ j is taken to be the population mean, various methods for handling missing observations have been derived and studied (e.g., [1,20,5,29]. Approaches include a maximum likelihood method [12,13], a weighted adjustment method [4], single imputation [23] and multiple imputation [28,16]. When testing (1), a simple strategy is to impute missing values and then use some conventional test statistic in the usual manner. It is known, however, that this approach can be unsatisfactory, as noted for example by Liang et al. [15] as well as Wang and Rao [32].
Ludbrook [18] suggested using a permutation test when comparing groups based on means. A positive feature of this permutation test is that when testing the hypothesis of identical distributions, the exact probability of a Type I error can be determined. But as a method for testing (1), it can be unsatisfactory even when there are no missing values and the goal is to compare measures of location (e.g., [2,25]).
Let (X 1 , Y 1 ), . . . , (X n , Y n ) be a random sample of size n from some bivariate distribution and let D i = X i − Y i (i = 1, . . . , n). Let µ D be the corresponding population mean of D. An approach to missing values using an empirical likelihood method, based on single imputation, was derived by Liang et al. [15], which is readily adapted to the problem of computing a confidence interval for µ D even when there are no missing values. But for heavy-tailed distributions, it can be unsatisfactory. For example, suppose D has the contaminated normal distribution where Φ(x) is the cumulative distribution function of the standard normal distribution. Based on the simulation with 5000 replications, if the sample size is n = 100, ε = 0.1, the actual Type I error is estimated to be 0.084 when testing at the .05 level. Using the Bartlett-correction studied by DiCiccio et al. [8] ,the actual probability coverage is 0.078. Here it was found that even with a few missing values, the probability coverage deteriorates, and so this approach was abandoned. Even when there are no missing values, there are well known concerns about inferential methods based on means. A basic concern is that the population mean is not robust (e.g., [11,30]. Moreover, the sample mean has a breakdown point of only 1/n, roughly meaning that even a single outlier can result in the sample mean being arbitrarily large or small. Also, outliers can result in very poor power relative to methods based on some robust measure of location (e.g., [34]). Indeed, even a small departure from normality can result in a relatively large standard error and poor power.
One robust estimator that has been studied extensively is a 20% trimmed mean. A reason for considering 20% trimming, rather than other amounts of trimming, is that its efficiency compares well to the mean under normality, but unlike the mean, its standard error remains relatively small as we move toward more heavytailed distributions [26]. Consequently, under normality, power is nearly the same as hypothesis testing methods based on means. For J = 2, methods for handling missing values, when using a 20% trimmed mean, have been studied that perform well in simulations [33]. But it is unclear how these methods might be generalized to J > 2. (Arguments for considering other robust estimators can be made, but for the situation at hand they were found to be unsatisfactory for reasons outlined in Section 2.0.1.) Here, two test statistics were considered coupled with several robust measures of location. Among these methods, only one performed well in simulations in terms of controlling the probability of a Type I error in a reasonably accurate manner. Details of this method are given in section 2 along with a brief outline of the methods that were considered and abandoned.

Methods That Were Considered
The one measure of location that performed well in simulations is the 20% trimmed mean. Momentarily consider a single random sample: X 1 , . . . , X n . The γtrimmed mean is computed as follows. Let X (1) ≤ . . . ≤ X (n) be the values written in ascending order and let g = [nγ], 0 ≤ γ < 0.5, where [nγ] is the greatest integer less than or equal to nγ. Then the γ-trimmed mean is A 20% trimmed mean corresponds to γ = 0.2.
Now consider a random sample from some unknown Jvariate distribution (X i1 , . . . , X iJ ), i = 1, . . . , n. Rather than discard any row of data having one or more missing values, all of the data are used to compute each marginal trimmed mean. So if the jth column of data has no missing values, all n values are used even if there are missing values in some other column. If the jth column contains say m missing values, then the trimmed mean is based on the n − m values that are available.
To test (1), where now θ is taken to be the population trimmed mean, the test statistic is whereX t = X tj /J andX tj is the trimmed mean for the jth group. The strategy is to approximate the null distribution of Q via a bootstrap method. In more detail, let C ij = X ij −X tj . That is, shift the empirical distributions so that the null hypothesis is true. Next, a bootstrap sample is obtained by resampling, with replacement, n rows of data from the matrix . . .
For the jth column of the bootstrap sample just generated, computeX t * j , the 20% trimmed mean. Note that all of the non-missing values are being used. Then compute Here, B = 599 was used, which has been found to perform well, in terms of controlling the Type I error probability, when testing other hypotheses based on some robust estimator [34]. But in terms of power, a larger choice for B might have practical value. Racine and MacKinnon [22] discuss this issue at length. (Also see [14].) Davidson and MacKinnon [7] proposed a pretest procedure for choosing B.

Alternative Methods That Performed Poorly
Although the trimmed mean used here guards against the deleterious effects of outliers among the marginal distributions, a possible criticism is that it does not deal with outliers in a manner that takes into account the overall structure of the data. There are several (affine equivariant) estimators that deal with this issue (e.g., Wilcox, 2012, section 6.3). For some of these estimators, imputation methods have been derived (e.g., [31,6]). So a simple strategy would be to impute missing values and then use the test statistic Q in conjunction with the bootstrap method just described. However, even with no missing values, control over the Type I error probability was found to be poor. The estimators considered were the minimum covariance determinant estimator (see, for example, [27]), the Donoho and Gasko [9] trimmed mean, the orthogonal Gnanadesikan-Kettenring (OGK) estimator derived by Maronna and Zamar [19], and the translated-biweight S-estimator proposed by Rocke [24]. A basic concern is that when using Q and testing at the .05 level, the actual level can be less than .01 even with n = 60. (When using the method in [6], via the R package GSE, bootstrap samples were encountered where the measure of location could not be compute.) Consequently, further details are omitted.
An alternative hypothesis testing method, based on trimmed methods, was considered, details of which can be found in Wilcox (2012, p.393). Missing data were handled as previously indicated. But situations were found where the actual level was estimated to be much higher than the nominal level, so for brevity, further details are omitted.

Simulation Results
Simulations were used to study the finite sample properties of the method in Section 2. It is known that generally, both skewness and kurtosis can impact the probability of a Type I error when comparing measures of location, so data were generated where the marginal distributions have one of four distributions: normal, skewed and light-tailed, symmetric and heavy-tailed, and skewed and heavy-tailed. More precisely, a g-andh distribution [10] was used to generate the data. To elaborate, let Z be a random variable generated from a standard normal distribution. Then has a g-and-h distribution. The parameters g and h determine the skewness and kurtosis of the distribution.
(Data were generated with the R function rmul in the R package WRS.) The four distributions considered here are: standard normal (g = 0, h = 0), asymmetric light-tailed (g = 0.2, h = 0), symmetric heavy-tailed (g = 0, h = 0.2), and asymmetric heavy-tailed (g = 0.2, h = 0.2). Table 1 summarizes some properties of the g-and-h distribution, where κ 1 indicates skewness and κ 2 indicates kurtosis.   Table 2 reports  the estimated Type I error probabilities based on 2000 replications. Based on results in Pratt (1968), the hypothesis that the actual level is .05 would be rejected if the estimated level is less than or equal to .04 or greater than or equal to .06. Power, if the actual level is .03, is equal to .995. If the actual level is .07, power is .959. Although the seriousness of a Type I error depends on the situation, Bradley [3] suggests that as a general guide, at a minimum the actual Type I error probability should be between 0.025 and 0.075 when testing at the .05 level. All indications are that this criterion is met among the situations considered when using the bootstrap method in Section 2 based on the test statistic Q.

Conclusions
As is evident, simulation results cannot guarantee that the bootstrap method considered here, used in conjunction with a 20% trimmed mean and the test statistic Q, will always provide reasonably accurate control over the the Type I error probability. The main result is that it is the only method that performed reasonably well among the situations and location estimators that were considered. Indeed, the other (affine equivariant) robust estimators that were considered did not perform well even with no missing values.