Robust Multivariate Location Estimation in the Existence of Casewise and Cellwise Outliers

Multivariate outliers can exist in two forms, casewise and cellwise. Data collection typically contains unknown proportion and types of outliers which can jeopardize the location estimation and affect research findings. In cases where the two coexist in the same data set, traditional distance-based trimmed mean and coordinate-wise trimmed mean are unable to perform well in estimating location measurement. Distance-based trimmed mean suffers from leftover cellwise outliers after the trimming whereas coordinate-wise trimmed mean is affected by extra casewise outliers. Thus, this paper proposes new robust multivariate location estimation known as α-distance-based trimmed median ( ) to deal with both types of outliers simultaneously in a data set. Simulated data were used to illustrate the feasibility of the new procedure by comparing with the classical mean, classical median and α-distance-based trimmed mean. Undeniably, the classical mean performed the best when dealing with clean data, but contrarily on contaminated data. Meanwhile, classical median outperformed distance-based trimmed mean when dealing with both casewise and cellwise outliers, but still affected by the combined outliers' effect. Based on the simulation results, the proposed yields better location estimation on contaminated data compared to the other three estimators considered in this paper. Thus, the proposed can mitigate the issues of outliers and provide a better location estimation.


Multivariate Outliers
Most parametric statistical tools were derived using the population parameters mean (μ) and (co)variance (Σ), but most of the time, these population parameters are unknown. Consequently, these parameters are usually estimated from the sample mean and sample covariance, denoted as � and S respectively. However, Hampel [1] mentioned that real data typically comprises of abnormal observations (outliers). The classical mean with 0% breakdown point is highly sensitive to outliers, even with only one outlier could divert the estimation from the supposed location and lead to the defect of the least-square-based (co)variance [2][3][4][5][6]. This situation is even complicated in the context of multivariate data as multivariate outliers are harder to be identified.
In multivariate context, outliers occur in a single form or can be a combination of two abnormalities within a single set of data. These outliers can be categorized as either casewise outlier, which refers to one whole observation as a contaminant to the data itself such that the observation originates from another distribution, or cellwise outlier which refers to extreme values exist within each variable of the dataset independently similar to those in univariate context [7][8]. Figure 1 illustrates a condition where both casewise and cellwise outliers coexist in a dataset.
Based on Figure 1, the z 9 and z 10 are contaminants originated from another distribution but mistakenly included into Z-distribution. Such contaminations can happen due to sampling error or model misspecification [7], while the extreme values within the components can be some random errors. Not until recent decade, researchers began to realize the situation where both casewise and cellwise outliers present simultaneously. Nevertheless, the existing methods particularly dealt with either casewise or cellwise outliers but not both [9,10]. Thus, the main goal of this article is to introduce a more accurate location estimation method even in the presence of either one or both types of outliers.

Robust Approach -Trimming
Tukey [11] acknowledged the need of a better location estimator dealing with non-normality for Student's t-test and suggested the notion of trimming off samples. Trimming refers to a process of removing abnormal observations that caused a sample to be non-normally distributed. In univariate context, the most common trimming procedure for a size n sample is to decide a trimming proportion (α%) and removes nα% observations from both end of the ordered statistics.
Generally, multivariate trimming can be done differently via coordinate-wise or distance-based approach. Coordinate-wise trimming (CT) is simple where it treats each of the components of the sample matrix as univariate vector and applies the common trimming procedures. CT is able to retain valid inputs while the outlying inputs were discarded [12]. On the contrary, distance-based approach relies on Mahalanobis Squared Distance (MSD) to determine contaminants for trimming. The common type of distance-based trimming is performed by determining a certain percentile of Chi-Squared distribution to identify multivariate outliers. However, Alloway and Raghavachavari [13] developed a straightforward MSD-based trimmed mean to construct robust Hotelling T 2 control chart by removing two observations with highest MSD values from the sample. One thing to be noted is that, the trimming procedure introduced by [13] was confined to only 2 observations trimming which can be further extended to α-percent distance-based trimmed mean ( � ( , ) ).
Although the high-distance observations were deemed contaminants and trimmed off, there are possibilities that cellwise outliers may still remain in the multivariate dataset. The remaining data might not be as clean as what we are expecting, for more accurate location estimation. Such situation may cause the application of mean on post-trimming data to be affected by any undetected cellwise outlier. Figure 2 illustrates the post-trimming conditions of a contaminated dataset.
According to Figure 2, the remaining extreme values still exist after employing the distance-based trimming and pose a risk in the estimation of the classical mean ( � ). In a previous study on Robust Linear Discriminant Analysis [14], the application of median after MSD-based trimming yielded a more consistent performance than of the MSD-based trimmed mean. Therefore, a more reliable location estimator should be applied after the MSD-α%trimming procedure to alleviate the problem faced when using the highly sensitive classical mean.
As mentioned earlier, the trimming process by [13] is rather straight forward. Observations with high MSD measurement will be trimmed off at α percentage. Supposed that x ij comes from j-dimensional feature vectors with i=1,...,n th sample vector and j=1,...,d th variable dimension, the classical MSD is obtained via following Eq. 1 [15]:- where � j = classical mean vector of n samples of j th dimension S = classical covariance matrix of n samples × j dimension After MSD vector containing distance measurement for each of the observation was obtained, the following trimming procedure is carried out:-Step1: Arrange the MSD in ascending order; Step2: Trim α% observations with the highest MSD; Step3: Compute the mean for each dimension based on the remaining observations.

Distance-Based Trimmed Median ( � ( , ) )
Directly employing classical median instead of the classical mean on the original data may seem simple and straightforward, but the existence of casewise outliers may distort the estimation of location measure. Therefore, the distance-based trimming procedure extended from Alloway and Raghavachavari [13] is useful in removing casewise outliers. However, for the cases where both casewise and cellwise outliers coexist, the use of classical mean on the remaining post-trimmed data (X trimmed ) may still be affected by the remaining cellwise outliers in the dataset. Thus, to overcome this shortcoming of the classical mean, [14] employed median as it is more robust location estimation compared to mean.
To ensure that the outliers are well managed, a two-stage outliers filtration location estimator, known as distance-based trimmed median, is proposed. The first stage is to eliminate casewise outlier by using distance-based approach whereas the use of median estimation acts as a secondary outlier filtration for any possible outlier leftovers (casewise and/or cellwise). The same algorithm presented in Section 2.1. is used to compute the median (instead of the mean in Step 3) for each dimension of the trimmed matrix. A simulation study was conducted to compare the performance of the proposed estimator with other chosen estimators for this study.

Trimming Percentage
Trimming percentages come in various forms. Among the frequently used and suggested percentages of trimming by most researchers are symmetrical trimming of 20% to 25% [16], 20% [17-19], 15% [20], 10% to 15% [21]. For univariate data, the trimming percentage is based on each tail, thus, if the trimming percentage is 10%, the total trimming from left and right tail will be 20% from each dataset. However, when using MSD-α% trimming, the total amount of trimming is the amount of the given nα%. Considering trimming percentages from [16][17][18][19][20][21], the one-tailed distance-based trimming percentage in this study was capped at 40%.

Simulation Settings
The Tukey-Huber Contamination Model (THCM) [22][23] was commonly employed in robust multivariate analyses to generate contaminated data for simulated study. However, the problem with THCM is that the simulation mechanism generates a two-part distribution, i.e., a "clean" part which is normally distributed data whereas the other part consists of rows of contaminated data (casewise outliers). Thus, in order to simulate both casewise and cellwise outliers, THCM was adapted and modified to randomly include cellwise outliers within the simulated data in this study.
Simulated data, matrix X n×p , with sample size n and variable dimension p were generated according to Eq. 2 around pre-set Location, L p , and simulated according to the settings presented in Table 1.
*where 1 were randomly selected from clean data for each dimension p independently to perform 1 inflation To determine the ability of the estimators in handling outliers, the simulation was conducted according to the following procedure:- Step 1: Generate data according to the settings in Table  1; Step 2: Compute the location estimators ( � , � , � ( , ) and � ( , ) ); Step 3: Find the absolute difference of each location estimators and L p ; Step 4: Repeat Step 1 to Step 3 for 10,000 iterations; Step 5: Compute the average of each location estimations and absolute difference from the 10,000 iterations.
After obtaining the average absolute difference of estimated location and pre-set location, ANOVA test was conducted to confirm significant difference between the estimators' performance and Tukey's HSD test conducted for post-hoc comparisons between the estimators.
ANOVA test shows significant difference between the estimators' performance when sample size n = 40 (p = 0.000) and n = 100 (p = 0.000). Tukey's HSD test results were obtained and sorted to rank the estimators based on the difference in |T-L| into Tables 2b and 3b. Tukey's HSD test confirms the preliminary results in Tables 2a  and 3a, with � being the best fit to the pre-set location L p whereas � ( , . ) being the least fit to L p .    As cellwise outliers and casewise outliers present in the samples, � is inevitably affected. Tables 4a and 5a show that � deviates the most from L p compared to other estimators at both sample sizes, n = 40 and n = 100. On the other hand, � ( , ) significantly outperforms � with better accuracy as trimming percentage (α) increases. Simulation also shows that � is better than � and � ( , ) in handling outliers' influence. Meanwhile, � ( , ) illustrates a better and more stable performance compared to the other estimators.
Similarly, ANOVA test indicates significant difference amongst the estimators' performance when n = 40 (p = 0.000) and n = 100 (p= 0.000). The rankings according to Tukey's HSD test determine � ( , . ) as the best fit while � is the least fit location estimation for both sample sizes n = 40 (Table 4b) and n = 100 (Table 5b).
ANOVA results with p = 0.000 are reported, reported for both sample sizes n = 40 and n = 100. Tables 6b and 7b show that � is the least fit estimation for both n = 40 and n = 100. On the contrary, � ( , ) did not have significant difference (p-value = 1.000) for α = 0.2, 0.3, 0.4 with sample size n = 40 (Table 6b). Similarly, Table 7b (n = 100) also reported no difference between � ( , . ) and � ( , . ) . However, the rankings between the three � ( , ) estimations against other estimators are shown in the following tables.       With ε 1 maintained at 0.1 and ε 2 increased from 0.2 to 0.4, the total outlying proportion in the data added up to 50%. The � estimation is now severely deviated from L p regardless of sample sizes (Tables 8a and 9a). Obvious deviation also can be observed in � ( , ) when the outlying proportion is greater than trimming percentages (α = 0.2, 0.3, 0.4), while � starts showing increase in deviation, as well. Meanwhile, noticeable differences in � ( , ) between the trimming percentages can be detected, as higher α will have lower |T-L|.
The ANOVA test and Tukey HSD test for n = 40 (p=0.000) and n = 100 (p=0.000) have reported significant difference in performance between the estimators. Tables 8b and 9b show identical rankings for estimators, with � ( , . ) has the best fit while � has the least fit for both sample size n = 40 and n = 100.
As expected, X � deviates the most from L p with average deviation of 6.997 for both n = 40 and n = 100 under each dimension. Further deviation also can be seen on X � (MSD,α) , which corresponds with the cell wise outliers' inflation.
However, � performance maintains at this level of outliers' proportions, ε 1 = 0.1 and ε 2 = 0.4, despite the cellwise outliers' inflation. Similarly, � ( , ) 's deviation does not affected much from the inflation, maintaining its low deviation even at higher trimming percentage.
Lastly, ANOVA test again indicates significant difference on estimators' performance with p = 0.000 for n = 40 and n = 100. The ranking of the estimators is similar to the previous section (as shown in Tables 10b and 11b).
Note: All pairwise Tukey HSD results are significant at 0.05 significant level  Note: All pairwise Tukey HSD results are significant at 0.05 significant level

Summary
To summarize, the classical mean ( � ) is good when applied to clean data, but very susceptible to the existence of outliers. The performance of classical median ( � ) is considered moderate as the rankings are consistently in between the other estimators. On the other hand, despite having comparable performance to � when applied to clean data, � ( , ) 's performance worsens when dealing with contaminated data. The estimation through � ( , ) increasingly deviates from L p as the proportion (ε 2 ) of casewise outliers increases. The deviation also increases when the cellwise outliers' inflation (µ 1 ) increases. However, � ( , ) holds a dominant performance over � , � and � ( , ) on all contaminated occasions regardless of the trimming percentage applied. Overall, � ( , . ) yields best location estimations than the other two trimming percentages (α = 0.2, 0.3).

Conclusions
In this paper, the capability of the existing multivariate location estimators in handling situation where both cellwise outliers and casewise outliers coexist is discussed and reviewed. One should take note that when both casewise and cellwise outliers coexist in the dataset, they should be treated separately. Treating the coexistence of these outliers like a single problem could still jeopardize the location measure estimation as the outliers might not be well identified (leftover outliers). Based on the results of simulation study, the proposed distance-based trimmed median, � ( , ) , yields more accurate location estimations despite the leftover outliers compared to the other three chosen estimators in this study.
Pertaining the feasibility of using median over mean after α-distance-based trimming, the simulation study shows that � ( , ) has consistently lower deviation as compared to � ( , ) . Despite the performance of both � ( , ) and � ( , ) getting better as the trimming percentage (α) increases, in cases where α relatively lower than the total outlying proportions in the data, � ( , ) outperforms � ( , ) with noticeable differences. When the contaminated data is under-trimmed, the leftover cellwise and casewise outliers can cause � ( , ) to breakdown. On the other hand, even though trimming percentage equal or greater than the total outlying proportions in the data, some cellwise outliers might be left from trimming due to the cellwise outliers which are randomly spread within the data. Thus, the use of median after α-distance-based trimming is recommended to mitigate the problem.
As a conclusion, the proposed two-stage outlier's filtration location estimator, � ( , ) , is recommended as a better multivariate location estimator when dealing with unknown outlier's situation.