The Performance of Different Correlation Coefficient under Contaminated Bivariate Data

Bivariate data consists of 2 random variables that are obtained from the same population. The relationship between 2 bivariate data can be measured by correlation coefficient. A correlation coefficient computed from the sample data is used to measure the strength and direction of a linear relationship between 2 variables. However, the classical correlation coefficient results are inadequate in the presence of outliers. Therefore, this study focuses on the performance of different correlation coefficient under contaminated bivariate data to determine the strength of their relationships. We compared the performance of 5 types of correlation, which are classical correlations such as Pearson correlation, Spearman correlation and Kendall’s Tau correlation with other robust correlations, such as median correlation and median absolute deviation correlation. Results show that when there is no contamination in data, all 5 correlation methods show a strong relationship between 2 random variables. However, under the condition of data contamination, median absolute deviation correlation denotes a strong relationship compared to other methods.


Introduction
Analysis of bivariate data is a statistical method to investigate the relationship between 2 variables. 2 variables, for example, X and Y, are said to relate when the value assumed by one variable affects the distribution of another variable. For a bivariate datum, the input variable or independent variable, xi, is plotted on the horizontal axis, and the output variable or dependent variable, yi, is plotted on the vertical axis. Scatter plot can be used to plot all the ordered pairs (xi, yi) of bivariate data on a coordinate axis system and can be used to display the relationship between these 2 variables. Besides scatter plot, correlation coefficient can be used to measure the relationship between 2 variables [1]. There are several types of correlation coefficients, such as Pearson correlation coefficient, Spearman rank correlation coefficient, Kendall's Tau correlation coefficient, and other robust correlation methods. Correlation coefficient is a simple statistical measure of relationship between 2 random variables. The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between 2 variables [2]. If there is a strong linear relationship between the variables, the value of correlation coefficient will be close to +1 or -1 depending on the direction of that relationship. However, when there is no linear relationship between the variables or only a weak relationship, the value of correlation coefficient will be close to 0. This correlation coefficient can be used as an indicator to the statistical fit of a regression based on their squared value, frequently known as coefficient of determination, used as a measure of the goodness-of-fit of the [2]- [4]. Furthermore, testing the correlation has become an important subject in multivariate analysis. The importance of testing the correlation has been shown in many areas such as economics, financial market, medicine, and social science [5]- [9]. However, there are issues in testing the correlation matrices using classical statistics when some outlier exists in the data set [10], [11]. In the presence of outliers, the Pearson correlation coefficient results will be inadequate. Therefore, it has reduced the capability to measure the strength of the relationship, damage effects on statistical analysis, increase the variance of error and reduce statistical tests power [12], [13].
To overcome the presence of outliers, robust methods can be used as alternative methods in reducing the influence of outliers [10], [14]- [16]. The aim of robust statistic procedure is to address the data without being affected by the outlier in the data and to ensure the stability of statistical inference under deviations from the assumed

Methodology
In this study, we focus on the robust correlation coefficient performance with contaminated data to ensure that the hypothesis testing of correlation is conducted effectively. Therefore, the ultimate goal of this research is to identify the best correlation coefficient when involving contaminated data using simulation study.

Simulation Study
Simulation study is used to compare the performance of all correlation coefficients under 2 situations either with or without contaminated data. Then, R programming version 3.4.2 is used to run the simulation data. This study used correlation values, bias and standard errors to conduct performance analysis among correlation methods [12], [19].
We generated 200 datasets based on the linear relation of: = 6.0 + 2.0 + z where xi is pseudo-random numbers from the uniform distribution with U(0,1), zi is drawn from normal distribution with N(0,1). A 200-dataset is generated based on sample sizes of 20, 30, 50 and 100. For each dataset, we generated contaminated data with a percentage of 10%, 20%, 30%, 40% and 50% of the sample size, respectively. Then, we calculated the correlation coefficient value, bias and standard deviation.

Nature of Data
Figures 1 -6 show the examples of 1 out of 200 datasets about the nature of data where we start from no contaminated data to 50% contaminated data. Figure 1 shows that there is a strong positive linear relationship between 2 variables and there is no contaminated data in this dataset. The simple regression for this dataset is y = 6.3552 + 1.9414x. We simulated the dataset in Figure 1 and then added 10% contaminated data in the dataset. Some data appeared to be somewhat distant with other data as illustrated in Figure 2. We continued to simulate the dataset and added another 20% until 50% contaminated data in dataset is achieved. Figures 3 -6 show more data away from the original dataset when we increased the percentage of contaminated data.

Method of Data Analysis
Generally, we have 6 conditions of contaminated data type (0%, 10%, 20%, 30%, 40%, and 50%), and 4 different sample sizes (n = 20, 30, 50, and 100). Based on that condition, therefore we have 24 sets of 200 generated datasets. The performance evaluation is based on 3 measurements, which are the value of correlation, bias and standard error. 5 correlation coefficients, which are classical correlation, such as Pearson correlation, Spearman correlation and Kendall's Tau correlation with other robust correlations, such as median correlation and median absolute deviation (MAD) correlation that will be applied.

Pearson Correlation
There are several types of correlation method to measure relationships among variables, such as Pearson correlation. Pearson correlation coefficient is a classical statistic approach in measuring the relationship between 2 variables, especially for the data that are normally distributed and there is linearity between 2 variables. Let X and Y be drawn from 2 random variables with n sample size, therefore the sample correlation of 2 random variables introduced by Pearson is defined in (1), This Pearson correlation coefficient can be measured when the data is on at least interval scale. However, the Pearson correlation is excessively influenced by outliers, unequal variances, non-normality, and nonlinearity assumptions.

Spearman Correlation
A non-parametric alternative to the Pearson's linear correlation coefficient is Spearman correlation. Usually, the Spearman correlation coefficient can be used when the data is non-normal distributed, measured in ordinal scale, or outliers exist in the data. Technically, the Spearman correlation involved exists ranking of each set of data. Therefore, the Spearman correlation is sometimes called as the Spearman's rank-order correlation. The Spearman's rank-order correlation coefficient measures the strength and direction of association between 2 ranked variables. This correlation coefficient is defined in (2).
is the difference in paired rankings, R( ) − ( ), and n is the number of pairs of data. The value of rs will range from -1 to +1 and will be used in the same manner as the Pearson's linear correlation coefficient, r, was used.

Kendall's Tau Correlation
Similar to the Spearman rank correlation coefficient, Kendall's Tau correlation, is based on the ranks of observations. Commonly, the Kendall's Tau correlation is used to measure the degree of correspondence between sets of rankings where the measures are not equidistant. Because it is used with ordinal association data, this correlation is also a non-parametric alternative to the Pearson correlation coefficient. The Kendall's Tau correlation coefficient can be measured in (3).
where is the number of concordant pairs is the number of discordant pairs n is the number of pairs of data

Median Correlation
Another alternative method that can be used to calculate a correlation between 2 variables is robust correlation approaches. Robust correlation methods could ensure high stability of statistical inference under the deviations from the assumed distribution model [18], [20]. Median correlation is one type of robust correlation coefficient known as the median-product correlation coefficient. As the Pearson correlation refers to the product moment correlation, alternative approaches to calculate the correlation between 2 variables can use median as an 4 The Performance of Different Correlation Coefficient under Contaminated Bivariate Data estimator to the mean. This method uses a 2-dimensional sample position determined by the use of median and as an estimator to the mean.
From bivariate data, let (X,Y) pair, we determined the median for X variables, denoted by Med(x), and median for Y variable, denoted by Med(y). Then, the median absolute deviation for variable X denoted by MAD(x) is given by ( ) = (| − ( )|) while the median absolute deviation for variable for variable Y, denoted by MAD(y) is given by ( ) = (| − ( )|). The median covariance for variables X and Y denoted by Covmed(X,Y) is the median determined by given products, which are In (1), the denominator is replaced by MAD(x) and MAD(y). For the nominator, it is replaced by Covmed(X,Y). Then, the median correlation denoted by , is given by using (4) and (5).  Table 1 shows the results of the performance of Pearson correlation with another robust correlation coefficient. 2 variables have a strong relationship if the value of correlation is close to 1 or -1. For uncontaminated data, which has no outlier, all the correlation methods demonstrate a coefficient value close to original, ρ = +1. The Pearson correlation values present the highest values compared to other correlations when the sample size is small, for n = 20, r = 0.986 and for n = 30, r = 0.985. For sample size 50 and 100, the MAD correlation gave the highest values r = 0.99 for both sample sizes. On the other hand, the Kendall Tau correlation recorded the worst values for all different sample sizes.

Results and Discussion
The performance of all 5 correlation methods can also be measured by using the average bias and their standard error. Based on 200 generated samples, we calculated the average and standard error of sample correlation. These biases and standard errors for sample correlation are displayed in Table 1. Bias is referred as the difference of average of sample correlation coefficient with true correlation. Therefore, the smaller the value of bias and standard error are, the better the performance of that method is. By referring to Table 1, the Pearson correlation coefficient shows the smallest bias and standard error for small sample size (n = 20 and 30) while the MAD correlation coefficient for large sample size (n = 50 and 100).
The performances of all 5 correlation methods will be compared with and without contaminated data. Table 2-5 show the correlation coefficient value for all 5 methods for different percentage of contaminated data for sample size, n = 20, 30, 50 and 100 respectively. Based on Table 2, for small sample size (n = 20), when the contaminated data exist in the dataset, the correlation value for all methods drops compared without contaminated data. It means that the correlation value is highly sensitive to the presence of outliers, which can cause invalid results. For 10% contaminated data existing in the dataset, the MAD correlation gave the highest value of correlation (r = 0.968) compared to others. This MAD correlation shows that although outlier exists in the data, the value of correlation is not affected. The value of correlation for the Pearson correlation values (r = 0.554) drops more compared to other correlation methods, therefore the Pearson correlation coefficient suffers in the presence of outlier although only 10% outliers exist in the dataset. We added another 10%outliers to become 20% outliers in the dataset, the MAD correlation still presents a good value of correlation (r = 0.926) compared to others. This MAD correlation shows the best performance compared to other correlation methods although it uses a small sample size. When the data consists of more than 40% contaminated data, the value of correlation for all the methods indicates weak relationships.
The performance of the correlation methods when we increase the number of sample size is presented. Tables 3-5 display the coefficient value of different correlation methods based on different sample sizes, n = 20, 30, 50 and 100, respectively. The performance of large sample size n = 50 and n = 100, is the same as small sample size, n = 20 and n = 30. As we can see in Tables 3 -5, the correlation values for all methods would drop compared to without outliers in the dataset. The value of MAD correlation still produces the highest correlation coefficients value for different percentage of contaminated data except for 40% contaminated data.
The performance of all 5 correlation methods with contaminated data can also be measured using their average bias and standard error. Based on 200 generated samples, where they contain 10% contaminated data, the value of bias and standard error of correlation methods was calculated as displayed in Table 6. The results show that the MAD correlation has the smallest bias and standard error compared to other methods for all different sample sizes. The median correlations have the second lowest bias and standard error. Meanwhile, the Pearson correlation has the largest bias and standard error for all samples sizes.    We continued the performance comparison of correlation methods when the number of contaminated data is increased. Tables 7 -10 display the results of bias and standard error for 5 correlation methods with different percentage of contaminated data. Based on Table 7, we conclude that the same pattern occurred. The MAD correlation has the smallest bias and standard error compared to other methods for all different sample sizes followed by median correlations with the second lowest of bias and standard error. The Pearson correlation still has the largest bias and standard error for all samples sizes.     For 50% contaminated data existing, all the correlation methods gave a poor result as displayed in Table 10. All correlation methods produced a high value of bias and standard error. The bias is around 0.748 to 0.973. This means that for 50% contaminated data existing, the strength from strong relationship is reduced to no relationships among them.

Conclusions
For perfect data with no outlier in the dataset, all 5 correlation methods demonstrated a strong relationship, which is close to +1. However, when outliers exist in the dataset, the MAD correlation still gave a strong relationship compared to other methods. This shows that although outliers exist in the data, the value of MAD correlation is not affected. The relationship became weak for MAD correlation when there is 40% contaminated data. The value of correlation for the Pearson correlation drops more compared to other correlation methods. Therefore, the Pearson correlation coefficient suffers in the presence of outlier. This means that the Pearson correlation value is highly sensitive to the presence of outliers, which can cause invalid results.