Performance of Classification Analysis: A Comparative Study between PLS-DA and Integrating PCA+LDA

Classification methods are fundamental techniques designed to find mathematical models that are able to recognize the membership of each object to its proper class on the basis of a set of measurements. The issue of classifying objects into groups when variables in an experiment are large will cause the misclassification problems. This study explores the approaches for tackling the classification problem of a large number of independent variables using parametric method namely PLS-DA and PCA+LDA. Data are generated using data simulator; Azure Machine Learning (AML) studio through custom R module. The performance analysis of the PLS-DA was conducted and compared with PCA+LDA model using different number of variables (p) and different sample sizes (n). The performance of PLS-DA and PCA+LDA has been evaluated based on minimum misclassification rate. The results demonstrated that PLS-DA performed better than the PCA+LDA for large sample size. PLS-DA can be considered to have a good and reliable technique to be used when dealing with large datasets for classification task.


Introduction
Classification method not only plays a role as a classifier but also acts as a predictive and descriptive model as well as discriminative variable selection. The purpose of classification is to achieve a minimum classification rate. Classification methods can be grouped into three; parametric, non-parametric and semi-parametric methods. According to [14] parametric methods are more reliable than non-parametric method as all the data must be normally distributed and exhibit a bell-shaped curve. Examples of parametric method used for classification frequently are Quadratic Discriminant Analysis (QDA), Partial Least Square Discriminant Analysis (PLS-DA) and Linear Discriminant Analysis (LDA).
In contrast, the non-parametric method is a flexible method than parametric method because it is robust to the distribution of data [18]. For instance, k-Nearest Neighbour (KNN), decision trees (CART) and survival analysis [3] make no assumptions on data distribution. Meanwhile, semi-parametric method is a combination between parametric and non-parametric methods. According to [9] semi-parametric method achieved greater precision than nonparametric models but with weaker assumptions than parametric models. Semi-parametric estimators can possess better-operating characteristics in small sample sizes due to smaller variance than non-parametric estimators [20]. Logistic discriminant analysis is an example of semi-parametric method [12].
This study focuses on parametric methods only which are LDA and PLS-DA. This method was chosen because LDA works efficiently when the assumption of equal population covariance structures for groups are satisfied and the independent variables follow multivariate normal distribution [16]. Then, PLS-DA has demonstrated great success in modeling high-dimensional datasets for the past two decades [5,17].

Classification Problems
Large variables will be highly computational and suffer from the curse of dimensionality, which is caused by the exponential increase in volume associated with adding extra dimensions to mathematical space. According to [2] the existence of irrelevant variables will cause misclassification problems. Multicollinearity exists when the measured variables are large and correlated to each other. Multicollinearity can affect the standard error of parameter estimates may be unreasonably large, parameter estimates not significant, and parameter estimates may have a significantly different from what is expected [1]. According to [4] variable selection and reduction is the best solution to reduce the problem of multicollinearity. No dimensionality reduction technique is universally better than others. Depending on the dataset characteristics, one method may provide a better approximation of a dataset than the other techniques.

Classification Techniques
Principal component analysis (PCA) is a variable reduction technique in order to reduce a complex dataset to a lower-dimensional subspace. PCA helps to compress the data without much loss of information [6]. After conducting PCA, LDA model was constructed for classification purposes. LDA is a well-known scheme for dimension reduction and it is intended for classification problems where the output variable is categorical. Hence, it has been successfully used as a statistical tool of origin method in several classification problems [7]. According to [15], the weakness of this technique is that it cannot perform well when the dataset consists of a large number of variables relative to the amount of measurements taken.
For this study, we will use PCA to deal with a very large number of variables. Then, LDA model constructed for classification purpose for first model. The past studies showed that PCA and LDA were popular methods for variable reduction and classification. This study intends to integrate PCA and LDA for a large number of variables and perform classification task using the reduced set of variables resulting in PCA.
PLS-DA can be thought as a "supervised" version of Principal Component Analysis (PCA). Partial Least-Squares Discriminant Analysis (PLS-DA) is a multivariate dimensionality-reduction tool and classifier [17]. It is reported that PLS-DA can be effective both as a variable reduction and as well as a classifier for a large number of variables.
In addition, PLS-DA outperforms in modeling high-dimensional datasets for classification task.
Hence, the objective of this study is to make comparison between the performance of PCA+LDA and PLS-DA on variable reduction. PCA will be used to perform variable reduction and then integrate with LDA to construct classification task. The PLS-DA will be used in data reduction and perform classification task. The performance of PCA+LDA and PLS-DA for the various number of simulation dataset is accessed based on misclassification rates.

Data Generation
Data are generated using data simulator; Azure Machine Learning (AML) studio through custom R module. Performance analysis is conducted using the different number of independent variables (p) and number of sample size (n). The number of independent variables considered for sample n = 30 are p = 10, p = 30, p = 50, p = 100, p = 200, then n = 100 are p = 10, p = 30, p = 50, p = 100, p = 200 and finally n = 150 are p = 10, p = 30, p = 50, p = 100, p = 200. In order to measure the performance of PLS-DA and PCA+LDA, fifteen datasets were simulated.

Computational of PCA
The PCA algorithm calculates the first principal component along the first eigenvector, by minimizing the projection error (i.e., by minimizing the average squared distance from each data to its projection on the principal component or maximizing the variance of the projected data). After that, the algorithm iteratively projects all the points to a subspace orthogonal to the last principal component and then repeats the process on the projected points, thus constructing an orthonormal basis of eigenvectors and principal components. An alternative formulation is that the principal component vectors are given by the eigenvectors of the non-singular portion of the covariance matrix C, given by: Where; C n is n*n centering matrix (2)

Construction of LDA
The LDA model is constructed from a reduced set of variables resulting from the procedures of PCA using following steps:  Estimate means for (π 1 ) and (π 2 ) using Construct LDA model using equation (3)  Evaluate the performance of the constructed PCA +LDA model based on the lowest misclassification rate.

Construction of PCA+LDA
Suppose that, n1 is observed group from group 1 (π1) and n2 is observed from group 2 (π2). For classification purposes based on LDA, the object is classified to (π1) if otherwise it will be classified to (π2): 3)

2.3.1.
Step to Integrate PCA+LDA Firstly, perform PCA to reduce a very large number of measured variables. Then, estimates parameters; mean, covariance matrix and probability using the reduce set of variables. Hence, construct classification model based on LDA using the estimated parameters. Lastly, evaluate the performance of the constructed PCA+LDA model based on minimum misclassification rate.

Construction of PLS-DA
As with PCA, the principal components of PLS-DA are linear combinations of features, and the number of these components defines the dimension of the transformed space. In a standard variant of PLS-DA, the components are required to be orthogonal to each other (although this is not necessary). This is employed in the package mixOmics in R. Similar to Eq. (1), the principal components of PLS-DA can be formulated as the eigenvectors of the non-singular portion of the covariance matrix C, given by: (5) The iterative process computes the transformation vectors (also, loading vectors) a 1 . . . a d , which give the importance of each feature in that component. In iteration h, PLS-DA has the following objective: (6) where bn is the loading for the label vector yn, X1 = X, and Xn and yn are the residual (error) matrices after transforming with the previous n − 1 components.

Step for Built PLS-DA Model
Perform the PLS-DA to create a pseudo linear Y value against which correlate with the samples. Specify the number of components, or latent variables (LVs), to use for our data. Then, plot the score between latent variables in order to look up the separation between sample groups. If the sample causes the problems, filter that sample. Construct PLS-DA model based on score and weight after filter. Evaluate the performance of the constructed PLS-DA model based on minimum misclassification rate.

Result and Analysis
The investigations based on different number of independent variables (p) and different sample sizes (n) are conducted to compare the performance of PLS-DA model with the integrated PCA+LDA based on their misclassification rate.   (11) p = 30 0.0400 (4) 0.1100 (11) p = 50 0.0500 (5) 0.0700 (7) p = 100 0.0200 (2) 0.0900 (9) p = 200 0.0600 (6) 0.0500 ( Table 1 shows that for a small sample size (n = 30), there is almost no difference in the performance between PLS-DA and PCA+LDA model. However, there are two cases where PLS-DA is much better than the PCA+LDA that is when the measured variables (p = 30) and (p = 200). When p = 30, PLS-DA gives perfect misclassification rate but PCA+LDA achieves slightly higher misclassification rate which is 3.33%. Meanwhile, when p = 200, PLS-DA gives perfect misclassification rate while PCA+LDA provides much greater misclassification rate with 26.67%.
For larger sample size (n = 100), the performance of PLS-DA is improved significantly than PCA+LDA for all cases except case of p = 200 where there is 1% difference misclassification rate between PLS-DA and PCA+LDA.
Finally, for large sample size (n = 150), the performance of PLS-DA is greatly improved than PCA+LDA for all cases. The results indicate small misclassification rate under PLS-DA compared to PCA+LDA. In particular, when p = 10, PLS-DA obtains 2.0% misclassification rate while PCA+LDA shows much higher misclassification rate of 24.0%. In other words, PLS-DA has misclassified only 3 out of 150 objects while PCA+LDA misclassified 36 objects for the same condition. These results demonstrated that PLS-DA plays an important role in dealing with a large number of variables. However, there is no significant difference between PLS-DA and PCA+LDA performance for smaller sample size. The discussion of the results based on the relationship between sample size and independent variables can be summarized as follows:  For smaller sample size (n = 30). When p gets larger generally misclassification rate of PLS-DA gets smaller. In fact, PLS-DA performs better than PCA+LDA model, for larger p.  For sample size, (n = 100).PLS-DA performed better than PCA+LDA. In almost all cases, the misclassification rate of PLSDA are less than 6.0% compared to PCA+LDA > 7.0%.  For large sample (n=150), the performance of PLS-DA is consistent and better than PCA+LDA in all cases. Hence, produce better model.  Regardless of the fact that sample size of the performance of PLS-DA is always better than PCA+LDA especially when number of variables (p) is equal to number of sample size (n).  PLS-DA model show better performance for large sample size in most cases.  This finding is consistent with the result of [17], where PLS-DA outperformed PCA+LDA when dealing with large sample size

Conclusions and Future Work
As the sample size gets larger, the misclassification rate becomes smaller for the PLS-DA. On the other hand, PCA+LDA are inconsistent. In conclusion, the entire results revealed that PLS-DA highly recommend for a large sample size than PCA+LDA in dimension reduction and classification. PLS-DA can be considered to have a good and reliable technique to be used when dealing with large datasets for classification task. The future work might include the real dataset to make comparison between the performance of PCA+LDA and PLS-DA on large variable dataset.