Comparison Analysis: Large Data Classification Using PLS-DA and Decision Trees

Classification studies are widely applied in many areas of research. In our study, we are using classification analysis to explore approaches for tackling the classification problem for a large number of measures using partial least square discriminant analysis (PLS-DA) and decision trees (DT). The performance for both methods was compared using a sample data of breast tissues from the University of Wisconsin Hospital. A partial least square discriminant analysis (PLS-DA) and decision trees (DT) predict the diagnosis of breast tissues (M = malignant, B = benign). A total of 699 patients diagnose (458 benign and 241 malignant) are used in this study. The performance of PLS-DA and DT has been evaluated based on the misclassification error and accuracy rate. The results show PLS-DA can be considered as a good and reliable technique to be used when dealing with a large dataset for the classification task and have good prediction accuracy.


Introduction
In multivariate classification, the aim of this method is finding the mathematical model who is able to identify the membership or grouping for each sample according to their appropriate class and the basis of a set of measurements. After the process of classification, the model calibrated will be used to find the membership of unknown samples and predicted the defined classes. Classification techniques not only applied to the quantitative variable but also able to handle the qualitative response. In order to classify the category in qualitative response, the mathematical relationship will be used to identify the relationship between a set of descriptive variables. Classification plays an important role in the real world applications in every field for determining the correct result.
The aim of this study is to investigate the performance of two different classification methods using PLS-discriminant analysis and Decision tree analysis for predicting the diagnosis of breast tissues. In medical science, the major problem occurs is getting the correct diagnosis for certain information. Surprisingly, to our knowledge, no research has been carried out on classifying the diagnosis of breast tissues (M = malignant, B = benign) using PLS-DA. Therefore, for the purpose of getting the ultimate diagnosis, this paper attempts to provide a more detailed investigation regarding the effects of the performance of PLS-DA and decision tree to classifying the diagnosis of breast tissues.

PLS Discriminant Analysis (PLS-DA)
Partial Least Square Discriminant Analysis is a linear classification method that is based on the PLS regression algorithm [8,16]. This method is a combination of properties of partial least squares regression and the discrimination power of classification technique. In PLS-DA it is dealing with dependent variable Y and the presence of several dependent Y variables in searching for the latent variable with a maximum covariance with Y variables [1]. One of the advantages of PLS-DA is data variability. The data variability is modeled and called Latent Variables. The latent variable score and loading allowed graphical visualization and show the different data and their relation. For the purpose of identifying the latent variable number, the cross-validation method was applied. However, the problem of PLS-DA has occurred when the variable number was increased. When the number of variables increases, it is difficult to search the proper size of the relevant subspace of variable space [6]. For small datasets, usually, the unstable result appears for covariance but for a large sample, some extensive computation time will be needed. However, in recent studies, the selection of variables in the PLS-based algorithm was to attract more attention among researchers [2,5,17].

Decision tree analysis (DT)
The decision tree [9] is the most important technique in classification problems of breast cancer database and medical field. There are two basic steps in decision trees. First is construct the tree and then applying the tree to the database. The decision tree algorithm creates user-friendly rules that indicate important attributes, requires less calculation and easy to understand contrasted to other algorithms such as Neural Networks [13]. The main advantages of the decision tree are flexible, easy to build, easy to debug and suits for classification and regression. In this research work, the simulation results assure that the priority-based decision tree algorithm is for SEER breast cancer dataset [11]. B. Padmapriya and T. Velmurugan [3], discussed several algorithms such as C4.5, ID3, and CART (Classification and Regression Trees) to classify the data using decision trees. The CART algorithm is chosen to classify the breast cancer data because it provides better precision for medical data sets than ID3. The decision tree gives a powerful technique for classification and prediction in Breast Cancer diagnosis problem [14].
Various supervised learning classifiers are available to classify the data, including Multi-Class classifiers, Decision trees (J48), Naï ve Bayes, SMO, KNN, bagging, DNTB, AD Tree & Rep tree are compared, to identify the best classifier using the breast cancer dataset. The experimental results show that the classification result with the decision trees algorithm is more exact than other classifiers by 75.52% was discussed by S. Joshi and A. V. Vidyapeetham [15]. Therefore, this study to investigate the performance of PLS-DA and decision tree to evaluate large dataset for predicting the diagnosis of breast tissues.

Data
The large sample data of breast tissues were obtained from the University of Wisconsin Hospital. The sample will consist of 699 patients diagnose, it contains 458 benign and 241 malignant. The aim is to predict true disease status based on nine different variables. The data is divided into a training sample (used to build the model) which is 70% of patients diagnose and the testing sample is 30% (used to evaluate the performance of the model). The PLS-DA method is then compared with DT to determine the most efficient model. The performance of PLS-DA and DT has been evaluated based on the misclassification error rate and the percentage of testing samples that are correctly classified by the model evaluated by the accuracy rate. The descriptions of the dataset were given in table 1.

Construction of PLS-DA
PLS-DA used for constructing predictive models when there is a lot of independent variables and high multicollinearity. PLS-DA also allows the series of equations to be analyzed simultaneously while traditional regression may require separate regression equations to analyze.
In a standard variant of PLS-DA, the components are required to be orthogonal to each other. Its components are orthogonal so that PLS-DA is not affected by multicollinearity. This is employed in the package mixOmics, the principal components of PLS-DA can be formulated as the eigenvectors of the non-singular portion of the covariance matrix C, given by: where, n C is then n n  centering matrix

(1)
The iterative process computes the transformation vectors (also, loading vectors) a1.. . ad, which gives the importance of each feature in that component. In iteration h, PLSDA has the following objective: where bn is the loading for the label vector yn, X1 = X, and Xn and yn are the residual (error) matrices after transforming with the previous n − 1 component.

Step for built PLS-DA model
Perform the PLS-DA to create a pseudo linear Y value against which to correlate the samples. Specify the number of components, or latent variables (LVs), to use for our data. Then, plot the score between latent variables in order to look up the separation between sample groups. If the samples causing the problems filters that sample. Construct PLS-DA model based on score and weight after the filter. Evaluate the performance of the constructed PLS-DA model based on the minimum misclassification rate.

R-coding for PLS-DA
For constructing PLSDA predictive models, we used package mixOmics for predicting the diagnosis of breast tissues.

Construction of Decision Trees
Decision trees are indispensable graphical tools in such settings and displayed in a simple, easy-to-understand format. Each branch of the decision tree represents a possible decision or occurrence. The target variable can be a categorical and continuous variable. Decision tree model will calculate the probability that a given data belongs to each of the target variables or to classify the data by assigning it to the most likely category [4].
Classification trees apply to data where the target variable (outcome) is a classification label, such as the disease status of a patient. Classification trees are decision trees derived using recursive partitioning data algorithms that classify each case into one of the class labels for the outcome. A classification tree consists of three types of nodes, which are root node (the top node of the tree comprising all the data), splitting node (a node that assigns data to a subgroup) and terminal node (final decision or outcome).

R-coding for DT
We used package foreign for predicting the diagnosis of breast tissues.   Figure 1 shows the decision tree rules to classify the diagnosis of breast tissues (M = malignant, B = benign). For benign, the patient's uniformity of cell size is less than 2.5 and the bare nucleus is less than 4.5. The patient's uniformity of cell size is greater than equal to 2.5 and the uniformity of cell shape is less than 2.5. The patient's uniformity of cell size is between 2.5 (included) to 4.5, uniformity of cell shape greater than equal to 2.5 and bare nuclei less than 2.5. Then, malignant shows the patient's uniformity of cell size is less than 2.5 and bare nuclei greater than equal to 4.5. The patient's uniformity of cell size is between 2.5 (included) to 4.5, uniformity of cell shape is greater than equal to 2.5 and bare nuclei greater than equal to 2.5. The patient's uniformity of cell size greater than equal to 4.5 and the uniformity of cell shape greater than equal to 2.5.

Result and Analysis
The investigations based on 70% number of training samples and 30% of the testing sample are conducted to compare the performance of the PLS-DA model with DT based on their misclassification and accuracy rate.   Table 2 shows the classification for training and testing samples for a large sample size (n = 699). PLS-DA presents 306 true predict of benign and 160 malignant, while DT predicts 297 of benign class and 166 malignant class for training sample. PLS-DA shows the number of corrected for the testing sample from benign class is 142 and DT is 136. Then, malignant is 64 and 66 respectively. Table 3 summarizes the results of the performance analysis between PLS-DA and DT. Decision tree (DT) has the lowest accuracy rate in training and testing samples compared to PLS-DA but the difference is too small. The difference in accuracy rate in training is 0.62% and testing is 1.87%. PLS-DA presents a lower misclassification rate compared to DT for both classification training and testing. The percentage of corrected PLS-DA is 96.08% for training and 96.26% for testing. Then, training of DT is 95.46% and 94.39% for testing. However, the difference between these two models is small.

Conclusion
PLS-DA model was chosen as the best predictive model in predicting the category of breast cancer class since the results show it the best accuracy rate for training and validation samples. In conclusion, the results of this study indicate that PLS-DA can be considered as a good and reliable technique to be used when dealing with a large dataset for the classification task because of the advantage that its components are orthogonal and hence not affected multicollinearity.