Empirical Study to Evaluate the Performance of Classification Algorithms on Healthcare Datasets

Healthcare is a rapidly growing industry in both developed and developing countries. The expanse of technology has facilitated the storage and analysis of the diverse data which the healthcare industry generates. Data mining algorithms have been employed in the health care industry for the past few years for diverse kind of decision making and predictive analysis related tasks. Classification algorithms have been widely used for early detection of disease symptoms among patients. However, the selection of a suitable classifier for a particular dataset is an important problem in various healthcare related problems. This paper puts forward an empirical comparison of five important classifiers built using decision trees, bayesian learning, support vector machines and ensemble learning on twelve UCI healthcare datasets. The experimental results are examined from multiple perspectives, namely accuracy, precision, recall and F-measure.


I. Introduction
The amount of data generated in today's world is humongous. It is not difficult to find databases with terabytes of data in enterprises and research facilities. The quantity of data is only increasing exponentially and can be used effectively to draw out useful knowledge which can assist in decision making [1]. The Indian healthcare industry itself is worth USD 100 billion in 2015 and poised to grow at an estimated Compound Annual Growth Rate (CAGR) of 22.9 per cent to USD 280 billion by 2020 [2]. Deloitte Touche Tohmatsu India has predicted that with increased digital adoption, the Indian healthcare market is likely to grow at a CAGR of 23 per cent. However the per capita spending is only USD 40 in India in 2010, which is way below developed nations like USA and UK where the per capita spending is USD 7,285 and USD 3,867 respectively. It is in fact a good deal more downcast than the worldwide per capita expenditure of USD 802. Further, the population boom and changes in lifestyle aggravate the challenge. Hence, there is an urgent need to make rapid advancements in the medical field and also empower it by the proper use of Information Technology. Data mining techniques enable the extraction of knowledge from the large databases [3]. Classification is an important area of data mining and machine learning. A data classification system makes essential data easy to find and retrieve, thus helping in risk management, legal discovery, and compliance. Written procedures and guidelines for data classification should define what categories and criteria the organization will use to classify data. The goal of a classification algorithm is to construct a classifier and develop an accurate model by analyzing characteristics of unknown data. Data classification can be performed using various methods like Support Vector Machines, Naïve Bayes' Method, Random Forest etc. The performance of the classifier is evaluated using several criteria such as accuracy, precision, recall etc.
Classification in healthcare data sets is of extreme importance, particularly in nations with rising economic systems like India. Health care has become one of India's largest sectors -both in terms of revenue and employment. The Indian healthcare system is categorized into two major parts- The Government, i.e. the public health care system comprising of limited secondary and tertiary care institutions in key cities and focuses on providing basic healthcare facilities in the form of primary healthcare centers (PHCs) in rural areas.  The private sector supplies the bulk of secondary, tertiary and quaternary care institutions with a major concentration in metros, tier I and tier II cities. Table 1 presents facts about Indian health care industry which leads to believe that massive reinforcements and improvements need to be implemented using Information Technology. In addition to being hard hit by cancer, the WHO also estimates that approximately 230 million Chinese currently suffer from cardiovascular disease and that annual cardiovascular events will increase by 50 percent between 2010 and 2030 based on population aging and growth alone. The incidence of diabetes tells a similar tale. Almost one-in-three global diabetes sufferers today are in China, with approximately 114 million adults afflicted by the disease. The data about these people suffering from health problems in China can be used to extract knowledge about the causes and effects of these diseases, to not only help in curing them, but also to study and predict the sources of these problems.
Health statistics are crucial for decision making at all levels of health care systems. The healthcare sector generates huge amounts of data related to patients on an everyday basis. Few nations in the globe today have effective and comprehensive arrangements in place to collect this data [7]. Healthcare data mining provides countless possibilities for hidden pattern investigation from these data sets. These patterns can be used by physicians to determine diagnoses, prognoses and treatments for patients in healthcare organizations. These patterns also facilitate better decisions in policy design, health planning, management, monitoring and evaluation of programs and services including patient care and facilitate advances in overall health services performance and outcome. Unfortunately, data sets are not easily available in India because the health care industry lacks the necessary tools for storage and manipulation of these data.
Classification is a supervised learning technique which divides data samples into target classes. The information set is partitioned as training and testing data set. Using a training dataset the classifier is trained. Correctness of the classifier is tested using thr test dataset. Classification is one of the most widely used methods of Data Mining in Healthcare organization [8]. Each disease can be detected using a different set of parameters. The classification technique predicts the target class for each data point based on the disease for which it is being assessed. For example, patients can be classified as "high risk" or "low risk" patient on the basis of their disease pattern using data classification approach. It is a supervised learning approach having known class categories. Binary and multilevel are the two methods of classification. In binary classification, only two possible classes such as, "high" or "low" risk patient may be considered while the multiclass approach has more than two targets for example, "high", "medium" and "low" risk patient.
The arrangement of the remainder of the paper is as follows: Other works in the domain have been discussed in section 2, section 3 provides details of the classification methods employed, followed by discussion of evaluation approach of classification algorithms in section 4. Section 5 provides a detailed description of the experimental setup with discussion on results in section 6 and directions for future work in section 7.

Related Works
Data mining has been applied to a variety of healthcare domains to help in decision making. Data mining classification techniques are applied on the different diagnostic datasets to extract useful knowledge for helping in decision making. The application of decision trees for detection of high risk breast cancer groups over the dataset produced by the Department of Genetics of Faculty of Medical Sciences of Universida de Nova de Lisboa has been highlighted in [9], they have successfully proven that statistically significant associations with breast cancer can be found using decision trees. In [10], the performance of the Naïve Bayes and WAC (weighted associative classifier) was analyzed to predict the likelihood of patients getting a heart disease. This system uses CRISP-DM methodology to build the mining models. These methods depict that the WAC gives the highest percentage of correct predictions for diagnosing patients with a heart disease.
Weighted K-NN classifier was used to diagnose skin diseases [11], it was successfully shown that the weighted KNN algorithm gave better result than the basic KNN algorithm. Classification methods such as decision tree, SVM and ensemble approach for microarray data analysis is depicted in [12]. The experimental results indicate that all ensemble methods were significantly more accurate. In [13], data on children with Diabetes mellitus and Diabetes insipidus were studied using different classification methods such as Rule based, decision tree and Artificial Neural Network. In [13], emphasis has been put on how to make a model using the classification methods, but a comparison of different methods is not mentioned.

Classification Methods
Machine learning and statistics have many algorithms for performing classification on datasets. In this study, five well known classification techniques and ensemble methods have been used to train the models for empirical comparison, them being Support Vector Machines (SVM) [14], Naïve Bayes [15], Conditional Inference Trees [16], Random Forests [17] and Gradient Boosting [18]. All algorithms were implemented using R, a free data mining open source software [19]. Below is a short description about the classifiers along with the reasoning behind choosing them for this study.
SVM: It searches for the linear optimal separating hyperplane and uses it for separating data from different classes by performing a non-linear mapping to transform the training data into higher dimension [20]. It is one of the top 10 Data Mining algorithms [21].
Naïve Bayes: It models probabilistic relationships between predictor variables and class variables by estimating class conditional probabilities based on Bayes theorem [22]. It assumes independence among attributes when assigning probabilities. It also belongs to the top 10 Data Mining algorithms [21].
Conditional Inference Trees: These are decision trees which recursively perform univariate splits of the dependent variable based on values of a set of covariates. It searches from root node to a leaf node to determine the class of instance [23].
Random Forests: This is an ensemble learning approach where multiple trees are constructed and mode of the predicted class instances from each tree is taken as final output. Each tree uses a different bootstrap sample of data and each node is split using best among a subset of predictors chosen randomly at that node. Ensemble algorithms have been relatively less studied in empirical comparisons though they provide several advantages over other non-ensemble algorithms such as they do not expect linear features, they can handle binary features as easily as high dimensional space very well [24].
Gradient Boosting: This is another ensemble approach which produces a prediction model in the form of an ensemble of weak prediction models. The model is constructed in a stage wise fashion and therefore each new addition helps to correct errors previously made and consequently making the model more expressive. Gradient Boosting has shown better performance than Random Forests in various studies, like [25].

Evaluation Approach of Classification Algorithms
There are various measures for classification algorithms and these measures have evolved in order to evaluate different things. Studies have even demonstrated that if an algorithm achieves the best performance according to a given measure on a dataset, may not be the best using a different measure [26]. Characteristics of datasets, such as size, class distribution, or noise, can affect the performance of classifiers, too. Hence, evaluating the performance of classification algorithms employing one or two measures on very few datasets often proves to be inadequate. Based on the above, Accuracy, Precision, Recall and F1 Score have been used in this study to compare the performance of algorithms. The following paragraphs describe and define the above standards:  Accuracy is the number of correctly classified instances [27].
Where TP represents the True Positives which is the number of correctly classified positive instances, FP represents False Positives and is the number of non-fault-prone instances that is misclassified as fault-prone class, TN is True Negatives and is the number of correctly classified negative instances and FN is the number of fault prone instances that is misclassified as non-fault-prone class.

Experimental Setup
This section provides detailed information about the experimental datasets and the various steps involved in analyzing the performance of the different classifiers. This section contains two parts I) dataset information and II) methodology.

Data Set Information
The experiment was conducted on 12 health care data sets downloaded from the UCI repository [28]. Table-2 summarizes the characteristics of the dataset used in the experiment.

Methodology
The steps undertaken in the experiment are described below: (i) The dataset is pre-processed to remove attributes, if any, which offered no variation across all instances. (ii) The dataset is split into two subsets, a train set and a test set as per a 70:30 ratio [29]. (iii) Calculate the performance scores as discussed in Section 4 for different algorithms. (iv) Compare the performance of different algorithms and generated tables as presented in Table 3 containing the performance outcomes.
For datasets with multiple classes the results of recall and

Results
The results obtained by implementing the classification algorithms, Naïve Bayes, SVM, Conditional Inference Trees, Random Forests and Gradient Boosting have been compiled in Table 3. The column 1 of the table provides the names of the datasets, the five classification algorithms are specified for each dataset in column 2 and columns 3, 4, 5 and 6 provide the values of performance measures -Accuracy, Recall, Precision and F1 Score obtained on applying the given algorithms on the specified dataset. Among the classifiers, Naïve Bayes performed the least with an average classification accuracy of 75.2%, followed by Conditional Inference Trees (81.9%), then SVM (82.9%), then Random Forests (84.9%) and the best performance in terms of accuracy was given by Gradient Boosting with an average accuracy of 85.1%. The other evaluation metrics show the same order in terms of performance. Figure 1a, 1b, 1c and 1d depict a graphical view of each performance measure when a particular algorithm is applied to a dataset. Tables 4a, 4b, 4c, 4d further bolster the above made claims where ranks have been provided to all the algorithms in terms of relative performance. An overall rank has also been assigned after performing an average across the individual ranks for every algorithm. Gradient Boosting came in front for all the measures followed by Random Forest, SVM took up the 3 rd place for all the metrics, followed by Conditional Inference Trees and Naïve Bayes respectively. Tables 5a, 5b, 5c, and 5d show the p-values calculated for different classifiers based on different performance measures. Considering the 6 Empirical Study to Evaluate the Performance of Classification Algorithms on Healthcare Datasets significance level (α) to be 0.05, we can conclude from table 5a that, on the basis of accuracy, the difference in values of Naïve Bayes is statistically significant compared to all other classification algorithms. Even the Conditional Inference Trees are statistically significant with all other algorithms except Support Vector Machines. While p-values values, in  table 5b and table 5c, calculated using precision and recall values do not conclude any of the classification algorithms to be statistically significant, the p-values in table 5d, using F1 score, shows the Naïve Bayes to statistically significant in comparison to Random Forest and Gradient Boosting independently keeping the significance level (α) to be same as 0.05. The classifiers were also pitted against each other for all the metrics on a one to one basis to evaluate which is the better of the two when compared in terms of Wins (W), Draws (D) and Losses (L). The results of these have been compiled in Table 6 which confirms all the results stated till now. It is evident from the tables and the figures that Ensemble algorithms, Gradient Boosting and Random Forests, though they take up more time than the traditional algorithms to execute and evaluate, they provide the best performance. Hence, a user has to decide on a tradeoff between time and performance. Traditional algorithms take up a lot less time but also provide way poorer results comparatively. Since the matter of concern here is health care and in health care, some decisions need to be really quick whereas some can take relatively more time, so keeping with the complexity and requirement of a situation the decision regarding the choice of algorithm must be made.

Conclusion and Future Work
In this paper an empirical comparison has been produced to assess the performance of classification algorithms. This work is unique because of its health care oriented focus wherein multiple data sets from the above mentioned domain have been studied and attempts have been made to find the perfect classifier for them from a selection of classifiers judging their performance using multiple evaluation parameters. More so, the study also compares one classification algorithm against another and also portrays that the best classifier for a dataset can vary depending upon the evaluation standards. In total, twelve health care data sets were studied using the five algorithms. The results show that ensemble algorithms, Gradient Boosting and Random Forests give the best result followed by Support Vector Machines, followed by Conditional Inference Trees and Naïve Bayes. The study can be broadened to incorporate Feature Selection techniques in society to reduce dimensionality and consequently improve the evaluation time for a health care dataset. There is also a need to work on theoretical foundations to make it more appropriate for health care classification.