Comparing the Performance of AdaBoost, XGBoost, and Logistic Regression for Imbalanced Data

An imbalanced data problem occurs in the absence of a good class distribution between classes. Imbalanced data will cause the classifier to be biased to the majority class as the standard classification algorithms are based on the belief that the training set is balanced. Therefore, it is crucial to find a classifier that can deal with imbalanced data for any given classification task. The aim of this research is to find the best method among AdaBoost, XGBoost, and Logistic Regression to deal with imbalanced simulated datasets and real datasets. The performances of these three methods in both simulated and real imbalanced datasets are compared using five performance measures, namely sensitivity, specificity, precision, F1-score, and g-mean. The results of the simulated datasets show that logistic regression performs better than AdaBoost and XGBoost in highly imbalanced datasets, whereas in the real imbalanced datasets, AdaBoost and logistic regression demonstrated similarly good performance. All methods seem to perform well in datasets that are not severely imbalanced. Compared to AdaBoost and XGBoost, logistic regression is found to predict better for datasets with severe imbalanced ratios. However, all three methods perform poorly for data with a 5% minority, with a sample size of n = 100. In this study, it is found that different methods perform the best for data with different minority percentages.


Introduction
Statisticians often encounter the imbalanced data issue when performing classification tasks. Imbalanced data, which is the major source of classification problems, normally occurs when the distribution between outcome classes is not equal [1]. Imbalanced data happens when there is one majority class that makes up almost the whole data and one small minority class [2]. In imbalanced data, the minority class always contains important information. Hence, standard classification is not the best approach to predictions involving imbalanced datasets [3].
Real world applications of classifications that involve imbalanced data include text categorizing, fault and fraud detection, and oil spill detection through satellite images [1]. It is crucial to find the best way to deal with imbalanced datasets [4]. [5] emphasizes finding a solution for an accurate prediction of the minority cases.
Many issues arise in mining imbalanced cases. The first issue is classification accuracy. Even though the evaluation metrics method is commonly used for the classification task, it is not appropriate for classifying imbalanced cases because it is a biased performance measure [6]. The second issue in mining imbalanced cases is a lack of data (absolute rarity). This problem arises when the number of observations in the minority class is small in the absolute sense, making it almost impossible for the classifier to detect any regularities in the minority class [6]. The third issue is a relative lack of data (relative rarity). This problem arises when the minority cases are not rare in an absolute sense but are rarely relative to other cases, rendering it difficult to identify patterns [6]. The fourth issue is data fragmentation; partitioning the data into smaller groups can also lead to an absolute lack of data within a single partition [6]. Fifth, noise affects the minority class more than the majority class and has a large negative impact on detecting the minority cases, making it difficult to form decision boundaries around the minority class [6]. Lastly, inappropriate inductive bias can affect the overall ability to learn about the minority class [6].
Machine learning algorithm is always preferred than traditional classification approaches when it comes to handling imbalanced data. Traditional classification methods tend to ignore the imbalanced data issue, leading to poor classification performance in imbalanced datasets. Besides, the imbalance problem is associated with the ratio of the majority to the minority class, the overall training data set, the classifier involved, and the complexity concept indicated by the data [7].
Thus, this study intends to investigate the performance of two boosting algorithms, namely AdaBoost and XGBoost, along with a regression method known as logistic regression.

Adaptive Boosting (AdaBoost)
AdaBoost [8] is the earliest boosting technique introduced by Freund and Schapire [9]. AdaBoost makes use of weak classifiers to build a stronger and more stable classifier. According to Tavish [10], the initial step is where the base learner or the weak learner allocates equal weight or attention to each observation. Once weight has been assigned to each observation, the weak learner can be used for prediction. Greater weights are then assigned to observations that are misclassified by the weak learner, and the next base learner is used for prediction. This process will be iterated until it reaches the limit of the base learning algorithm, , where the iteration is denoted as . In the final step, the outputs of the weak learner will be combined to develop a stronger learner that will enhance the prediction accuracy. In this study, the decision tree is chosen as the weak classifier with a depth of 1.
According to Freund and Schapire [11], the AdaBoost algorithm takes places in the training set, which contains the vectors ( 1 , 1 ), …, ( , ), where each of is an element from the domain , while each of is an element from the set . In this study, the labels in Y are set as {0,1}, where 0 is for the majority class and 1 is for the minority class in each dataset. At each iteration , the weight of the distribution attributed to case , denoted as ( ) on {1,2, … , }, is calculated. In the first iteration where = 1, the weight is set equally, 1 ( ) = 1/ . Next, a weak classifier, ℎ ( ) → {0,1) is obtained. The error made by the weak learner ℎ ( ) is then calculated by , where is 0 if ℎ ( ) = and 1 otherwise. At iteration = 2, the weight is updated by +1 ( ) = ( ) * / . where, = normalisation constant such that ∑ ( ) = 1 =1

Extreme Gradient Boosting (XGBoost)
Boosting involves assembling all the weak classifiers to produce a strong classifier. XGBoost, a form of boosting, was developed from the existing Gradient Boosting technique. According to [12], XGBoost was adapted from Gradient Boosting to enhance the computing speed, scalability, and generalization performance. The first step in implementing XGBoost is to organize the data. All the categorical forms of the data will be transformed into the numeric form since XGBoost only works with numeric vectors. This transformation can be completed using One Hot Encoding. The next step is to clean the data and undergo feature engineering.
The estimated model can be obtained using the general function, as stated in the following equation: where, ̂( ) = predictions at step t The aim is to stop overfitting in XGBoost, as shown in the equation below: where, ( ) = (̂, ) , a loss function that measures the difference between the prediction ̂ and the target ( ) = + || || , a regularized term that penalizes complex models = the min. loss needed to further partition the leaf note = the number of leaves in the tree = a regularized parameter to scale the penalty = spanning the tree pruning = weight of leaves Then, a second order expansion is taken to the loss function in XGBoost, and the final aim is shown in equation (5), where, ℎ = the second derivative of the loss function = the first derivative The optimal weight of leaf j, can be identified using the following equation, The sum of loss value for every leaf node can be characterized by the upcoming loss function in the equation where refers to all data samples in the leaf node . Thus, the change in the model's performance can be assessed based on the objective function, after a node split in the decision tree [13].

Logistic Regression
Logistic regression [14] is a regression technique used when the response variable is a binary, mutually exclusive variable. In other words, logistic regression is used for modelling binary outcomes, which would involve an estimation task because it tries to answer the probability of a record belonging to a relevance class. The response variable has only two possible outcomes, which can be represented by a binary indicator variable taking the values of 0 and 1. A logistic regression model is made up of one or more independent variables (continuous or categorical) and one categorical dependent variable. Given is the probability of success, is transformed into odds to form a linear equation of an exponential function form, where is expressed in terms of odds, Taking the natural log on the both sides, the logit function will be as follows, where,

Performance Measure
A confusion matrix will be used to evaluate the output of the approaches by documenting how many instances are correctly and incorrectly found. Next, the efficiency of the three classifiers will be determined based on their sensitivity, specificity, and g-mean. While sensitivity is the percentage of real positives that are correctly recognised, specificity is the percentage of real negatives that are correctly recognised.

Simulation Study
A Monte Carlo simulation study is performed to generate nine imbalanced datasets having different imbalance ratios of 5%, 10%, and 25% with different sample sizes of n = 100, 300, and 500. This study adapted a simulation study done by [15], where the imbalanced data is simulated to suit a binary logistic regression with 1 and 2 as the independent variables and as the binary dependent variable. In this simulation study, the coefficient values of 0 and 1 are set at 1.08 and 2.08, respectively, similar to the values chosen by Rahman and Yap [15] in their study. The dependent variable, is a binary variable with an assigned value of 0 or 1. Cases with a probability of less than or equal to 0.5 generated by the model will fall into class 0, while the ones having the probability of more than 0.5 will fall into class 1. The looping will stop upon reaching the imbalance ratio that has been set.
The simulation process is replicated for about 5,000 times with different imbalance ratios and sample sizes. Each generated data is partitioned into two sets, with 70% as the training set and 30% as the testing set. The training sets are then used to model AdaBoost, XGBoost, and logistic regression; hence, these models are used for prediction purposes. The performance of each model is assessed using both training and test sets. The mean values of sensitivity, specificity, precision, F1-score, and g-mean are obtained to assess the performance of each classifier.

Application of Real Datasets
The data used to achieve the second objective is secondary data instead of primary data. The secondary data was retrieved from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html), which contains numerous collections of datasets created by students of UC Irvine.
In this study, the selected datasets are Glass, Ecoli, and Wifi Localization from UCI Repository. Three degrees of imbalance are tested, namely 5% (highly imbalanced), 10% (moderately imbalanced), and 25% (imbalanced). These imbalanced datasets are then compared to see how they affect the training of the classifiers.

Results
This section presents the analyses and findings of AdaBoost, XGBoost, and logistic regression on the simulation study, the descriptive analysis of the imbalanced real datasets, and the performance of each classifier model for each imbalanced real dataset. These analyses are conducted using R Statistical Software version 3.5.1. All three methods perform the worst with the dataset of 5% minority class and 500 sample size. We can conclude that as the sample size increases and the percentage of the minority class decreases, the classifiers fail to predict the minority cases as the classifiers treat the minority cases as noise. Thus, the classifiers focus on predicting the majority cases, with good outcomes. Logistic regression performs the best for most of the datasets, and it outperforms the boosting methods for the datasets with a 5% minority class.

Minority 5%
A constant pattern is observed for all sample sizes, as the specificity rate seems to be very high across all three sample sizes. Besides, it seems that the sensitivity of AdaBoost and XGBoost does not change much across all three sample sizes. Hence, sample size does not influence the sensitivity of AdaBoost and XGBoost. A larger sample size is needed to obtain readable results for AdaBoost and XGBoost. Meanwhile, logistic regression shows a constant pattern for all sample sizes used.

10%
A constant pattern is observed across all sample sizes. Specificity seems to be very high across all three sample sizes. Meanwhile, sensitivity is very low for all sample sizes and is not as high as specificity. It seems that the precision of AdaBoost, XGBoost, and logistic regression is the lowest when the sample size is 100, and the precision of logistic regression is the lowest when the sample size is 300. However, when the sample size is increased to 500, both methods show an increment in precision. As for the F1-score and g-mean, logistic regression shows a systematic improvement as the sample size increases. Hence, logistic regression is the most stable for the datasets with a 10% minority class.

25%
A constant pattern is observed for all sample sizes. Specificity seems high across all three sample sizes. Besides, all the sample sizes show relatively high sensitivity. The F1-score and g-mean are the highest when the sample size is 100 for all three methods.

AdaBoost
AdaBoost performs well when the imbalance is not severe. For instance, the sensitivity rate is the highest, at about 64.58% for the dataset with a 25% minority class and a sample size of 500. AdaBoost's performance improves as the minority percentage increases. In addition, AdaBoost's F1-score and g-mean also increase as the sample size increases to 500. AdaBoost has the best sensitivity, F1-score, and g-mean for the dataset with a sample size of 300 and 25% minority. On the contrary, it has the lowest sensitivity, F1-score, and g-mean when the sample sizes are 100 (with 25% minority) and 300 (with 10% minority).

XGBoost
XGBoost performs the best when the imbalance is severe, which is when n = 100 with 25% minority, achieving the highest sensitivity rate of 71.85%. This is followed by 56.75% sensitivity when n = 500 with 25% minority. XGBoost's performance is inconsistent as the g-mean does not show a consistent improvement as the sample size increases.

Logistic regression
Logistic regression performs the best when the imbalance ratio is not severe. It achieves the highest sensitivity rate of 65.78% when n = 100 with 25% minority. Logistic regression also achieves a relatively high g-mean and F1-score with the 10% minority datasets. Logistic regression performs better that boosting algorithm for most datasets, such as the 10% and 25% minority datasets. Table 3. Results of Performance Measure

Sensitivity
All three methods have the lowest sensitivity rate when the dataset is highly imbalanced, with a 5% minority class in a sample size of 100. All three methods start to improve, showing good performance in predicting positive classes when the datasets have a sample size of 500 onwards with a 5% minority class.

Specificity
All three methods show very consistent high specificity rates for all minority percentages and all sample sizes. However, the sensitivity rate for all three methods is the lowest when the sample size is set at 100 with a 25% minority class.

Precision
All three methods have the lowest precision rate when the minority class is highly imbalanced at 5% and the sample size is set at 500. All three methods start to improve, showing good performance in predicting positive cases when the dataset is not highly imbalanced (10%) and the sample size is set is 500 onwards.

F1-score
All three methods have the lowest F1-score when the minority class is highly imbalanced at 5% and the sample size is set at 100, except for logistic regression. All three methods start to improve, showing good performance in predicting positive cases when the dataset is set at 5% minority with a sample size of 500 onwards.

G-mean
All three methods have the lowest g-mean when the minority class is highly imbalanced at 5% with a sample size of 100. All three methods start to improve, showing good performance in predicting positive cases when the dataset is set at 5% minority with a sample size of 500 onwards. Regarding the imbalanced real datasets, all the methods perform perfectly in predicting the positive class in the glass dataset, achieving a 100% rate for sensitivity. However, there is an overfitting issue in the case of AdaBoost and logistic regression, as the training set obtain better results than the testing set. Hence, from Table 4, the best method for the glass dataset is XGBoost, as it obtains the highest result for sensitivity (100%), specificity (100%), precision (100%), F1-sore (100%), and g-mean (100%) for the test set.

Glass Dataset
Based on Table 5, AdaBoost, XGBoost, and logistic regression perform well and are stable in predicting positive cases. AdaBoost and XGBoost have similar performances which are a better than logistic regression, as all the performance measures are higher for the boosting methods. However, all three methods face an overfitting issue, since the training set performs better than the testing set for each method. Finally, from Table 6, XGBoost shows the highest performance for sensitivity and all other performance measures in comparison to AdaBoost and logistic regression. The overall abilities of all three methods are the best when the imbalance ratio is not severe, which is at 25%, and with a large sample size of n = 2000.

Conclusions
In this study, we have reviewed how imbalanced datasets can be dealt with using the existing approaches, namely boosting algorithm and traditional classification methods by comparing the performance of AdaBoost, XGBoost, and logistic regression. We applied AdaBoost, XGBoost, and logistic regression on nine sets of simulated imbalanced datasets with three different minority percentages and sample sizes. All three methods have an overfitting issue, as their performance on the training set is better than on the test set. The study did not find one best method for the simulated datasets, as each minority percentage has a unique best-performing method. For the simulated imbalanced datasets, logistic regression performed the best for the 5% minority dataset, XGBoost is the best method for the 10% minority dataset, and AdaBoost is the best method for the 25% minority dataset. For the real datasets, XGBoost performed the best for two out of three datasets, namely the ecoli and wifi datasets. The inconsistent results could have been caused by underlying issues of multicollinearity and outliers in the simulated datasets, which have not been examined due to time constraint. Finally, AdaBoost, XGBoost, and logistic regression are the best methods for handling imbalanced datasets when the imbalance is not severe; however, logistic regression is the best method for severe imbalanced datasets compared to boosting algorithm.