Robust Method in Multiple Linear Regression Model on Diabetes Patients

This paper is focusing on the application of robust method in multiple linear regression (MLR) model towards diabetes data. The objectives of this study are to identify the significant variables that affect diabetes by using MLR model and using MLR model with robust method, and to measure the performance of MLR model with/without robust method. Robust method is used in order to overcome the outlier problem of the data. There are three robust methods used in this study which are least quartile difference (LQD), median absolute deviation (MAD) and least-trimmed squares (LTS) estimator. The result shows that multiple linear regression with application of LTS estimator is the best model since it has the lowest value of mean square error (MSE) and mean absolute error (MAE). In conclusion, plasma glucose concentration in an oral glucose tolerance test is positively affected by body mass index, diastolic blood pressure , triceps skin fold thickness, diabetes pedigree function, age and yes/no for diabetes according to WHO criteria while negatively affected by the number of pregnancies. This finding can be used as a guideline for medical doctors as an early prevention of stage 2 of diabetes.


Introduction
Diabetes is defined as a disease in which the body's capability to produce or respond to the hormone insulin is reduced, resulting in unusual metabolism of carbohydrates and raised levels of glucose in the blood. Nowadays, diabetes is a common disease and is becoming more common. Age-adjusted prevalence is set to increase from 5.9% to 7.1% (246-380 million) worldwide in the 20-79 year age group [1]. 20% of case of diabetes among adults was in urban population and 10% was in rural population [2]. There are several factors that contribute to diabetes disease such as age, body mass index (BMI) and central adiposity, measured either as waist circumference (WC) [2]. Nowadays, people no longer practice physical activities but eat additional food with the high consumption of sugar causing high tendency for people in India to develop insulin resistance [3].
Multiple linear regression (MLR) model can be described as statistical approach to describe the association between two or more quantitative variables so that the dependent variable can be predicted from the others. MLR is widely used in business, the social and behavioural sciences and many other areas [4]. MLR needs the assumption of normally distributed variables and measurement errors necessarily causing underestimation of simple regression coefficients [5]. Robust regression is used to help in detection and deletion of outlier and it is an approach that provides estimation, inference and testing that are not influenced by outlying observations but described correctly the structure for the data [6]. The goal to use robust regression is to produce linear models that are not biased by few outliers [7]. There are other quite considerable studies carried out in statistical modelling [8,9]. The objectives in this study are to identify the significant variables that affect diabetes by using multiple linear regression model, to apply the robust regression method on diabetes data and to measure the performance of robust method by comparing MLR model only and MLR model with robust method.

Materials and Methods
The data were collected from the US National Institute of Diabetes and Digestive and Kidney Disease webpage. It involved 332 women who were at least 21 years old of Pima India heritage and living near Phoenix, Arizona. There are 8 variables involved in this study which are 1 dependent variable and 7 independent variables. The dependent variables are denoted by y that is plasma glucose concentration in an oral glucose tolerance test. This test is a common test used in general hospital in Malaysia and other countries. This test checks a standard dose of glucose ingested by mouth and blood levels that are checked two hours later where the normal reading should be below 6.1 mmol/L. While other 7 independent variables are denoted by

Multiple Linear Regression
Multiple linear regression (MLR) is one of the most commonly used of all statistical methods. MLR is known as predictive analysis that is used to explain the relationship between one dependent variable and two or more independent variables. Leona (2012) described multiple regression model as a linear regression model with two or more predictors and one response [10]. The model equation expresses the value of predictor variables as a linear model of two or more independent variables and the error terms as in (1) [4]: where, = dependent variable or real data 0 = constant of MLR = constant of the ℎ independent , = independent variables î i i e y y = − = residual ˆi y = estimated data from MLR model Before the MLR can be done, two assumptions need to be fulfilled. Firstly, the normality test for y vs all , should be done by using P-P Plot or by numerical calculation. Using P-P Plot, straight graph data indicate normality distribution. Next, multicollinearity test among , variables should be done to identify the existence of multicollinearity in model that can affect the least square method accuracy of the estimated model. The VIFF value less than 10 indicates no multicollinearity among , variables [4].

Least Quartile Difference (LQD) Method
LQD is a regression estimator which is highly robust since the LQD can resist up to 50% largely deviant data values without becoming too biased. Since LQD has almost 50% of breakdown point, LQD is expected to deal with unusual observation and should give the good performance when the data is not contaminated. LQD formula decreases the Q 1 of the | residual i -residual k | as in (2) [11]. LQD estimator of β is defined by, In this method, the residual needs to be sorted first. Then, 25% of upper data and 25% of lower data need to be discarded. Then 50% of the remaining data need to be analysed using MLR model. This model should be applied to all data in order to find new MSE and MAE values.

Mean Absolute Deviation (MAD) Method
Median absolute deviation (MAD) is one of the most familiar robust measures. MAD is defined as the median of absolute values and overall median of the data set. MAD is also known as a robust measure of variability of univariate sample of quantitative data and also refers to parameter of population that is estimated by MAD calculated from a sample as in (3) [12].
MAD is an estimator of the spread in data and has an approximately 50% breakdown point like the median.

Least-Trimmed Squares (LTS) Method
Least-Trimmed Squares (LTS) are a robust estimator with 50% breakdown point. This estimator is unaffected to exploitation due to outliers, if the outliers found not more than 50% of the data and can be represented as in (4) and (5) [13].
Ordering the squared residuals from smallest to largest: The LTS estimator chooses the regression coefficients b to minimize the sum of the smallest m of the squared residuals, ⁄ , a little more than half of the observation 2 = squared residual

Cross Validation Technique
Cross validation is a technique for evaluating how the results of a predicted model will predict the real data set. It is used when someone wants to evaluate how precise a predictive model is when it will be applied into real data [14]. In this study, only two methods of cross validation are used as in (6) and (7).

Mean Square Error (MSE)
MSE is represented as in (6) where y i is the real data, i ŷ is the predicted data, N is the number of observations.

Mean Absolute Error (MAE)
MAE is represented as in (7), where y i is the real data, i ŷ is the predicted data, N is the number of observations.

Multiple Linear Regression
Referring to the Figure 1, P-P plot shows that the data are in nearly straight lines which indicate that the distribution of y vs all , is normally distributed. Since the VIFF value for all , is less than 10, it indicates that the multicollinearity among , variables does not exist. So, the two early assumptions in MLR have been satisfied. The MLR model equation is given below where all variables are included without exceptional as in (8) The value of MSE of MLR model is 650.042, whereas the MAE of the MLR model is 20.5234. Using studentized residual test, it is shown that 12 points are identified as outliers. This is the reason why the usage of robust method is needed in this study.

Least Quartile Difference (LQD) Method
Firstly, sorting the residual value from smallest to highest value, then 25% of upper and 25% of lower quartile of the data will be removed. The remaining data are used to build the new model. The new model is stated as in (9). ŷ = 56.990+ 0.311 1 -1.041 2 +0.318 3 + 0.160 4 + 9.7 11 5 + 0.407 6 +28.604 7 The new model equation in (9) will be applied to the original data, then MSE and MAE are calculated. After using LQD method, the MSE value is 641.9068 and MAE value is 20.4288.

Mean Absolute Deviation (MAD) Method
The analysis of MADe method is as shown below, 1.
The new model equation as in (10) is applied to the original data and the value of MSE and MAE is calculated. The values of MSE and MAE obtained are 652.9255 and 20.53357 respectively.

Least-Trimmed Square (LTS) Method
In LTS estimator method, the square of residual is sorted in ascending order in Microsoft Excel. Then 116 of the data were removed and the remaining 171 data are used to build the new model by using equation (5). The new model equation in (11) is as follows, � = 60.6 + 0.261 1 −0.966 2 + 0.290 3 + 0.172 4 +10.9 5 + 0.378 6 + 28.8 7 The new model in (11) then is applied to the original data and the MSE and MAE values are calculated 640.2429 and 20.4255 respectively.

Conclusions
In order to measure the effectiveness of the model, the comparison of methods is done. The value of MSE and MAE of the original data of MLR, MLR with applied LQD, MLR with applied MAD and MLR with applied LTS estimator was compared. Based on the value of MSE and MAE, it is concluded that the MLR model with applied LTS estimator is the best model since it has the lowest value of MSE and MAE. In conclusion, plasma glucose concentration in an oral glucose tolerance test is positively affected by the increasing of body mass index, diastolic blood pressure, triceps skin fold thickness, diabetes pedigree function, age and yes or no for diabetes according to WHO criteria. In fact, diabetes pedigree function and yes/no for diabetes according to WHO criteria have the highest impact on plasma glucose concentration in an oral glucose tolerance test. Besides that, plasma glucose concentration in an oral glucose tolerance test is negatively affected by number of pregnancies. This result can be used as a guideline for medical doctors as an early prevention of stage 2 of diabetes.