A Modified Robust Support Vector Regression Approach for Data Containing High Leverage Points and Outliers in the Y-direction

The support vector regression (SVR) model is currently a very popular non-parametric method used for estimating linear and non-linear relationships between response and predictor variables. However, there is a possibility of selecting vertical outliers as support vectors that can unduly affect the estimates of regression. Outliers from abnormal data points may result in bad predictions. In addition, when both vertical outliers and high leverage points are present in the data, the problem is further complicated. In this paper


Introduction
The SVR model is first introduced by [1]. It depends on the powerful concept of support vector machine (SVM), which is a typical perspective in the field of statistical learning. With a specific and global optimal solution, SVR has become popular in recent years due to its excellent performance and high generalization capability [2]. Some of the reasons for the widespread use of the SVR include reduced sensitivity to relative minima, hypothetical output guarantees, and a high degree of versatility to add additional dimensions to the input space, thereby avoiding the model's increasing complexity [3].
The superiority of SVR is its ability to approximate nonlinear relationships by the use of kernel tricks by constructing a sparse model to tackle regression problems [3]. To explain this, consider a data set such that an original input is x ∈ R n and a target output is y ∈ R 1 . The input vector x is first mapped onto a high-dimensional feature space in the SVR method, which is nonlinearly connected to the input space. The idea is to use the kernel trick in a high-dimensional predictor space to approximate the nonlinear relationship within the original input space to linear form [4][5]. The function of regression is given by: ( ) , is a non-linear function, w and b correspond to the weight and the bias. The goal of SVR is to estimate the values of parameters w and b, which optimize the expected risk by minimizing the ε-insensitive loss function below: In other words, SVR tries to minimize the bound generalization error such that generalized efficiency is achieved instead of minimizing the actual training error [6]. It is described through the use of kernel functions, sparse solution, and margin and number of support vectors controlled by Vapnik-Chervonenkis (VC) theory [7]. The key terminologies in SVR are shown in Figure 1. According to [8], most applications for real-world data are subject to anomalies and noises, which is a common issue that leads to misleading and false conclusions. Barnett and Lewis [9] have described an outlier as one that appears to significantly deviate from other members of the sample it occurs in. Outliers are observations which are far from the bulk of the rest of the data [10]. In SVR, it is possible to select outlying observations (outliers and high leverage points) as supporting vectors which may affect the estimation process [11][12]. It is, therefore, useful to implement some robust methods in SVR to remedy both problems of outliers and high leverage points.
Several studies were conducted by utilizing SVR to analyze data consisting of outliers. For instance, Jordaan and Smits [13], using the benefits of the Lagrange multipliers, developed the traditional SVR method for outlier detection in relation to Karush-Kuhn-Tucker (KKT) requirements. The main drawbacks of this approach are high calculation costs, difficult operation for non-expert users and the possibility of the emergence of masking and swamping problems. To solve the problems of the standard method, [14] suggested the µ-ɛ-SVR outlier detection procedure, which uses a new parameter of regularization (abbreviated as μ). The possible shortcomings of this method are detection of one outlier per iteration, high computational costs and lack of clear rule for selecting for the threshold value, ɛ which complicates the approach. In order to avoid the disadvantages of the above techniques, [15] presented a functional technique for the detection of outliers, using non-sparse ε-SVR that helps to minimize the time cost and incorporates fixed parameters. The robust SVR based on this approach is often referred to as Double SVR (DSVR). The objective of this study is to modify the DSVR technique by integrating three types of SVR models, i.e. epsilon-SVR (ε-SVR), nu-SVR (v-SVR), and epsilon bound-constrained SVR (ε-BSVR), with different kernel functions into the algorithm.

Standard SVR Models
The SVR concept is suggested by [1] to solve problems of function fitting, which is called the ɛ-SVR. Given a set of data points, {(x 1 , y 1 ), ..., (x n , y n )} such that x i ∈ R n is an input and y i ∈ R 1 is a target output, we want to employ ε-SVR. The optimization problem is: This can be solved as a quadratic programming problem.
C is the parameter to control the amount of influence of the error, * , ii  are slack variables and b is the bias term.
In [16], the authors introduced an improved SVR model, called nu-SVR (v-SVR), which uses a 'v' parameter to monitor the amount of training error and support vectors. To put it in another way, the parameter v is used to evaluate the proportion of the number of support vectors we want to hold in our solution about the total number of samples in the dataset. The size of v is between 0 and 1. That is, 0 < v < 1. After substituting ɛ for v, (3) is modified as can be seen in (4 The difference between ε-SVR and v-SVR is that in v-SVR, we can control the number of support vectors, but in ε-SVR, we can control the amount of error in the model instead of support vectors. Finally, the proposed method in [17], in which the bias term squared is applied to the objective function, is known as epsilon bound-constrained SVR (ε-BSVR) formulation. The optimization problem of ε-BSVR is: This can be solved with Wolfe's dual to a quadratic programming problem with box constraints only. The parameters of ε-BSVR are the same as the parameters of ε-SVR except that the parameter b, the bias term, is restricted to a given range.

Kernel Functions
To employ SVR models, kernel functions are to be used. The kernel function measures the dot product of two vectors i x and j x in the feature space In SVR procedure, is used in some form of nonlinear relationship to map the input space into high dimensional predictor space. Several kernel function types are employed in SVR. The type of kernel function chosen affects the estimates and computational complexity for implementing the designed SVR. Table 1 outlines eight of the common kernel function types that are considered for this research study.
ij K x x Data Containing High Leverage Points and Outliers in the Y-direction

Double SVR (DSVR) Algorithm
The Double SVR (DSVR) is proposed by [12]. It is a practical method for detecting influential observations by considering three directions: type of transformation, sparseness and robustness. The efficiency of this approach arises from the fact that it decreases the computational cost and can identify outliers without having to delete them to enable handling. This can be achieved by minimizing the weights for outlying points in the data set. The DSVR is based on fixed parameters SVR (FPSVR) method [8]. The benefit of this technique is to control the free parameters, ε, C and σ, where σ is the hyperparameter of the RBF kernel.
The method of DSVR based on fixed parameters can be summarized as follows: Step 1: Based on FPSVR (the fixed parameters of ε-SVR are: ε = 0, C = 10000 and σ = 1), find the fitted values of the ε-SVR model.
k x x is the kernel function and b is the constant (bias term). Or alternatively, find the absolute residuals based on the FPSVR (the fixed parameters are ε = 0, C = 0.0001 and σ = 1 in the case of using residuals, z). The absolute residuals are given in (8).
Step 2: Any point with an absolute z value larger than the cutoff point is considered to be an outlier. The cutoff point is given in (9). var( ) 2 ( ) 2 2 z CP median z n   (9) Step 3: Compute the weight function as given in (10).

The Proposed Robust SVR Method
As the standard SVR models are not good robust against outliers and high leverage points, DSVR is developed. Nevertheless, the DSVR considers only ε-SVR with RBF kernel. Hence, we improved the DSVR by combining all three forms of SVR into the algorithm with different kernel functions. The hyperparameter of any kernel is abbreviated as h. In this method, the value of h is set equal to 1. If the kernel function has more than one hyperparameter, like the polynomial kernel, they all need to be equal to 1. The proposed robust SVR approach based on fixed parameters can be summarized as follows: Step 1: Based on FPSVR (in ε-SVR and ε-BSVR the fixed parameters are ε = 0, C = 10,000 and h = 1, and in v-SVR, ε = 0, v = 1 and h = 1), find the fitted values of the selected SVR model.
The values of the parameters C and v change with (12) and (13) because in both equations we considered that the fitted and estimated error values to be large enough to detect the vertical outliers.
Step 2: Any point with an absolute z value larger than the cutoff point is considered to be an outlier. Step 3: The suspect high leverage points are detected by using robust Mahalanobis distance (RMD) based on the Minimum Volume Ellipsoid (MVE) developed by [20] as: 1 ( ( )) ( ) ( ( )), for = 1, 2, .., where T R (X) and C R (X) are robust locations and shape estimates of the MVE, respectively. Rahmatullah Imon [21] suggested a cut-off point for the robust Mahalanobis distances as: The RMD is only valid for small dimensional data (P ˂ n), where P is the number of predictor variables and n is the number of data points.
Step 4 where z is the estimated, predicted or fitted values, as in (12), and HLPs are high leverage points.
Step 5: Estimate the final robust SVR model as follows:

Performance Measures
To measure the accuracy and determine the best SVR model, this study considers the following three types of errors as stated in [22]: 1. Training error is the error that we get back to training data when we run the trained model. 2. Test error is the error that we get on the test data when we run the trained model. 3. A k-fold cross-validation error is an average error that we get when we run the trained model on k subsets (folds) of the test data.
Also, we can determine the goodness of the model by using the following measures: 1. The R-square (R 2 ) demonstrates the percentage of the total variation of the response variable that is explained by the predictor variable/s.
where RSS is Residual Sum of Squares and TSS is Total Sum of Squares.
2. The Predicted R-squared (R 2 pred ) reveals how well a regression model predicts outcomes for new data.
where PRESS is Predicted Residual Error Sum of Squares.
Finally, the assessment of the merits of the selected SVR, the DSVR and the proposed robust SVR methods can be made by using the following performance measures: 1. Mean Square Error (MSE) is used as an efficiency test to figure out which methods are the best in various cases.
where y is the fitted values and n is the number of data points.

Belgian Phone Data Set
Belgian phone data, collected from the Belgian Statistical Survey, is a real data set that has vertical outliers. It consists of the total number of international phone calls made between the years 1950 and 1973 (in ten million calls) [20]. The cross-validation error is determined using 10-fold cross-validation. We only used it to figure out the best model for this data set.
The results in Tables 2, 3 and 4 present the performance of the three different SVR types with eight different kernel functions of Belgian phone dataset by tuning the parameters ε, C and v with 10-fold cross-validation in the grid search. The optimum parameters obtained in ε-SVR and ε-BSVR by the grid-search are ε = 0.3 and C = 251, while in v-SVR, the optimum parameters are ε = 0.4 and v = 0.1. In these results, we can find that when Laplace kernel is used, the MSE is at least. In addition, the determination coefficient (R 2 ) is presented to measure the performance (goodness) of each model. In this data set, ε-SVR with Laplace kernel is the best model since MSE and cross-validation errors are minimal. As for the other SVR models, R 2 is also relatively high. Figure 2 displays the outlier detection of ε-SVR technique with Laplace kernel depending on fixed parameters (C = 10,000, σ = 1, and ε = 0) by using the median absolute deviation of the absolute fitted values. We can see that the cases 15-20 are vertical outliers.
In Figure 3, we compared the performance of the three methods by considering different values for the parameters (ε = 0.1, 0.2, 0.3, C = 1, 20, 50 and σ = 0.5, 1, 5). The values of C greater than 50 give approximately the same results like those used. Furthermore, Figure 3 shows that ε-SVR is less efficient in reducing the estimation effects of the outliers, whereas the proposed robust SVR and DSVR methods compete with each other. However, in this data set, our proposed method is marginally better than DSVR by attaining the lowest MSE and RMSE values. Data Containing High Leverage Points and Outliers in the Y-direction

Hawkins-Bradu-Kass Data Set
In this example, we used the well-known artificial data set created by [23]. It consists of 75 observations with one dependent and three independent variables. Additionally, the first 14 cases are influential observations. 70 percent of this data set is randomly selected to use in training data and the remaining 30 percent in test data. Also, the cross-validation error is determined using 10-fold cross-validation. We only used this to determine the appropriate model and we used the complete data set for the rest.
Tables 5, 6 and 7 show the results of three different forms of SVR with eight different kernel functions of HBK data set by tuning the parameters ε, C and v in the grid search with a 10-fold cross-validation error. The optimum parameters obtained in ε-SVR and ε-BSVR are ε = 0 and C = 15, while the optimum parameters in v-SVR are ε = 0.2 and v = 0.1. Three types of errors, i.e. training, test and cross-validation errors, are considered to optimize the accuracy of the selection of the model. From the results, we can see that when using RBF or Gaussian kernel, those errors are small. Moreover, Predicted R 2 (R 2 of the test data) is provided to determine the goodness of fit of each model. Based on these findings, the suitable SVR model of this data set is ε-BSVR with RBF kernel as the test and cross-validation errors are minimal and predicted R 2 is reasonably large when compared to the other SVR models.
The results in Figure 4, using RMD based on MVE, explain the detection of high leverage points for HBK data set. We can observe that the cases 1-14 are detected as influential observations.
In Figure 5, we considered different values for the parameters (ε = 0.1, 0.2, 0.3, C = 1, 50, 100 and σ = 0.5, 1, 5) for evaluating the three methods' performances. The values of C greater than 100 yield approximately the same results as those previously. Furthermore, the results in Figure 5 clearly shows the efficiency of the proposed robust SVR approach over DSVR and ε-BSVR techniques in terms of achieving lower MSE and RMSE values. Based on these outcomes, we can infer that the use of the proposed robust SVR method is recommended for HBK data set.

Simulation Study
Two simulation studies are conducted to examine the merits of our newly developed robust SVR technique and existing methods (standard SVR and DSVR) in the presence of outliers and high leverage points. The first simulation study deals with the linear case and the second simulation with the nonlinear case. The simulation studies were performed by using R software.

Simulation I: Linear Case
A linear regression model with two independent variables and different sample sizes, that is n = 50, 100 and 150 is considered in this simulation study. For each sample n = 50, 100 and 150, the following relationship is used to produce the clean data [24]: , with x 1 and x 2 the same as the clean data. Also, we considered different contamination percentages (10%, 15% and 20%) and different values for the parameters (ɛ = 0.1, 0.3 (as small as possible) and C = 50, 100). For this simulation study, consideration is given ɛ-SVR with a linear kernel. On the basis of (22) and (23) and for each simulation run (1000), MSE and RMSE are obtained on the basis of ɛ-SVR, DSVR and the proposed robust SVR methods. Average MSE and RMSE for 1000 iterations are recorded. This can be used to assess the merits of the methods mentioned above.
The complete results of this simulation study, based on the proposed robust SVR method, DSVR and the ɛ-SVR, are explained graphically in Figures 6,7 and 8. This shows the estimates of the above techniques for the different sample sizes and levels of contamination. The DSVR successfully detected vertical outliers in this simulation study, based on absolute residuals of SVR. However, it does not detect high leverage points. On the other hand, by using RMD based on MVE, the proposed robust SVR approach detects both the vertical outliers and high leverage points. Additionally, we can clearly see that the proposed robust SVR approach has successfully achieved very low values of MSE and RMSE relative to DSVR and ɛ-SVR for all possible combinations of different sample sizes and different percentages of contamination. This reveals that the proposed robust SVR method is a more efficient technique compared to other approaches, despite the presence of various percentages of contamination points in both x and y directions.

Simulation II: Nonlinear Case
A nonlinear regression model with two independent variables and different sample sizes, which is n = 50, 100 and 250, is considered in this simulation study. For each sample n = 50, 100 and 250, the following relationship is used to generate the clean data [12]: 22  , v = 0.2, 0.5 and σ = 1, 5) are taken into consideration.
In this simulation study, v-SVR with RBF kernel is considered. The MSE and RMSE based on v-SVR, DSVR and the proposed robust SVR methods are obtained for each simulation run (1000) on the basis of (22) and (23). The MSE and RMSE averages are recorded for 1000 iterations. This can be used to evaluate the effectiveness of the techniques suggested.
Figures 9, 10 and 11 present graphically the results of comparison of the proposed robust SVR technique, the DSVR and the v-SVR of this simulation study. This describes the estimates of the mentioned methods for different sample sizes and levels of contamination. In this simulation study, vertical outliers and high leverages are effectively identified by both DSVR and the proposed robust SVR approach using absolute fitted values and robust Mahalanobis distance based on MVE respectively. Furthermore, we can obviously see that the proposed robust SVR method has smaller MSE and RMSE than the DSVR and the v-SVR for all possible combinations of different sample sizes and different percentages of contamination. This demonstrates that the proposed robust SVR approach is the best technique compared to other techniques, in the presence of different levels of contamination. Figure 9. The MSE and RMSE of the v-SVR, the DSVR and the proposed robust SVR methods for different sample sizes and 10% contamination Figure 10. The MSE and RMSE of the v-SVR, the DSVR and the proposed robust SVR methods for different sample sizes and 15% contamination Figure 11. The MSE and RMSE of the v-SVR, the DSVR and the proposed robust SVR methods for different sample sizes and 20% contamination