New Gradient Methods for Bandwidth Selection in Bivariate Kernel Density Estimation

The bivariate kernel density estimator is fundamental in data smoothing methods especially for data exploration and visualization purposes due to its ease of graphical interpretation of results. The crucial factor which determines its performance is the bandwidth. We present new methods for bandwidth selection in bivariate kernel density estimation based on the principle of gradient method and compare the result with the biased cross-validation method. The results show that the new methods are reliable and they provide improved methods for a choice of smoothing parameter. The asymptotic mean integrated squared error is used as the measure of performance of the new methods.


Introduction
Kernel density estimators are widely used nonparametric estimation techniques due to their simple forms and smoothness. Kernel density estimation is the construction of a probability density estimates from a given sample with few assumptions about the underlying probability density function and the kernel function. Kernel density estimation is a popular tool for visualising the distribution of data [1]. These estimates depend on a bandwidth also known as the smoothing parameter which controls the smoothness and a kernel function which plays the role of a weighting function [2]. Bandwidth selection is a key issue in kernel methods and has attracted the attention of researchers over the years. It is still an active research area in kernel density estimation. Progress has been made recently on data based smoothing parameter selectors by some researchers [3,4].
The importance of the bivariate kernel density estimator cannot be overemphasized because it occupies a unique position of bridging the univariate kernel density estimator and other higher dimensional kernel estimators [5]. The usefulness of the bivariate kernel density estimator is mainly in its simplicity of presentation of probability density estimates, either as surface plots or contour plots. It also helps in understanding other higher dimensional kernel estimators [2]. In bivariate kernel density estimation, , is taken to be the two random variables with a joint probability density function ( , ). The random variables , , = , , … , are the set of observations and is the sample size. The bivariate kernel density estimate of ( , ) is of the form where > 0 and > 0 are the smoothing parameters in the and axes and ( , ) is a bivariate kernel function [5,6]. The bivariate kernel density estimator in (1.1) can be written as [7] Bivariate bandwidth selection is a difficult problem which may be simplified by imposing constraints on and . For example, and may be restricted to be the diagonal elements of the bandwidth matrix and the advantages of imposing restrictions on and has been investigated [8]. One of the popular methods of bandwidth selection that is data based is the biased cross-validation method that considers the asymptotic mean integrated squared error [9]. The bivariate biased cross-validation method is based on minimizing an estimate of the asymptotic mean integrated squared error and is of the form [10]  where ∆ = � − �, ∆ = � − � and is the standard normal density function.
This paper focuses on the methods of selecting bandwidth for bivariate kernel density estimator using the gradient methods. The rest of the paper is organized as follows: in section 2, we present the asymptotic mean integrated square error of the bivariate kernel density estimator; in section 3, we present the gradient methods while section 4 talks about numerical illustrations of results. Section 5 concludes the paper.

Asymptotic MISE Approximations
The estimate � ( , ) in (1.2) is measured by the asymptotic mean integrated squared error (AMISE). A straightforward asymptotic approximation of (1.2) using the multivariate version of Taylor's series expansion yields the integrated variance (IV) and the integrated squared bias (ISB) as where ( ) is the roughness of the kernel, is the trace of a matrix (matrix of second partial derivatives) of , is the dimension of the kernel and ( ) is the second moment of the kernel. Combining the terms in (2.1) yields an estimate of the asymptotic mean integrated squared error and is of the form The smoothing parameter that minimizes the AMISE of (2.2) is given by that it has a bias term which depends on ( ″ ) and cannot be evaluated if the true density function is not known. Several suggestions have been made and one of the simplest solutions to this problem is to obtain the value of ( ″ ) from the normal distribution for unimodal data but for data that is multimodal (which is kernel density estimations' expectation), it behaves poorly in performance and in that situation, it serves as an initial point for other iterative methods [11,12].

The Gradient Method
Gradient methods are iterative methods of optimizing functions that are at least twice continuously differentiable [13]. Gradient methods involve generating successive points in the direction of the gradient function with the desire to obtain the stationary points [14]. The gradient of a function ( ) denoted by ( ) is given by Gradient methods are iterative techniques with the elements of the gradient being nonlinear and a positive scalar (stepsize) that determines the distance between the current point and the previous point [13]. Generally, the gradient method requires higher derivatives of the function to be approximated. In this work, we modify the gradient methods of Barzilai and Borwein and the relaxed steepest descent method of Raydan and Svaiter [15,16].
In the application of the gradient methods, we replace the unknown quantity ( ″ ) in (2.3) by a suitable kernel based estimate denoted by � � ( )�, i.e. ( ″ ) = � � ( )�. The kernel function used is the standard normal kernel that produces smooth density estimates and simplifies the mathematical computations needed. In the case of the standard normal kernel, successive smoothing parameters will be obtained from the approximation given by Since all gradient methods are iterative methods and they require the choice of a starting value say we apply a kernel based estimate as an initial point for the iteration process and is of the form where is the standard deviation of the variate.

The Barzilai and Borwein Gradient Method
This method solves the problems associated with the classical gradient method with their novel approach of the stepsize selection. It requires lesser computations and the rate of convergence is also accelerated, though with the same search direction as the classical gradient method but with better performance [17]. The method is used for obtaining large sparse linear system of equations from the solution of partial differential equations and also applies in optimization theory due to its ease of computation and implementation [18,19]. The Barzilai and Borwein gradient method is of the form and must be symmetric. The value of the kernel based estimate � � ( )� must be positive, i.e. � � ( )� > 0 because the smoothing parameter must be positive. The estimate � � ( )� = + when substituted into (3.2) will give the smoothing parameter that minimizes the AMISE. In all the gradient methods considered, the modifications were in the method of obtaining the initial point of the iteration which is kernel based and the introduction of the third step that will be used to obtain the required solution. (b) Compute the step size = .

STEP3.
Employ � � ( )� = + to compute the smoothing parameter using (3.2) above. STEP4. Test a criterion for stopping the iterations. If the test is satisfied, then stop. Otherwise, consider = + and continue with step 2 by updating the step size, , where + = ( + ).

The Relaxed Steepest Descent Method
Another modification of the steepest descent method was made by Raydan and Svaiter [16] and they concluded that the poor performance of the method is a function of the stepsize and not the search direction. In solving the problem of the stepsize's selection, they introduced a scalar called the relaxation parameter which lies between 0 and 2 into the classical steepest method. The classical steepest descent method is of the form where = .
Multiplying the stepsize by the relaxation parameter resulted in the improvement of the method by accelerating its rate of convergence when applied to numerical problems [20]. It should be noted that when = , the classical steepest descent method will be obtained and so ≠ . (c) Update + = − .

STEP3.
Employ � � ( )� = + to compute the smoothing parameter using (3.2) above. STEP4. Test a criterion for stopping the iterations. If the test is satisfied, then stop. Otherwise, consider = + and continue with step 2 by updating the step size, In the application of this algorithm, our relaxation parameter = where is the sample size. The relaxation parameter is chosen as = because of the role of the sample size in the choice of smoothing parameter. Generally, it is known that the smoothing parameter depends very strongly on the sample size such that as the sample size increases, the smoothing parameter tends to be reduced [21]. The stopping criterion is‖ + − t ‖ < with = 10 −5 where is the tolerance level. The results of the modify methods shall be compared with the biased cross-validation method.

Results and Discussion
In order to illustrate the efficiency of these methods, we compare their performances with the biased cross-validation method using the asymptotic mean integrated squared error (AMISE) as the error criterion function. The results are presented by comparing the gradient methods (Modify Barzilai and Borwein (MBB) and the Modify Relaxed Steepest Descent(MRSD)) with the Biased Cross-Validation (BCV)) for the bivariate kernel density estimator and using the asymptotic mean integrated squared error (AMISE) as the error criterion function for measuring their performance. As generally known, one method is better than the other when it gives a smaller value of the AMISE [12]. The comparison also involves the kernel estimates (graphs) of the methods considered since kernel density estimation has direct applications on data analysis such as exploratory data analysis and data visualisation [1,2,6].
Two sets of data were used to illustrate the results of the methods, showing in a tabular form the smoothing parameter, the Asymptotic Integrated Variance (AIV), the Asymptotic Integrated Squared Bias (AISB) and the Asymptotic Mean Integrated Squared Error (AMISE) using the bivariate standard normal kernel. An important point to note from the Tables below is that in terms of performance, the gradient methods resulted in a smaller value of the AMISE.
One very important and notable step to be taken when examining bivariate data set is to consider the Scatterplot of the data. But as it has been in most cases, while kernel density estimate will reveal or highlight important features, Scatterplot cannot play this vital role [2]. Scatterplots have been regarded as the most frequently used tools for graphically displaying bivariate data sets but with the serious disadvantage that the eye is only drawn to the peripheries of the data cloud, while structures in the main body of the data will be hidden by the high density of the points [22]. In kernel density estimates, these disadvantages of the Scatterplots are removed because they have an advantage in the presentation of information regarding the distribution of the data set. As noted from the Scatterplots of the data sets considered, the modes are not apparent from the Scatterplot vas in the kernel density estimates and this exemplify the usefulness of the bivariate kernel density estimates for highlighting structure.
The first data set examined is the Volcanic Crater data of Bunyaruguru Volcanic Field in Western Uganda [23]. It involves the Locations of Centers of Craters of 120 volcanoes in two variables in which variable X represents the first center while variable Y represents the second center. Figure 1 shows the Scatterplot of the Crater data and the Scatterplot clearly show a strong relationship between the variables with correlation coefficient = .
. It is evident that the two Locations of the Centers of Craters are highly positively correlated. A significant feature of this data set that is very noticeable from the kernel density estimates (graphs) is the bimodality of the data but this is hidden as presented by the Scatterplot. We standardized the data in order to obtain equal variances in each dimension because in most multivariate statistical analysis, the data should be standardized in order to make sure that the difference among the ranges of variables will disappear [2,10,24,25]. Figures 2, 3, and 4 below show the kernel estimates of the Crater data.    The biased cross-validation method yields smoothing parameter that produces an estimate with the bimodality being clearly present as shown in Figure 2. The gradient methods also yield smoothing parameter values that retain the bimodality of the data as shown in Figure 3 and Figure 4 respectively. The table below shows the bandwidths, the asymptotic integrated variance (AIV), the asymptotic integrated squared biased (AISB) and the asymptotic mean integrated squared error (AMISE) of the methods considered.  Table 1 above, it is obvious that in terms of performance, the biased cross-validation method produced the largest AMISE value. The gradient methods yield a smoothing parameter with smaller AMISE value as shown in Table 1.
The second data set examined is the waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA [26]. The data set is made up of 272 observations on two variables in which variable X represents the duration of the eruption while variable Y represents the waiting time between eruptions. One very important point to note from the bivariate kernel estimates of this data is that the data set is bimodal and this provides very strong evidence in favour of eruption times and the time interval until the next eruption exhibiting a bimodal distribution [2]. The bivariate kernel density estimates are bimodal and it is evident that the time interval until the next eruption is highly positively correlated with the duration of the eruption. The Scatterplot of the Old Faithful data is shown in Figure 5 while Figures 6, 7 and 8 are the bivariate kernel estimates of the data for the methods considered. The Scatterplot show a strong relationship between the variables with correlation coefficient = .
. The data is also standardized to obtain equal variances in each dimension [24,25]. Table 2 below shows the smoothing parameters, the asymptotic integrated variance (AIV), the asymptotic integrated squared bias (AISB) and the asymptotic mean integrated squared error (AMISE) of the methods considered.      Table 2 shows the performances of the biased cross-validation method and the gradient methods and from the results, the biased cross-validation method yielded larger AMISE value. Again the gradient methods yield a smoothing parameter with the smaller AMISE values as presented in Table 2.

Conclusions
The methods presented are compared with the biased cross-validation method because they are based on a suitable estimate of the asymptotic mean integrated squared error (AMISE). The results presented show that the new methods are reliable and they provide improved methods for a choice of smoothing parameter. An advantage of the gradient methods is that they can be easily computed provided the function is at least twice differentiable.
As for the bivariate case that sits between the univariate and higher dimensional kernel, and that is a standard normal product kernel, the gradient methods based on their performance is at least as competitive as the existing biased cross-validation.