Tree-based Threshold Model for Non-stationary Extremes with Application to the Air Pollution Index Data

Air pollution index (API) is a common tool used to describe the air quality in the environment. High level of API indicates the greater level of air pollution which will gives bad impact on human health. Statistical model for high level of API is important for the purpose of forecasting the level of API so that the public can be warned. In this study, extremes of API are modelled using Generalized Pareto Distribution (GPD). Since the values of API are determined by the value of five pollutants namely sulphur dioxide, nitrogen dioxide, carbon monoxide, ozone and suspended particulate matter, data on API exhibit non-stationarity. Standard method for modelling the non-stationary extremes using GPD is by fixing the high constant threshold and incorporating the covariate model in the GPD parameters for data above the threshold to account for the non-stationarity. However, high constant threshold value might be high enough on certain covariate for GPD approximation to be a valid model for extreme values, but not on the other covariates which leads to the violation of the asymptotic basis of GPD model. New method for the threshold selection in non-stationary extremes modelling using regression tree is proposed to the API data. Regression tree is used to partition the API data into a stationary group with similar covariate condition. Then, a high threshold value can be applied within a group. Study shows that model for extremes of API using tree-based threshold gives a good fit and provides an alternative to the model based on standard method.


Introduction
Air quality is an important aspect of human life. In Malaysia, it is monitored and enforced by the Department of Environment (DOE). Air quality is determined by the Air Pollution Index (API) which is updated hourly by DOE. API value is derived from five main pollutants which are ground level ozone (O 3 ), nitrogen dioxide (NO 2 ), particulate matter (PM 10 ), carbon monoxide (CO) and sulphur dioxide (SO 2 ). High concentration of these pollutants in the air is harmful for everyone and also causing serious health problem. High concentration of SO 2 , NO 2 and CO in the air can cause a heart and lungs problem [1]. High level of PM 10 is associated with haze days which can limit the eyesight and cause the respiratory problem [2]. According to Malaysian Ambient Air Quality Guidelines (MAAQG), API values which above 100 are considered unhealthy and could threaten public health. Therefore it is important to understand the behavior of high level of API particularly to give health warnings for the public. In order to describe the behavior of high level API at a particular area, it is important to identify the distributions which best fit the data [3]. Extreme value distribution from extreme value theory are suitable in modelling such high values.
Extreme value theory (EVT) is a branch of statistics which study the stochastic behavior of a process at unusually large or small values [3]. Particularly, EVT provides procedures for tail estimation which are scientifically and statistically rational. Two significant results from EVT are first, the asymptotic distribution of the standardized series of maxima (minima) is shown to converge to the Gumbel, Frechet, or Weibull distributions. A standard form of these three distributions is called the generalized extreme value (GEV) distribution. The second significant result concerns the distribution of excess over a given threshold, where the limiting distribution is a generalized Pareto distribution (GPD). The application of EVT is important in various disciplines such as hydrology, meteorology, finance, engineering, insurance and environment. Common applications of EVT include modeling the extremes of heavy rainfall, sea-levels, pollution concentrations and many more. There are many application of extreme value analysis (EVA) in the context of air pollution data set in many parts of the world. An early comprehensive review of the application of EVA to the air quality data (SO 2 and NO 2 ) can be found in [4] and [5]. [6] compares the performance of GEV and GPD fitted on PM 10 data based on estimated parameters and return levels while [7] modelled API values which are above 100 using GPD. [8] use MRL plot to select thresholds for API, O 3 and PM 10 data. Using Pickands dependence function plots, [8] shows that the PM 10 and O 3 are the dominant pollutants which could affect the API at high level.
As the API value varies according to the variation of five main pollutants which are O 3 , NO 2 , PM 10 , CO and SO 2 , considering these pollutants into model for API seems reasonable. Modelling the high API in the presence of these covariates also known as model for non-stationary, requires a specific treatment to account for these non-stationarity. Standard extreme non-stationary model using GPD is to fixed a high threshold u while the effect of non-stationarity is accounted by the inclusion of covariate model in the GPD parameters. However, the pre-selected high u does not guarantee that it is high enough for all the covariate to produce those API values, hence, imposing an inaccurate approximation of GPD as a model for threshold exceedances [9]. One of the possible solution for these problem is to use high u for API values produced by similar pollutants condition. This can be achieved by grouping the pollutants into the same group which produce most, if not all, stationary API values. In this paper we propose a new threshold selection for non-stationary GPD model using a regression tree. This paper is organized as follows. In Section 2, we explain the data that will be used in this study followed by an explanation on a methodology in Section 3. Section 4 evaluate the performance of the proposed method by a simulation study. Section 5 will discuss the finding of the research and some concluding remarks in Section 6.

Data and Study Area
The hourly air quality data consist of air pollution index (API) and hourly average of ground level ozone (O 3 ), nitrogen dioxide (NO 2 ), particulate matter (PM 10 ), carbon monoxide (CO) and sulphur dioxide (SO 2 ) obtained from Department of Environment Malaysia for the period of 1 st January 2008 until 31 st December 2017. The data is from a Continuous Air Quality Monitoring Station located in Klang Valley which is Sekolah Kebangsaan Bandar Damansara Utama, Petaling Jaya station. This station are considered as residential and industrial areas where air pollutants are produced at a higher rate [10]. Besides, Petaling Jaya also located at the edge of Kuala Lumpur, the capital city of Malaysia which makes it a highly populated area. Figure 1 shows the location of Petaling Jaya station in the state of Selangor. The minimum, maximum and median values for API observations are 18, 257 and 54 respectively. According to Malaysian Ambient Air Quality Guidelines (MAAQG), API values which above 100 are considered unhealthy and could threaten public health. In this data set, there is 97 API observations which exceed 100.

Model Formulation for the Tree-based Threshold
Let Y 1 , Y 2 , . . . be a sequence of independent random variables with common continuous distribution function F (y) and denote the upper end point as y F . The extreme observations refer to those of the y that exceed some pre-determined high threshold u with u < y F . According to [11], as u → y F , the distribution of the threshold exceedances, z = y − u|y > u can be modeled by distribution function of the form defined on the set {z : z > 0 and (1 + ξz/σ u ) > 0}. The distribution function defined by Eq. (1) is called the Generalized Pareto distribution (GPD). The parameters of the GPD are determined by the scale σ u > 0 and shape −∞ < ξ < ∞. The density function of GPD is The parameters of the GPD can be estimated by maximum likelihood estimation method. Suppose that the values z 1 , . . . , z k are the k threshold exceedances. The likelihood function derived from Eq. (2) is Tree-based Threshold Model for Non-stationary Extremes with Application to the Air Pollution Index Data with θ = (σ u , ξ). By taking a logarithm, the likelihood function given by Eq. (3) becomes We use numerical optimization method to optimize the log-likelihood function in Eq. (4) since the analytical maximization is not possible.
In real-life applications, the distribution of the data sets cannot always be assumed to be identically distributed. This situation which is known as non-stationary is often apparent because of seasonal effects, trends or because the variable of interest is related to covariate. The usually adopted approach is by using the standard extreme value models as a basic templates that can be enhanced by statistical modeling.
Let Y 1 , Y 2 , . . . be a non-stationary series and information about some covariates {X} are available. Suppose that the random variable Y are related to the random variables X. The standard method for modeling the extremes of non-stationary series is focuses on retaining a constant high threshold u and incorporating the covariate models in Generalized Pareto (GP) parameters to account for the non-stationarity [12]. The distribution of the threshold exceedances from a non-stationary series can be model by where σ u (x) and ξ(x) are the covariate models. The distribution of the threshold excesses in Eq. (5) can be approximate by the GP if each covariate x have a high enough threshold. However, high enough threshold for one covariate might not be high enough for the other covariate to produce those high y values leads to the invalidity of the GP to approximate the distribution of threshold exceedances. To remedy the problem, we propose a tree-based threshold to model the threshold exceedances. The regression tree is use to partition the y sequences into m homogeneous stationary clusters. In order to obtain a stationary cluster, we apply a stationary test and use a stopping criteria for growing the tree. We defer the discussion of these to Section 3.2.
Suppose that the observations within each clusters produced by regression tree are stationary or approximately stationary. Then, a constant high threshold can be set within each clusters producing a different threshold in each of the clusters known as tree-based threshold. Then, the distribution of the tree-based threshold exceedances can be model by where y k is the observations within the cluster k (k = 1, 2, . . . , m) and u k is a threshold set in a cluster k. Denote the tree-based threshold exceedances z k = y k − u k , the distribution of z k has a form of The density function of distribution function of Eq. (7) is similar as in Eq. (3). The estimation of the parameters is simply done using maximum likelihood estimation by maximizing the likelihood function as given in Eq. (3).
If the covariate model is still needed to model the distribution of the tree-based threshold exceedances due to the inability of the regression tree produced the stationary observations in each cluster, GP model with covariate function in the parameters can also be estimated using maximum likelihood estimation method. Let z i1 , z i2 , . . . , z ik for i = 1, 2, . . . , n uk where n uk is a number of exceedances in cluster k, be a tree-based threshold exceedances follow the GP(σ u k (x), ξ(x)) where each of σ u k (x) and ξ(x) have an expression in terms of parameters vector and covariates. Denoting β as a vector of parameters of a covariate model, the numerical technique is required to optimize the likelihood function derived from Eq. (7) to estimate β.

Stopping Criteria in Regression Tree
Regression tree is a supervised learning method that construct a flowchart-like tree from the data as a prediction tree model and uses the model to classify the future data [13]. Regression tree consists of one parent node, internal nodes and terminal nodes. In our study, we refer to the terminal nodes as a cluster which consist of stationary observations. To determine which cluster an observation is belongs to, all observations are placed at the root (parent node) of the tree. We follow a path from the root and proceed to one of the internal node called leaf by following a question that split the parent node. The observations with yes answer will be placed at the left leaf (internal node) while observations with no answer will be placed at the right leaf (internal node). The tree model is fitted using binary recursive partitioning where a parent node in a decision tree is split into two internal nodes based on the splitting criterion. The split is chosen such that the impurity level of the tree is reduced the most by the split. The tree impurity level is measured by sum squared errors of the tree which given by where m c = 1 nc i∈c y i , is the mean of observations within leaf c.
The binary partitioning process will be applied over and over again until it meets some stopping criteria. The stopping criteria is set to control the size of the tree so that the tree will stop to grow when the observations within the clusters are stationary. In this study, we set a value δ, such that the reduction of sum squared errors of the tree after each split is not less than δ. We consider values of δ between 0.000001 and 0.01 with 0.000001 interval. The maximum value of δ which grow the tree with stationary observations within a cluster will be chosen as a stopping criteria. The stationarity of the observations within the cluster is tested using Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test at significance level α = 0.05. Then, a constant high threshold can be set within each resulted clusters.
The threshold is set at q th percentile where q kept similar for all clusters so that the rate of exceedances remain constant throughout the data set. Since each cluster has different number of observations, the threshold value might differ for each clusters. In other words, each observation will have their own threshold value. These threshold values are arranged according to the index of observations producing a varying threshold. In this study, the 95 th percentile value for threshold are chosen for each cluster. This percentile value is reasonably sufficient for GPD approximation be a valid limiting distribution for threshold exceedances while still keep the number of exceedances large enough for the model estimation [14].

Simulation Study
In this section we will illustrate the efficiency of the treebased threshold method over the standard method for modelling the non-stationary extremes by a simulation study. We simulate random numbers from generalized extreme value distribution using inverse sampling method. Our argument for the choice of GEV distribution is as follows. If Y 1 , . . . , Y n is distributed as GEV(µ, σ, ξ), then, it can be shown that the block maxima M n = max(Y 1 , . . . , Y n ) will also GEV(µ * , σ * , ξ) with µ * = µ + σ(n ξ − 1) ξ and σ * = n ξ σ.
According to [3], if the distribution of a block maxima is GEV, then the excesses of high enough threshold, u can be approximated by GPD with parameterσ and ξ wherẽ σ = σ * + ξ(u − µ * ).
Here, parameter ξ is equal to that of the corresponding parameter ξ in GEV distribution.
Covariates model is incorporated in the GEV location parameter, µ to induce non-stationarity in the simulated random numbers. Two covariate models are used which are: 3 x for cyclic trend where t and n represent time covariate and number of observations respectively. Another covariate x is generated from standard normal distribution. Time covariate, t is included to create trends in the data sets, while the covariate x represents a random variable which might affect the variable y. The time covariate is simply an increasing index from 1 until 3653 which corresponds to number of days in 10 years. The covariate x is simulated using function rnorm in R statistical software. We consider four non-stationary GEV data sets of size n = 3653, each containing either a linear or a cyclic trend in parameter µ with shape parameter ξ = 0.4 and ξ = −0.4. The abbreviation of non-stationary GEV data sets are given in Table 1. The scale parameter, σ is fixed at 1 for all data sets. The location parameter µ 0 , µ 1 , µ 2 , µ 3 are chosen arbitrarily and given in Table 2. The tree-based threshold selection method is applied to the simulated data sets. The excesses of tree-based threshold are modelled by both stationary and non-stationary GP model. Covariate models are incorporated in the scale parameter of nonstationary GP model such that the scale parameter is either or where Eq. (8) is for data set with linear trend while Eq. (9) is for data set with a cyclic trend. The performance of the fitted stationary GP and non-stationary GP models are compared using Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The AIC and BIC values are shown in Table 3. Table 3 shows that the AIC and BIC values of stationary model fitted to the tree-based threshold exceedances for simulated data with positive shape parameters are smaller compared to the non-stationary model. For simulated data with negative shape parameter, both AIC and BIC show a negative values indicates less information loss than a positive values. Comparison between stationary and non-stationary models shows that, in overall, the AIC and BIC values are smaller for stationary than non-stationary therefore favor the stationary model in modelling the tree-based threshold exceedances. This conclude that regression tree method are able to produce most stationary data within the cluster, hence simpler model can be fit into the threshold exceedances. Now we compare the performance of tree-based threshold method using stationary model with the standard method which is based on a constant high threshold u. For a standard method, value of threshold is fixed at 95 th percentile of the simulated data. Threshold exceedances for standard method are modelled using GPD with the covariate models incorporated into the scale parameter as given in Eq. (8) and Eq. (9). The Root  Table 4 shows that the RMSE values for treebased threshold method are smaller compared to the standard method except for GEVLP. However, the value of RMSE for tree-based threshold method is quite close to the RMSE for standard method with covariates model indicating that these two methods are comparable with the advantage to the treebased threshold method because of less parameter has to be estimated. Moreover, the R 2 values for tree-based threshold selection method is closer to 1 compared to standard method.

API Data Analysis
In this section, the proposed tree-based threshold selection method is applied to daily maxima of API and covariates PM 10 , O 3 , SO 2 , NO 2 and CO data as described in Section 2. Table 5 shows the percentage of missing values for the data sets. As the percentage of missing values in several covariate are quite high, we use the Full Conditional Specification technique to impute the missing data. In this technique, each incomplete variable is imputed by a separate model. The imputation is done using mice package in R software and the algorithm is completely discussed in [15]. The descriptive statistics for API data and the covariates after the imputation is given in Table 6. From Table 6, the highest API recorded during the study period is 257 which falls in very unhealthy level. We develop the regression tree for API and covariates data using procedure discussed in Section 3.2 with δ = 0.000083. The regression tree shown in Figure 2 produce 71 clusters with O 3 and PM 10 become a dominant covariates that split the API data. For each resulted clusters, we set 95 th percentile threshold producing a covariate-varying threshold known as tree-based threshold as shown in Figure 3. The exceedances of tree-based threshold is then modelled by both stationary and non-stationary GP model. Since PM 10 and O 3 are the dominant pollutants that split the API data, we consider these pollutants in modelling the tree-based threshold excesses of API data for non-stationary GP model. Study by [8] also shows that PM 10 and O 3 are the dominants pollutants that affect the variation in API data. The covariates model is incorporated in the scale parameter of GP model such that σ = exp(σ 0 +σ 1 O 3 +σ 2 PM 10 ) where the exponential function is used to ensure that the positivity of σ is respected for all values of PM 10 and O 3 . Table 7 shows the parameter estimates of stationary and non-stationary GP model fitted to the tree-based threshold excesses API data. From Table 7, both models have positive shape parameter indicates that the distribution of tree-based threshold exceedances for API data is unbounded. The performance of the stationary and non-stationary GP model fitted to the tree-based threshold exceedances are evaluated using AIC and BIC. The AIC and BIC values are shown in Table 8. Based on Table 8, the AIC and BIC values are smaller for stationary GP model compared to non-stationary model, which conclude that modelling treebased threshold excess with stationary GP model produce less information loss. Hence, this method provide much simpler model to explain the variation in API data. The goodness-of-fit (GoF) of the stationary GP model is tested using Anderson-Darling (AD) test and Cramer von Misses (CVM) test. The p-values of the GoF tests shown in Table 9 indicates that the stationary GP model fit the tree-based threshold exceedances of API data well. We also compare the tree-based threshold selection method with the standard method using RMSE and R 2 . A constant threshold is set at 95 th percentile for standard method. Results in Table 10 shows that RMSE value for tree-based threshold method are smaller than the RMSE value using standard method indicating that the tree-based threshold method pro-

Conclusion
In this paper, a new and simple method for threshold selection for the GPD in the presence of covariate is presented. The method uses regression tree to partition the data sets into approximately stationary series. The excesses of the tree-based threshold is shown to be better fitted with stationary GP model compared to non-stationary GP model producing much simpler model to explain the variation in data sets. Comparison made with the standard method shows that the proposed tree-based threshold is much better in terms of producing less error and better at forecasting values. In modelling the API data, the treebased threshold is sufficiently enough to produce a stationary threshold exceedances so that much simpler model could be fitted in order to explain the variation in API data. In practice, our method can be seen as an additional tool that complements existing threshold selection methods.