Projections of Cardiovascular Disease Mortality in Peninsular Malaysia Using Statistical Downscaling Based on Cluster Approach

Projecting the mortality of cardiovascular disease in future is crucial in preparing the mitigation strategies. The purpose of this research is to estimate number of deaths of the cardiovascular disease in Peninsular Malaysia based on future temperature projections using the cluster approach. Ward's method is used to identify the number of clusters of 45 meteorological stations by calculating the shortest distance between the two coordinates of the stations. The output of global climate model (GCM) is incredibly useful for the projection of future temperature, but the large bias in the observational datasets may lead to inaccurate projection. To tackle the bias, a good fitted model for temperature series is important in order to ensure that the mean and variability of the observed series are well captured. It is important to estimate the parameters for each cluster precisely. Furthermore, a good fitted model for temperature series is also crucial in order to ensure that the mean and variability of the observations are well captured. Thus, this study proposed the appropriate statistical distribution for the temperature series to be associated in the bias correction method (BCM) using the quantile mapping (QM) technique to reduce the biases between observations and historical GCM temperature data series. Next, Ward’s method is applied to determine the optimal number of clusters for Peninsular Malaysia. The results have shown that the proposed model is able to reduce the temperature series biases between the GCM and the observations. Six clusters throughout Peninsular Malaysia have been selected based on Ward’s method. The projection number of deaths of cardiovascular disease under is estimated to increase between 2006 and 2100 in all clusters across Peninsular Malaysia, based on the temperature projections.

Abstract Projecting the mortality of cardiovascular disease in future is crucial in preparing the mitigation strategies. The purpose of this research is to estimate number of deaths of the cardiovascular disease in Peninsular Malaysia based on future temperature projections using the cluster approach. Ward's method is used to identify the number of clusters of 45 meteorological stations by calculating the shortest distance between the two coordinates of the stations. The output of global climate model (GCM) is incredibly useful for the projection of future temperature, but the large bias in the observational datasets may lead to inaccurate projection. To tackle the bias, a good fitted model for temperature series is important in order to ensure that the mean and variability of the observed series are well captured. It is important to estimate the parameters for each cluster precisely. Furthermore, a good fitted model for temperature series is also crucial in order to ensure that the mean and variability of the observations are well captured. Thus, this study proposed the appropriate statistical distribution for the temperature series to be associated in the bias correction method (BCM) using the quantile mapping (QM) technique to reduce the biases between observations and historical GCM temperature data series. Next, Ward's method is applied to determine the optimal number of clusters for Peninsular Malaysia. The results have shown that the proposed model is able to reduce the temperature series biases between the GCM and the observations. Six clusters throughout Peninsular Malaysia have been selected based on Ward's method. The projection number of deaths of cardiovascular disease under is estimated to increase between 2006 and 2100 in all clusters across Peninsular Malaysia, based on the temperature projections.

Introduction
Climate change has been shown in recent research to dramatically increase temperature-related mortality. However, the potential effects of climate change-related health risks will differ greatly between diseases. The population's exposure to local social and environmental stressors, and access to health care, potential health outcomes will differ across regions and countries. Global climate models (GCMs) is the common model to assess future climate. GCMs has a long history of growth and have a rare opportunity to physically model global climate and uncertainty in novel ways [1]. However, the resolution of GCMs is too coarse and is associated with distortions due to the structure of the model, the processes of parameterization, assumption, and calibration [2]. As a result, GCMs cannot be used to forecast future climates directly. Downscaling is the process of reducing a coarse resolution to the GCM output's finest resolution. Statistical and dynamical downscaling are the two types of downscaling. The process of developing statistical relationships between local climate variables and large atmospheric variables is known as statistical downscaling. Statistical downscaling is used in this study due to several advantages such as it is less computational, good in capturing the bias, and can provide finer resolution outputs than dynamical downscaling [3]. Statistical downscaling encompasses a wide range of techniques. Bias correction method (BCM) is one of the statistical downscaling approaches in which it can correct the bias between observations and historical GCM data that remain valid under future conditions [4]. One of the techniques in BCM namely quantile mapping (QM), is often used to capture changes in the mean and variability of a GCM [5]. Grouping similar stations in a good structure could explain the climate of Peninsular Malaysia accurately [6]. The estimated parameter values without clustering may not represent the whole Peninsular Malaysia quite well as each region has different climatic characteristics as well as different topographical factor. It is important to estimate the parameters for each cluster precisely [7]. Furthermore, a good fitted model for temperature series is also crucial in order to ensure that the mean and variability of the observations are well captured. Thus, this study proposed the appropriate statistical distribution for the temperature series to be associated in the bias correction method (BCM) using the quantile mapping (QM) technique to reduce the biases between observations and historical GCM temperature data series. Next, Ward's method is applied to determine the optimal number of clusters for Peninsular Malaysia. Therefore, the aim of this study are to project the future series of daily mean temperatures (2006-2100) and to calculate the cardiovascular disease mortality rate  in Peninsular Malaysia based on the projections of temperature using the cluster approach.

Temperature in Malaysia
According to the Intergovernmental Panel on Climate Change (IPCC), Southeast Asian countries, especially developing countries like Malaysia, would be the most vulnerable to heatwaves [8]. From 1961 to 2002, surface temperature in most parts of Malaysia showed major warming trends [9]. Warming patterns of between 2.7°C and 4.0°C/100 years were observed at various stations in Peninsular Malaysia and northern Borneo. Kuching and Bintulu stations had lower rates of between 1.0°C and 1.5°C/100 years, while Miri had no noticeable warming or cooling pattern. From 2008 to 2010, the lowest heatwave index was 27.3°C in Kuching and the highest heatwave index was 35.0°C in Sandakan in East Malaysia [10]. From 2001 to 2010, the highest heatwave index experienced in Kuala Lumpur with an increase of 9.1°C, and the lowest heatwave experienced in Alor Setar with an increase of 0.1°C across Peninsular Malaysia [11]. In addition, the moderate heatwave index was found at 4.2°C in Kuantan, and the longest heatwave, spanning 24 days, occurred in Ipoh, Perak, with amplitudes ranging from 29.4°C to 33.0°C [11]. The heatwave's characteristics were also compared to spatial distribution maps, which revealed that the southeast, northeast, and west parts of Peninsular Malaysia experience the most heatwaves. During dry season, the highest heatwave index occurred between March and July [11]. Tang [12] had investigated daily mean temperature in five different locations: Kota Kinabalu, Kuching, Malacca, Kuantan, and Subang Jaya. The annual moving average of daily mean temperature in Kota Kinabalu has been discovered to be on the rise, with temperatures fluctuating between 26°C and 28.5°C. In Kuantan, the temperature varies from 25°C to 28°C, while in Subang Jaya, temperature ranges from 26°C to 28.7°C. Meanwhile, Kuching has a temperature range of 25.5°C to 27.5°C, while Malacca has a temperature range of 26°C to 28.5°C. Kuching had the smallest annual moving average increment of daily mean temperature, followed by Kota Kinabalu, Malacca, Kuantan, and Subang Jaya. Due to the slower rate of development, Kuching experienced the smallest temperature rise. This confirmed that the heatwave conditions in Malaysia are concerning, necessitating further research and investigation because of the effects of heatwaves that may directly affect the agriculture, economics, and human health [10][11].

Cardiovascular Disease (CVD)
Cardiovascular diseases (CVDs) are a category of diseases that affect the heart and blood vessels. Coronary heart disease (CHD), one of the CVDs, is the leading cause of death globally. CHD mortality has risen by more than threefold in Malaysia over the last 40 years and it continues to grow. CVDs were ranked third in the 1950s, steadily rising to first place in 1970, and continued to grow in 1989, with 16.5 times increase from 1.8% in 1950 to 29.6% in 1989 [13]. This suggests that CVDs have surpassed respiratory diseases, neoplasms diseases, infectious diseases, metabolic diseases, and blood diseases as the leading cause of death among the six disease groups. The four main diseases under CVD are coronary heart disease, cerebrovascular disease, hypertension, and rheumatic heart disease. Deaths due to CHD increased from 32.7% in 1965 to 38.2% in 1989, while deaths due to cerebrovascular disease, hypertension, and rheumatic heart disease decreased from 33.1%, 16.0%, and 4.5% in 1975 to 30.1%, 1.4%, and 1.7% in 1989 [13]. Other types of CVDs have decreased from 46% in 1965 to 28.6% in 1989 [13].

Cluster Analysis
Cluster analysis is the method of combining objects from the same cluster with objects from other clusters. The formulation of a query, the selection of a distance metric, the selection of a clustering process, the definition of the number of clusters, the interpretation of profile clusters, and the evaluation of clustering validity are all part of cluster analysis. The aim of cluster analysis is dealing with the underlying structure of data in order to gain insights of previously unknown data, as well as to identify significant features and probable, classification to determine the degree of similarity among data points, and compression to organize and summarize the data into understandable segments [18][19]. Cluster analysis is becoming more common in atmospheric science for redefining climate divisions using specific climate and meteorological data [18][19][20]. Several studies have used cluster analysis to determine the spatial and temporal trends of monthly precipitation [6,[21][22][23]. It has been suggested that cluster analysis can be used in variable such as temperature to investigate the regional characteristics of temperature distribution through pattern and spatial analysis temperature variability on an annual and monsoon-seasonal basis for various climatic regions [6]. Several studies have used the k-means clustering approach to classify the spatial and temporal patterns [23]. The k-means method allows users to determine the number of clusters to be used. However, incorrectly specifying the number of clusters can lead to severe errors in the analysis [6]. On the other hand, Ward's method is less sensitive in determining the number of clusters and capable of capturing extreme values [7]. Hence, Ward's method is used for clustering.

Bias Correction Method Quantile Mapping
BCMs use a transfer function to correct the bias between observations and historical climate variables. The advantages of the transfer function are assumed to be stationary and remain valid for the future period [4]. Several BCMs have been developed to downscale the meteorological variables from the GCM outputs, varying from the simple or linear method to the advanced or nonlinear method [4][5]24]. Linear scaling, variance scaling, power transformation, QM, daily bias correction, local intensity scaling, daily translation, delta change method, multiple linear regression, multiple linear regression with randomization, analogue method, and nearest neighbour analogue are only a few examples of BCMs [4][5][24][25][26][27][28][29][30][31]. QM is one of the best BCM methods because of the ability of QM to adjust the mean, the quantile, the variances, and preserving the extremes [4][5][26][27]. The aim of QM is to match the GCM outputs' probability density distribution (PDF) with the PDF of observed data [4,32]. This can be accomplished by adjusting the occurrence distributions of meteorological variables using a transfer function [33]. QM can be classified into three types: distribution-derived QM, parametric QM, and nonparametric QM [34]. As it offers the best combination of accuracy and robustness [26][27], QM based on distribution derived is used in this study. Previous research has assumed that temperature series are usually normally distributed with a symmetrical shape of distribution [4-5, 26-27, 35-37]. However, the distribution's shape of temperature series is positively skewed. Thus, Gamma distribution with three parameters is proposed in this study. The addition of a location parameter, which can be used to determine the best central value for describing the data. In particular, the temperature series is fitted with a QM with three parameters: shape ( ), scale ( ), and location ( ) of the Gamma distribution to reduce the biases between observations and GCMs temperature.

Data
The historical daily mean temperature for Peninsular Malaysia was obtained from 45 weather stations between 1976 and 2005 [23]. The Interdisciplinary Climate Research Model (MIROC5) outputs are used, which were modelled as part of the Phase 5 Coupled Model Comparison Project (CMIP5). In this study, the historical GCM of daily mean temperature  and the future GCM of daily mean temperature (2006-2100) were obtained under RCP4.5 and RCP8.5. Representative Concentration Pathways (RCPs) are a collection of estimated initial values for radiative forcing, CO 2 concentration, and temperature anomaly up to the year 2100. RCPs were created using economic activity, energy sources, population growth, and other socioeconomic variables [38]. RCP8.5 represents increasing greenhouse gas emissions over time, while RCP4.5 represents a stabilization scenario in which radiative forcing stabilizes shortly after 2100 while remaining below the long-run radiative forcing target level. Radiative forcing is the result of a process that changes the balance of incoming and 122

Projections of Cardiovascular Disease Mortality in Peninsular Malaysia
Using Statistical Downscaling Based on Cluster Approach outgoing energy in the Earth-atmosphere system. As shown in Table 2, each RCP has its own emissions trajectory and radiative forcing. The mortality data for unstable angina, STEMI, and NSTEMI being recorded at Malaysia's hospital were obtained from [39][40][41][42].

Ward's Method
Hierarchical (HCA) and non-hierarchical (NCA) cluster analysis are the two types of cluster analysis (NHCA). HCA is used in this study to compress, organize, and summarize data into manageable clusters. HCA connects the coordinates that are the most similar. Starting with an agglomerative approach. It begins with data points that are grouped together to form individual clusters. To form a new cluster, the shortest distance between and − 1 neighbours is measured. After that, consider the distance between the remaining − 2 coordinates and the newly created cluster. The two coordinates are then connected to the shorter distance, either by adding data coordinates to the two clusters or by constructing a new cluster from two new data points. This process is repeated until is a single cluster of coordinates is achieved, regardless of the absolute distance between them. This is an incredibly good algorithm for reducing dimensionality. The results of HCA are often represented as a dendrogram, a tree-shaped graph that contains crucial information about the cluster's measured distances and the couplings formed. HCA has advantages over the NCA where the cluster number determination is not sensitive. One of the methods in HCA is Ward's form. Ward's approach has the advantage of reducing overall within-cluster variance. The formula for measuring the distance between two points is where, is 45 number of stations and ( , ) is the coordinates of each stations.

Quantile Mapping (QM)
QM defines a transfer function to reduce the biases between observations and historical GCM temperature data series. The transfer function is where, represents the corrected temperature between observations and historical GCM, represents the observations of daily mean temperature, and represents the cumulative distribution function (CDF) of historical GCM and observations of daily mean temperature, respectively. Equation (1) is used to calculate future daily mean temperature projections under RCP4.5 and RCP8.5. The observations ( ) were replaced with future GCM temperature under RCP4.5 ( 1 ) and RCP8.5 ( 2 ) while the historical GCM temperature ( ) were replaced with the corrected temperature ( ). Let 1 and 2 represent the projection of future daily mean temperature under RCP4.5 and RCP8.5.
The PDF of Gamma distribution with three parameters is where, the shape, the scale, the location, and ( ) gamma function. The parameter values can be estimated using Maximum Likelihood Function (MLE). Hence, the ln is By taking the partial derivatives with respect to , , and . Then, let the equation equals to zero, The value of , , and cannot be estimated directly because of ∑ 1 − =1 . The value of , , and can be calculated in a closed form without having to solve the nonlinear equations simultaneously [43]. The formula to estimate is where, represents the daily mean temperature value's first order statistics. The formula to estimate and are where, ̅ represents the average of daily mean temperature and 2 represents the variance of daily mean temperature.

Validation Framework
The results of corrected temperature with QM for all clusters are compared to the observations using k-fold cross validation to reduce the effects of the training period selection and to ensure the assessment's robustness [44]. To calibrate the parameters, first use the period from 1976 to 1996 as the training period. Then, as a single experiment, correct the temperature biases for the remaining 10 years. The training period is then advanced by one year per time, and the BCM and validation are completed for the remaining years. This operation is performed triple times. Therefore, 20 continuous 30-year periods are chosen as training periods and 10 remaining years as validation periods. The root-mean-square error (RMSE) and mean absolute error (MAE) are used to determine the efficiency of corrected temperature with QM for all clusters (MAE). Furthermore, using the baseline period of 1976 to 1996 as a reference, the future temperatures for all clusters are projected from 2006 to 2100 under RCP4.5 and RCP8.5. RMSE and MAE are used to test the efficiency of all clusters for future temperature projections under RCP4.5 and RCP8.5.

Attributable Annual Deaths
The attributable annual deaths formula was used to project the impact of temperatures on CVD mortality rates in the future (2006-2100) [45][46][47], where, represents attributable annual deaths, 0 represent the baseline annual mortality, represents attributable change in mortality due to cardiovascular disease at each temperature, and represents the total population. The constant values of 0 and were obtained from the National Cardiovascular Disease Database (NCVD) and DOSM in 2015, respectively. A distributed lag non-linear model (DLNM) was used to model the relationship between temperature and cardiovascular disease mortality [45][46]. DLNM has the advantage of capturing the lagged dependencies of exposure-response relationships by using two-function modelling (lag-response relationships and exposure-response, respectively) and complex non-linear. The exposure-response and lag-response relationships are then incorporated into the cross-basis function. In the exposure-response relationship, a quadratic B-spline is used, while in the lag response relationship, a natural cubic B-spline is used [46].  Figure 1 shows the cluster dendrogram for 45 stations throughout Peninsular Malaysia. Ward's method has advantages when the determination of cluster number is less sensitive compared to the k-means method [6]. However, a different number of clusters could give different results. Hence, the optimal number of clusters could give more reliable results using the Ward's method [18]. Based on Table 3, six clusters have a minimum value of Kelly-Gardner-Sutcliffe which is 19.4703 compared to four clusters which is 22.2270. Thus, six clusters are the optimum number of clusters that will be selected in this study. Table 4 lists the number of stations for each cluster throughout Peninsular Malaysia.   Table 5 shows the descriptive statistics for the observed daily mean temperature of 6 clusters throughout Peninsular Malaysia. The result shows that the value of mean is greater than the value of median which indicates that observed daily mean temperature for all clusters is positively skewed. It is also supported by the positive value of skewness. Thus, Gamma distribution with three parameters is used to fit the observed daily mean temperature for all clusters throughout Peninsular Malaysia. The value of kurtosis for cluster 1, 4, and 5 are positive while in contrast, the value of kurtosis for cluster 2, 3, and 6 show negative values. The positive value of kurtosis represents indicate that a distribution is peaked and possess thick tails while the negative value of kurtosis indicates flatter distribution.  Table 6 shows the correlation coefficient between and within clusters across Peninsular Malaysia. The correlation coefficient within clusters is higher than the correlation coefficient between clusters. This indicates that there is no strong relationship between clusters while there is a strong relationship within clusters. Table 7 shows the statistical analysis of observations, historical GCM, and corrected daily mean temperature of all clusters throughout Peninsular Malaysia after QM. The results show that the mean and variance values for all clusters are almost identical to the observations. Closer values of RMSE and MAE to zero indicate the smallest difference between observation and corrected daily mean temperature, indicating better simulation accuracy. Cluster 5, cluster 6, cluster 3, cluster 1, and cluster 4 have lower RMSE and MAE values than cluster 2, cluster 6, cluster 3, cluster 1, and cluster 4. In comparison to the other clusters, this shows that QM provides reliable results for cluster 5 [48].     Table 9 shows the RMSE and MAE of projection of future temperature of all clusters throughout Peninsular Malaysia under RCP4.5 and RCP8.5 scenarios. For both clusters, the RMSE and MAE values are less than zero. This means that the projection of future temperature of all clusters under the RCP4.5 and RCP8.5 scenarios across Peninsular Malaysia are reliable. Table 10 shows the estimated parameter values of the shape � , scale ̂ and location � for corrected temperature for all clusters in Peninsular Malaysia and future GCM under RCP4.5 and RCP8.5. All the values of � , ̂ and � are highly significant based on 95% confidence interval.   Figure 2 shows the monthly mean temperature for historical GCM, observed and corrected temperature of all clusters throughout Peninsular Malaysia. The monthly mean temperature recorded the lowest value of monthly mean temperature in January and December. This is because November and December associate with the northeast monsoon wet seasons when the weather is cooler than in other seasons. Meanwhile, the highest value of monthly mean temperature was recorded in May. Peninsular Malaysia experiences southwest monsoon rainfall between March and October where this area receives minimal rainfall during this period and is considered as a dry period [49]. Different ranges of mean temperature were observed in different clusters due to the climate characteristic differences. The delineated temperature zones are closely linked to geography and topographic characteristics [6,23].   Figure 3 shows the boxplot of mean temperature from January until December for all clusters. In contrast to the second halves of the year, the outliers are more prominent in the first half of the year. In general, the shape of the mean temperature distribution for all clusters is positively skewed between January and June, indicating that the weather is slightly warmer during this period. However, the weather becomes somewhat cooler towards the end of the year.   Figure 4 shows the monthly mean temperature of raw future GCM and corrected temperature of all clusters under RCP4.5. The projection of temperature for cluster 1, 3, and 5 is higher than the raw future temperature whereas the projection of temperature for cluster 4 and 6 is lower than the raw future temperature. Nevertheless, the projection of temperature for cluster 2 is similar to raw future temperature. All clusters recorded the highest value of monthly mean temperature in June while the lowest value of monthly mean temperature between December and January.  Figure 5 shows the monthly mean temperature of raw future GCM and corrected temperature of all clusters throughout Peninsular Malaysia under RCP8.5. The projection of temperature for cluster 3 is slightly higher than the raw future temperature whereas the projection of temperature for cluster 1, 2, 4, and 6 is lower than raw future temperature. The temperature projection for clusters 5 is close to the raw future temperature. The highest monthly mean temperature was recorded in June in all clusters, while the lowest monthly mean temperature was recorded between December and January in all clusters.   Table 11 shows the increment projection of temperature under RCP4.5 and RCP8.5 for 6 clusters throughout Peninsular Malaysia. The increment projection of temperature under RCP4.5 by 0.05 ℃ of all clusters whereas the increment projection of temperature under RCP8.5 by 0.1℃. The projection of temperature under RCP4.5 for cluster 5 has higher ratio while the projection of temperature under RCP4.5 for cluster 1 has lower ratio. Meanwhile, the projection of temperature under RCP8.5 for all clusters have similar ratio except for cluster 1.

Conclusion
In conclusion, Ward's method could obtain the optimal number of clusters in Peninsular Malaysia based on the Kelly-Gardner-Sutcliffe. QM was able to fit the temperature series as well as eliminate biases in the mean and variability, resulting in an accurate prediction of heat-related CVD. RCP8.5 has a higher CVD mortality estimate than RCP4.5. This is because RCP8.5 expect global temperature to rise of about 4.3°C whereas RCP 4.5 delivers global temperature increase around 2°C to 3°C. However, there are a few caveats to this study that must be addressed. To begin with, this study only corrects temperature in the present study, ignoring any other biases from different sources such as sample size selection and other potential sources of bias. Second, the population is projected to increase in the future. Finally, other causes of CVD and different categories of patients, such as gender, age, and race, must be examined in a future study in order to achieve more meaningful outcomes. Many of these issues are essential for further study.