Application of ARIMAX Model to Forecast Weekly Cocoa Black Pod Disease Incidence

The losses caused by cocoa black pod disease around the world exceeded $400 million due to inaccurate forecasting of cocoa black pod disease incidence which leads to inappropriate spraying timing. The weekly cocoa black pod disease incidence is affected by external factors, such as climatic variables. In order to overcome this inaccuracy of spraying timing, the forecasting disease incidence should consider the influencing external factors such as temperature, rainfall and relative humidity. The objective of this study is to develop a Autoregressive Integrated Moving Average with external variables (ARIMAX) model which tries to account the effects due to the climatic influencing factors, to forecast the weekly cocoa black pod disease incidence. With respect to performance measures, it is found that the proposed ARIMAX model improves the traditional Autoregressive Integrated Moving Average (ARIMA) model. The results of this forecasting can provide benefits especially for the development of decision support system in determine the right timing of action to be taken in controlling the cocoa black pod disease.


Introduction
Black pod or Phytophthora pod rot is the most economically important and widespread disease of cocoa, Theobroma cacao L. in Malaysia. The losses due to Phytophthora exceed $400 million worldwide [1]. Among the Phytophthora spesies which attacked the cocoa, Phytophthora palmivora is the most widely distributed in the world and it caused global yield loss of 20-30% and tree deaths of 10% annually [1]. The intensity of black pod disease in cocoa was influenced by numerous climatic parameters such as rainfall, temperature and high humidity as reported by Thorold [2,3], Dakwa [4], Wood [5] and Mpika [6]. Both rainfall and relative humidity has strong correlation which encourages pathogen sporulation by reproduction of zoospores to infect the cocoa pods while the optimum temperature is propitious to black pod symptom development [2][3][4][5][6]. It is important to quantify the black pod disease fluctuations due to the real effect of climatic parameters. Understanding the effect of the climatic variables on the cocoa black pod incidence can identifying the suitable management options such as fungicide spraying and culture practices to control the disease incidence under projected climate change scenario. The time-fractional partial differential equation was widely used in study the mathematical biology including the description of plant disease epidemic [7] but need the numerical approach to solve the equation. Another approach was developing the time series model especially auto-regressive moving average (ARIMA) described by Box and Jenkins [8] that widely used in forecasting where it used the historical sequences of observations to do the forecasting. In agriculture, ARIMA model used to forecast the annual production of several crop in countries that relied the crop for daily life or economy, for example production of rice [9], wheat [10], coffee [11] and cocoa [12].
As for the disease monitoring program, ARIMA model was used to predict Botrytis cinerea spore concentrations that caused grey mould in Spain and assist in deciding number of treatments needed [13]. In cocoa, it is very useful in forecasting the cocoa black pod incidence to understand the effect of previous incidence on the current incidence. However, ARIMA model along can't quantify the effect of climate variables on the cocoa black pod disease incidence and help in decision making process. The key problem is how to incorporate the climate information into the forecasting process and subsequently into the decision making process. When an ARIMA model includes other time series as input variables, the model is referred to as an ARIMAX [14]. The application of ARIMAX model has been used in agriculture, economics and engineering as ARIMAX model provides better forecast performance than ARIMA model [15][16][17].
Therefore, the present study was undertaken with the following objectives (i) Developing the univariate ARIMA cocoa black pod disease forecasting models and (ii) Fitting ARIMAX (climate parameters as regressors) models and testing the validity of the developed models.

ARIMA Models
ARIMA models are also known as Box-Jenkins models which require historical time series data of underlying variables. There are three stages involved in time series approach, namely process of model identification, parameter estimation and model checking. In the stage of model identification, the data series is determined whether the series is stationary before the Box-Jenkins model or ARIMA model is developed. A stationary series in Box-Jenkins model will have constant mean, constant variance and constant autocorrelation. For a nonstationary series, a differencing on the non-stationary series one or more times will be carried out to achieve stationary. For example, when we applied a first difference ∆y t = y t -y t -1 on non-stationary series to achieved stationary, so y is called as integrated of order 1 or y ~ (1).

Test of Stationary Series of Data
In order to test for non-stationary, the Augmented Dickey-Fuller (ADF) test [18] is used where it test for a unit root in a time series sample. Given where μ t would be a non-stationary random walk if b = 1.
To test whether y has a unit root or non-stationary, we regress where c = b-1 and test hypothesis that c = 0 against c < 0. For non-stationary series data, the differencing function will be applied to the data sequence to transform into stationary series.

Model Identification
When the stationarity has been addressed, then we need to identify the order (the p and q) of the autoregressive and moving average terms. The tools used are the autocorrelation function (ACF) and the partial autocorrelation function (PACF). The formula for both ACF and PACF are available in most of time series book [8]. The protocol used in identifying what autoregressive or moving average models terms are given below; a. An ACF with steadily decline and a PACF will cut off suddenly after p lags indicates an autoregressive process at lag p, AR(p). b. An ACF cuts off suddenly after q lags and a PACF with steadily dkecline indicates a moving average process at lag q, MA(q).
The ACF and the PACF both exhibiting large spikes that gradually die out indicates that there are both autoregressive and moving averages processes.

Estimation of Model Parameters
Given a tentative ARIMA (p, d, q) model as follows; which can also be written as The degree of homogeneity d is determined in the identification, and Y t is a dependent variable representing stationary series. This stage is to estimate all unkown population parameters, which are α 1 , α 2 , α p , and β 1 , β 2 , ., β q in equation (3). The least squares method is used in estimating the model parameters of α 1 , α 2 , . . ., α p , and β 1 , β 2 , . . ., β q . The estimation procedure used in the least squares method involved minimizing the sum of squared residuals, 2 1

Diagnostic Checking
Diagnostic checking in time series model is similar to the regression analysis which included testing the parameters and residual tests. Parameters testing by using the t-test is to check and retain only those estimate parameters α �(L) andˆ( ) L β whose t -ratios are significantly greater than a predetermined critical value (that is, | t | > 2 at 5% significance level). Then, the residual tests are carried out using the Akaike Information Criterion (AIC) test and the Ljung-Box test or also known as Q statistics.
The AIC test formula is calculated as: AIC(p,q) = n log(RSS/n) + 2(p + q) where n is the number of data points (observations) and RSS is the residual sums of squares. The model with smaller AIC values is considered to be the best model. Meanwhile the Ljung-Box test the magnitudes of the ˆ( ) L α 31 residuals autocorrelation for significance: H 0 : The data are independently distributed with no serial correlation H a : The data are not independently distributed which they exhibit serial correlation.
The test statistic is: where n is the sample size, 2 k ρ ρ � k 2 is the sample autocorrelation at lag k, and h is the number of lags being tested. Under H 0 , the statistic Q follows a chi-square distribution, χ 2 (k -p -q). For significance level α, the critical region for rejection of the hypothesis of randomness is Q > χ 2 ( α, k -p -q).

ARIMAX Model
An ARIMAX model is viewed as a multiple regression model with one or more autoregressive (AR) terms and/or one or more moving average (MA) terms. Autoregressive terms for a dependent variable are merely lagged values of that dependent variable that have a statistically significant relationship with its most recent value. Moving average terms is nothing more than residuals (i.e., lagged errors) resulting from previously made estimates. The general ARIMAX models are as follows: Autogressive model with exogenous variables (ARX): Moving average model with exogenous variables (MAX): Autogressive Moving Average with exogenous variables model (ARMAX): where x t represents exogenous variables, β their coefficients, ϕ(L)y t is an AR model (ϕ 1 y t -1 + ϕ 2 y t -2 + ϕ 3 y t -3 + . . . + ϕ p y t -p ) and θ(L)ε t is the MA model (θ 1 ε t -1 + θ 2 ε t -2 + θ 3 ε t -3 + . . . + θ q ε t -q ). The approach used in ARIMAX model was based on the steps described by Andrews [19] with flowchart shown in Fig. 1. The approach involved two phases where the first phase deals with a linear regression model and the second phase deals with integration of AR and MA terms into a multiple regression model. In the first phase, the linear regression is used to identify the exogenous variables that are significant. Meanwhile, the second phase deals with iterative searching process to search the order of ARIMA part of the model. When there is large number of exogenous variables to be screened, the stepwise regression process is used in selecting and introducing a new exogenous variable into the ARIMAX model. This is followed by the iteration process of finding newly AR or MA or both terms to re-establish the random pattern of residuals in the ARIMAX models.

Data Collection and Statistical Analysis
The study is conducted in the cocoa mature areas in the Cocoa Research and Development Madai, Sabah with the coordinates of latitude and longitude as 4 o 47' 10'' N and 117 o 57' 54" E. The plot's size of 1 hectare was planted with cocoa clone of PBC 123. The study plot was divided into two treatments, i.e. first treatment was the combination of fungicide, pruning and phytosanitary of diseased pods removed from the field while second treatment was the combination between fungicide and phytosanitary of diseased pods removed from the field. Each treatment has three replications with 30 trees per replication. The random sampling method was used to select 10 trees in each replication for the black pod incidence studied meanwhile the assessment of the black pod incidence was adapted from Ndoumbe-Nkeng [20] and Ngoh Dooh [21] as follows: Black pod rate in treatment n (BPTRTn) was calculated each week i, (12) BP i *100where BP i is the number of black pods observed over 30 selected trees in week i, YP i is the number of young pods over 10 selected trees in week i and AP i is the number of adult pods over 30 selected trees in week i.
The daily weather variables (rainfall, relative humidity and temperature) and the weekly black pod incidence data from the period of February 2015 to June 2016 collected. The weather data were recorded using WatchDog data loggers (1000 Series Micro Station, Spectrum Technologies, Aurora, IL, USA). The statistical tool used in this paper is EViews version 6.0 to analyse the time series data and building the ARIMA and ARIMAX model. The forecasting performance of the best-fitting between ARIMA and ARIMAX models as plant disease forecasting model are compared using three criteria, namely mean squared error (MSE), root mean squared error (RMSE) and mean absolute error (MAE). The dataset for model validation purpose is 23 observations (or weeks) that covered observations from July 2016 to December 2016.

Descriptive Statistics
From Fig. 1 to 4, the black pod incidence and weather data have shown volatile pattern, showing ups and downs over a period of time except rainfall shows less volatile pattern as less rain occurred during that period. The high black pod incidence is recorded to be high especially during the rainfall and relative humidity were high during earlier days which is known as lag effect. The high relative humidity was also associated with a high frequency of rainfall often favorable for the occurrence of the disease as the presence of moisture on cocoa trees from rain or dew provided ideal conditions for P. palmivora infection [4]. The lag effect of the climate change could be due to the incubation period which is the period from fruit infection to the first disease symptoms [22].

Testing the Stationarity of Black Pod Incidence
The cocoa black pod incidence data in both treatments as shown in Fig. 1were found to be stationary for both treatments based on the ADF test or unit root test given in Table 1. ADF test successfully rejected hypothesis null of unit root at the 5% significant level in both treatments.

ARIMA Modeling
Fig . 5 shows the ACF and PACF plots were cut-off after one lag in treatment 1. This suggested the presence of autoregressive models AR(1) through PACF plot and also presence of moving average models MA(1) through ACF plot. So, possible ARIMA models to be considered are ARIMA(1,0,0), ARIMA(1,0,1) and ARIMA(0,0,1). Fig. 6 shows the ACF and PACF plots with the partial autocorrelation cofficients cut-off after one lag and the autocorrelation coefficients cut-off after two lags in treatment 2.
The Ljung-Box Q test in treatment 1 gave the Ljung-Box Q statistics of ARIMA(1,0,0) and ARIMA(0,0,1) at lag 24 as 10.750 and 10.248 with p values greater than 5% significant level. This proved residuals of fitted ARIMA(1,0,0) and ARIMA(0,0,1) in treatment 1 are independently distributed with no serial correlation (Table  4). Meanwhile Ljung-Box Q test in treatment 2 gave the Ljung-Box Q statistics of ARIMA(1,0,0) and ARIMA(0,0,1) at lag 24 as 14.091 and 17.504 with p values greater than 5% significant level to show there was no autocorrelation among the residuals of fitted ARIMA(1,0,0) and ARIMA(0,0,1) ( Table 4). The best fitted ARIMA model based on the AIC test and BIC test in Table 5 (Table 5). The selected ARIMA(1,0,0) model for both treatments explained that the forecast cocoa black pod disease incidence at time t was depend on the incidence at time t -1. This was because the cocoa black pod disease incidence existed in treatments 1 and 2 mostly caused by the environment as study by Guest [23] that showed under favorable condition will caused the sporangia from diseased pod to develop within 48 hours of infection.

ARIMAX Modeling
The twelve steps given in the flowchart shown in Fig. 1 was used to build the ARIMA model. Part of the tests have been carried out in ARIMA modeling. The pair-wise Granger Causality tests in Table 6 for treatments 1 and 2 showed three null hypotheses of Granger Causality tests being rejected at 5% significant level to indicate one-way causal relationships from the exogenous variables to a dependent variable. The exogenous variables in treatment 1 were the relative humidity at time t (MeanRH t ), average temperature at time t (Tmean t ) and maximum temperature at time t (Tmax t ). As for treatment 2, the exogenous variables were the total rainfall at time t (TRainF t ), average relative humidity at time t (MeanRH t ) and maximum temperature at time t (Tmax t ). It is important to ensure the exogenous variables have correct correlation sign to the disease incidence. The Spearman correlation analysis in treatment 1 determined four climate variables have correct correlation at 5% significant level ( Table 7). The climate variables were the average relative humidity at time t -6 (MeanRH t-6 ), average temperature at time t -3 (Tmean t-3 ), maximum temperature at time t (Tmax t ) and t -2 (Tmax t-2 ). There were nine climate variables in treatment 2 have correct correlation at 5% significant level ( Table 8). The climate variables were the total rainfall at time t -4 (TRainF t -4 ) and t -5 (TRainF t-5 ), average relative humidity at time t -3 (MeanRH t -3 ), t -4 (MeanRH t-4 ), t -5 (MeanRH t-5 ), t -6 (MeanRH t-6 ) and t -7 (MeanRH t-7 ), maximum temperature at time t (Tmax t ) and t -5 (Tmax t-5 ). The stepwise regression that develop a linear regression model between the climate variables and the cocoa black pod disease incidence showed both Tmean t -3 and Tmax t variables have significant F values at 5% level in treatment 1 while only MeanRH t -4 in treatment 2 has significant F values at 5% level (Table 9).  Table 10 shows the residuals from stepwise regression model in treatments1 and 2 rejected the hypothesis null of unit root at the 5% significant level in ADF test. This indicated the residuals from the stepwise regression model in treatments 1 and 2 were stationary.  Both the Ljung-Box Q statistics in treatments 1 and 2 (Table 11), 12.289 (lag 5) and 4.218 (lag 1) have p value greater than 0.05 which indicated autocorrelation exist in the residuals at 5% significant level. This provides an indication that AR and/or MA terms must be added into the model to remove the serial correlation.  Table 12 shows the ARIMAX model in treatment 1 was established using the AR(1) term and variable Tmax t as both were significant at 5% level while ARIMAX model in treatment 2 involved the AR(1) term and variable MeanRH t -4 as both were significant at 5% level. The sign check on the coefficient of the exogenous variable as given in Table 12 correctly correlated with the disease incidence. The selection of maximum temperature as one of the predictors in ARIMAX model in treatment 1 was due to its treatment applied the pruning as one of its control measure in managing the cocoa black pod disease incidence. The pruning provided good ventilation around and within the cocoa tree which can reduced the relative humidity to less volatile compared to temperature [24]. There were study proved that the difference in maximum temperature measured outside and inside canopy layers of cocoa trees caused the maximum temperature to be an important factor in effecting the cocoa production and also the disease infection [25][26]. Meanwhile the selection of average relative humidity as one of the predictors in ARIMAX model in the treatments 2 due to the control measure applied to the cocoa trees were free from pruning with fungicide application only. The cocoa tree without pruning have dense canopy and caused little sunshine penetration that will create highly relative humidity to favor the Phytophthora's inoculums to infect the cocoa pods [5,27].

Model Validation
The best cocoa black pod disease forecasting model between ARIMA and ARIMAX model was selected based on its forecasting performance using three criteria, namely mean squared error (MSE), root mean squared error (RMSE) and mean absolute error (MAE). Table 13  08139. This showed ARIMAX model has performed better compared to ARIMA model and therefore, ARIMAX model was selected as the best disease forecasting model in this study. The gap between the actual and ARIMAX predicted values seen to be far especially on sample with higher black pod incidence rates as seen in Fig.7 and Fig.8. The reason probable due to nonlinear effects in cocoa black pod incidence rate and also influencing external factors that not being considered in the study, i.e. duration of pods being exposed to wetness as direct impact to the disease.

Conclusions
This study shows that there were an effect of climate variables to the cocoa black pod incidence. Key findings indicate that maximum temperature and relative humidity have significant correlation with black pod incidence and suggested as indicator in forecasting the cocoa black pod incidence. The results in this paper has found that ARIMA model with climate variables as input series i.e. ARIMAX consistently showed the superiority over ARIMA models in capturing the percent relative deviations pertaining to cocoa black pod incidence forecasts in two different treatment plots. The ARIMAX models performed well with lower error as compared to the ARIMA models. The plant disease forecasting models developed from ARIMAX model able to provide benefits especially for the development of decision support system in determine the right timing of action to be taken in controlling the cocoa black pod disease.