A Hybrid Seasonal Box Jenkins-ANN Approach for Water Level Forecasting in Thailand

Every year, many basins in Thailand face the perennial droughts and floods that lead to the great impact on agricultural segments. In order to reduce the impact, water management would be applied to the critical basin, for instance, Yom River basin. An importing task of management is quantitative prediction of water that is stated by water level. This study proposes the hybridized forecasting models between the stochastic approaches, seasonal autoregressive integrated moving average (SARIMA) models and machine learning approach, artificial neural network (ANN). The proposed hybrid model is called seasonal autoregressive integrated moving average and artificial neural network or SARIMANN model for average monthly water level (AMWL) time series of Yom River basin. The study period is from April 2007 to March 2020, over thirteen hydrological years. The forecasting performance is the minimum values of root mean squared error (RMSE) and mean absolute percentage error (MAPE) between SARIMA models, ANN models, and SARIMANN models. Results indicated that: The three models reveal the similarity of RMSE and MAPE for both four water level measurement stations for wet and dry seasons. The forecasting performance is the minimum values of RMSE and MAPE of three models. The SARIMA model is the best approach for Y.31 Station [Wet Season], Y.20 Station [Wet Season], Y.37 Station [Wet Season], Y.31 Station [Dry Season], Y.20 Station [Dry Season], and Y.1C Station [Dry Season, while the best method for Y.37 Station [Dry Season] is ANN model, furthermore the SARIMANN model is the best approach for Y.1C Station [Wet Season]. All methods have delivered the similar results in dry season, while both SARIMA and SARIMANN are better than ANN in wet season by RMSE for all stations. Even though the downstream is affected by many disturbances, it is still more accurate than the upstream. This is the visible evidence to indicate that the stochastic based models, SARIMA and SARIMANN proposed in this study are appropriate for the high fluctuation series. Furthermore, the dry season forecasting is more accurate than the wet season.


Introduction
Among 25 main basins of Thailand, The Yom River basin located in Northern Thailand, with tropical wet and dry or even savanna climate throughout the year. Moreover, the Land Development Department reported in 2013 that Yom River basin has an area of 907.75 km 2 with fewer than 4 droughts per ten years, 701.3 km 2 with 4-5 droughts per ten years, and 229.34 km 2 with over 5 droughts per ten years, which accounted for 49.38, 38.15, and 12.34 percentage of the total area with the repeated drought, and represents 3.77, 2.92, and 0.95 percentage of the Yom River basin area. The Yom River basin has an area of 2,454.06 km 2 with fewer than 4 floods per ten year, 1,334.25 km 2 with 4-7 floods per ten years, and 541.90 km 2 with over 8 floods per ten years, which accounted for 56.67, 30.81, and 12.51 percentage of the total area with the repeated flood, and represents 10.21, 5.55, and 2.25 percentage of the Yom River basin area [1,2,3,4]. The streamflow of a river is the integration of climatic factors and the precipitation. Changes of streamflow may be caused by climate changing and the human activities disturbance. These lead to complication of hydrological modeling [4,5,6,7]. Sopipan, 2014 [8] study of forecasting historical monthly rainfall data from April 2005 to March 2013 in Nakhon Ratchasima Province, Thailand. Using auto regressive integrated moving average (ARIMA) and multiplicative Holt-Winters method, which the mean absolute percentage error (MAPE), mean squared error (MSE) and mean absolute error (MAE) were used to measure the performance. Forecasts from both methods were found to be acceptable but ARIMA gave a better result for that case. Fashae et al., 2018 [9] are to compare the artificial neural network (ANN) and autoregressive integrated moving average (ARIMA) model of River Opeki discharge in Oyo State, Nigeria 1982 to 2010 and to use the best model to forecast the discharge of the river from 2010 to 2020. The performance of the two models was subjected to be based on the root mean square error (RMSE) and correlation coefficient. The result showed that ARIMA performs better considering the level of RMSE and higher correlation coefficient. Pongsiri, 2550 [10] are to compare accuracy forecasting the regressive integrated moving average (ARIMA) model, artificial neural network (ANN) model, and hybrid regressive integrated moving average and artificial neural network (ARIMA-ANN) model of daily closing price of Trans-Pacific Partnership (TPP) time series since 2005 to 2007. The accuracy of the three models was subjected to be based on the mean square error (MSE) and mean absolute error (MAE) and mean absolute percentage error (MAPE). The ARIMA model concept is suitable for time series with a linear structure, while some models in machine learning approach e.g. the ANN model was able to completely describe the non-linear structure, as well as the three-based threshold model [22], or nonlinear principal component analysis [21] for stochastic approach. Therefore, the hybrid ARIMA-ANN model composite model would be completely descriptive of time series data with a linear and non-linear structure. The result showed that the hybrid ARIMA-ANN model was able to provide more accurate forecasts short-term (7-day) than the ARIMA and ANN models, but for a long-term forecast (30-day) the ANN model provides the most accurate forecast.
This study proposes the hybridized forecasting models between the stochastic approaches, seasonal autoregressive integrated moving average (SARIMA) models and machine learning approach, artificial neural network (ANN). The proposed hybrid model is called seasonal autoregressive integrated moving average and artificial neural network or SARIMANN model for average monthly water level (AMWL) time series of Yom River basin. The study period is from April 2007 to March 2020, over thirteen hydrological years. The forecasting performance is the minimum values of root mean squared error (RMSE) and mean absolute percentage error (MAPE) between SARIMA models, ANN models, and SARIMANN model.

Study Region and Dataset
Yom River basin is located in Northern Thailand. It is a river basin covering a surface area of approximately 24,046.89 km 2 , between the latitude 14 • 50' N to 18 • 25' N and the longitude 99 • 16' E to 100 • 40' E. Yom River basin consisted of 11 major sub-river basins and covers administratively 11 provinces, 45 districts, 286 sub-district, and 2,028 villages [4,11], as shown in Figure  1.
In 2014, the Yom River basin received average annual precipitation of about 1,179 mm, and average annual runoff of about 5,261 million mm 3 and an average annual runoff of fewer than 2,500 mm 3 per year per person, which is less than average annual runoff in Thailand per year per person (3,496 m 3 ). In 2019, the Yom River basin is one large-sized reservoir and five medium-sized reservoirs with a total storage capacity of 295.62 million m3 [1,4,11,12] [13,14,15,16,17], as shown in Figure 1. The AMWL data at the four previously listed water level measurement stations for the wet and dry seasons of Thailand were calculated, the wet season is from May to October and the dry season is from November to April. The multiplicative autoregressive integrated moving average (ARIMA) model is a universal and widely used model in the time series analysis area. An ARIMA model is a combination of autoregressive (AR) and moving average (MA) parts with differencing. Generally, ARIMA model is denoted by ARIMA ( , , ) where , and are non-negative integers. In this notation, the -parameter refers to the autoregressive (AR) part, the -parameter refers to the order of regular differencing part, and the -parameter refers to the moving average (MA) part [18,19]. The back-shift operator ( ), where = − for is a positive integer. The multiplicative SARIMA( , , ) × ( , , ) model is represented by: and is independent random variable that represents the error term at time .
The multiplicative SARIMA models are a form of the Box-Jenkins methodology, It is requires stationary time series data of and the independent random variable ( ) must qualify as white noise

Artificial Neural Network Model (ANN Model)
An artificial neural network (ANN) is the foundation of artificial intelligence (AI), which the piece of a computing system is designed to simulate the human brain analyzes and processes information. ANNs have self-learning capabilities and two properties: earning and recall. The multi-layer perceptron neural networks or feed-forward backpropagation neural network is the learning of neural network, utilizes a supervised learning technique, learning that requires the input set and the target output as a training pair. Usually, network teaching uses multiple training sets, during network teaching will generate actual output, which is different from the target output causes an error value. The network learns both data by adjusting the weight to reduce the difference value or the error value between actual output and target output to a minimum. The weight adjustment will be adjusted in small increments by the process of repeating the data one by one until the weights in the network converge. Its greatest strength is in non-linear solutions to illdefined problems [20]. The multi-layer perceptron neural networks are shown in Figure 2.
Where is the input data of input layer at node ℎ ; = 1, 2, … , is the output data of hidden layer after weight adjustment using sigmoid-activation function at node ℎ ; = 1, 2, … , is the actual output data of output layer after weight adjustment using linear -activation function at node ℎ ; = 1, 2, … , is the target output data of output layer at node ℎ ; = 1, 2, … , is the error of output layer at node ℎ ; = 1, 2, … , is the weight of network between input layer and hidden layer is the weight of network between hidden layer and output layer  The SARIMA models are stochastic approach, which is suitable for linear variation with linear approach greatest seasonal and trends data but unable to manage non-linear data such as hydrological data. That opposes to ANN models which are a machine learning approach and suitable for non-linear situation. Therefore, we hybridize the seasonal autoregressive integrated moving average model and the artificial neural network model to deal with linear and non-linear situation simultaneously, which is celled SARIMANN models. The creation of the SARIMANN model uses the concept of using the input variables obtained from the best-fit SARIMA models and applying them to input data in the ANN model as shown in Figure 3.
Where the is the parameter of the autoregressive (AR) of order , the Φ is the parameter of the seasonal autoregressive (SAR) of order , the is the parameter of the moving average (MA) of order , the Θ is the parameter of the seasonal moving average (SMA) of order , the (1 − ) is the differencing (a.k.a. Integrated, denoted by I) process of order , and the (1 − ) is the seasonal differencing process (a.k.a. Seasonal Integrated, denoted by SI) of order .

Accuracy and Performance
If the mean absolute percentage error (MAPE) and the root mean square error (RMSE) values are small, the forecast value of AMWL is highly accurate.
Where is the forecast value of AMWL data of since = 1, 2, … , ; is the observations of AMWL data of and is the number of observations of AMWL data.
The cross-validation method is one of the methods of dividing data to perform performance testing from the variety of prediction models. The basic concept of cross-validation is to split data into multiple parts and use some of the data to predict others. The K-fold cross-validation method is certainly the most popular cross-validation method procedure.
In K-fold cross-validation, firstly, the K equal-sized subsamples were partitioned from the original data. Secondly, we treat K−1 subsamples as the training data, i.e. the model fitting dataset, and the remaining subsample as the testing dataset, i.e. be compared with the predicted values from the training model. Finally, we repeat these processes K times along with the K subsamples and find the average performance of K-results [20].

Results
Time series plots of AMWL data of four water level measurement stations have relatively clear seasonal variations for the wet season and dry season. There is a peak of the AMWL every hydrological year. The peak and distribution of data in the dry season are less than the wet season, as shown in Figure 4 and Figure 5.

Hybrid between Seasonal Autoregressive Integrated Moving Average and Artificial Neural Network Model (SARIMANN Model)
Based on the twenty-four possible SARIMA models, there are eight selected best-fit SARIMA models of AMWL data of all four water level measurement stations for wet and dry seasons. Parameter estimates or coefficients of the SARIMA models are shown in Table 1.
Parameter estimates of the SARIMA models of AMWL data of all four water level measurement stations for wet and dry seasons were statistically significant (The p-value is less than the significance level = 0.01). Therefore, the parameter estimates were included in the model, as shown in Table 1.
The input data and their weights of the ANN models of AMWL data of all four water level measurement stations for wet and dry seasons as shown in Table 2 and SARIMANN models as shown in Table 3. The weights of the network for the both models are shown in Figure 6, where the learning rate is 0.3 in 500 repetitions.    SARIMA (1,0,0) (1,0,1)   Table 4 and Figure 7. Forecasting of the three methods of AMWL data of all four water level measurement stations for wet season are shown in Figure 8 and dry season as shown in Figure 9.

Conclusion and Discussion
The AMWL time series from April 2007 to March 2020, at four water level measurement stations (Y.31, Y.20, Y.1C, Y.37) for wet and dry seasons in the Yom River basin had been analyzed. The following conclusions and recommendations can be obtained from this study: Comparisons of the three models reveal the similarity of RMSE and MAPE for both four water level measurement stations for wet and dry seasons.
Comparisons of the three models reveal the similarity of RMSE and MAPE for both wet season forecasting (six months: May 2019 -October 2029) and dry season forecasting (six months: November 2019 -April 2020 In the dry season, all methods are similarly performed, while in the wet season SARIMA and SARIMANN explicitly overcome ANN method by RMSE for all the upstream through downstream stations. The downstream station (Y.37 station) has more accuracy than the upstream station (Y.31 station) despite the downstream affected by human activities (reservoirs, water use for agriculture and consumption) that disturb hydrological changes that cause data uncertainty and hard to handle in forecasting. This is the clarified evidence to indicate that the stochastic based models, SARIMA and SARIMANN which proposed in this study are suitable for handle with high fluctuation series The dry season forecasting is more accurate than the wet season all four water level measurement stations due to the seasonal variation of the AMWL time series during the wet season is more complicated than the dry season and the magnitude of data dispersion in wet season is wider than dry season.
The possible further study in hybridized-manner for handling the extreme values which often nonstationary may be Tree-based threshold model [22].
However, this proposed method, SARIMANN is suitable for wet season as well as SARIMA and better than ANN definitely for both upstream and downstream especially in downstream. So, the implication of this proposed model is for the highly disturbance situation of any stream.
But in dry season the proposed method has quite similar performance as SARIMA and ANN for both upstream and downstream, this might be the limitation of the proposed model. So, the prospective researcher should try other approaches for instance frequency domain time series analysis or fuzzy time series analysis for the future study.