Time Series Forecasting with Trend and Seasonal Patterns using NARX Network Ensembles

In this research, we propose a Nonlinear Auto-Regressive network with exogenous inputs (NARX) model with a different approach, namely the determination of the main input variables using a stepwise regression and exogenous input using a deterministic seasonal dummy. There are two approaches in making a deterministic seasonal dummy, namely the binary and the sine-cosine dummy variables. Approximately half the number of input variables plus one is contained in the neurons of the hidden layer. Furthermore, the resilient backpropagation learning algorithm and the tangent hyperbolic activation function were used to train each network. Three ensemble operators are used, namely mean, median, and mode, to solve the overfitting problem and the single NARX model's weakness. Furthermore, we provide an empirical study using actual data, where forecasting accuracy is determined by Mean Absolute Percent Error (MAPE). The empirical study results show that the NARX model with binary dummy exogenous is the most accurate for trend and seasonal with multiplicative properties data patterns. For trend and seasonal with additive properties data patterns, the NARX model with sine-cosine dummy exogenous is more accurate, except the fact that the NARX model uses the mean ensemble operator. Besides, for trend and non-seasonal data patterns, the most accurate NARX model is obtained using the mean ensemble operator. This research also shows that the median and mode ensemble operators, which are rarely used, are more accurate than the mean ensemble operator for data that have trend and seasonal patterns. The median ensemble operator requires the least average computation time, followed by the mode ensemble operator. On the other hand, all of our proposed NARX models' accuracy consistently outperforms the exponential smoothing method and the ARIMA method.


Introduction
Time series data forecasting is mainly done using statistical methods, namely Auto-Regressive Integrated Moving Average (ARIMA) or Box-Jenkins, Holt exponential smoothing, simple exponential smoothing, multiplicative Holt-Winter exponential smoothing, additive Holt-Winter exponential smoothing, etc. These method choices depend on various influencing aspects, namely the aspect of time, data patterns, the type of system model being observed, the desired level of forecasting accuracy, and others. However, these statistical methods are limited to linear models only [1,2,3]. Meanwhile, in many cases, forecasting time series data has a nonlinear tendency. Descriptions and applications for the nonlinearity test can be found at [4,5].
Recently, a more flexible approach has been developed for modelling linear and nonlinear relationships, namely the Neural Networks (NN) model. The NN model is a widely used alternative because it does not require assumptions on the data that are often difficult to fulfil in the linear model [6,7]. The most popular NN model for modelling and forecasting of time series data is the Nonlinear Auto-Regressive Neural Network (NARNN) model, also known as the Feed-Forward Neural Network (FFNN) or Back-Propagation Neural Network (BPNN) or Multi-Layer Perceptron (MLP) [8,9]. On the other hand, the NN model is an unstable technique. It is because a small change can result in a significant change in forecasting output. Many experiments have shown that the generalized results of a single NN model are not unique, meaning that a single NN model's solution is unstable. One alternative solution is found by assuming that the discarded NN model with inaccurate performance is considered to have potential as a candidate for the new NN model. Combining several NN models, is one of the effective ways of reducing the overfitting problem and the weaknesses of a single NN model in choosing a model. This combination of NN models is known as the ensemble NN [10,11].
In this research, we propose an extension of the previous results of [11] on four sides: First, the forecasting model is considered for a more general class, namely the Nonlinear Auto-Regressive network with exogenous inputs (NARX) model, which previously applied the NARNN model. Second, the NARX model focuses on monthly time series data comprising of seasonal and trends patterns. A 12-period centered on moving average with time series data was employed on Cox-Stuart to identify the presence of trend. The test was carried out on the centered moving average to smoothen irregularities seasonally. Furthermore, the de-trended time series data was used to calculate and identify the seasonal indices, with a Friedman test used to determine its significant deviations. Seasonality or trend is identified in the time series data, and eliminated in the first differencing, in accordance with the research carried out by [12]. The time series data are then scaled linearly between -0.8 and 0.8 to facilitate the NARX model training. Besides, seasonal information is captured using a deterministic seasonal dummy as an exogenous input variables. Determination of the main input variables is done in the same way as [11] through the stepwise regression approach. Here, there are two approaches to make a deterministic seasonal dummy, namely the binary and the sine-cosine dummy variables.
Third, the NARX model with one hidden layer and the number of neurons in the hidden layer is half the number of input variables plus one. It is known that this similar approach is applied by [13] for the NARNN model. The tangent hyperbolic activation function and resilient back-propagation learning algorithm were used to train the NARX model. Fourth, the NARX model uses three ensemble operators, namely mean, median, and mode, to overcome the over-fitting problems and weaknesses of the single NARX model. Different ensemble sizes were evaluated from 10 networks to 100 networks. The results for a single NARX model by choosing the best, were not provided in this research, as there has been ample evidence in the literature that the ensemble is superior, see e.g. [14,15]. This paper is structured as follows: This section explains our paper's background and contribution. In section 2, we provide a brief description of some theories required for a complete description of the paper. Section 3 provides an empirical study with actual data, namely the number of international airline passengers, the number of accidental deaths in the United States of America (USA), and Indonesian inflation rates. The measurement accuracy of forecasting is done by using the Mean Absolute Percent Error (MAPE) value. Besides, comparisons of MAPE values were carried out with ARIMA and exponential smoothing methods. Furthermore, the descrip-tions of each were described in the studies [16] to see the increase in forecasting the NARX model using an ensemble.

Neural Networks
Neural Networks (NNs) was successfully enforced in forecasting univariate and multivariate time series. The Multi-Layer Perceptron (MLP), is the most utilized architecture with a linear and nonlinear functional relationship without preliminary assumptions on the underlying data generation process. Generally, MLPs are used to design Nonlinear Auto-Regressive NAR(p) and NARX(p)-processes. These models are used to forecast feed-forward architectures, while the external variables are used to code exogenous events. Furthermore, with a time series y, at a point in time t, a onestep ahead forecastŷ t+1 is determined using p = I observations y t , y t−1 , ..., y t−I+1 from I preceding points in time t, t − 1, t − 2, ..., t − I + 1, with I representing the number of input units of the MLP. The functional forms is as follows.
where Y = [y i , ..., y t−I+1 ] is the vector of the observed input time series. In addition, the network weights are γ = [γ 11 , γ 12 , ..., γ HI ] and w = (β, γ), β = [β 1 , β 2 , ..., β h ] for the hidden input and output layers, respectively. The β 0 and γ 0i are the biases of each neuron, while the I and H are the number of input and hidden units in the network. The g is a non-linear transfer function, which is known as the logistic or hyperbolic tangent function [11].

NARX Model
The Nonlinear Auto-Regressive model with exogenous input (NARX) has been reported to be very essential to the discretetime nonlinear systems and also defined using the following mathematical relationship [17]: where u(t) and y(t) indicate the model's input and output at time t, respectively, n u ≥ 1 and n y ≥ 1 (n y ≥ n u ) represent input and output-memory orders. w and f are the weights matrix and nonlinear function used to estimate the Multi-Layer Perceptron (MLP) [18]. Basically, the NARX network is prepared using one out of two models [19]. The first is the parallel architecture (or with feedback), where the output is the feedback for the feedforward neural networks input, being part of the standard architecture: The second model is the series-parallel architecture (or parallel architecture without feedback), where there is formation of the output's regressors only through the use of the output actual values: , ..., u(t − n u + 1); w) =f (y sp (t); u(t); w) As previously stated, the NARX network inputs are included in the system's inputs and outputs regressors while the timeseries is used as the output channels without the measured input. This model's ability to forecast is, therefore, limited in its application for time-series data without the input regressors, due to the elimination of the tapped-delay line over the input signal and this would further lead to the reduction of NARX to plain focused time-delay neural network architecture [20,21]: According to [22] and [23], a simple strategy was proposed in line with the Takens Embedding Theorem as a solution to the problem by providing the opportunity for the full exploitation of the original NARX network computational abilities in predicting nonlinear time-series and this, firstly, include defining the input signal regressors, denoted by u(t), through the delay-embedding coordinates: where d E = n u is the embedding dimension and τ is embedding delay. Secondly, due to the possibility of training the NARX network in two different architectures, the output signal regressors y(t) are presented as shown in the following relationships: where the output regressor y(t) for the parallel architecture without feedback in (7) has n y actual time series past values while those with feedback in (8) has n y estimated ones. These outputs are values of y(t + 1) that have been previously estimated for a network that has been trained effectively and are required to follow the prognostic relationships applied using NARX network. This is represented as follows: The NARX networks trained in line with equations (9) and (10) are, therefore, represented as NARX-SP and NARX-P networks, respectively. This research was, therefore, focused on NARX models with parallel architecture and without feedback.

Exogenous Dummy Variables
According to [24], creating additional features in form of dummy variables is one of the methods used to capture deterministic trend without selecting lagged inputs. The conventional approach used to model deterministic seasonal patterns are S − 1 binary dummy variables with period of t, where S denotes the seasonal length. Meanwhile, for high frequency data, long input vectors used S − 1 additional time series. A set of sine and cosine dummy variables, captures deterministic seasonal elements of the time series [25]. In addition, two inputs including x s,1 and x s,2 were used to encode seasonality with variables created using Sin(t) and Cos(t) for an explicit representation of the point within an identified seasonality of length s, (see e.g. [24,25]) with Furthermore, x s,1 and x s,2 are used to determine the sinecosine-pairs for each s and the input vector for long and multiple seasonalities. This study applied the two conventional approaches as exogenous input variables in the NARX model.

NARX Modelling
NARX modelling generally consists of five steps.
• Step 1: Preprocessing of the time series started with seasonality, checking the time series trend, and applying seasonal difference or the difference if the time series contains seasonality or trend patterns. The time series was further scaled into the range of [−0.8, 0.8].
• Step 2: The following strategy was used to obtain the autoregressive lags for the main time series and the exogenous variables. If the main time series frequency is equal to m, all lags from 1 to m were considered as possible lag numbers. For example, for monthly data, 1 to 12 was considered. The lag order of the exogenous variables was assumed to equal the main time series, and the significant lags were selected using the stepwise AIC method. Here, the exogenous variables include the seasonal dummy variables. There are two approaches to make seasonal dummy variables, namely the binary and the sine-cosine dummy variables.
• Step 3: The number of neurons in the hidden layer was chosen to be half of the total input in Step 2 plus 1.
• Step 4: The NARX networks obtained in Step 3 were fitted from 10 to 100 times using different random starting weights, and the forecasts obtained were combined using the ensemble approach (mean, median, and mode) to produce the final forecast.
• Step 5: The recursive or iterative strategy was used for multi-step ahead forecasts.
Further detail, which also shows the above algorithm's applicability for the NARNN model, is provided in [11].

514
Time Series Forecasting with Trend and Seasonal Patterns using NARX Network Ensembles

Ensemble Operators
The application ofŷ mt as the forecast from model m for time t, where m = 1, ..., M and M represents the number of forecasts available for combination in an ensemble forecastỹ t . There was a discussion on the formation ofỹ t through the use of the mean, median, and mode operators in this section and a unimodal distribution was assumed to achieve dependable results.

Mean Ensemble
One of the prevalent ways to determine the central tendency of a set of data that is weighted or unweighted is by using the mean ensemble method. Let w m denote the forecast's weight from model m. Therefore, using 0 ≤ w m ≤ 1 and M m=1 w m = 1, the ensemble forecast t is calculated as follows.ỹ The combination produced for all w m = M −1 is unweighted with the attributes and limits of the mean known. It was observed to have some sensitivity to outliers and this makes it not to be reliable for skewed distributions. It is, however, possible to use winsorized or truncated means to obviate some of these problems by making the mean behave more closely to the median. Moreover, the maximum distance between these two measures is one standard deviation for distributions with finite variance and this has been reported to be true for sets of forecasts [26].

Median Ensemble
It is also possible for the median to either be unweighted or weighted but the weighted aspect is not usually applied. The calculation of median ensembleỹ M edian t is conducted by arranging w m y mt and selecting a single value in the middle if M is not even or by determining the average of the two values in the middle when it is not odd. It is, however, important to note that it also feels the effect of non-symmetric distributions even though it is more robust than the mean.

Mode Ensemble
The mode ensemble is the most occurring value in a set of data and has been found not to be sensitive to outliers unlike the mean and the median. No formula was found to calculate unidentified continuous variables, but two methods are usually applied and these include data discretization and identification of the most frequent bin as well as kernel density estimation. The second approach was preferred in this research due to its effectiveness for continuous-valued nature of forecasts.
Kernel density estimation has been identified as a nonparametric method of calculating the probability density function of a random variable and, for the purpose of this research, the forecasts. It is possible to estimate the shape of the function using this method if some forecasts of distribution with unknown density f are provided.
where K(·) is a function with K(x)dx = 1 known as the kernel, while h ≥ 0 is the bandwidth. It is most times used as a unimodal symmetric density function and this makes thê f th (x) a density function itself and this is most times known as the Gaussian kernel φ(x): Previous literature has proposed several alternative kernel functions but the one selected was observed to have a minimal effect on the results in most cases. Moreover, the kernel h bandwidth moderates the smoothing quantity and, at a higher value, more smoothing is provided. Therefore, the h selected is very important mainly due to the ability of under-smoothing or over-smoothing to give wrong density f estimation [27]. Meanwhile, the approximation by [28] is often used in practice whereσ represents the forecast samples' standard deviation and its estimate is most appropriate for Gaussian kernels. Moreover, [29] proposed a method for the automatic selection of bandwidth without being affected by the arbitrary standard reference rules used in previous studies. The preference of this method in calculating the mode ensemble is due to its ability to produce a bandwidth h allowing the ensemble's fast convergence and excellent performance.
The value x corresponding to the maximum value of density function estimates the mode for the true fundamental set of forecasts distribution and also serves as the mode ensemble valueỹ M ode t+h . Meanwhile, a relatively numerous observations are needed for the appropriate display of the fundamental density by the kernel density estimation because using a small number usually leads to a bad approximation.

Result and Discussion
This research was conducted on three cases of actual data: the number of international airline passengers, the number of accidental deaths in the USA, and Indonesian inflation rates. The measurement accuracy of forecasting is done by using the Mean Absolute Percent Error (MAPE) value. MAPE is a measure of accuracy that is widely used in time series forecasting studies. A method performs very well when the MAPE value is below 10% and performs well when the MAPE value is between 10% and 20%. MAPE is defined as follows [30]: where N is the number of data, A t is actual values at data time t, and F t is forecast values at data time t.

Airline Passengers
Data on the number of international airline passengers, or better known as airline data, are popular data and are often used as an example of applying the time series method, among others in [1,2,6,7]. The data used in this research are monthly airline passenger numbers from January 1949 to December 1960. Thus, the number of samples used is 144, and the last 12 data are used as the testing data. Monthly data of international airline passengers have an upward trend pattern and seasonal pattern with multiplicative properties, and graphically it can be seen in Figure 1.
This research aims to determine the monthly time forecasting series data using trend and seasonal patterns. The Cox-Stuart test was used to determine trends in a time series data on a 12-period moving average. The test was carried out on the moving average to smoothen effects due to irregularities. Furthermore, the de-trended time series data were used to calculate and identify seasonal indices, which were tested to determine their significant deviations through a Friedman test. The result showed that time series data identified through a trend is eliminated through the first differencing method. The time-series data are further scaled linearly between -0.8 and 0.8 to facilitate the NARX model's training. The NARX model's design is as follows: Determination of the main input variables is carried out in the same way as [11] through a stepwise regression approach, and exogenous input variables are made with a deterministic seasonal dummy. There are two approaches to make a deterministic seasonal dummy, namely the binary and the sine-cosine dummy variables. The number of neurons in the hidden layer is half the number of input variables plus one. Each network was qualified using the resilient backpropagation learning algorithm and the tangent hyperbolic activation function. Finally, the NARX model design uses three ensemble operators: mean, median, and mode, to solve the overfitting problem and the single NARX model's weaknesses. The ensemble size is evaluated from 10 networks to 100 networks. Based on the NARX model's design above, the MAPE results for data on the number of international airline passengers can be seen in Tables 1 and 2. Overall, the results of NARX model with binary dummy variables in Table 1 are more accurate than the NARX model with sine-cosine dummy variables in Table 2. These differences in results indicate that the second approach as an exogenous dummy variable is less robust for seasonal data with multiplicative properties. However, all NARX models in this first case study consistently outperformed the exponential smoothing method and the ARIMA method. Exponential smoothing through the state-space approach obtained the best model using the Holt-Winter linear multiplicative method or ETS (M,A,M) with an AIC of 1395.16. Meanwhile, through the stepwise ARIMA approach, the best model is obtained using ARIMA (2,1,1)(0,1,0) [12] with an AIC of 1017.85. Graphic illustration in the form of plot of real data, fitting in sample and out sample forecast using NARX 1 (the best model of NARX with binary dummy variables), NARX 2 (the best model of NARX with sine-cosine dummy variables), ARIMA, and ETS can be seen in Figures 1  and 2.
Overall, the NARX model with mode ensemble operator (NARX-Mode) is an ideal model with the lowest MAPE value. The NARX-Mode performs better when the ensemble size is large enough. In particular, the ensemble size becomes very important for the accuracy of the mode ensemble operator. It is because the kernel density estimation becomes reliable when the number of observations is adequate, as discussed in Section 2. This research found that the rarely used median and mode operators are more accurate than the mean ensemble operator. The NARX-Median is slightly lower in performance but can perform well for small and large ensembles, meaning that the NARX-Median is relatively stable. The NARX-Median and NARX-Mean rank second and third with slight differences, respectively.

Accidental Deaths
Monthly data on accidental deaths in the United States of America (USA) can be seen graphically, as shown in Figure 3. Monthly data from January 1973 to December 1978 were used as the training data, and the data from January 1979 to June 1979 were used as testing data. The data used in this research is popular due to its complex pattern. Furthermore, research by Brockwell and Davis [3] was carried out by applying the ARIMA and exponential smoothing models. The design is the same as in the first case study. The MAPE results on the NARX model for monthly data of accidental deaths in the USA can be seen in Tables 3 and 4. The results of NARX model with sine-cosine dummy variables (second approach) in Table 4 are, overall, more accurate for NARX-Median and NARX-Mode. However, the same thing is not true for the NARX-Mean. The NARX-Mean accuracy with the first approach is slightly better. These results also indicate that the second approach for NARX-Median and NARX-Mode is robust for seasonal data with additive properties. Overall, the NARX-Mode is the most accurate model with the lowest MAPE value. The NARX-Median ranks second, followed by the third NARX-Mean. The accuracy of all NARX models in the second case study consistently outperformed the exponential smoothing method and the ARIMA method. Exponential smoothing as the best model uses the Holt-Winter linear additive method or ETS (A,A,A) with an AIC of 1240.18. Meanwhile, the best ARIMA uses ARIMA (0,1,1)(0,1,1) [12] with an AIC of 942.27. Graphic illustration in the form of plot of real data, fitting in sample and out sample forecast using NARX 1 (the best model of NARX with binary dummy variables),

Indonesian Inflation
The third case study is the data on the inflation rate in Indonesia. Inflation rate data in Indonesia is the monthly data that is observed from January 2007 to February 2018 or consists of 134 observations. Model training is carried out on the first 129 data (or training data), and the last five data is used as testing data. Graphically, data on the inflation rate in Indonesia can be seen in Figure 5. The data graph shows that there is a downward and upward trend over a certain period. The NARX model's design in the third case study is not the same as in the first and second case studies. It means that it does not consider the deterministic seasonal dummy. Here, the exogenous variables of the NARX model use a second correlation approach such as [31] in modelling a single NARX. The exogenous variable with the second correlation chosen is the interest rate data of the Indonesian Central Bank. The lag order for the interest rate data was assumed to equal the inflation rate data. MAPE results from the NARX model for the third case study can be seen in Table 5. Overall, the third case study found that the NARX-Mean is an ideal model with the lowest MAPE value, followed by NARX-Median with little difference. Meanwhile, NARX-Mode ranks third, which previously found the most accurate in the first and second case studies. This indicates that NARX-Mode is less robust for non-seasonal data. However, all of the NARX models consistently outperformed the exponential smoothing method and the ARIMA method. Exponential smoothing using the Holt linear method or ETS (M,A,N) with an AIC of 648. 19. While the ARIMA model uses ARIMA (0,1,1)(2,0,0) [12] with an AIC of 378.90. A graphic illustration in the form of plot of real data, fitting in sample and out sample forecast using ARIMA, ETS, and NARX, is shown in Figure 5.
On the other hand, the comparison of the computation time required for forecasting all ensemble sizes of each data as shown in Table 6. To show the comparison of different computation times, we summarize them into the average computation time. The average of computation time is shown in the form of seconds. The network is trained in parallel on a processor i5-8250U CPU with 1.8GHz. Table 6 shows the median operator requires the least average computation time compared to the mean or mode operators for each data. The mode operator ranks second with a slight difference. Meanwhile, the mean operator which is most often used requires the most average computation time.

Conclusions
The first case study on the trend and seasonal patterns with multiplicative properties which shows all the NARX model approaches (NARX-Mean, NARX-Median, and NARX-Mode) with binary dummy are more accurate than the NARX model approaches with sine-cosine dummy. The second case study on the trend and seasonal patterns with additive properties obtained that the NARX models (NARX-Median and NARX-Modus) is more accurate with sine-cosine dummy. The same thing is not proper for NARX-Mean. Meanwhile, the third case study on the trend and non-seasonal patterns obtained the most accurate using NARX-Mean. Overall, NARX-Mode is the most accurate model with the lowest MAPE value for data that has both trend and seasonal patterns. However, the NARX-Mean is most accurate for data that has trend and non-seasonal patterns. The NARX-Mode performs better when the ensem-ble size is large enough. The NARX-Median has slightly lower performance but it is stable enough for each data pattern and ensemble size. This case study also shows that the rarely used median and mode operators are more accurate than the mean operator. On the other hand, the median operator takes the least average computation time, followed by the mode operator in the second place with a slight difference. However, all NARX models in this study consistently outperformed the exponential smoothing method and the ARIMA method. Based on these results, we recommend the next researcher to use the dummy variables as exogenous input of the NARX model and use the median and mode operators in ensembles research and applications, which previously focused only on the mean operator. However, further research is needed to develop the NARX model of parallel architecture with feedback and weighted versions of the operators.