Water Level Prediction Using Different Numbers of Time Series Data Based on Chaos Approach

The prediction of water level in floodplain area is important for early signals and flood control. A total of 6350 hourly water level time series data located at Sungai Dungun were used in this study. The data were divided into training set and testing set. The training set consisted of the first 6000 data which were used to predict the last 350 data. A total of six set data consisting of different amount of training set of data were involved in this study. Consequently, it was used to determine the influence of different amount of data on predicting accuracy by using chaos approach. Those sets of data required a combination of parameters for prediction. In this study, the different amount of data had impacts on the combination of parameter for prediction. In addition, the correlation coefficient showed different values for all sets of data and excellent prediction when they were all used in testing the data. Hence, the different total amount of data will give impact on different combination of parameters and prediction accuracy for water level prediction based on chaos approach in floodplain area.


Introduction
Cao, Tao, Dong and Li [5] asserted that floods occur when excessive water level rises in river areas, whether in natural or man-made conditions. From a scientific vocabulary point of view, floods are caused by the existence of excessive heavy rain that cannot be supported by the river basin and thus the water overflows to the riverbanks or floodplains [18]. Flood disaster can cause damage to people and nature where it can affect the land structure, agriculture, livestock as well as residential areas [7]. Therefore, prediction of water level in floodplain area is important for early signals and flood control.
The dynamics of a time series data can be divided into two parts; deterministic and random. In 1963, Lorenz [15] discovered the dynamic chaotic where the knowledge was used for research in the field of science and engineering in a thorough matter while the term chaos was first introduced by Li and Yorke [14]. According to Abarbanel [1], chaotic dynamic is between the deterministic and random dynamic. Chaotic time series can be used in prediction only in a short term due to the sensitive dependence on initial conditions [19].
This chaos approach is an important discovery for predicting phenomena in scientific research. The application of chaos approach is widely used in many types of time series data such as river flow [3], ozone [9] and sea level [4]. Nowadays, the research in the application of this method on water level time series data is growing and being conducted in several countries such as in China [10], Iran [21] and Malaysia [16]. In addition, a lot of research also emphasised on time scale of water level data such as hourly scale [13], daily [12] and weekly [12].
Grouping a number of data will affect the value of the parameter used in prediction and hence, choosing the right amount of data is important. The research in predicting water level is extended to research in prediction using different number of time series data such that the influence of data with several numbers in giving the accuracy in prediction performance for water level. Thus, this study is conducted to perform the prediction of water level with several number of data sets using chaos approach in floodplain area where flood often hits.

Data
The unit meter (m) is used to measure the time series data for the river water level. Data used in this study were based on hourly time series. Hourly data are suitable for applying the study of hydrology, in particular flood prediction [13]. This study was conducted at Kampung Surau Station in Sungai Dungun, Terengganu. Data used in this study started from July 2009 to March 2010 since this area was affected by flood at this range of time [11]. Referring to Figure 1, a total of 6350 hourly time series data were used in this study and they were divided into training data set and testing data set. The training data set consisted of the first 6000 data while the rest of the data were used as testing data set that were as much as 350 data.
In conducting this study, as much as 6000 data in training data set were divided into several parts as this study focused on testing the prediction performance using 1000 data set (coded as SD1000), 2000 data set (SD20000), 3000 data set (SD3000), 4000 data set (SD4000), 5000 data set (SD5000) and 6000 data set (SD6000). Since prediction was tested from data 6001 until 6350, the data used in constructing the prediction were counted backwards, in which 1000 data set (SD1000) was taken from 5001 until 6000, 2000 data set (SD2000) was taken from 4001 until 6000 and so on. The amount of data, their ranges and percentage used out of the total training data set are presented in Table 1.

Methodology
The X time series is recorded by hourly as follows: (1) The X time series is referred to as the training set of data and is referred to as total of data involved such that is value of data at the first hour. An example for SD6000, the total data involved are . According to Takens [20], the phase space reconstruction can be developed. The phase space involves a single variable that is referred to as the training set that needs to be reconstructed to multi-dimensional phase space m i Y as follows: where τ indicates time delay, m is the embedding dimension such that i = 1, 2,…, . In order to obtain the value of the Average Mutual Information (AMI) was used as follow: In order to determine the value of m, Cao method was used in this study. This method does not only can provide the value of m but can also help to determine the chaotic dynamics of data [6]. Furthermore, this method does not depend on specific number of data and therefore it suits for this study which involves different amount of time series data to see the impact of parameters values to the prediction accuracy. Therefore, Cao method is more relevant to be used in determining the chaotic dynamics of a water level time series data compared to the other methods such as Lyapunov exponent method [17], phase space plot [8] and correlation dimension. The Cao method involves two parameters which are and . The parameter can be calculated by: where  refers to the Euclidean distance and m n Y refers to the neighbouring value for . If is saturated with a value that is larger than then is the optimum dimension value [22].
Besides identifying the value of , Cao method is also used to determine the dynamic of this system towards chaotic dynamics or random time series data. If the value E1(m) does not reach saturated with the increasing of m, hence the time series is random. Cao [6] also introduces as follows: If the chaotic dynamic exists in the training set of data, the value E2(m) will not be fixed to 1 for any m or at least one m. Meanwhile, prediction based on chaos approach can be conducted using local linear approximation method: (8) where is the one step ahead phase space and is the last phase space. The constants and depend on the nearest neighbour, In this research, the value of k is determined by where is the embedding dimension. Table 2 shows the value of τ that was obtained using AMI method with different amount of time series data. The different values of τ obtained indicated the different amount of time series data used to influence the value of τ . The saturated value for was chosen between 0.9 until 1.0 [9]. Table 3 shows the value of E1(m) for each set of data and the bold numbers were the number where E1(m) started saturated for each data set. For SD1000, values started to saturate at and therefore the embedding dimension, m for SD1000 was 4. Thus, the values for m were 6, 6, 7, 6, 5 for SD2000, SD3000, SD4000, SD5000 and SD6000, respectively. Hence, the different amount of data set induced different values of m.

Result and Discussion
Note that the value of E1(m) values saturated with the increase of m for all sets of data. Hence, the time series was chaotic. Moreover, the existence of for each value of m assured the presence of chaotic dynamics for each set of data as referred to Table 4. Therefore, the prediction using chaos approach can be conducted in all solved hourly time series data.   Since chaos exists in the water level time series data in Sungai Dungun, hence prediction using chaos approach can be conducted. Local linear approximation method was used. This method requires the combination values of ( , ) m τ from the previous calculation of using AMI method in Table 2 and E1(m) in Table 3. As much as 350 data taken were used to test the prediction performed by each set of data as presented in Fig. 2. The prediction performances were compared to the actual data for each SD1000, SD2000, SD3000, SD4000, SD5000 and SD6000. Table 5 shows the prediction performances for each data set that was represented by the correlation coefficient (CC). It can be seen that at SD1000, the prediction conducted with parameter combination of (4,4) gave correlation coefficient (CC) value of 0.9763. For SD2000 and SD3000, the combination of parameters was the same with (11,6) that generated CC values of 0.9683. This clearly shows that the same values of parameter τ and m used in prediction may contribute to the same value of CC. Meanwhile, the value of CC exceeded 0.9400 for SD4000, SD5000 and SD6000 with the parameter combinations of (24,7), (22,6) and (18,5), respectively. This shows that the different amount of data used gives different prediction values. However, the best prediction in this research can be obtained by using 6000 hours data which has the highest number of data set used.

Conclusions
This study observes the influence of different amount of data used with the combination of parameter τ and m for prediction purpose using chaos approach. In this study, the different amount of data gives impact on the combination of parameter for prediction. As such, this research suggests using different amount of data set in order to have a good combination for prediction. Furthermore, six sets of different amount time series data that consist of up to 6000 has been used in this study to determine the impact of data amount to prediction accuracy in floodplain area. As a conclusion, different amount of time series data influences the accuracy of the prediction. The purpose amount of data set that gives excellent prediction is when it uses all testing data which is 6000 data in order to predict 350 hours of data ahead. Therefore, a large amount of data needed in order to get an excellent prediction accuracy based on chaos approach prediction.