Regression Tree Analysis of Spm Entropy Groups: Case Study from the Irish Sea

This paper describes our studies of the suspended particulate matter (SPM) in the Liverpool Bay (UK). Monitoring data were analyzed by using entropy analysis. Entropy analysis of in situ particle size spectra revealed 5 basic types, attributable to different sets of environmental conditions. The revealed basic types of in situ particle size spectra were then subjected to the classification trees analysis in order to identify the meteorological and oceanographic variables of importance for the characterisation of the shape of SPM spectra. The results obtained are a step towards a better characterisation of the floc size, and therefore a more precise calculation of the sedimentation and transport rate, and are therefore relevant to the scientific analysis of a wider range of environmental issues.


Introduction
Suspended particulate matter (SPM) is of fundamental importance in issues of ocean engineering. Its dynamics is indispensable for understanding of corrosion and abrasion of materials, and also the formation of fluid mud, and is therefore relevant to issues of navigation and channel maintenance (Schrottke et al., 2006;Schwartz & Kozerski, 2003). SPM is also important as regards issues of aquatic ecology and environmental management (Hakanson & Eckhell 2005) as it is intimately related to the transport of pollutants and influences water clarity and primary production, and hence also secondary production (Krivtsov et al. 2008b). Consequently, SPM characterisation has recently been among the increasingly important topics as regards pollution control, environmental auditing and management (Audry The in situ particle size spectrum of suspended particulate matter in aquatic environment influences the feeding pattern of bottom fauna (Cranford et al. 2005), affects the transmission and reflectance of light in water (Mikkelsen 2002) and is of importance for numerous sedimentological and a wide range of ecological processes (Krivtsov et al. 2008b). It has previously been shown (Sharp & Fan 1963) that such parameters as mean/median particle size and standard or median absolute deviation can be incomplete or even misleading descriptors of the shape of the size spectrum, in particular for multi-modal spectra (Mikkelsen et al. 2007;Mikkelsen et al. 2005). Here we have applied a combination of entropy modelling with regression tree analysis to deduce 5 basic types of SPM spectra, and describe their relationships with environmental variables.

Site Description
The data presented here were collected in Liverpool Bay, an area of the Irish Sea important as regards recreation and shipping ( Figure 1). The site is characterised by tidal straining, intertidal regions with exposed banks, high suspended sediment concentration and complex biogeochemical interactions. Tidal currents are strong (up to 1 m/s during springs) and there are occasional large storm surges and waves (in particular associated with westerly winds). The principal freshwater inputs are from rivers Mersey, Dee and Ribble. Further information about the area can be found in our previous publication and references therein (Krivtsov et al. 2008b).

110
Regression Tree Analysis of Spm Entropy Groups: Case Study from the Irish Sea

Observations
The observational data forming the basis of this paper come from 9 cruises carried out by the Proudman Oceanographic Laboratory and School of Ocean Sciences (Bangor) on the RV Prince Madog, which took place between Sep 2004 and Feb 2006. The oceanographic variables measured during the cruises using a profiling CTD package are standard, and include (among others) temperature, salinity, conductivity, beam attenuation, chlorophyll a fluorescence, photosynthetic active radiation (PAR). Data on wave characteristics are available from the CEFAS wave rider buoy. The principal observational evidence comes from the LISST-100 laser, which provides in situ estimates of volume concentrations in µl/l for 32 size classes corresponding to particle sizes between 2.5 and 500 µm (Agrawal & Pottsmith 2000).

Entropy Analysis
Entropy analysis has previously (Mikkelsen et al. 2007) been used to classify in situ particle (floc) size spectra of suspended particles into groups based on similar distribution characteristics. It was evident that the in situ spectra sorted into groups that reflected different forcing conditions (e.g. variations in turbulence). Importantly, the different forcing conditions were not necessarily reflected in other commonly used distribution measures such as median floc diameter; this suggests that entropy analysis may be an effective approach for investigating the effect of changes in forcing conditions on floc size (Sharp & Fan 1963).
In information theory, the concept of entropy is related to the randomness of an event or a signal. Essentially, entropy links the information content of a signal to its randomnessif a signal has a high entropy (high randomness) the information content is low and vice versa. In particle size terms, this can be illustrated by considering a completely flat size spectrum, i.e. all volume (or mass) in the size spectrum occurs with the same frequency throughout the spectrum. This is essentially a random distribution of matter throughout the size spectrum, so a size spectrum with this shape has maximum entropy. Conversely, in a size spectrum where all particle volume or mass is found in only one bin there is no randomness of the distribution, so the entropy for such a spectrum is at a minimum. Therefore, a particle size spectrum can be characterized in terms of its entropy. For a particle size spectrum with n size bins, the entropy, E, is given as: (1) where p i is the proportion of particles in size bin I (Shannon 1948 ). Note that when p i = 0, p i log p i = 0 (according to L'Hôspital's rule). The entropy can vary between a maximum value, E MAX of log n when all p i = 1/n and a minimum value, E MIN , of zero when p i = 1 for exactly one of the bins i = 1…n. The entropy is related to the information gain, I, which is also known as the inequality statistic, by the equation: When E equals E MAX , the proportion of particles is the same in all size bins, and I equals zero. As the value of I increases, the information content of the size spectrum increases.
With an ensemble of size spectra, the inequality statistic can be used to divide the spectra into groups. Optimal grouping maximizes the inequality between the groups and minimizes the inequality within the group, so the spectra in each group all have similar shapes, and the shapes of the spectra differ mainly between groups. The first step in grouping size spectra is to express the proportion of particles in each size bin of each size spectrum as proportions of the grand total (Johnston & Semple 1983). For size distributions expressed as volume concentrations, the volume concentration in each size bin of each spectrum must be divided by the grand total volume concentration, defined as the sum of the volume concentration in all bins in all spectra: (3) where N is the number of spectra, J is the number of size bins in each spectrum, VC ij is the volume concentration in spectrum i, bin j and Y ij is the proportion of the total volume concentration in spectrum i, bin j.
Following Johnston and Semple (Johnston & Semple 1983), the total inequality for all spectra is then given as: (4) where and Y i = Y ij /Y j . In case the spectra have been divided into R groups, a measure of the efficiency of the grouping (in terms of maximising between-group inequality) can be obtained from the so-called R S statistic: RS=(IB/I)100 (5) In Eq. (5) I B is the between-group inequality, which is defined as: (6) where p jr =(∑ i r Y ij )/Y j , and N r is the number of spectra in group r of R. High R S values indicate that the inequality is mostly related to differences between the groups and that the inequality within each group is low. In short, the spectra within each group have similar shapes, and the shapes of the spectra differ mainly between groups.
Unfortunately there is no way to predict in advance how the spectra should be grouped or how many groups are desirable. The only way to obtain a best grouping (simply defined as the best R S statistic) is to perform all possible combinations of N spectra into R groups, compute the R S statistic for each of the combinations and then choose the combination that yields the largest R S statistic for that number of groups. This problem is well known from other grouping techniques such as, for example, principal component analysis, where the full set of principal components is as large as the original set of variables, but the vast majority of the variation usually can be explained by the first two to four principal components.
Johnston and Semple (1983) provided a FORTRAN routine that automatically arranges the data into a user-selected number of groups, and then shifts them between groups until an optimal grouping for that number of groups is found. Their routine was later adapted to QBASIC (Woolfe & Michibayashi 1995) for the analysis of sedimentological facies. Entropy analysis was also useful in delineating ecological habitats on the Scotian Shelf off Nova Scotia, Canada (Orpin & Kostylev 2006). Here we have used a Matlab implementation of the Entropy analysis reported previously (Mikkelsen et al. 2007).

Regression Tree Analysis
To investigate whether the entropy group of an SPM spectrum could be predicted using a set of meteorological and oceanographic variables, we applied the data mining method of regression trees using a Matlab function 'treefit' with the 'classification' option. The resulting tree was subsequently pruned using level 4 and displayed using the 'treedisp' operator. The list of variables used in this analysis is given in Table 1.
Regression trees are a representation for piece-wise constant or piece-wise linear functions, and models are given in a form of hierarchical structures of their elements. The models predict the value of a dependent variable (i.e. in our case, the entropy group type of the SPM spectra) from the values of a set of independent variables. The space of examples is partitioned into axis-parallel rectangles and a model is fitted to each of these partitions. A regression tree has an inverse hierarchical structure with a test in each inner node (junction from were two links go to the lower hierarchical levels). Each node tests the value of a certain independent variable, and each leaf (the lowest level of hierarchical tree) displays a linear equation or (in the analysis presented here) just a constant for predicting the value of the dependent variable.

Results and Discussion
The 5 groups of spectra resulting from the entropy analysis are displayed in Figure 2, with the groups numbered in the ascending order according to the position of the main modal. To investigate the relationships between the group number and environmental factors, the data were subjected to the regression trees analysis. The variables used for this analysis were the ones known to be important to the SPM characterisation from our previous work and also those showing particularly strong correlations with the spectra grouping (Krivtsov et al. 2012) The classification tree resulting from the regression tree analysis is displayed in Figure 3. It shows that the most important variables for the characterisation of the SPM spectra are temperature, the directions of wind and waves, wave period and orbital velocity. The upper level node is represented by temperature, thus dividing all the spectra into predominantly winter ones (types 1 and 2) and predominantly summer ones (types 4 and 5). The spectra belonging to type 3 constituted rather a small group, and their occurrence was under broadly similar (albeit somewhat more turbulent) conditions as those of type 4 (data not shown).
It should be noted, however, that Type 2 spectra can also be observed during warmer periods, provided there are sufficiently high levels of turbulence. These may e.g. happen either on the tail of a passing depression (when the strong swell comes from the W/ NW) or during sufficiently strong tidal currents -see Figure 3. It should also be noted that, based on the results of these analysis, at the site studied the wave-induced turbulence appears to be more important for the SPM characterisation than tidal currents, which is in line with our previous work (Krivtsov et al. 2009).

112
Regression Tree Analysis of Spm Entropy Groups: Case Study from the Irish Sea    Table 1 It has previously been argued (Mikkelsen et al. 2007) that the shape of the in situ size spectrum must be a function of a limited number of variables, including turbulence, biological 'stickiness' and suspended matter concentration. Therefore, a group of size spectra that all have approximately the same shape should be indicative of a certain set of environmental conditions. Thus the complexity of the in situ size spectra in a particular body of water may be reduced to a few groups, each typical of the forcing conditions varying within a certain limited range (Krivtsov et al. 2012). The results presented here not only support these considerations, but also provide an insight how the shape of spm spectra could be estimated from the concurrent environmental conditions. It should be noted, however, that to enable reliable predictions the analysis presented here should be repeated including data on particle sizes larger than the current measurement limit of 500 microns.

Conclusion
In this paper, we have shown the evidence that a library of entropy groups could be built for a particular site, and the average shape of the spectrum could subsequently be estimated from measurements of the forcing parameters. Figure 3 shows how a classification tree analysis could be used to deduce the spectra group number from the concurrent values of ambient parameters. Potentially, this strategy may enable computation of average floc effective density, floc settling velocity, and floc fraction, hence providing valuable 114 Regression Tree Analysis of Spm Entropy Groups: Case Study from the Irish Sea information for a good range of engineering, environmental and ecological modelling applications.