Integration of Cluster Centers and Gaussian Distributions in Fuzzy C-Means for the Construction of Trapezoidal Membership Function

Fuzzy C-Means (FCM) is one of the mostly used techniques for fuzzy clustering and proven to be robust and more efficient based on various applications. Image segmentation, stock market and web analytics are examples of popular applications which use FCM. One limitation of FCM is that it only produces Gaussian membership function (MF). The literature shows that different types of membership functions may perform better than other types based on the data used. This means that, by only having Gaussian membership function as an option, it limits the capability of fuzzy systems to produce accurate outcomes. Hence, this paper presents a method to generate another popular shape of MF, the trapezoidal shape (trapMF) from FCM to allow more flexibility to FCM in producing outputs. The construction of trapMF is using mathematical theory of Gaussian distributions, confidence interval and inflection points. The cluster centers or mean (μ) and standard deviation (σ) from the Gaussian output are fully used to determine four trapezoidal parameters; lower limit a, upper limit d, lower support limit b, and upper support limit c with the assistance of function trapmf() in Matlab fuzzy toolbox. The result shows that the mathematical theory of Gaussian distributions can be applied to generate trapMF from FCM.


Introduction
Fuzzy logic is an idea whereby we apply the uncertainties in the real world to be applied in the computing world which represents the degree of truth. It contradicts with the crisp value or Boolean value (0 or 1) to produce a certain result, which is not realistic [1] [2]. The idea of fuzzy logic was introduced by Dr Lotfi Zadeh when he was working in natural language which cannot be easily translated into absolute terms of true or false [3]. An application of fuzzy logic can be found in fuzzy inference system (FIS). Fuzzification is a component in an FIS where an input variable is compared to a membership function (MF) to obtain the membership degree [4]. The membership degree will go through the fuzzy rule engine for processing, such as decision making. Membership degree of a fuzzifier can be constructed by two methods: expert opinion and generated via data [5]. One of the common methods for MF construction from data is through clustering.
Fuzzy clustering is a technique to handle unlabelled data, which may contain outliers and unusual patterns. Thus, membership functions can provide the possibility of one data point to belong to other groups or clusters [6]. The clusters of data are generated by a possibility distribution 560 Integration of Cluster Centers and Gaussian Distributions in Fuzzy C-Means for the Construction of Trapezoidal Membership Function or collected from various resources. The measurement used in most clustering algorithms to determine the cluster centres is Euclidian distance [7]. Fuzzy C-Means (FCM) is one of the mostly used techniques for fuzzy clustering [8]. Based on various applications such as web usage mining, stock market, web analytics and image segmentation, FCM is proven to be more efficient, robust and reliable by its performance [9]. However, the resultant MF of FCM is of only Gaussian shape due to a straightforward nature of clusters to be distributed [10]. There are other regularly used parameterized MFs such as triangular and trapezoidal MFs. These MFs are used in specific cases such as antenna positioning fuzzy controller [11] and crime prevention analysis [12]. Thus, FCM should also have a capability to produce linear MFs. The construction of Gaussian MFs is straightforward and discussed in [13]. The construction of trapezoidal MFs is not quite straightforward and a method of convex hull is proposed in [12] [13]. However, the implementation is unclear and it depends on specific cases [16].
In this paper, we present a method to produce trapezoidal MFs based on the integration of cluster centres produced by FCM and mathematical theory of Gaussian distributions. This paper is organized as follows: In section 2, literature review on trapezoidal MF and Gaussian distribution is introduced and an approximation of trapezoidal MFs is explained. The result and testing are in Section 3, and Section 4 presents the conclusion.

Trapezoidal MF
According to [17], with experience, one can decide which shape of MF is good for certain application and cases under consideration. This is where the degrees of freedom is offered in the fuzzy system environment since the MFs can be of any shape and form as long as it could map the given datasets with the desirable membership degrees. It also depends entirely on the size and type of the problem. The MF shapes are not the only concern as setting up the interval and the numbers of MFs are considered important too [17]. In addition, trial and error method is often used to determine the shape of MF. However, trapezoidal MF (trapMF), which represents fuzzy intervals, is proven to be easy to implement with fast computation [17]. In most practical applications, trapMF functions work well since they use linear interpolation to get both endpoints of the interval. The theoretical explanation which proves the practicality of trapMFs is discussed in [18]. An intuitive explanation on how trapMF is functioning well is explained in simple terms in [19]. In short, a trapMF ( Figure 1) is defined by a lower limit a, an upper limit d, a lower support limit b, and an upper support limit c, where a < b < c < d [18]. Compactly, the mathematical representation of trapMF is as follows: (1) In a simple explanation, as an example, for a property such as "small", different values namely x are assigned, along with their degrees, μ(x). The main motivation in fuzzy implementation is to ensure the value x and x' are close, along with their corresponding MFs μ(x) and μ(x') which should be close too [19].

Gaussian Distribution
Since trapMF in this paper is constructed via FCM, the output of the data clustering is in the form of Gaussian MF. Hence, it is natural to use Gaussian distribution to convert the Gaussian MF into trapMF. Gaussian distribution is also called normal distribution which states that a random variable is normally distributed. A normal distribution is informally called the bell curve. The function to calculate the probability of a random variable to be within a particular range of values, instead of taking on any one value is called the probability density function, shown in eq. (2). (2) where µ is the mean, σ is the standard deviation and σ 2 is the variances. Since the mean and standard deviation are provided from the output of Gaussian MF, it is possible to mathematically construct the trapMF by using the Gaussian distribution.
In a standard normal distribution, µ = 0 and σ = 1, and it is described by the probability density function (3), whereby the factor 1 / √2 in this expression makes sure that the area under curve is equal to 1. Since the curve is symmetric, the inflection points are x = +1 and x = -1. The standard normal probability density function plot is as Figure 2. A confidence interval is a range of values where it is commonly known that the true value lies in [21]. This can also be obtained by a known mean and standard deviation (sigma) which makes the range to be helpful to approximate the trapMF data distributions. It is suitable to be used to estimate the range calculated from a given dataset [22] The common choices for the confidence level, C are 0.90, 0.95 and 0.99 which correspond to a normal Gaussian curve area percentage. Calculations for both left half and the right half of the curve will be the same since the Gaussian curve is symmetrical, which naturally will produce a symmetrical trapezoid shape. As shown in Figure 3, each tail of the curve has the area which is equal to (1-C) / 2. The area in each tail is equal to 0.05/2 = 0.025 for a 95% confidence interval [22].
The value z * in Figure 3 is representing the inflection points on the standard normal density Gaussian curve which shows that the probability of observing a value greater than z * which equals to p is known as the upper p critical value of the standard normal distribution. For a confidence interval with level C, the value p is equal to (1-C)/2. The interval is (-1.96, 1.96), since 95% of the area under the curve falls within this interval [22].

Inflection Points Approximations for TrapMF
Inflection points are where the curve turns inwards or concave where for a Gaussian shape, it will concave near the peak in a downward manner [23]. The inflection points need to be obtained from the probability density function of the Gaussian in order to locate the x-axis points. According to [23], inflection points are normally at ± σ, ±α and α√2. These points are applied for symmetrical and normal Gaussian shape.
The probability density function (PDA) for a normally distributed random variable with a known mean μ and standard deviation is in eq. (4).
The notation exp[y] = e y is used where e is a constant of approximately by 2.71828 [24]. The first derivative of the PDA is found by getting the derivative for e x and the chain rule is applied in eq. (5).
Based on the equation, the inflection points occurred 562 Integration of Cluster Centers and Gaussian Distributions in Fuzzy C-Means for the Construction of Trapezoidal Membership Function when x = μ ± σ [24]. It is located one standard deviation above the mean and one standard deviation below the mean. Hence, the cluster center obtained from FCM clustering is used in eq. (10) and it is subtracted from the standard deviation. It is applied for both left and right side of the curve. To get the inflection points of lower limit a, and upper limit d, the theory of confidence interval (-1.96, 1.96) is used whereby an approximation of x = μ ± 2σ suits the result for the datasets in testing process.

Result and Testing
For the purpose of testing the theory, a dataset which contains the response time of a web service is used in order to generate the GaussianMF from FCM. The dataset consists of 6145 points and is milliseconds (ml) in unit. The dataset is one of the attributes to access the quality of web service and it is obtained from online resources. FCM will generate two outputs, mean and standard deviation, which will be used for trapMF approximation. Figure 4 shows the result from FCM. The number of clusters (three in this particular sample) is determined by using Clustering Validity Index (CVI) [5]. By using the cluster centers (μ) and sigma (σ) of the Gaussian output, trapMF approximation is performed. Both values are maintained while the inflection points are tested in trial and error manner based on eq. (10). Figure 5 shows the trapMF generated from the approximations for the respective GaussianMF.  Figure 6 shows the trapMF after being separated by its respective GaussianMF. The trapMF will be compared with trapMF generated by toolbox in Matlab to validate its parameters.   Table 1 shows the cluster centers and sigma for the chosen dataset. From the obtained results, approximation from GaussianMF and TrapMF is conducted and it produces the parameters a, b, c and d as in Table 2. To get the parameters, a function to produce trapezoidal membership functions called trapmf() in Matlab Fuzzy toolbox is used. To use trapmf(), the membership function range is obtained from FCM output and applied in the toolbox. Next, set the type to trapmf(). The availability of the toolbox to generate and edit the membership functions as well as to design fuzzy inference system may help users to apply fuzzy logic. However, mathematical solution to generate membership function is still important since dealing with a toolbox usually involves software compatibility issue and is not publicly available. Next, to test the significance between the manually generated and Matlab toolbox-generated trapMF, t-test is performed. The p-value, which is the probability that the results from the dataset are occurred by chance will be the ratio that will determine the validity of the result. P <0.05 means the data have statistically significant difference [25]. Our target is to validate whether the output produced by manually generated trapMF is not significantly different from the toolbox-generated trapMF. Table 3, 4 and 5 show the t-test result from the web response time dataset for each cluster from the FCM outputs. Based on the p values, it proves that the results are valid and statistically significant.

Conclusions
In this paper, we present a method to generate trapezoidal MFs based on the integration of cluster centers produced by FCM and mathematical theory of Gaussian distributions. MF is developed by using FCM clustering method in Matlab environment. The mean and sigma of the Gaussian output is then used to mathematically construct the trapMF. In overall, the proposed method can provide more flexibility to FCM when it allows the generation of other membership function types. For future works, the proposed method may be further explored to generate trapezoidal fuzzy type-2 membership functions.