Investigation on the Clusterability of Heterogeneous Dataset by Retaining the Scale of Variables

Clustering with heterogeneous variables in a dataset is no doubt a challenging process owing to different scales in a data. The paper introduced a SimMultiCorrData package in R to generate the artiﬁcial dataset for clustering. The construction of artiﬁcial dataset with various distribution helps to mimic the scenario of nature of real datasets. Our experiments shows that the clusterability of a dataset are inﬂuenced by various factors such as overlapping clusters, noise, sub-cluster, and unbalance objects within the clusters.


Introduction
The presence of mixed-type of variables is unavoidable, especially with the cheap of technologies nowadays. Clustering heterogenous dataset is a challenging process. The outcome of the analysis gives a significant impact on the interpretation of clusters [1,2,3,4]. Moreover, it demanded excessive computational skills and memory storage due to incorporation of broad categories [5]. The most common approached in treating heterogeneous data is through converting the variables into a single scale of measurement. However, this method may result in information loss [6,7,4]. Meanwhile, conducting a separate cluster analysis can abandon the connection between the variables which can be inappropriate. As for constructing the cluster analysis, it involves mixed variables, and it requires a more significant effort to build the mathematical model that suitable to the problem.
Past studies have implemented different clustering process, namely k-means and k-prototypes, but little has used kmedoids. Subsequently, k-medoids demonstrate satisfactory results of clustering through when the measured variables are of mixed-type [8,2]. There are varieties of programming packages introduced in R in generating artificial dataset or.
Despite the increasing attention, the existing packages proposed employed Gaussian multivariate, which limited to generated continuous data. Therefore we introduced a package created by Fialkowski et al. [15] known as SimMultiCorrData to produces a heterogeneous artificial dataset. Generally, the artificial dataset from SimMultiCorrData package g constructed based on mean, variance, skewness, and/or kurtosis, from power method transformation (PMT): whereZ ∼ iid N(0,1), c stands for cumulant and r represents the order method. The cumulant refers to mean, variance, skew, kurtosis, as well as standardized fifth and six cumulants with a fifth order method based on Headrick's method. Nevertheless, this package not proposed for clustering purposes. The current paper presents further investigations on the behavior of clustering from the artificial dataset. Since we generate a heterogeneous data, the best clustering algorithm that allowed for mixed-types of variables is the k-medoids algorithm. We assessed the performance of the clustering through selected internal clustering validation (ICV). This ICVs gives the amount of goodness of a clustering structure based on compactness of clusters and separation between clusters. We also investigated the clusterability of the dataset in the existance of noise, unbalance data points in a cluster, as well as subclusters.

Mixed Variables in Clustering
Partitioning approach aims is to establish k cluster so that each object, x, in the dataset must belong to a particular one cluster and each cluster must have an object at least. The core process of partitioning clustering can be carried out via itera-tion approach to minimize or to maximize an objective function. Typically, researchers tend to minimize the function to obtain a homogeneous cluster. Partitioning around medoids (PAM) [16], k-medoids uses the actual objects in the dataset as the center of the cluster rather than using the means points which known as medoids.
PAM enables multiple measures to be incorporated, such as the range of Euclidean, Manhattan, and Gower distances. On the contrary, only Gower's distance able to measure the dissimilarity of the heterogeneous dataset.

Gower's Distance
Assume a dataset with n objects with p variables x ij : (i = 1, 2, . . . , n; j = 1, 2, . . . , p) where p stands for dimensions of variables that comprise of c continuous, o ordinal, b binary or m discrete variables. For ordinal variables with o t levels, the value written as 1, 2, . . . , o t . Generally, the difference in value denoted in the form of matrix structure attained from dissimilarity measure from a dataset.
Nonetheless, the distance for continuous variables is more extensive than the distance for mixed-type of variables [17]. Currently, several methods have been suggested to address the shortcomings of variation for mixed-type of variables (see [18,19,20,21]). It was found that the most commonly used measurement to deal with mixed-type of variables is the Gower's similarity measure. Gower introduced it in the 1970s with the features of Euclidean distance [22]. Generally, the similarity measure between ith object and kth object,s ik , is calculated as: where w ikl refers to weight. w ikl = 0 if objects i and k cannot be compared for variable p = l. For instance, let variable l be the criterion of chest pain after certain exercises. Thus, comparing objects i and k based on variable l would be biased in one of these two objects that do not show the pain.
Nevertheless, Gower's suggested a few methods to dismiss ordinal variables. Since many datasets have ordinal variables, Podani [23] had stretched out Gower's study by inserting the function for ordinal, as given in the following: where r il and r kl are the ordinal variables for objects i and k, respectively. If the number of objects share similar rank score for variable p, therefore: T il refers to the number of n objects in that share similar rank score with object i for variable l, T kl denotes the number of objects in n that have the same rank score with object k for variable l, T l,max reflects the number of objects with maximum score rank for variable l, and T l,min represents the number of objects with minimum score rank for variable l. An extension of Gower's distance proposed by [23] allowed for metric or non-metric version. Currently, Gower's similarity index has been widely used to support the vector machine, machine learning, recognizing the patter, bioinformatics, molecular biology, epidemiology, and other disciplines.

Internal Clustering Validation (ICV)
The ICV measures (as in [24,25,26,27]) provide better options for researchers to select the most appropriate method for clustering. The validity indices determine the quality of clustering, apart from discovering the optimal K. In order to attain effective evaluation of cluster analysis, understanding the 'definition' of internal validation is crucial. In addition, internal validation is a method to assess the fitness of the structure and data in clustering as it is based on the distance between the objects in the cluster as well as between the clusters [28].
Handl and Knowles [29] explained ICV from the stance of three aspects; (i) compactness, (ii) connectedness, and (iii) separation. Compactness is considered to attain homogeneous cluster. High homogeneity can lead to overlapping of clus-ters. Next, connectedness examines the connection of neighbouring data points in the similar clusters. Meanwhile, separation examines the degree of partition between the clusters. Additionally, inverse compactness leads to a good quality of separation and vice versa. Based on these measurements, most of the previous studies had conducted clustering by combining the measurement of compactness and separation to dismiss the cluster's density and structure. In fact, compactness and separation were estimated to be suitable number of clusters K of a dataset. The indices of ICV can be discovered in [30,31,32].
Intra-cluster compactness and separation were measured using Dunn (D) index [33]. D index shares same range values with Davies-Bouldin (DB) index, yet, D index generates maximum value to show compact and well-separated clusters. In addition, DB and D had utilized the ratio-type coefficient within the different clusters and also between the different clusters which do not display any trends [25]. Meanwhile, the concept of ΓC is adopted from [34] and designed by [35]. They defined it as average of total distance between objects in a cluster over other clusters. In addition, it also defines the average to the nearest and the furthest distances between the pairs of objects in the dataset (Table 1). In the C-index, the minimum value represented the good clusters. [36] suggested that Silhouette (S) index should determine the objects that do not belong to cluster and depicts the wellness of the objects in the cluster. Thus, the index will measure the average distance between the objects within a cluster and the average distance between the

Methodology
This package created all variables from standard and normal variables from an intermediate correlation matrix. The continuous mixture variables are drawn from more than one component distribution. Categorical and count variables are generated from inverse cumulative density function (cdf). As for ordinal variable, the data generated through a process of discretising the standard normal variables at quantiles in conformity to the targetted marginal distribution. Meanwhile, the count variables comprise of standard and normal variables of uniform distribution. The continuous mixture variables drawn from more than one component distribution is describe in term of mixture distribution.
The setting for generating the heterogenous dataset is described in the following: (a) count variables were set to derive from one of these three distributions: (i) Poisson distribution (λ= 2, 6 and 11), (c) the continuous variables were drawn from Gaussian (N (0, 1)), gamma (Γ(α = β = 10)), and chi-square (χ 2 4 ) distributions to have both normal and non-normal datasets.
From the uniform distribution with U (0.25, 0.7), the correlation matrix was obtained and fell within the feasible correlation bounds. The valid corr function for 'Correlation Method 1' resolute if the matrix is in the boundaries. Subsequently, the variables were generated using the function of rcorrvar along with the addition of error loop to reduce the errors in correlation.
Generally, ten parameters along with 300 objects were generated as follows: Step (1a) -BUILD phase.
(c) Compute the dissimilarity between the medoids and the objects by employing Gower's distance.

Assign objects to medoids
(a) Assign the objects to the nearest medoids.
(b) Other objects (non-medoids) are selected as medoids, if the objective function is not achieved -SWAP phase.

Update medoids
Steps in (2a) and (2b) are repeated until no change is noted in medoids location.

Selection of k-clusters
For every k selected for PAM, ICV is conducted to look at the best performance of k.
The computed indices in the current paper presented in Table 1. For review purpose, the ratio-type coefficient betweenwithin cluster variation and between cluster variations of DB and D indices were implemented. DB offered details regarding the separation of cluster as all clusters were measured to attain the mean values.
The score of DB and D indexes is between [0, ∞]. However score value in indicating a clustering differs, where DB should be minimized, while and D should produce maximized value. The range of ΓC is between [0,1], and a score that is nearest to zero (0) signifies good clustering. S index takes the value of [-1,1], and the average silhouette width approach value 1 indicates a strong structure of clustering has been found.
We observed the structure of clustering at difference k to identify the effect of diameter of a clusters and separation between clusters. We notice that unbalanced objects within cluster, overlapping between clusters, effect of noise on the diameters of a clusters as well as sub-clustering give large impact in determine the quality of clusters.

Findings
Based on the 300 generated observations, it was found that the setup simulation creates few data points with greater distance from the others, as shown in Figure 1. However, we do not intend to exclud those data points to look at how the clustering performed. Generally, mixed variables dataset is known to acquire overlapping clustering. Therefore it is hard to determine the optimal number of clusters. K. Relying only on the scatterplot neither guarantees us where and which objects belong to which clusters. Nor in determining the number of clusters. Hence, k-medoids algorithm with Gower's distance were employed for further evaluations.  The dendrogram illustration in Figure 3 is a compact visualization of the dissimilarity matrix. The y-axis indicates how similar the data points in its clusters and how dissimilar it is between clusters. The heights represented at y-axis helped us to identify which data points belong to which clusters (or groups) at specific heights values based on its similarity. Note that the dendrogram illustration does not let us set the K, but may only yield an appropriate number of clusters.

Number of Clusters
Therefore, the assessment of the clustering structure of an algorithm is evaluated based on the ICVs values. Looking at Figure 2, it is apparent that DB, ΓC, and S shows a point at which a significant change in values of the index for k = [2,10]. However, D index hardly indicates the trend of changes in the index values for the heterogeneous dataset. The indexes values of the DB, D, ΓC, and S are 1.1163, 0.0501, 0.0529 and 0.3381, respectively. DB and S indices suggest the construction of clusters at K = 2. On the hand, D index suggested the suitable clusters formations be K = 3, while ΓC indicates K = 10 as the best number of clusters. Judging from the value obtained and the constraint of the four indexes, the suitable K for this artificial dataset should be K = 10 since only ΓC index able to meet its constraints.

Cluster Strucuture Assessment
Further investigations on the behavior of clusters formations based on the selection of K from DB, D, ΓC, and S indexes presented in Figure 4. From this figure, it can be seen that there is no distinct separation on the formation of clusters for K = 2 and K = 3. Interestingly, there occur a pattern of clusters for K = 10 in Figure 4 (c), where it suggested that the number of clusters for this dataset is equivalent to three. However, it only can be managed with some conditions. For example, data points allocated to cluster 5, 7, and 8 should be classified as one class, while data points in cluster 2, 4, and 9 as another class.
Hence, a precise clustering formation could have formed.
From the average silhouette width (figure not shows for K = 2 and K = 3), overall average silhouette width for K = 2 yield 0.28. The average silhouette width of the first cluster is only 0.2, while the second cluster recorded an average of 0.45. The overall average silhouette width for K = 3, we obtained 0.24, where the average of the first cluster is 0.3, second clusters with 0.35 while only 0.4 in the third cluster. In K = 10, the average silhouette width obtained is 0.27, and the average value of each cluster shown in Figure 5.
We observed the structure at difference K to identify the effect of diameter of a cluster and separation between clusters. We notice that unbalanced objects within cluster, overlapping between clusters, effect of noise on the diameters of a clusters as well as sub-clustering as shown in Figure 6, give large impact in clusterability of this dataset.

Discussion
Various results in the number of clusters obtained from different ICVs. This phenomenon is common for a dataset to have more than one K that represent good clusters. This is because an unsupervised clustering makes it more challenging to discover the best K. Nevertheless, careful validation of clusters and discovering the best K as particular indices are sensitive to the outliers, noise, and even subclusters. A wise selection of ICV used in a study is essential since the numerator and denominator of the indices is much influence on the structure of the dataset.
Halkidi, Batistakis, and Vazirgiannis [25] independent reviews define that validation assessment performance is the best when clusters are thoroughly compact (well clustered), but performance badly in data with mixed variables. Having a mixed type of variables in clustering may create some discrimination in term of dissimilarity. Bijuraj [37] mentioned that it is more appropriate if a dataset has a similar of variables.
Intersection, insertion or deletion at certain magnitude of objects within a cluster significantly influence the of dispersion (width) as well as density of a cluster. As the K increase, the intra-cluster compactness tends to decrease an inter-cluster separation increase simultaneously. k-medoids assigned objects to it nearest medoids. Medoids is based on points of dataset that represent minimal averaging dissimilarity measures. The concepts similar to the k-means. Due to this objective function, it explained the implication of the factors.
Since this dataset consists of overlapping data points in clustering, k-medoids algorithm is not suitable to carry out the cluster analysis. It is known that k-medoids algorithmhard clustering, in which each data points should belong to one and only one cluster. Therefore other clustering approaches, such as density-based method should be employed. The differences in the decision of appropriate K between ICVs were influenced by various factors that affected the nominator and denominator of ICVs.
The problems in identifying the appropriate K have created an interest among researchers in the discipline of clustering. Moreover, it becomes as essential issues as its effects on the performance of the internal validity of clustering. Few research works of this matter have been carried out and reported in [38,39,40] and few others cited elsewhere.

Concluding Remarks
In this paper, we investigate the clusterability of mixed-types of variables of a dataset through artificial dataset consist of mixed types of data from various distributions from SimMul-tiCorrData package. This package allows users to choose to work on empirical or theoretical approaches based on one's objective in generating simulation dataset. We have conducted both approaches, and we discovered that the empirical method is time-consuming, and the correlation matrix always hard to achieve. For this paper, we opt for theory methods using cal theory function with mixed types of variables. Furthermore, it is happened a simulation dataset from this package generate some noises and outliers which give benefits to identify the effect on clustering. Since the simulation data not purposely developed for clustering, therefore is unable to measure the level of overlap between clusters.
We have demonstrated that by'retaining' the scale of variables through Gower's distance, the formation of clusters does exist. However, we obtained poor results of clustering using kmedoids algorithm. Results from the average silhouette width, it can be said indicated that no substantial structure had been found. As for the effect of various factors in clusterability of dataset, we have performed a weighted clustering approach to 54 Investigation on the Clusterability of Heterogeneous Dataset by Retaining the Scale of Variables    reduce the overlapping between clusters and assigning the objects to it nearest cluster. We also deleted data points that were identified as outliers/noise (results are not display) to obtained a better clustring results. Yet, there is no improvement in the validity scores but further worsen the clustering performance. Besides, the process of clustering mixed-type of variables and retaining their scale throughout the analysis was indeed challenging. k-medoids (a) unbalanced and ouliers/noise in a cluster (b) overlapping between clusters (c) sub-clustering Figure 6. Csusterability of dataset based various unbalanced data points in a cluster, overlapping of data points between clusters and sub-clusters.