Comparison of Unsupervised Learning Algorithms for Identifying Disease Clusters in Cognitive Impairment Using Functional MRI Connectivity Features

Machine learning techniques are often used to model data from functional MRI, a noninvasive technique to study and measure brain activity by identifying changes in blood flow which can be used to classify healthy and disease populations. Most studies use supervised machine learning techniques that require training data labeling to make predictions. To avoid this problem, unsupervised clustering, which does not require training, is performed. However, most fMRI studies using unsupervised learning offer no justification for selecting one unsupervised clustering algorithm over another and normally default to the popular K-Means algorithm. To reach the true potential benefit of unsupervised learning techniques when applied to fMRI data, we examine and compare 12 unsupervised learning algorithms in identifying Alzheimer’s disease clusters based on fMRI connectivity features, with the intention to identify the most effective unsupervised clustering algorithm for fMRI connectivity clustering. Through an analysis of both clustering accuracy and execution time, the K-Medoids algorithm was found to be most optimal for fMRI connectivity data.


Introduction
Brain imaging is the use of various techniques to create images of the structure or function of the nervous system, with two major types: structural imaging and functional imaging. Structural imaging aims to show the structure of the nervous system and aids in the diagnosis of large-scale disease. Functional imaging attempts to diagnose disease from a different perspective and is used in pre-surgical planning.
fMRI, or functional MRI, is a noninvasive technique to study and measure brain activity by identifying changes in blood flow. The most common form of fMRI uses the Blood-Oxygen-Level-Dependent (BOLD) contrast, which measures the ratio of oxygenated to deoxygenated hemoglobin in the blood. This measures the metabolic demands of active neurons, and not actual neural activity; however, when neurons fire, they require energy to be brought in from an external source because they do not have any deposits of energy. This leads to the hemodynamic response, or when the blood releases oxygen to active neurons faster than it does to inactive neurons. Since hemoglobin has different magnetic properties in its oxygenated and deoxygenated forms, this leads to a signal that can be detected by an MRI scanner.
The first fMRI data was collected in the early 1990s, and early studies created activation maps, which are mappings that correspond to the activation of different parts of the image with a high activation meaning that a certain feature was found. After researchers realized that fMRI had higher temporal resolution compared to PET, they began conducting event-related designs. Early work in the late 90s/early 2000s began examining noise in BOLD data. Since 2000, a new approach to fMRI data analysis to analyze information in patterns and not at individual voxels has become increasingly popular. Since the 90s, because it is noninvasive, does not pose a radiation threat, and is widely available, fMRI has exploded in popularity, with 160,000 publications from 2000-2009 and about 30000 publications each year since. fMRI brain imaging is an interdisciplinary science at the intersection of neuroscience, psychology, with several applications in machine learning. Machine learning teaches 24 Comparison of Unsupervised Learning Algorithms for Identifying Disease Clusters in Cognitive Impairment Using Functional MRI Connectivity Features computers to do what is natural to humans: learn from experience. Machine learning applies when a task is too complex for handwritten rules, when the rules of a task are constantly changing, and when the nature of the data keeps changing. As a subset of artificial intelligence, machine learning models enable computers to carry out certain tasks much more effectively. As such, these techniques have found increasing use in the last decade for the study of brain imaging data, to understand mechanistic models of brain functioning as well as develop objective biomarkers of mental disorders. This study focuses on machine learning applications in the study of functional MRI connectivity data in those with cognitive impairment. Machine learning techniques can be used to model multivariate patterns in fMRI data. Specifically, they can be used to construct models that can effectively make predictions for new observations. The data from fMRI images can be used to help us learn the relation between observed features and some outcome so that we can make predictions.
Machine learning classification is separated into two types: supervised and unsupervised learning. Supervised learning, unlike unsupervised learning, requires correctly labeled examples/a known dataset to train the machine learning model so that it can be used to make predictions for the response value for the new data. Unsupervised learning differs as the goal is to model the underlying structure or distribution in the data in order to learn more about the data and uncover hidden insights.
Most unsupervised learning methods are a form of cluster analysis. Clustering is a technique to group similar objects together while separating objects that are different. This occurs by first identifying features in the dataset for each observation which are then analyzed to identify clusters by minimizing intra-cluster distance and maximizing inter-cluster distance. Clustering has applications in market research, medical imaging, search result grouping, image segmentation for object recognition, crime analysis, data mining, bioinformatics, and several others.
Clustering can be further described as either hard clustering or soft clustering. Hard clustering algorithms apply when each data point belongs to only one cluster, and soft clustering algorithms apply when each data point can belong to more than one cluster. One hard clustering algorithm is called k-means clustering, which is used to find groups in the data which have not been explicitly labeled, and then to assign each data point to one of the groups previously found based on feature similarity. It does this by partitioning data into k distinct clusters based on the distance to the centroid of a cluster [1].
Some different unsupervised techniques include various clustering algorithms (hierarchical, k-means, mixture models, DBSCAN, and OPTICS algorithm), anomaly detection with k-nearest behavior or local outlier factor, neural networks, and approaches for learning latent variable models.
Functional connectivity is the association between two or more fMRI time series that makes statements about the functional relationships among brain regions. It is usually quantified by Pearson's correlation coefficient. In this study, we have used Static Functional Connectivity (SFC), which provides one value representing the strength of the connection between two brain regions, as the features for our clustering algorithms.
Pearson's correlation coefficient, denoted by the letter r, describes the strength and direction of the linear relationship between two variables [2]. The coefficient r can be between -1 and 1, with a value of -1 representing a completely negative linear association, and a value of +1 representing a completely positive linear association. As the magnitude of r increases, the value of the strength of the linear correlation increases.
Currently, most fMRI studies use supervised learning techniques to classify observations, instead of unsupervised learning techniques [3], [4]. In addition, the fMRI studies that do use unsupervised learning techniques default to using the popular k-means clustering algorithm or only one other algorithm without a justification [5]-[8].
Without a justification as to why a selected unsupervised algorithm is most applicable to fMRI data, the outcomes of the studies are not reflective of the true potential benefit of unsupervised learning techniques when applied to fMRI data. To improve the results of future studies, it is important to find the most optimal unsupervised learning algorithm for fMRI connectivity data, so that there is a scientific process to select a certain unsupervised learning algorithm over another.
In this study, we address this gap by comparing the performance of 12 unsupervised clustering algorithms in correctly identifying individuals with and without Alzheimer's disease. We believe that the findings of this study would promote a better choice of unsupervised learning algorithm in future studies. Using SFC features of 29 individuals with Alzheimer's disease and 35 matched healthy controls, we employed the 12 unsupervised clustering algorithms and recorded the accuracies of the results and the execution time for the program, since an effective unsupervised clustering algorithm is categorized by both high accuracy and efficient runtime.
The organization of the paper is as follows: section 2 describes the methods used for organizing the data into clusters; section 3 details the results by presenting clustering accuracies and real-time applicability; and section 4 offers an evaluation of the results and drawn inferences.

Participants
The fMRI data for this study came from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, an organization devoted to the research and development of new diagnostic methods in their earliest stages and to provide all its data to all scientists without barriers. The ADNI dataset of fMRI data used for this study consisted of the control of 35 healthy elders who did not have Alzheimer's and 29 Alzheimer's patients, for a total of 64 subjects. Access to the ADNI dataset is available at www.adni-info.org.
The 3D+time processed fMRI data were then converted into ROI×time time-series data from 200 regions-of-interest (ROIs) through spectral clustering using the . A Pearson's correlation analysis was performed between every pair of ROIs, resulting in a 200×200 SFC matrix per subject. T-test analysis was used to determine the significance of the SFC features between each pair of ROIs for all the subjects at the 0.05 significance level (uncorrected). All SFC features that were significant were selected for further analysis, resulting in a 64x2603 (subjects × features) SFC matrix that would be used for unsupervised clustering.

Unsupervised Machine Learning Techniques
In this section, we will explain the 12 unsupervised learning algorithms that were compared in this study to identify the most suitable technique for fMRI connectivity data. Although the descriptions are brief to keep the manuscript length within limits, detailed information about each algorithm can be found in the cited references. K-Means starts by randomly assigning k centroids to the dataset, working iteratively to first assign each data point to the nearest centroid and then calculate the mean of the values of all the points belonging to each centroid. The value of the centroid is then updated, and the process is repeated until no data point switches centroids, which implies the data is successfully clustered [12]. CLARANS begins similarly to K-Means, but diverges by choosing a representative item, or medoid, to represent each cluster at each iteration instead of calculating the new centroids by finding the mean of all points in each cluster [13].
Genetic Algorithm clustering initially generates an initial population of solutions and optimizes and refines the solutions using principles of evolution and natural genetics [14].
Agglomerative Hierarchical Clustering begins by treating every data point as its own cluster, and then iteratively merges data points into clusters. These clusters are then merged until only one cluster remains [15], [16].
ISODATA initially assigns randomly-placed cluster centers and then uses the standard deviation within each cluster and the distance between cluster centers to determine whether two adjacent clusters should be split or merged [17].
K-Medoids, like K-Means, aims to update the value of the centroid and iteratively modify the value of the centroid which changes the clustering of the data points. It differs in the way it does this, as k-medoids minimizes the sum of dissimilarities between points instead of minimizing the total squared error [18], [19].
OPTICS is a type of density-based clustering that is based on identifying higher density clusters with more data points before lower density clusters depending on core distance and reachability distance [20], [21]. Partition around Medoids, or PAM clustering, is a two-phased algorithm that repeatedly swaps selected and unselected objects to improve the quality of the clusters [22].
PSO-based clustering is inspired by swarms of birds, which treats the data points as particles or searching points and has a leader that the entire swarm will follow. Over many iterations, this algorithm will find several solutions [23], [24].
Clustering by Local Gravitation (CLA) uses a model of local gravitation which views each observation as an object with mass and a parameter called the local resultant force based on the observation's neighbors. These characters are then used to separate the dataset into clusters [25]. Spectral Clustering works by clustering data points that are immediately next to each other, regardless if there is another point that is closer distance-wise. Points that are connected are put in their cluster [11], [26], [27].
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) uses hierarchical clustering for high-dimensional datasets. It generates a compact summary for the data and then clusters hierarchically based 26 Comparison of Unsupervised Learning Algorithms for Identifying Disease Clusters in Cognitive Impairment Using Functional MRI Connectivity Features on that summary instead of the actual dataset [28], [29]. The above algorithms were used because they represented the various types of unsupervised clustering: the inclusion of density-based clustering with OPTICS, partition-based clustering with K-Means and K-Medoids, and hierarchical clustering represented three major types of unsupervised clustering.

Application of Unsupervised Machine Learning Techniques to fMRI Connectivity Data
After the fMRI connectivity data was processed into the 64x2603 (subjects × features) format, the extracted features were inputted to the machine learning algorithms. The accuracy of the outcomes were measured by finding the number of correctly identified subjects to their cluster, which was divided by the total number of observations, 64. The execution times of each of the algorithms were also recorded. The system configuration of the computer where the algorithms were run was as follows: Intel i7-7500U CPU @ 2.70GHz with 8GB DDR4 RAM. Figure 1 displays the complete methodological pipeline.

Results
The algorithm with the most accurate cluster assignments for fMRI data was K-Medoids, which had an accuracy of 82.8125%. The accuracy of Spectral Clustering was second with 81.25%, with the popular K-Means and density-based OPTICS algorithm obtaining an accuracy of 79.875%.
The algorithm with the most efficient runtime was K-Medoids, with a runtime of 0.0327 seconds run on our computer (Intel i7-7500U CPU @ 2.70GHz). When comparing the two aspects of an effective unsupervised learning algorithm, K-Medoids, leading in both highest accuracy and lowest runtime, seems to be the most optimal algorithm for fMRI functional connectivity data to cluster those with Alzheimer's disease. Refer to Table 1 for the full results of the success rate and execution runtime for each algorithm. For a visual representation of the clustering success rates, refer to Figure 2.

Discussion
The goal of the study was to identify the most effective unsupervised clustering algorithm for application to fMRI connectivity data to correctly identify groupings of Alzheimer's patients and healthy subjects. After a comparison of the various unsupervised learning algorithms, K-Medoids clustering was found not only to have the highest accuracy, as the algorithm's output yielded the most correctly identified subjects, but it was also the most efficient algorithm to cluster fMRI data, with a runtime of 0.03 seconds. If one wants to employ unsupervised clustering to fMRI data, then, whether their first and foremost priority is accuracy or efficiency, our study suggests the choice of K-Medoids clustering.
Previous literature on the topic of using unsupervised clustering in application of fMRI data revolve mostly around the default choice of the K-Means algorithm, without providing a valid justification as to why K-Means is the most valid choice. A more optimal clustering algorithm could potentially aid in identifying Alzheimer's disease from fMRI data with more accuracy and/or efficiency, which is why research into various unsupervised clustering algorithms of different types is necessary.
In this study, it was found that the K-Medoids algorithm, not the popular K-Means, or any other clustering algorithm should be used for the most accurate and efficient clustering of fMRI connectivity data. The results clearly indicated the benefits of using K-Medoids instead of the default K-Means, as not only did K-Medoids have a higher accuracy (82.8125%), it also boasted a low runtime of just a mere 0.03 seconds.
The study highlighted the lack of reasoning in current literature about why certain unsupervised learning algorithms are chosen in place of others, if they are used at all. Unsupervised learning algorithms have yet to be comprehensively compared amongst each other in relation to fMRI data for the identification of Alzheimer's until this study. In addition, the study provided the most optimal unsupervised machine learning clustering algorithm to be used in future fMRI studies. Implications of this study include better diagnostic prediction of mental disorders using better algorithms like K-Medoids, if our study were replicated on a larger sample. Future studies must test if our findings extend to the study of other psychiatric and neurologic disorders, as well as whether our findings replicate in a larger sample.
Finally, we list a few limitations of this study, and scope for future work: (i) The sample size of the study was small. The issue with small samples in machine learning analyses is that accuracy might go down with larger sample sizes. With a sample size of only 64 subjects, the accuracies of the algorithms might have artificially been inflated. (ii) The findings of the study are specific to Alzheimer's disease only, and are not applicable to the use of fMRI data for other studies. The study consisted solely of data for Alzheimer's disease. A future study could be conducted on the effectiveness of various unsupervised learning algorithms on other neurological and psychiatric diseases using fMRI data, and not just Alzheimer's disease. Moreover, the study also ignored the different stages of Alzheimer's disease, and assumed Alzheimer's to be a binary disease when in fact the opposite is true. (iii) This study was limited to only 12 algorithms, when the true number of all developed and designed unsupervised learning algorithms extends far beyond that. This study was limited to finding MATLAB scripts online of unsupervised clustering algorithms that were applicable to high dimensional fMRI data. A more comprehensive study comparing more clustering algorithms would be more effective in identifying the most optimal unsupervised clustering algorithms to use for fMRI data applications and ultimately disease detection.

Disclosures
The authors report no competing interests.