Application of Clustering in the Dimensionality Reduction Algorithms for Separation of Financial Status of Commercial Banks in Ukraine

The issue of determining the financial condition of commercial banks and separating investment-attractive banks from problem banks on this ground is extremely important for developing countries. The aim of this study is to make sure on the example of Ukraine that commercial banks really form separate clusters, where more reliable, stable and efficient banks are well separable from less successful ones in this regard. The study used the t-SNE and UMAP dimensionality reduction algorithms, and the Ward’s Agglomerative Hierarchical Clustering algorithm. The results of visual analysis of two-dimensional t-SNE projections show that banks of different degrees of risk are well separable and have their own specifics. Clustering in the UMAP algorithm allowed distinguishing clusters with banks of Class A, “mid-tier” and problematic banks by different parameters. The t-SNE and UMAP algorithms for solving the problem are compared. The results show that a purely visual analysis of the two-dimensional map for the banks over the last period is best made using the t-SNE algorithm. UMAP, on the other hand, is proved to be excellent when used in tandem with the clustering algorithm.


Introduction
The financial condition of commercial banks is one of the most important categories of the financial sector of any developed or developing country. The financial performance of banking institutions significantly affects the financial stability of the entire economy [1]. But an important and difficult issue is the formation of appropriate tools for assessing the financial condition of banks. Different techniques can give different results. The assessment of the reliability of the financial system in the country as a whole will depend on this. The reaction and behaviour of banking institutions during the 2007-2008 crises have been examined and analysed over the past decade by researchers around the world in order to assess the effect, the resulting damage and/or develop methods to mitigate future financial impacts [2][3][4][5]. This issue has become particularly important for developing countries. Such countries have weak financial systems and the reliability of the banking system is important. Confidence in the banking system provides confidence in the economy as a whole [5].
First of all, let's consider the concept of financial condition. We can determine the financial condition of a commercial bank as a comprehensive economic characteristic of its activities in the short, medium and long term, determined by the structure of the bank's Universal Journal of Accounting and Finance 10(1): 148-160, 2022 149 sources of funds and the effectiveness of their placement. The financial condition of a commercial bank determines a number of various external and internal factors, individual characteristics of the bank, which include the quality of assets, capital adequacy, loan to assets ratio, and the quality of management [6]. Sometimes you can come across another term, which is, however, very close by its meaning -financial performance. Studies of financial performance of banks also involve such characteristics of banking as capital adequacy, asset quality, quality of management, as well as profitability and bank liquidity [7].
We are going to consider the mentioned categories in the context of the Ukrainian financial market, and it is necessary to note some of its features. It is relatively young and is still at the stage of active development [8], it is floating to some extent, being in a state of uncertainty, in the process of transition from the post-Soviet model to European standards [9,10].
Despite the fact that this transition can be called relatively successful, there are still problematic points in the banking sector of Ukraine that do not allow asserting the unequivocal reliability and efficiency of commercial banks in the system: a high proportion of low-quality assets, a high level of credit risk in general, low efficiency of some systemically important banks, a decrease in the return on assets and capital and the instability of these indicators over time.
Based on the above, it can be unequivocally stated that the question of identifying the most effective, reliable, stable banks is urgent for individuals and legal entities, for whom it is important to determine the investment attractiveness of banks for the correct investment of their funds, which indicates the topicality of this research.
To conduct the study, we decided to move away from classical simulation approaches and turn to more innovative tools and methods of unsupervised machine learning for data dimensionality reduction algorithms, since we assume that they enclose unrevealed potential for solving problems in the domain under study.
There are many works in the field of data clustering in the scientific field, and a significant part of them deal with the development of the algorithms themselves, not to their application, for example, the work of researchers [11].The studies in the banking field often focus on a particular side of the bank, the business line, for example author [12] examined the bank's credit risk. The subject of a comprehensive analysis of the banking system using methods and tools of unsupervised machine learning with data dimensionality reduction algorithms has not been widely covered in the scientific literature, while it is even more difficult to find and qualitative research in the context of Ukrainian realities. Therefore, we can argue that this study represents scientific novelty, and is aimed at filling the above research gap.
Two objectives were identified that are aimed at achieving the goal of this study. The primary objective is to make sure that the separation of commercial banks (their clustering), based on the indicators of different aspects of their activities, is possible, and that successful investment-attractive banks are separable from problem banks, since banks with similar indicators in the multidimensional space are next to each other. Secondary objective is to verify whether t-SNE is a suitable dimensionality reduction algorithm in the context of working with data in a banking domain, or it is good for data visualization only and UMAP suits better. The rest of the article is structured as follows. Literature Review -contains a review of the world scientific literature covering various aspects of this research, a list of advanced hypotheses. Material and Methods -contains a detailed description of the data used in this work and the approach to their pre-processing, the used dimensionality reduction algorithms and unsupervised machine learning. The Results and Discussion section provides the proposed approaches to diagnosing the financial condition of Ukrainian commercial banks through the t-SNE and UMAP dimensionality reduction algorithms, as well as the Ward's Agglomerative Hierarchical Clustering algorithm. Research hypotheses have been worked out. The last part is Conclusions, which additionally consider the limitations of this work and suggest directions for further research.

Literature Review
One of the key categories of this work is the financial condition of a commercial bank. Some works, for example, author [13], say that indicators of banks' financial performance include profitability and return on assets. But this approach is more focused on bank owners than on their customers. From the point of view of clients, more important parameters of the bank's financial stability are its liquidity and stability than profitability [13]. This category is also often referred to in scientific literature as financial performance, and has been considered by researchers in different contexts, for example, in the context of corporate social responsibility [14]. We consider the financial performance of the bank as a result of the effect of different indicators of its operation. A similar approach can be found in researchers [7], where the category of financial performance itself and the factors influencing it are considered.
If we consider the economic and mathematical models for diagnosing the financial condition of commercial banks, the frequent use of classical methods can be noted. Researcher [15] considers a model based on multiple discriminant analysis and compares it with logistic regression in the context of diagnosing Slovak bank bankruptcies. As a result of the analysis, the author comes to the conclusion that "MDA is not a recommended 150 Application of Clustering in the Dimensionality Reduction Algorithms for Separation of Financial Status of Commercial Banks in Ukraine method for bankruptcy prediction", and "in terms of logit model estimation, we have to admit its uselessness" [15]. Machine learning methods can be distinguished among more modern ones. For example, some authors consider machine learning based on modified logistic regression, which has shown good results in solving the problem of credit scoring. Ensemble models show high efficiency in the tasks of analysing bank performance, for example, researchers [16] consider the bagging method built on three tree-based models, which is used to predict the key factors of profitability of Turkish banks (Development and Investment Banks).
One of the types of machine learning is unsupervised machine learning, models of this type solve clustering problems. Clustering has been used in the research of many scientists; it is advisable to familiarize with it as a separate type of algorithms on the basis of works dealing with the review and comparative analysis of such algorithms [17,18]. Central banks around the world develop and implement their own systems and algorithms that best meet the characteristics of national financial systems. At the international level, such systems are still being agreed for comparability. Clustering is a popular method because it allows you to divide banks into different groups for which there are common features. This approach provides an opportunity to develop recommendations for each group of banks, taking into account the problems and risks that are specific to this group [17].
It is difficult to argue that the clustering of commercial banks has received much attention in the scientific literature. The article by researcher [19] considers the problem of identifying failing banks using neural networks such as self-organizing maps, concluding that "the results show that self-organizing maps can successfully carry out bank clustering tasks and identify banks that require immediate attention from the regulatory bodies" [19]. The work of researcher [20] deals with the clustering of Ukrainian banks using the Ward agglomerative hierarchical clustering, with the aim of dividing them into three groups ("Growth", "Stabilization", "Decline"). Researchers [21] also applied hierarchical cluster analysis to identify and separate problem banks from the rest based on 12 performance indicators. Considering that there is a certain research gap in studies of the Ukrainian financial market using machine learning and clustering methods, and that many existing works may have lost their relevance due to the high dynamism of the financial sector, we would like to fill this gap in this work and investigate the following research questions: 1. Separation of Ukrainian commercial banks (their clustering) based on a wide range of indicators of their activities is possible, and effective investment-attractive banks are separable from problem banks.
2. Problem banks tend to fall into areas where closed banks were in the last period before they were liquidated, and this can be used to identify them even before clustering is applied.
An important stage in our research is the analysis of multidimensional data, which is a very difficult, sometimes impossible task due to such a phenomenon as the "curse of dimensionality" [22]. Moreover, it is necessary for the visual analysis to display a multivariate array of model features in two-or three-dimensional dimensionalities. Dimensionality reduction algorithms are used to solve this problem. The t-Stochastic Neighbour Embedding (t-SNE) algorithm is one of the most popular among them, it was proposed researchers [23] for multivariate data visualization. However, some researchers [24] see this algorithm as a pre-processor for further clustering, when clustering is performed on a two-dimensional map obtained after dimensionality reduction. Other researchers [25,26], on the contrary, point to the limitations of the algorithm, which include a strong dependence on the local data structure and the inability to preserve the global structure, the inability to allocate outliers, and high requirements for computing power. The Uniform Manifold Approximation and Projection (UMAP) algorithm proposed by can be used as an alternative. It is relatively young, but has already gained popularity in scientific research in the fields of bioinformatics, materials science, natural sciences, machine learning. A comprehensive literature review on this topic is provided by researchers [27,28] noted the computational efficiency of the algorithm (speed of operation), the ability to form meaningful clusters, and high reproducibility. Researchers [29] test four dimensionality reduction algorithms on 71 genetic datasets, stating that UMAP is superior to PCA and Multidimensional Scaling (MDS) algorithms, and shows advantages over t-SNE.
However, the work of Kobak & Berens (2019), dealing with a detailed analysis of the t-SNE algorithm and how to optimize it, also compared t-SNE and UMAP and concluded that, with the correct settings of the t-SNE parameters, it allows maintaining a global structure no worse than UMAP, you can achieve reproducible results when using initialization with the help of PCA, and the Fit-SNE modification can significantly increase the speed of the algorithm.
So, it is difficult to speak for sure about the superiority of any of the algorithms, and given that there is a certain polarity of opinions in the research field, we would like to put forward and test the following hypothesis: Н3: The UMAP algorithm is the most suitable for data dimensionality reduction in the study of Ukrainian commercial banks from the point of economic interpretation.

Research Methodology
The research design of this study includes all the stages necessary to accomplish its key objectives: Data collection, Feature preparation & selection, data analysis & pre-processing, Dimensionality reduction, Clustering.

Data Collection
The object of our research is Ukrainian commercial banks, and we collected the relevant data of the NBU supervisory statistics for its conduct, namely, quarterly data of the grouped balance sheet balances of banks for the period QI 2015 to QIV 2020, broken down into four groups: Assets, Liabilities, Equity, Financial results. (NBU data, Aggregated outstanding amounts on balance sheet accounts of the Ukrainian banks, 'Aggregation_eng.zip').

Feature Preparation & Selection, Data Analysis & Pre-processing
Taking into account the specifics of our research, we relied on a number of Ukrainian-language sources which provide an overview of diverse banking indicators. As a result, we identified 143 coefficients, including 63 ones that can be calculated based on data from open statistics of the NBU.
We have taken a number of measures to eliminate distortions in the data. First, the N/A data (110 observations out of 2243, i.e. 5% of the dataset) were replaced with 0 with the intention of keeping some banks in the sample, but giving all such observations a similarity characteristic. We assume that zero values of the ratios more often indicate the problem of banks.
Second, we applied a data winsorization approach to limit outliers [30]. We additionally used the classic standardization to smooth the distribution, where X -raw score; μ -population mean; σ-population standard deviation.
We also used robust scaler paired with winsorization. Robust scaler scales features using statistics that are robust to outliers. This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range (interquartile range). Therefore, its formula is as follows: We obtained an improvement in the quality of the dataset as a result of that data processing. The total mathematical expectation across the entire dataset decreased from 475.53 to 0.00 at the standard scaler and to 0.30 at the robust scaler; the overall standard deviation decreased from 10394.29 to 1.00 and 2.08, respectively.
Third, we performed multicollinearity analysis using Pearson's pairwise correlation and Variance Inflation Factor (VIF) metrics, relying on the work of researchers [31], in which these methods, among others, are considered as tools for determining the correlation between factors, and their thresholds are indicated at 0.7 and higher for the correlation and 10 or more for the VIF indicator. It is known that multicollinearity is extremely difficult and often impossible to correct by changing the data itself. But the solution to this problem is the feature selection process. We gradually added features to the model following a bottom-up approach based on the pairwise correlation metrics, VIF and two complementary methods, Mean Absolute Difference [32] and Spectral Feature Selection (SPEC) [33]. We also relied on our expertise in the banking area, primarily testing the indicators that, in our opinion, were the most important for inclusion in the model for assessing banking performance. A total of 17 features are included in the model. A complete list is provided in the Appendix. It was decided to leave one pair of coefficients (K12 and K113), in which the pair correlation is higher than 0.7 (0.98) and the VIF is greater than 10 (397.61 and 438.49, respectively), since their inclusion in the model has a positive effect on the interpretability and quality of data separation in two-dimensional space. In this case, this pair of features can be considered as one feature with increased weight in the model. As a result of the selection of indicators, the average VIF value for all data decreased from 1010.140 to 56.096. A further change in the number of features and/or their replacement did not affect the quality of clustering and the interpretability of the results at best; it had a negative effect at worst.

Dimensionality Reduction
We opted for two algorithms to reduce the dimensionality of multidimensional data: t-SNE and UMAP. We also considered the Principal Component Analysis (PCA) algorithm, but it showed significantly worse results, and the quality of clustering and data interpretability in the two-dimensional plane turned out to be unacceptably low.
When working with algorithms and setting their hyper-parameters, we relied on the original work of the authors of UMAP [34] and t-SNE [23], as well as on the work of researchers [29,35].

Clustering
We have chosen the Ward Agglomerative Hierarchical Clustering algorithm to solve the problem of data clustering. Hierarchical clustering is a cluster analysis technique that aims to build a hierarchy of objects. Its overview can be found in researchers [36,37]. The latter also provides an implementation of the algorithm in Python. The Ward's method is described in the author's work [38]. The algorithm is implemented in the scikit-learn Python library and requires only one mandatory parameter -the number of clusters.
To assess the quality of clustering, 3 metrics were used that are suitable for assessing when the base true marks of the clusters are unknown: the Silhouette coefficient [39], Calinski-Harabasz score [40] and Davies-Bouldin score [41].
Data analysis and descriptive statistics were prepared using the Deductor Studio software package developed by BaseGroup Lab. The import wizard, the Statistics module was used for this purpose.
All other stages of the research were carried out using the Python programming language in the Google Collab environment, with the support of the freely available libraries numpy, scipy, pandas, matplotlib, seaborn, sklearn, statsmodel, umap, pyclusterend.

Results
In our research, we came to several approaches to diagnosing the financial condition of Ukrainian commercial banks.

Dimensionality Reduction
In the first approach, we conduct a purely visual analysis of the distribution of points, using only dimensionality reduction algorithms, mapping the multidimensional space (these are 17 dimensions in our case) in a two-dimensional projection. To do this, we use the t-SNE algorithm, which has shown a fairly good result in preserving the local and global data structure and excellent results for subsequent visual interpretation. Our main goal is to find out if there are any regularities and economic sense in the distribution of banks on a two-dimensional projection in the context of one period (the last one). Let us refer to Figure 1 for this purpose.
Source: own study based on statistical data of the National Bank of Ukraine (NBU data, aggregated outstanding amounts on balance sheet accounts of the Ukrainian banks, 'Aggregation_eng.zip') On the map in Figure 1, yellow labels indicate active banks, and red ones -those that have left the market (data for the last available period). If we consider only active banks (that is, yellow labels on the projection map), we can clearly see that all banks are divided into two sectors: the banks in the lower part are clearly separated by a space from the banks in the upper part of the map. We will draw R1 through this gap to better visually indicate the separation. We see it necessary to emphasize, that this line is drawn not based on any specific calculations but on a visual analysis only and it has a single purpose to highlight this gap between active banks (yellow labels) that could be noticed in the middle of the projection map. We can consider this R1 line simply as a visual separator that divides the whole map into two major groups: above and below the line. We can also notice that the bank points form groups among themselves. Let's analyse the separated areas and banks therein, relying on two approaches: considering the rankings of banks (credit ratings of national ranking agencies, rankings of the reputable Ukrainian financial Internet publication Minfin and the official ranking of the NBU), and analysing a number of key ratios (K13. Credit losses provision ratio; K113. Credit-investment portfolio share; K51. Highly liquid assets share; K31. Autonomy ratio (capital adequacy); K68. Return on costs).
Having considered all the banks, we can draw the following conclusions. Based on our visual analysis of the above projection map we can distinguish five major groups of banks. The area A+ (the lower sector, the bank points converge to the lower left corner) grouped first-class banks, having the highest rankings of other experts and showing good results in terms of their indicators. There are the TOP-10 banks according to the Minfin ranking, 7 out of 10 TOP-10 banks according to the NBU ranking, 11 out of 12 banks have the highest ua AAA ranking. The areas B+ (the lower sector, centre) are mid-tier banks, mostly small businesses with above-average performance. The Corp area (lower sector, on the right) is predominantly corporate, international and/or investment banks. The upper part of the upper sector grouped state-owned banks, which are rather problematic in terms of their indicators; however, they cannot be unequivocally classified as such given the support of the state. The rest of the banks above R1 can be attributed to the Risk group: these are potentially problem banks, they form their own separate subgroups, each being characterized by its own set of indicators, according to which banks can be classified as problem banks. As Figure 1 also shows, most of the closed banks are located above the straight line, in the same place as problem banks (59 out of 73, that is, 80%). There is a relationship between the indicators of the problematic sides of banks and the reasons for the closure of the neighbouring inactive banks. For example, the area P includes active banks with low indicators of liquidity and capital adequacy, and inactive banks closed due to lack of capital and liquidity. We can also see that some inactive/closed banks are located in the 'safe' area, below the R1 line. Most of them (9 out of 14 banks that are located below the line) are grouped near the border. The general causes of those banks recognized as insolvent are a lack of liquidity and capital adequacy. Thus, we can conclude that those institutions are attracted by the P area on the projection map but are held back below the R1 line as other indicators except liquidity and capital adequacy were mostly healthier than those for the banks in the P area (in the last period when they were active). The other 5 banks are located near A+, B+, and Corp areas left the market due to different reasons like consolidation with another bank, closing by the NBU due to low ownership structure transparency, or financial fraud. Such banks` actions generally do not have any visible effect on the indicators that were considered by us during this research, thus these banks were mapped in the 'safe' areas.
We can draw several important conclusions based on the results of only a visual analysis of a two-dimensional projection map obtained through t-SNE. First: the bank points located in the multidimensional space of the features we have selected really attract each other, having common features. This pattern is preserved when projecting data into a two-dimensional plane. Efficient and problem banks will "push off" from each other, locating on opposite sides of the centre of the map (in our case, these are the lower and lower left parts of the map, and the upper and upper right parts of the map), while mid-tier banks will be located near the centre of the map closer to investment-attractive banks. Second, problem banks do converge to fall into areas where closed banks were in the last period before they were liquidated. The location of a bank in an area of accumulation of already inactive banks is a worrying sign, and the reasons for the closure of the neighbouring inactive banks may indicate possible problem areas of active businesses. So, we can consider H2 hypothesis proved. Such an analysis will not be enough to prove hypothesis H1, thus we will further additionally use the clustering algorithm to test it.
The most adequate result when reducing the dimensionality was obtained with the following t-SNE settings: perplexity (regulate how to balance attention between local and global aspects of data) = 15, learning rate = 500, iterations = 1000, initialization with PCA, Barnes-Hut modification [42]. This configuration has been determined through numerous experiments with different hyper-parameters, and generally corresponds well with the recommendations of researchers [35]. At the same time, it can be argued about the sufficient ability of the algorithm to preserve both a local structure (banks with similar characteristics are located nearby in clusters) and a global data structure (clusters of investment-attractive and problem banks are significantly distant from each other). UMAP, in turn, has a greater focus on the global structure, and although individual clusters of points can be well separable, the points themselves within the clusters can be stuck together very close to each other..

Clustering of Commercial Banks with Dimensionality Reduction Algorithms
In the previous approach, we used the t-SNE dimensionality reduction algorithm, and it showed good results for purely visual analysis of data for the last period. However, the results of this algorithm are poorly suited for further clustering (Figure 2, right box) Figure 2 shows that the points on the low-dimensional projection form a field with a large number of groups of low-density points that are difficult to cluster regardless of the method chosen.
The UMAP algorithm, in turn, produces projections that are slightly less amenable to visual analysis, but it showed the highest quality result when used in conjunction with clustering algorithms. The following hyper-parameters were used for dimensionality reduction in UMAP: n_neighbors (number of neighbours, controls the local vs global structure preservation) = 20, min_dist (minimum distance apart that points are allowed to be in low dimensional representation) = 0, distance = correlation. The correlation distance is calculated as follows: (3) where u and v are two data arrays with two sets of coordinates for each data point; � and ̅ are the mean of the elements of u and v accordingly; ( -� )•( -̅ ) is the dot product of ( -� ) and ( -̅ ). We experimentally determined the optimal values of the parameters, relying on the official documentation of the UMAP library, which in turn is based on the work of the author of the algorithm [34].
Dimensionality reduction resulted in a map with projections of bank points onto a two-dimensional plane ( Figure 2, left box). We used the Ward's Hierarchical Agglomerative algorithm. Only the setting of the number of clusters is required for clustering among the parameters. In order to determine the optimal number of clusters, we relied on the clustering quality metrics discussed in the Research Methodology section, and on the condition that the number of clusters should not exceed 30 (our assumption based on visual analysis of the two-dimensional UMAP projection in Figure 2). Having made runs and evaluating the quality of clustering for the number of clusters from 2 to 30, we found that the optimal number of clusters is 11. At the same time, the silhouette coefficient is 0.574, Davies-Bouldin score is 0.532, Calinski-Harabasz score is 6029.184, which can be generally considered high enough quality for clustering, given that we did not remove outliers from the data, but only smoothed them to preserve their impact on the division of banks into groups.
The table below describes the resulting clusters (Table   1). Based on indicators such as K0. Assets share in the system, K13. Credit losses provision ratio, K113. Bank business activity ration, K26. Dependence on naturals deposits ratio, K31. Autonomy ratio (capital adequacy), K51. Highly liquid assets share, K68. Return on costs, we have formed our own simple ranking system and ranked banks by using it. The rankings attributed to the banks according to our system mostly coincide with the Minfin and the NBU rankings. The table for each cluster shows the average ranking for all banks in the cluster. Let us note that this ranking system does not claim to be completely objective and highly accurate. Although the rankings of banks are based solely on the ratios that are calculated on the basis of the current open statistics of the NBU, the ratios for calculating the rankings were selected subjectively, based on our vision of the importance of certain indicators for assessing the financial condition of banks. The purpose of the ranking system is not to assess banks as bad or good; it serves solely for the quantitative assessment of clusters formed in the course of simulation for their further comparison with each other. An important note here is that clusters presented in Table 1 are based on a mathematical approach, cluster analysis specifically. This approach differs from the one we used when analysing Figure 1 which was based fully on a visual analysis of the projection map. Thus, clusters formed in Figure 1 and clusters described in Table 1 are not much related to each other as they are using different approaches. However, both approaches provide fairly good outcomes in separating different classes of banks. In Table 1, the features of the clusters, which we consider to be crater-forming based on the analysis of the ratios, are highlighted in bold. As we can see from the table, all banks can be conditionally divided into 3 large groups based on their financial condition and investment attractiveness. The first group is Class A (1st tier) banks (Cluster 1) or banks that can potentially be considered as such, but have minor problems in one or several indicators of their activity (Clusters 2-3). Such banks have the highest investment attractiveness and are reliable both for lending and for investing funds. The second group contains the largest number of banks (Clusters 4-6). Most of them can be classified as Class B (2nd tier) banks. They are divided into corporate and retail/universal (Clusters 4 and 6). The former may be of interest to clients looking for stable banks serving legal entities, and the latter may be considered attractive to the population as small and/or local banks. The rest of the banks (Cluster 5) are mid-tier banks that can hardly be called investment-attractive, but which are not unambiguously problematic. Those on the brink between the two categories can rather be characterized as risky. The third group is problem banks (Clusters 7-11). Each of the clusters has its own set of features that describe the banks that fell into it and indicate their problematic sides. Medium banks, potential Class A banks, most of which have problem assets in their portfolios. 6.20 / 9.00 3 1 Potential Class A banks, which, however, have problems with asset quality and capitalization. 6.00 / 9.00 4 10 Class B banks. Mostly medium-sized banks that are active in lending and investment activities, some of them have a problem with profitability. 5.50 / 9.00

12
Mostly medium-sized banks that do not stand out in terms of their performance. Most of the banks are pursuing a moderate or passive credit and investment policy and have problems with profitability. Some banks are undercapitalized and have problem assets in their portfolios.
5.08 / 9.00 Small banks, most of them focused on corporate lending, most of the banks conduct a moderate or passive lending and investment policy, all banks have problem assets in their portfolios and problems with profitability, 4 out of 7 banks are unprofitable.
2.57 / 9.00 Technical banks focused on specific goals within the public financial system. They practically do not conduct banking activities in the usual sense. They are low-liquid and unprofitable. 2.50 / 9.00 10 8 Mostly small banks with low liquidity, most of the banks conduct a passive lending and investment policy, have an unstable resource base, problem assets in their portfolios, and some banks are low-profit.
2.13 / 9.00 11 7 All banks pursue a passive policy, which is most likely justified by a small number of assets and a weak resource base. All banks have a significant share of problem assets on their balance sheets. 5 out of 7 banks are unprofitable, some are undercapitalized.
2.00 / 9.00 Source: own study based on statistical data of the National Bank of Ukraine (NBU data, Aggregated outstanding amounts on balance sheet accounts of the Ukrainian banks, 'Aggregation_eng.zip') Having obtained sufficiently adequate results in the course of interpreting clusters, we can conclude that the separation of Ukrainian commercial banks (their clustering) based on a wide range of indicators of their activities is possible and efficient, investment-attractive banks are separable from problem banks, we proved our hypothesis H1.
It is still difficult to draw an unambiguous conclusion with regard to the issue of choosing a dimensionality reduction algorithm, even as a result of the research and tests carried out. However, we tend to reject hypothesis H3 that the UMAP algorithm is the most suitable for data dimensionality reduction in the study of Ukrainian commercial banks in terms of economic interpretation of the results. We justify this by the fact that the t-SNE algorithm, after optimizing its hyper-parameters, for visualization and subsequent economic interpretation of the resulting two-dimensional projection, fits even slightly better than UMAP, although UMAP was still preferable for cluster analysis.

Discussion
In our research, we relied on the works of researchers [19][20][21], which are closest to our study, focusing on dividing banks into clusters based on their financial performance indicators. However, they use small sets of indicators that describe mainly one side of the operation of banks. In the work of researcher [20] the study is carried out in the context of the Ukrainian market, but it was carried out in 2016 and its results have lost their relevance, since the banking system has undergone changes since then. Besides, given that a system of bank separation, which would be based on a solid economic and mathematical methodology, has not been introduced and enshrined at the legislative level in Ukraine, we made an attempt to expand the studies presented above using the example of Ukrainian banks, involving a wider range of indicators, and propose an author's approach to dividing banks into groups/clusters and analysing their financial condition.
While selecting a general list of possible indicators, we analysed a number of Ukrainian-and Russian-language sources, but found a list of similar indicators in the works of researchers [19,21,43,44,45]. When selecting a basic set of indicators, the pairwise correlation matrix and the VIF metrics were one of the key indicators for us [31]. This approach is quite popular, and, for example, is used in the work of researcher [20].
When processing the data, we used the winsorization method and our experiments with its limits (tails) agree with the recommendations of researches [46][47][48][49][50]: a two-way limit of 1% for further work with the UMAP algorithm showed an excellent result. We also conducted a number of our own experiments with limits, reducing the dimensionality of different datasets, and found an improvement in the quality of t-SNE clustering at the lower and upper limits of 2% and 5%, respectively. The use of dynamic indicators provides adaptive monitoring of the financial condition of banks [46]. This allows you to track changes in the dynamics and respond quickly to changes. An important role in this is played by central banks, which stimulate technology transfer [50]. Communication of central banks at the international level provides an exchange of new technologies and methods of assessing financial condition. This has a positive effect on the financial systems of the world and the international financial system as a whole.
In preparation for cluster analysis and data dimensionality reduction, we selected and adjusted the hyper-parameters of the algorithms and, as a result of numerous experiments, came to the same parameters that were proposed in the work of researches [35].

Conclusions
As a result of the study, we obtained a clustering of Ukrainian commercial banks, while successfully separating investment-attractive institutions from potentially problematic ones. We obtained 11 clusters, 3 of which we classified as those that include successful and efficient banks, and 5 more clusters we classified as those that include banks characterized by a risky level of performance. We obtained confirmation of our assumption that banks are located in the multidimensional space of diverse features describing their operation are next to the banks with similar features. It was also visually demonstrated that problem banks form a whole area on a two-dimensional map and are well discriminated against successful banks, as they are usually located on a two-dimensional projection in the diametrically opposite part of the map.
We carried out numerous experiments with the t-SNE and UMAP dimensionality reduction algorithms as part of the study. We came to the conclusion that the latter shows itself much better when paired with the hierarchical clustering algorithm. Provided that the parameters of both algorithms are optimized, t-SNE showed excellent results in a purely visual analysis of a two-dimensional map only for the last period. The results of our research may be of interest primarily to depositors (individuals and legal entities), Ukrainian and foreign investors. The approach and set of methods used in our work can be extrapolated to other countries, but one should be careful about the formation of a set of coefficients.
Of course, there are some limitations in our work. First, more than half of the coefficients were excluded from the initial set of features, because the open aggregated statistics of the NBU does not provide enough data to calculate them. The excluded features may be those that had a significant positive effect on the quality of reducing the clustering dimensionality. Second, we are dealing with unsupervised machine learning from the point of models, and have no actual data to compare the result. Assessing the quality of clustering with the metrics used can be subjective and is often applied rather to compare different clustering algorithms or their settings with each other. We relied on our simple ranking system to be able to compare our cluster scores with some actual value.
The use of other clustering algorithms to solve the problem of dividing banks into investment-attractive and problematic ones can be an interesting direction for further research. Another direction may be deepening this study through using detailed reports from each of the bank, thus calculating additional ratios. The application of the described complex of stages (pipeline) for the analysis of banking systems of other countries is also of scientific interest. K0