Female Diabetic Prediction in India Using Different Learning Algorithms

Diabetes, also known as Diabetic Mellitus, is a metabolic disease that affects the body‟s natural blood glucose levels. It is a non-contagious illness with numerous serious health risks. The said illness is rapidly growing in India. It is a chronic disorder that happens when the human body unable to create enough insulin hormone to keep blood sugar levels under control. Several characteristics that cause diabetes were investigated in this study, and multiple machine learning techniques were used to predict whether or not an unknown substance had diabetes. PIMA diabetes detection for female patients was employed for this purpose. For prediction, six distinct classification models were applied. This research presented a comprehensive performance assessment of the multiple factors in the PIMA dataset. Based on all factors of the PIMA dataset, a full discussion was made to demonstrate how diabetes is affected. Finally, in order to forecast the best automated diabetic prediction model, a thorough analysis of many classification approaches was undertaken. It was discovered that the Support Vector Machine (SVM) model delivers the best prediction result, with a reliability of 83.5 percent. Interestingly, Random Forest (RF) Classifier produced the second-best prediction result, with a reliability of 82.76 percent. The study's findings demonstrate that machine learning models produce efficient solutions. The accuracy of the two best machine learning models is 82-83 percent, which can be used for subsequent improvement of the autonomous forecasting tool. The accuracy of these techniques can be improved further by integrating diverse variables for prediction and classification.

Abstract Diabetes, also known as Diabetic Mellitus, is a metabolic disease that affects the body"s natural blood glucose levels. It is a non-contagious illness with numerous serious health risks. The said illness is rapidly growing in India. It is a chronic disorder that happens when the human body unable to create enough insulin hormone to keep blood sugar levels under control. Several characteristics that cause diabetes were investigated in this study, and multiple machine learning techniques were used to predict whether or not an unknown substance had diabetes. PIMA diabetes detection for female patients was employed for this purpose. For prediction, six distinct classification models were applied. This research presented a comprehensive performance assessment of the multiple factors in the PIMA dataset. Based on all factors of the PIMA dataset, a full discussion was made to demonstrate how diabetes is affected. Finally, in order to forecast the best automated diabetic prediction model, a thorough analysis of many classification approaches was undertaken. It was discovered that the Support Vector Machine (SVM) model delivers the best prediction result, with a reliability of 83.5 percent. Interestingly, Random Forest (RF) Classifier produced the second-best prediction result, with a reliability of 82.76 percent. The study's findings demonstrate that machine learning models produce efficient solutions. The accuracy of the two best machine learning models is 82-83 percent, which can be used for subsequent improvement of the autonomous forecasting tool. The accuracy of these techniques can be

Introduction
Diabetes Mellitus, also known as simple diabetes, is a metabolic condition characterized by insufficient management of blood sugar levels. The hormone insulin aids in the transport of glucose from the bloodstream to the tissue and the absorption of sugar as fuels for subsequent use in the internal organs [1,2]. Diabetics develop when the percentage of this glucose level climbs or drops from its usual level. Diabetes, or hyperglycemia, can affect the nerves, eyes, kidneys, skin, and many other organs [3][4][5]. This is a highly prevalent and incurable condition. However, modifying one's lifestyle and eating habits in conjunction with medication can assist to manage blood sugar levels and lead a healthy life.
Prediction of diabetics in their early stages is quite important due to the adverse effects. Machine Learning approaches are now being used in the healthcare sector to more correctly forecast diabetes. Machine Learning allows a machine to become intelligent by learning from previous experiences or input attributes and predicting the next and its categorization. Researchers have presented numerous diabetes prediction algorithms in order to effectively predict PIMA Indians Female diabetics. The researchers conducted a comparative investigation of various machine learning techniques in order to assess their performance [6], calculate the different performance factors from the receiver operating characteristic curves [7]. Other researchers combine the PIMA diabetics dataset with data from the Bangladesh. They also compared several machine learning techniques for detecting diabetes individuals [8]. Deep learning approach also performs well in predicting diabetics [9]. The deep learning and machine learning algorithms achieved accuracy ranging from 90 to 98 percent. In one investigation, the deep learning strategy attained the highest accuracy, 98.04 percent [10,11].
This study reveals the behavior of many crucial parameters for diabetic individuals, as well as the interaction between the primary essential components, and is entirely focused on detecting diabetics in female patients. The most efficient classifier for better prediction is revealed by comparing six different machine learning methods.

Dataset
In this study, a dataset of Pima Indian women from the Kaggle repository [12] is utilized to predict a patient's diabetic proclivity. The dataset can be framed as a set of attributes containing many criteria for detecting diabetic proclivity. In this dataset, the attribute 'outcome' is used as the prediction's target class. This dataset is entirely dedicated to female diabetic patient prediction in India.

Dataset Details
This dataset [13] includes information about 768 female patients. There are 500 non-diabetic female patients and 268 diabetic female patients among them. The diagnosis outcome, along with the other parameters, is saved in the dataset as binary numbers for each individual. The PIMA dataset's summary and diagnosis details are included in Table 1 and Figure 1. The dataset has 8 predictive variables and one responsive variable. Table 1 provides extensive information on how many patients participated in the investigation, how many disease criteria were evaluated for investigation, how many of the participants were diagnosed as diabetes and how many were not, and lastly what the ratio of diabetes and non-diabetes was. Figure 1 depicts a visual representation of the total amount of diabetics and non-diabetics patient. Table 2 lists the nine experimental characteristics or features that were utilized to predict diabetics, as well as their minimum and maximum values. Sum of Diabetic Patients 268 Sum of Non-diabetic patients 500 The ration of Diabatic and Non-Diabatic 0.35 : 0.65   The boxplot plot is frequently used to depict data dispersion. It reflects if the data in the dataset are balanced, how neatly ordered the data is, and how biased the data are. It generates a five statistical summary: the minimal, first percentile (Q1), median, interquartile (Q3), and utmost, the minimal, first percentile (Q1), median, interquartile (Q3), and utmost. The "Minimum" indicates the minimum values of the data, "Median" indicates the median value of the data, "Quartile Q1" indicates the middle values of minimum and median value of the data (25 Percentile data), similarly "Quartile Q2" indicates the middle value of median and maximum values (75 Percentile data) of data, finally, "Maximum" indicates the maximum values of the data. Figure 4A to 4H depicts the barplots of each predictive variable with responsive variables, respectively. The barplot diagram describes the correlation between a numerical and a categorical variable. Each categorical variable is represented by a bar. The quantitative value is represented by the size of the bar.
In this investigation, the barplots for each explanatory factor explain how diabetes and non-diabetes fluctuate with a particular range of outcomes. The bars in each plot are organized by a specific range of the associated characteristic. The length of each bar is measured as a percentage of its related component.
To explain the statistical relations between variables, a correlation matrix is utilised. Correlation is a measure of how closely one parameter is dependent upon the other. A correlation matrix is a table that displays the correlation coefficients between parameters. Each entry in the table or matrix represents the relationship between two variables. It is used to consolidate the incoming data in a more complex manner. Figure 5 depicts the connection of the PIMA diabetes dataset's eight predictive variables and one responsive (Outcome) variable of female patients in India. The correlation matrix plot demonstrates that the dataset contains four significant predictor variables. They are as follows: 1) the number of pregnancies related with age, 2) the glucose level related with insulin, 3) the BMI (Body Mass Index) related with the thickness of the skin, and 4) the BMI (Body Mass Index) related with the level of blood pressure. Figure 5A to 5D depicts the association between all the highest correlated variables.
Every figure from 6A through 6D demonstrates how the four dependent components in the experimental dataset are interrelated. 464 Female Diabetic Prediction in India Using Different Learning Algorithms

Classification Models
A classifier that uses training data to translate input variables to target classes. The purpose of using a classifier throughout this investigation is to predict whether or not a patient has a diabetes predisposition. The classification model predicts the test dataset using the trained model.

Decision Tree (DT)
Decision Tree (DT) [14,15] is known as a machine learning classifier. For prediction, this model was utilised to generate a tree-like structure. It acquires knowledge from classification expertise. Each outcome variable is represented as a DT leaf node, while DT non-leaf nodes are used as test class decision nodes. The results of such tests are determined by one of the decision node's branches. This model generates categorization results by growing the tree branches from the parent node to the leaf node. This model is used to forecast the target depending on certain criterion.

Random Forest (RF)
This classification [15] system is made up of numerous decision trees. Bagging and feature unpredictability are used to construct each one-of-a-kind tree. This method generates a statistically independent forest of trees, and the whole forecast is far more precise than any individual tree's forecast. It predicts using ensemble learning, which is a technique that combines numerous weak classifiers to handle complicated problems.

Gradient Boosting (GB)
Gradient boosting [16] is a sort of boosting in machine learning algorithms. This algorithm combines a number of poor machine learning techniques to produce a powerful predictive model. It is predicated on the idea that the best next model, when combined with past models, reduces total prediction error.

Support Vector Machine (SVM)
It is a supervised classification model [17]. When the training set contains a collection of labelled data, this approach can generalise between two separate classes. The primary job of SVM is to look for a hyperplane that can differentiate between the two classes.

K-Nearest Neighbor (KNN)
K-Nearest Neighbour Classifiers [18] are referred to as slow learners. It recognises objects in the features space based on the closest distance between training and testing samples. The classifier considers k objects before deciding on a class based on the closest object. The biggest difficulty with this classification strategy is determining the appropriate value of k.

Logistic Regression (LR)
This regression model [19,20] is a classification approach derived from statistics by machine learning. This statistical method is used to examine a dataset that has one or many independent factors that influence the outcome. The purpose of logistic regression is to find the optimal model for describing the relationship between the dependent and independent variables. Figure 7 depicts the complete process of predicting diabetic or non-diabetic patients from an unknown sample using the PIMA Indian female diabetes dataset. In this investigation, six different categorization methods were used. The algorithm for diabetic prediction is provided in the algorithm section. The experiment dataset was obtained from the Kaggle archive [13].

Dataset Pre-Processing
The input dataset contains some empty fields. To build a clean dataset, null values must be removed. To fill in the omitted or null data, the mean value of the associated feature is computed.

Split the Dataset
The second step is to prepare the dataset for categorization. To categorize the data, the dataset must be divided into two parts: training and testing. In this experiment, 80 percent of the data in the input dataset is chosen for training, while the remaining 20% has been used for testing as unknown samples.

Train the Data for Classification
Following that, six distinct machine learning algorithms are applied one by one on the training dataset to train the data models.

Testing the Data and Make the Prediction
Afterwards, use the trained data models to evaluate the test data. For each classification model, the performance of the test data was estimated. Next, find the highest and second highest prediction models precision.

Algorithm
Step 1: Collect the input dataset from Kaggle repository.
Step 2: Examine the entire dataset to discover if there is any missing data. If there is any, the null data should be replaced with the field's average scores. The dataset is now capable for the prediction model after the missing data has been filled in.
Step 3: Divide the dataset in2:1 ratio. The vast bulk of the data was utilised for training, with the remaining being used for testing.
Step 4: Train the dataset using a classifier.
Step 5: Make a prediction on test data using the trained model.
Step 6: Calculate the accuracy for both training and testing data. Table 3 displays the outcomes of all classification models. The testing accuracy, sensitivity, specificity, and kappa values of all classifiers were used to evaluate their performance.

Discussions
Diabetes is a prevalent [21,22] and incurable disease. It is a protracted health condition that affects the human body's ability to produce energy from food. There are various factors that can be intended to identify the diabetic status of a human body. The input dataset for this investigation contains the eight essential key parameters for diagnosing diabetics, which are listed in table 2. Figure 4 show that each predictive variable suggests different values for predicting diabetics in test samples.  If a female patient has more than six pregnancies, her chances of developing diabetes increase.  If the glucose concentration level is greater than 140, diabetics are more likely to be detected.  As blood pressure rises in conjunction with the other variables, the likelihood of becoming diabetes rises as well.  There is a link between skin thickness and insulin level. When insulin levels fall, the likelihood of becoming diabetes rises. Similarly, as insulin levels fall, skin thickness increases.  As BMI and Diabetic Pedigree levels rise, so does the likelihood of diabetic detection.  At a certain age, the majority of females are diagnosed as diabetes. According to the graph, the majority of those above the age of 60 fall into this category.
The comprehensive performance study of six alternative categorization models is shown in Table 3. Depending on the testing results, each model measures the accuracy and other quality metrics.
According to the performance of each model, the support vector machine classifier has the highest accuracy of 83.5%, followed by the random forest classifier with the second highest accuracy of 82.8%. This strategy is developed to create a comparison model. This model aids in visualising the relationship between the numerous predictive and responsive factors. It also aids in the identification of the optimal classification model for effectively and efficiently predicting diabetic patients.

Conclusions
Diabetes mellitus is a chronic, noncommunicable disease with no complete cure. It can grow slowly in the human body and increase the risk of various linked diseases over time. Several risk factors, such as a poor lifestyle, obesity, poor nutrition, and so on, can be the most serious concern for a diabetic. Preventive actions and increased knowledge can assist to lower the risk of this disease. In a developing country like India, most individuals lack basic information of a healthy lifestyle and are unaware that they are already suffering from diabetes or are on the verge of getting the ailment. Anticipating diabetics at an early stage allows a patient to take precautions and prevent the illness from intensifying. Whenever it comes to own households, women in India have traditionally played an essential role. They emphasize the health of their family over their own. As a result, due to the various factors like poor nutrition, obesity, poor health, etc., the number of female diabetes cases has increased compared to male diabetes cases.
In this study, an automated diabetic forecasting system was developed that can diagnose this disease efficiently and precisely. This technology can detect the condition at an early stage, allowing people with prediabetic signs to better manage their health.
This study included a thorough examination of eight predictive variables and one responsive variable. To show the important variables and their dependencies in the dataset, a correlation matrix was calculated. The prediction is accomplished by a number of machine learning methods. Finally, the algorithms were subjected to a thorough comparison. According to the comparative evaluation, the support vector machine classifier achieved the highest accuracy of up to 83.5 percent. Similarly, the random forest classifier achieves the second highest accuracy of up to 82.8 percent. In this case, only one dataset is used, and the sample sizes in that dataset are quite minimal. As a result, future research will incorporate the use of a large dataset as well as the use of deep learning techniques to increase performance.