Survival Analysis Approach for Early Prediction of Student Dropout Using Enrollment Student Data and Ensemble Models

The Universal Access to Quality Tertiary Education Act is a law in the Philippines that provides college students with free tuition and other fees in Philippine state universities and local universities. People’s tax is used to finance this law and the government should ensure that student retention or persistence is attained throughout the duration of their stay. To effectively decrease student dropout, it is necessary to understand which students are at risk of dropping out. In addition, this study proposed a model that detects and predicts student success in tertiary education through the right selection of the suitable program utilizing the enrollment data that may have a significant on the study outcome of the students. This study experimented single classifier and added ensemble approach classifiers to propose a predictive model to detect early dropout of first-year college students. The study utilized tree algorithms and then applied the ensemble algorithm to identify student attributes that distinguish potential dropouts from college. The result reveals a very interesting prediction that if their average grade is less than 85, there is a high tendency of dropping any program they enrolled in. Evaluation results in the final stage of the model construction process reveal that applying bagging ensemble into j-48 tree attained the highest accuracy as matched with other tree algorithms, however, forest tree algorithm achieved the highest value in terms of dropout precision and graduated recall. The result also shows that applying ensemble approaches have a marginal increase in classification performance.


Abstract The Universal Access to Quality Tertiary
Education Act is a law in the Philippines that provides college students with free tuition and other fees in Philippine state universities and local universities. People's tax is used to finance this law and the government should ensure that student retention or persistence is attained throughout the duration of their stay. To effectively decrease student dropout, it is necessary to understand which students are at risk of dropping out. In addition, this study proposed a model that detects and predicts student success in tertiary education through the right selection of the suitable program utilizing the enrollment data that may have a significant on the study outcome of the students. This study experimented single classifier and added ensemble approach classifiers to propose a predictive model to detect early dropout of first-year college students. The study utilized tree algorithms and then applied the ensemble algorithm to identify student attributes that distinguish potential dropouts from college. The result reveals a very interesting prediction that if their average grade is less than 85, there is a high tendency of dropping any program they enrolled in. Evaluation results in the final stage of the model construction process reveal that applying bagging ensemble into j-48 tree attained the highest accuracy as matched with other tree algorithms, however, forest tree algorithm achieved the highest value in terms of dropout precision and graduated recall. The

Introduction
The Universal Access to Quality Tertiary Education Act, which both Philippine Senate and Congress ratified, forwarded to the Office of the Philippine President was signed by President Rodrigo Roa Duterte covers a total of 190 colleges and universities which comprise 112 state universities and colleges (SUCs), and 78 local universities and colleges (LUCs) nationwide gives full free tuition and other miscellaneous fees to all enrolled students. People tax is used to finance this law and the Philippine government should ensure that student retention or persistence is attained throughout the duration of their stay. This will assure that the money invested will not be wasted.
With this scenario, the Philippine higher education generation for the next years will reverse the current situation from 80 percent of college students enrolled in private schools and 20 percent in state universities and colleges (SUCs) to 20 percent, private colleges, and 80 percent SUCs (Macha, Mackie, & Magaziner, 2018). Free tuition also translates the increasing enrollment rates among students in the SUC and the drop backdrop is the government is spending large amounts of money per student enrolled. However, the drop-out rates revealed an alarming 83.7 percent, meaning the country produces 2.13 million college drop-out annually. Therefore, college student dropout is a major concern in the Philippine education system and first-year student dropout is of particular importance, as the Philippine state set aside more than 16 billion for higher education for full-time first-year students seeking baccalaureate degrees who do not return for a second year. Dropping out at State University and Colleges (SUCs) is a serious problem that may result in kick out or force students to leave from the university thus, it may deny the individual fundamental human right of students to their education and the financial waste particularly of public funds is higher for those students who eventually drop out without completing their degree.
With this current state, student retention and graduation have become very significant than ever to SUCs in terms of accountability and recent national initiatives have focused higher education feel increasingly forced to outline and implement interventions/strategies to increase student retention and student success. Whatever the reasons behind the drop out phenomenon at SUCs, one thing is certain. If this problem is not addressed, it will continue to cost the students, parents, and the larger public, especially our taxpayers, and to the Philippines government, it will mean a waste of scarce resources with very little to benefit to show. Increasing the student graduation rate and decreasing the drop out rates is a long term goal of state universities and colleges and the Commission on Higher Education (CHED). From the point of view of the parents and students, timely and successful graduation is vital as these two factors would strongly affect their employability rate and their intention to help their family financially.
Many of the students studying at the state university face several difficulties during the first year and thus the performance of the first year has been recognized as a significant predictor of timely graduation rate. In terms of keeping the students in the university, educators and researchers extensively studied the factors of retention rate and suggest that an early identification of the students at high risk of failing will enable a timely intervention with the necessary measures/intervention by the educators would increase the graduation rate. According to Mallincrodt and Sedlacek (1987) in terms of keeping the students in the university, "the retention rate is a factor that should have been studied extensively".
One reason for high drop out rates as reported in some researches were poor career choices and lack of personal interest (Kotsiantis & Pintelas, 2005). Yet, few researchers have conducted to investigate and analyze the success of career paths used by the students in the Philippines, especially the factors that affect the career or program choice of Filipino students. With this scenario, parents and students have limited information on how to help them identify their proper career options and course choice they have to pursue in the future.
Pangasinan State University, first-year students have significantly increased during recent years, thousands of students are admitted to study at universities every year, but after the first year of their study, some of them decided not to continue to go in the school. Thus, conducting studies on the monitoring and supporting the student at risk is a topic that needs attention at many educational institutions in the Philippines. CHED and SUCs would like to have 100% students' success to finish their studies, however, it is hard to recognize these students in their early stages. It is important to explore effective approaches for predicting a student drop out as well as identifying the factors affecting it with sufficiently high accuracy. As an example of the massive dropout of first-year students can be seen in Urdaneta City Campus of Pangasinan State University, for the academic year 2012 -2016, the dropout and falling rate of freshmen is more than 50% in the first year and it increases to more or less 10% after passing two years of study which shows the importance of modeling student drop out early during their study.
Several researches indicate that one of the important factors of student dropout rate is the initial program/course studied at university as well as the secondary school academic records. Indeed, the dropout rate is higher among students in engineering programs, and among students with relatively low performing levels during their high school (Di Pietro & Cutillo, 2008;Tumapon, 2017;Min et al, 2011).
The study will propose a student dropout model using data gathered from the PSU-Urdaneta registrar office. This study has explored and to understand the key determinants of drop out, to accurately identify students likely to give up their schooling and to recommend policy interventions to eliminate or lessen student attrition.
This study has experimented and analyzed the enrollment data that may have an impact on the study outcome of the students and explore the performance accuracy and efficiency of a single classifier and try to improve classification performance by applying the ensemble approach to propose a predictive model to detect early dropout of first-year college students. A set of rules was obtained from the experimented predictive model to predict/identify a suitable program for a certain student based on their enrollment data set. Furthermore, this study was proposed to experiment, investigate, and compare the use of different tree classifiers and combining ensemble 4038 Survival Analysis Approach for Early Prediction of Student Dropout Using Enrollment Student Data and Ensemble Models model techniques to improve the results of different tree classifiers. The purpose of this paper is to analyze the causes of the first-year students' dropout rates in the institutions using the real data of Pangasinan State University Urdaneta Campus.

Student High School Profiles
The determinants of the success of college students and their academic performance were analyzed taking into account the student's pre-collegiate endowment of knowledge and various factors associated with the high school profile.

Predicting Factors
High school General Point Average (GPA), admissions test scores, gender, and race/ethnicity are the factors in which the researcher has consistently found to be significant predictors of retention and/or dropout. . The main idea of previous researchers in using GPA as predicting attributes is the tangible value for future educational and student career mobility. It can also be considered as an indication of realizing academic potential (Mat et. al., 2015). Further, Grade Point Average is an indicator used to evaluate the performance and success of the students during their high school years.

Grade in Mathematics, Science, English and TLE
Some of the predictor attributes that were considered in this study were high school Grade Point Average (GPA) and average grades in Science, Mathematics, English and Technology and Livelihood Education (TLE). Other researchers utilized the high school final grades in English, Mathematics, and major subjects as predictor attributes Hayward et. al. (2014). Beaulac & Rosenthal (2018) observed that grades in Mathematics, Biology, and Chemistry are consistently among the most important variables for predicting if a student will complete their program. Studies varied in identifying factors that affect student retention the most of their freshman year. Zhang (2004), Veenstra (2008) found out that students GPA in high school and grades in Mathematics, science subjects like Chemistry, and Physics, were all strong predictors for Engineering student retention (

Gender
Kovacic, 2010; Aulck et al (2016) conducted a research utilizing social and demographic variables that may influence determination or drop-out pre-identifying successful and unsuccessful students. Cengiz & Uka (2014) found out that the students' high school GPA and gender were the most important/influential predicting attributes based on their research utilizing data mining techniques.

Machine Learning Building Predictive Model
There is a substantial number of classification algorithm/ learning machine available applying different approaches to determine student performance which was reported by many researchers. One of the most popular algorithms is decision tree which was experimented to generate a predictive model that will predict the performance of students. Several researchers (Kabra & Bichkar, 2011;Kovacic, 2010; Al-Barrak & Al-Razgan, 2016; Kabakchieva, 2012) utilized decision trees in educational data mining to predict the performance of students using their past performance data. The model predicts to identify the students who are likely to fail/drop-out/unsuccessful and allow the teacher to provide appropriate interventions.
Kabakchieva (2012) also applied different decision tree algorithms for predicting the academic success of students and found that the algorithm obtained a high accuracy prediction rate.
However, some studies utilized other different classification algorithms/ learning machine to create their predictive model. El Zeweidy, Osman & Elhennawy, 2013) conducted a study in predicting student success using the different classifying learning machine. Their study found out that decision tree obtained the highest accuracy in terms of correct classifications among growing method of classifying learning machines. Decision tree was utilized in this study because it is so simple to understand the results and easy to make good interpretations. In addition, it is easier to be understood by a reader of this study. The majority of the previous studies have used this algorithm because of its simplicity and comprehensibility to uncover small or large data structure and predict the value (

Materials and Methods
The educational data mining (EDM) is defined as: "Educational Data Mining is a new field in research that concerned with developing methods for exploring the unique and increasingly large-scale data derived from educational settings, and utilizing those methods to better understand students, and the settings which they learn" (International Educational Data Mining Society, 2019).

Predictive Analytics
Predictive analytics aimed at predicting future events/outcomes and behaviors present in previously unknown data, using a model built from historical data and analytic techniques (Nyce & Cpcu, 2007;Shmueli & Koppius, 2011). The method used previously collected data, a machine learning algorithm finds the relations between different properties of the data. The result is a predictive model that will be able to predict future outcomes based on the properties of the collected data. The data were collected through the enrollment form filled by the student at the time of admission. The student fills-up attributes in the enrollment form such as their demographic data such as the gender and the type of school graduated (public and private), past academic performance such as their grade in Mathematics, English, Science, TLE, and General Point Average (GPA). From this information, the attributes that possibly influence their results are selected as shown in Table 1.
Most of the attributes reveal the historical performance of the students. The reasons behind concentrating on the past performance data are 1. Data is easily available in the Registrars Office of the campus. 2. If a student has performed well in high school, it is most likely that he will perform better during their college years as well or the other way around.

Data Selection and Preprocessing Data
The data were composed of 2401 admission forms/records of 956 graduated and 1445 dropout students of Pangasinan State University -Urdaneta City Campus was collected who finish their schooling from the year 2012-16 as shown in table 1.

Predictor Variables
The main predictor variables considered in the study were high-school grade-point average (HSGPA) and grade average in the subjects, English, Mathematics, Science and TLE. The HSGPA used in this analysis was weighted grade-point average, that is, a HSGPA at 85. This is the minimum GPA needed to be admitted in the majority of programs offered at the campus. Grades in Science, Mathematics and TLE were also considered in the analysis because these are considered to be prerequisites in the subjects in college. These subjects are considered preparatory subjects where students have to take general courses in Mathematics, and English communication. In addition, this study considered the school where students graduated and their gender.

Classifying and Predictive Tool
Previous research reveals that the most popular learning machine used and supervised classification technique was the decision tree. It involves simple steps, very fast, and it is easy to apply therefore, very intuitive and easy to discuss and explain. Decision Tree can be applied to any domain (Lakshmi et al., 2013). The main goal and the objective of the decision tree is to produce a tree model that will predict the value of a target variable by applying several input attributes. In addition, the output of the decision tree can be utilized as a decision support tool that utilized a tree-like graph with several predicted possible outcomes. A decision tree is a classifier in the form of a tree structure where each node is either: Leaf node-Leaf node is an indicator of the value of the target attribute (class) of examples, or a decision node-A decision node specifies all possible tests on a single attribute-value, with one branch and sub-tree for each possible outcome of the test (Chahal, 2013). Sample result is shown in figure 4. In this study, a node in the tree is a predicting attribute, and its branches are drawn on the basis of predicting suitable program for each student. Every node provides a decision, predicting the success of students. However, to strengthen the predicting performance of the proposed tree model, ensemble methods were applied through combining several decision tree classifiers, boosting decision tree classifiers and bagging decision tree classifiers.

Building Model Process
Building the predictive/classification model is the next step. In this particular process, the selection of appropriate decision tree algorithms and ensemble models were finalized as classifiers under the cross-validation method. The proposed classification models used 8 input variables, as shown in Table 1. Data was collected from enrollment information of all the students who graduated from the school years 2013 -2017. The attribute having maximum gain ratio value is the basis of splitting the nodes and the process will continue until it produces the complete decision tree. RapidMiner was utilized as a tool kit for experimentation and construction of the decision trees. In terms of the software tool kit, Moghimipour & Ebrahimpour (2014) study reveals Rapidminer obtained the highest accuracy as compared to three Data Mining Software. Figure 5 shows the decision tree construction. The decision tree result has a leaf node that is represented 4042 Survival Analysis Approach for Early Prediction of Student Dropout Using Enrollment Student Data and Ensemble Models by the rectangle and oval representing the root node/splitting node.

Figure 5. Decision Tree construction
This study used decision tree algorithms to generate predictive models to suggest programs based on the high school academic records of the students. According to Rokach & Maimon (2014) decision tree provides many advantages and some of which are the following:  it is simple and can be clearly understood by the reader, end-user and analyst.  it can accommodate different kinds of input data such as textual, nominal, and numeric.  it can continue to process data that are erroneous or missing or uncompleted values.  with a minimal amount of effort and time, it produces a high level of performance.  it can work in data mining applications over a multi-variety of platforms.
However, there are drawbacks of decision tree classifiers such as the following:  There is a high probability of overfitting in decision trees.  Generally, it produces low prediction accuracy for a dataset as compared to other machine learning algorithms.  Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.  Calculations can become complex when there are many class labels (Gupta et al, 2017).
However, the research experimented and applied ensemble models to reverse the main weaknesses of decision tree classifiers and to form a strong learner machine, thus, increasing the accuracy of the model.

Ensemble Model
The ensemble model is composed of meta-algorithms that creates n learners from one algorithm sequentially or combine several machine learning techniques into one predictive model in order to decrease variance, bias and improve predictions (Hamilton, 2009). Furthermore, ensemble methods combine several decision tree classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the applying ensemble model is to group of weak learner algorithms combines to form a strong learning algorithm, thus, increasing the accuracy of the form model (Garg, 2018). Previous research results clearly show that the applying ensemble model obtained better accuracies and reliable rules as compared with other classifying component models, other conventional forecasting tools, and other combination schemes (Garg, 2018;Satyanarayana & Nuckowski, 2016). In addition, the study of Adejo (2018) reveals that heterogeneous ensemble techniques are more efficient and very accurate in the prediction of student performance and very useful in the proper identification of students at risk of attrition.

Vote (Stacking)
The Vote (stacking) ensemble operator is a nested ensemble operator that has a subprocess operator with at least two or more classification algorithms, called base learners. Furthermore, all the operators in the subprocess operator of the vote accept the given data set and generate a combined single classification model.
This study explored and experimented ensemble (vote) to handle imbalance data (graduate / not graduated data) and to improve the performance to suggest and identify suitable programs for a certain student. This study experimented with three tree classification algorithms and then combining the results into a single score in order to improve the accuracy of predictive analytics and data mining applications.

Bagging
Bagging is an ensemble model that creates n learners from one algorithm sequentially to decrease bias and variance results. The dataset randomly sampled with replacement and created n dataset in a given ratio. There can be data points that are misclassified by a given learner (Sewwandi, 2018). The error of the previous classifier is considered and a new weight is given to the misclassified data element letting that data element appear in new datasets more often. Therefore, bagging is used to help reduce the bias because it reduces result variance and overfitting. According to Breiman (1996), for unstable learning algorithms and unbalance data, bagging is an effective ensemble technique where there is a big change in prediction results with small changes in the training data set. In addition, Hasan et al., (2015) research reveals that applying bagging ensemble model into decision tree classifiers improves the predicting accuracy performance

E.4 AdaBoost
AdaBoost a part of a family of ensemble algorithms that are able to convert and enhance weak machine learners to become strong machine learners. The main principle of adaboosting is to fit a sequence of weak machine learner models that are only slightly better than random guessing, such as small decision trees− to weighted versions of the data (Patel, 2017).
According to Patel (2017), Adaboost is found to be the best ensemble model for predicting the student's result based on the academic marks obtained in the current semester. Furthermore, Smolyakov (2017) found that applying adaboost improves the performances of tree algorithms.

Model Evaluation and Interpretation
To evaluate the predictive models 10 fold cross validation and percentage, split methods have been applied. 10 Fold Cross validation was utilized because the data is small and it is not feasible to split into two subsets. Thus, in order to minimize the bias, it makes full use of the dataset for training and for testing. Results were presented using confusion matrix that contains information about the actual and predicted classification done by the predictive models (Hamilton, 2009). In terms of comparing their performance, the confusion matrix that contains the precision, recall, and accuracy (Smolyakov, 2017) was employed. Accuracy:  The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using equation (1) : The recall (in the case of positive cases) is the proportion of positive cases that were correctly identified, as calculated using equation (2): The precision (in the case of positive cases) is defined as the proportion of negative cases that were classified correctly, as calculated using equation (3): In addition, ensemble models were applied in the tree algorithms and compared the result of other models without applying ensembles. This procedure provides a mechanism to gradually modify, correct noisy, and overfitting data may lead to a better classification.

Results and Discussion
The model in figure 6 gives interesting information about students and provides guidance to students not to choose a track that is not suitable for them. The result indicates that the average grade is the best predictor of whether a student is going to graduate successfully or my dropped in their college year. In addition, grades in Mathematics, Science, and English become the next best predictor as shown in the figure above. Rules Extraction A set of rules can be extracted from the decision tree. These rules are used to predict and classify the suitable program for each student. The class label in this tree acts as the suitable classified program after the end of a high school. The set of extracting rules from the decision tree is shown in Table 3. The result reveals a very interesting prediction that the average grade in their high school is a very important predictor for all programs. In addition, Engineering course average grades in Mathematics and science are also a dominant predicting variable. In the case of IT, the average grade in TLE is also an important predicting variable. Variables gender and type of school were graduated are not influential variables for predicting student success as shown in the result.

Accuracy Results of Different Tree Classifiers and Applying Ensemble Approaches
Experimentation and evaluation results in the final stage in the model construction are shown in Table 3. The table reveals that applying the bagging ensemble to j-48 tree classifier obtained the highest accuracy as compared with other tree classifiers. Furthermore, a significant increase was seen in dropout recall and graduated precision measures, however, in terms of precision (dropout) and recall (graduated) applying bagging to J-48 did not show any increase in its performance as compared with single j-48 classifier. The forest tree algorithm produces an impressive result in terms of classifying performance of 76.62 which is the highest value in the case dropout precision and obtained the highest predicting results in the case of a graduated recall. The table also reveals that the ensemble model, did not perform well in the classification task that it obtained a minimal increase in their classification performances in all measured areas. This result indicates that combining the ensemble model to these machine learning algorithms has a minimal increase in classification performance, which can be attributed to the average in performances of true/strong models and bad/weak classifiers. According to Smolyakov (2017) that building ensemble models with the most accurate models may not result in better ensemble models. Furthermore, experiments by Baba, Makhtar, Fadzli, & Awang (2015) show that in the case of substantial classification noise, bagging is much better than boosting and sometimes better than randomization.

Conclusions
This study has explored and analyzed the enrollment data that may have an impact on the study outcome of the students and proposes a tree classification algorithm to help them to choose a suitable program when they enter PSU-Urdaneta City Campus. The experimentation of tree classification algorithms and applying the ensemble approach shows that applying bagging into j-48 tree classifier attained the highest accuracy as compared with other models. In addition, the forest tree classifier achieves the highest value in the case of graduate recall and dropout precision. The experimental result revealed a minimal increase in their performance in all measure areas when the ensemble approach was combined/applied to other classification algorithms.
The experimental result shows a very interesting prediction that the general average grade (GPA) is a significant predictor for all programs. In addition, variable gender and type of school were not also significant predictors. The study shows the potential of data mining in higher education, especially when used to improve students' performance and detect early predictor of their success.