A Comparative Classification Models Study for Development of Early Dyslexia Screening System

This paper presents the development of a rapid dyslexia screening system using Fuzzy Inference System (FIS) and comparative study using WEKA analysis (Random Forest, Decision Table and Naïve Bayes). The developed fuzzy system is able to output two risk conditions namely as High Risk and Low Risk of dyslexia based on the defined rule statements. The system performance is evaluated using pre-existed data (n = 30), which is comprised of dyslexia and slow learner subjects. The proposed fuzzy system achieves overall accuracy of 56.7 % (n = 30) whereas the accuracy of the system towards dyslexia subjects is 100 % (n = 17). The low percentage in overall accuracy is due to insufficient tuning of the defined rule statements when analysing extreme conditions related to slow learners. On the other hand, the best classification algorithms are Decision Table (73.33 %) and Random Forest (82.35 %) when using both subjects groups and dyslexia subjects, respectively. A larger dataset is needed to achieve better accuracy when conducting data mining. Therefore, modification of rule statements and additional IQ test will be added in the future in order to improve the accuracy and robustness of the fuzzy inference system towards identifying slow learner from dyslexia.


Introduction
Dyslexia is a learning disability affecting the study performance of an individual in the areas of reading, writing and spelling as well as language [1]. As reported by Dyslexia International, the prevalence of dyslexia is at the minimum of 10 % in any specific population [2]. Dyslexia does not only affect academic performance but also gives impact on self-confidence and development on social emotion. It is linked with various forms of learning disorders namely as Dysgraphia, Dyscalculia and Language as well as corresponding to Phonological Dyslexia, Surface Dyslexia, Visual Dyslexia, Double Deficit Dyslexia, Surface Dyslexia and Rapid Naming Deficit [3][4][5]. Each of the learning disabilities has its own specific characteristics. Dysgraphia is the inability in writing as the individual is unable to do well in spelling causing missing words or letters whereas dyscalculia is the disability in conducting mathematics performance. Phonological dyslexia origins from phonological theory where it states that a person tends to fail in reading due to the lack of skill in splitting the sounds in a word to connect with its respective visual letter complement [6]. This can be referred to as "phonemic awareness", it literally means the ability in reading non-words (pseudowords) and dyslexic individuals are found poor in phonological awareness. Rapid naming is also one of the issues that will be encountered by dyslexic persons as they cannot name the letters or numbers within a specific short timeframe. Surface dyslexia is characterized by selective deficit in reading irregular irregularly spelled words [7]. They often struggle in recognising irregular words rather than regular words thus making regularisation errors as they rely more on grapheme-phoneme correspondences.
Slow learners are considered as a group of children who are having slower learning ability and they are unable to cope with the work as normal as their age group [8]. They often possess low confidence level, and are timid and easily to be anxious. Such behaviour is caused by subnormal intelligence, personal factors and environmental factors. They can be identified by observing their daily classroom behaviour, assessment of study performance or measurement of intelligent quotient (IQ) test.
Fuzzy logic is widely used in the practice of engineering to allow decision-making as well as classification [9]. As an example, fuzzy rule was applied in the prediction of risk level of heart disease using the data mining obtained from Machine Learning Repository offered by University of California, Irvine (UCI) [10]. The developed system was mainly used for cardiovascular disease diagnosis by interpreting the numerical data to classify the rate of heart disease. Another fuzzy-rule based on classification system was developed to access the coronary artery disease [11]. Using a total number of 115 subjects, the developed system had achieved 96.72 % accuracy, which shows the success of applying fuzzy rule in classification system for medical diagnosis.
Waikato Environment for Knowledge Analysis (WEKA) data mining is another classification tool that contains various classifiers algorithms. It has been applied widely to perform classification such as brain tumour detection and classification in MRI images [12]. Instance Based K-Nearest using Log and Gaussian weight kernels classifier (IBkLG) was utilised to analyse the MRI brain images to classify a tumour whether it was normal, benign or malignant. The IBkLG showed better accuracy rate of 86.6 % as compared to the other two classifiers, J48 and Naï ve Bayes. In a comparative study, WEKA has displayed a better performance in heart disease classification [13]. It was compared with another classification tool named Orange in precession and recall aspects using four classification algorithms namely as Naï ve Bayes, SMO (Support Vector Machine), Random Forest and 1BK (K-Nearest Neighbour). The outcome from these two data mining tools showed that WEKA produced better result regarding precession and recall.
In Malaysia, the evaluation of dyslexia is being conducted in several forms such as Literacy and Numeracy Screening (LINUS) and LINUS 2.0 as well as Instrumen Senarai Semak Disleksia (ISD). LINUS assessment is an effort done by Malaysian Education Ministry to recognise the dyslexia condition among the Standard 1 students in order to make sure that they are able to cope up with the learning and arithmetic using Van Meter and Van Horn Model as the foundation [14]. It is followed by the establishment of LINUS 2.0 to intensify the performance of LINUS and elevate the level of proficiency English language of students [15]. The European Framework of Reference (CEFR) serves as the guideline to ensure the validation of the assessment. Meanwhile, ISD is an instrument containing a list of questions to assess the dyslexia condition of students who are dealing with learning difficulty [16]. It is utilised by school teachers to carry out the assessment and obtain the probability of dyslexia of students. Then, the results collected will be informed to their parents whether they would like to take their children for further diagnosis or not. Currently, a manual named "Ujian Pengesanan Awal Disleksia Bahasa Melayu" created by Persatuan Dyslexia Malaysia is applied in the assessment of dyslexia. It consists of ten different types of tests to be answered by the subject and the test scores will then be analysed to identify the condition of dyslexia.
As mentioned previously, the existing method to screen dyslexia is conducted manually and requires several days to obtain the result showing the ineffectiveness and inefficiency of the protocol. Motivated from this issue, a rapid screening system to identify risk level of dyslexia in a relatively short time has been proposed in this study. The objectives of this research project are firstly, to prescribe input variables, rule base and output variables; secondly, to develop classification tool based on fuzzy inference system by using MATLAB Fuzzy Logic Toolbox and lastly to compare the accuracy of the proposed system with WEKA data mining tool towards dyslexia and slow learner subjects. The scopes of this research cover, firstly: the implementation of four tests namely Rapid Naming, One-Minute Reading, Two-Minute Spelling and Pseudowords referring to the screening manual of Ujian Pengesanan Awal Diskleksia Bahasa Melayu; secondly, employing MATLAB Fuzzy Logic Toolbox to rule out all the dyslexic conditions and lastly, evaluating the developed system by calculating the system accuracy and by comparing correct classification using WEKA data mining tool analysis using the pre-existed data that consists of dyslexia and slow learner subjects..

Subject
In this study, there is no subjects' recruitment since the data used was pre-existed data as provided by Encik Saifuddin Mohtaram. The data supplied consists of 30 subjects: 17 dyslexia (10 % severe dyslexic; 43 % dyslexic) and 13 slow learners (47 %) with the age range of 6 to 10 years old. The classification ofsubjects' conditions (diagnose as dyslexic) was done using the four tests (Rapid Naming, One-Minute Reading, Two-Minute Spelling and Pseudowords) in Ujian Pengesanan Awal Disleksia Bahasa Melayu.

Fuzzy Inference System (FIS)
Fuzzy Logic is a logical system where the truth of any statement becomes a matter of degree that is the opposite of Boolean Logic allowing multivalues conditions to take place [17]. It was established by L.A.Zadeh in 1965 to explore on the theorem of separation for convex fuzzy sets. It can be explained as follows: Consider U is the universe set, and x is a particular element of X, then a fuzzy set A defined on X may be expressed as a series of ordered pairs as shown in (1).
where μA:U → [0,1] is the membership function of A and μA (x)ϵ [0,1] is the degree of membership of x in A. By applying the following statement of fuzzy logic: IF (a set of conditions is satisfied) THEN (a set of consequences can be inferred). From the statement previously stated, it establishes a basic IF-THEN rule of fuzzy logic: In a simpler form of explanation, if x∈A and ∈B, then z∈C , with A, B and C being fuzzy sets.
In the year of 1984, Mamdani subsequently scrutinised the practicability of fuzzy sets theorem with the application in a control of simple dynamic plant [18]. Figure 1 displays the general block diagram of a Fuzzy Inference System (FIS). It is made up of fuzzification for input, rules as interface unit, defuzification as the process of analysing input and rules and lastly the crisp value which is the output generated by the system.

Execution Flowchart of Screening System
Figure 2 depicts the working principle of designed screening system for dyslexia. Firstly, input parameters require scoring marks from four tests namely as Rapid Naming, Pseudowords, One-minute Reading and Two-minute Spelling. Then, the information of input parameters will be analysed according to the conditions that have been set previously in the rule base knowledge. Finally, the outcome will be generated via the system to determine the risk level of dyslexia either "Low Risk" or "High Risk".

Fuzzy Rule Based Method to Identify Risk Level of Dyslexia
FIS is applied to construct the fuzzy rules based on several elements namely as FIS Editor, Membership Function Editor, Rule Editor and Rule Viewer in MATLAB software as shown in Figure 3.   A proposed framework on the fuzzy inference system is illustrated in Figure 4 to describe the entire process from fuzzification to defuzzification consisting of total five input variables and one output variable for all the age groups. These whole processes are discussed in the next subsection. FIS Editor allows the addition of input and output attributes to FIS. The input attributes will be the types of tests based on manual Ujian Pengesanan Awal Disleksia Bahasa Melayu whereas the output attribute will be dyslexia risk level. As recommended by Encik Saifuddin, one of the authors of dyslexia manual book, four tests were utilised in this system, which are Rapid Naming, Pseudowords, One-minute Reading and Two-minute Spelling. The second element of FIS adjusts and modifies the range of frequency for each input and output attributes so that the "Dyslexic" and "Normal" conditions can be differentiated according to the types of tests and the age groups. It is tabulated in Table 1 on the distribution of test score in every single test used in the system whereas Figure 6 shows the Membership Function Editor for 6 years old category.  At the same time, it is vital to select on the appropriate type of membership function to produce the degree of dyslexia risk level. In this screening system, triangular and trapezoidal membership functions were applied for input and output variables. The linguistic terms used to represent the condition of a subject after assessment of four tests were "dyslexic" and "normal". A triangular membership function is defined by a lower limit a, an upper limit b, and a value m (mean), where a < m < b was written in (2) and presented in Figure 7.
(2) On the other hand, a trapezoidal membership function is defined by a lower limit a, an upper limit d, a lower support limit b, and an upper support limit c, where a < b < c < d.
The fuzzy function was expressed as (3)  There are five membership functions for inputs and one membership function for output, including gender, Test 1 (Rapid Naming), Test 3 (One-minute Reading), Test 5 (Two-minute Spelling) and Test 7 (Pseudowords). For gender, two fuzzy sets in the form of linguistic weighting variables which include "Boy" and "Girl" were used to define the gender by using numerical scale as shown in Figure 9.  Table 2 and Table 3 present the examples of the membership function for Test 1 (Rapid Naming) and Test 3 (One-minute Reading), respectively. As can be seen from the tables, two fuzzy sets in the form of linguistic weighting variables namely as "Dyslexic" and "Normal" were used to differentiate the condition of the subject. These variables are equivalent to fuzzy numbers on the numerical scale as displayed in Table 1. Similarly, the membership functions were designed for the remaining tests as well.   Figure 10 presents the membership function of dyslexic risk level (output). Two fuzzy sets in the form of linguistic weighting variables namely as "Low Risk" and "High Risk" were used to evaluate the condition of the subjects. Again, these variables are equivalent to fuzzy numbers on the numerical scale as displayed in Table 1. The AND fuzzy operator is applied in this proposed system that involved only multiplication operation where it is an intersection between fuzzy sets A and B, which can be denoted as in (4).
μA∩B(x) = T(μA (x), μB(x)) where μA∩B(x) is the intersection between membership functions μA(x) and μB(x) whereas T is a binary mapping that represents multiplication of these two fuzzy sets.
In this element, all the possible outcomes will be defined by formulating the desired rule statements with AND operator. All the designed rules statements will be processed and analysed to generate the output which is dyslexia risk level. Table 4 shows the fuzzy output sets based on the number of tests found to be dyslexic. If the number of tests found to be dyslexic is 3 and above, then the subject is said to be having high risk of dyslexic. On the contrary, if the number of tests found to be dyslexic is 2 and below, the subject is said to be having low risk of dyslexic. Based on this info, the 30 rules statements have been created with corresponding output attributes following the defined fuzzy output sets as tabulated in Table 5.  Rule Viewer displays the graph distribution for the overall fuzzy inference system based on the rules creation. Figure 11 displays the graph distribution for all the input and output variables. The test scores obtained from each test will be inserted into the "Input' column and the red line in the last plot indicates the result generated via analysis in each input variable. It is done through the aggregation process to combine the rule strength for getting output membership function. The crisp value (on top of the last column) indicates the possible risk of dyslexia which is either "Low Risk" or "High Risk" after defuzzification process. The defuzzification applied is centroid method where it returns the centre of area under the curve to find the defuzzified value for fuzzy set and it is expressed in (5).
where x* denotes as the defuzzified value, xi represents the sample element, μ(x) is the membership function and n means the number of elements in the sample.

Waikato Environment for Knowledge Analysis (WEKA) Data Mining Tool
WEKA is developed by University of Waikato which originally was funded by New Zealand Government from the year of 1993 up until present [19]. They aim to attain their objective of constructing a provision of machine learning that enables the application in the key areas of New Zealand economy. It is a Java based on data mining tool that consists of four main applications namely Explorer, Experimenter, Knowledge Flow and Simple CLI to perform the desired data mining tasks. In Explorer application, it comprises Preprocess, Classify, Cluster, Associate, Select Attributes and Visualize tools to conduct data mining task. Attribute-Relation file Format (ARFF) is required to be used in loading datasets to illustrate a list of instances sharing a set of attributes. All the declarations must be made for relation and attribute in header and data information. There are many machine learning algorithms available to deal with classification and regression issues. It contains seven main groups which are bayes, function, lazy, meta, misc, rules and tress and each of them has numerous algorithms to select and apply as shown in Table  6. Random Forest is a supervised algorithm that consists of a series of randomized base regression trees {rn (x,ϴm, Ɗn), m≥1}where ϴ1, ϴ2,… as the outputs of random variable ϴ. These random trees are merged to establish the aggregated regression estimate as illustrated in (6) rn (X, Ɗn) = Eϴ [rn (X, ϴ , Ɗn)] (6) where Eϴ means the expectation with respect to the random parameter, conditioning on X and dataset Ɗ [29].
In simpler form, it creates numerous decision tresses and integrates all of them into one to produce output. Decision Table is a form of rule-based algorithm that expresses the conditions of labelled instances into "True" or "False" which maps between default rule with major class generating a hypotheses [30]. Naï ve Bayes algorithm applied the probabilistic of Bayes Theorem as shown in (7) that states: where the probability of A (evidence) can be found given that B (hypothesis) has occurred in [31]. A simple flowchart showing the process of applying WEKA analysis in dyslexia classification was illustrated in Figure 12. The entire process begins with the collected dataset stored in excel spreadsheet and saved as CSV format. Then, the file is converted to ARFF file format to declare all the header and data information by assigning its relation and attributes. Next, open ARFF file format in WEKA and process the dataset using Random Forest, Decision Table and Naï ve Bayes algorithms respectively. The results obtained will be evaluated based on correctly classified instances and incorrectly classified instances.

Dyslexia Subjects
Using the pre-existed data, the system successfully classified 17 dyslexia children as "High Risk". Figure 13 shows an example of high risk classification for 6 year-old category. In this rule viewer screenshot, the gender of the subject was a boy having a score of 34, 6, 0 and 0 when answering Test 1, Test 3, Test 5 and Test 7, respectively. The output appeared at the last column which showed the Dyslexia Risk Level which was 3.5, implying that the child was "High Risk" of dyslexia.

Slow Learner Subjects
As for slow learners, the developed system classified all subjects as "High Risk" of dyslexia. Figure 14 shows examples of Rule Viewer for 7 year-old category. In the screenshot, the subject obtained 0, 12, 4 and 2 and the resulted output appeared at the last column showed crisp value of 3.5 which implied "High Risk" of dyslexia. The inaccurate classification was ascribed to the low scoring marks or obtaining zero marks in all dyslexia screening tests. This proved that the fuzzy rules statements related to slow learners need to be particularly modified so that the system will be able to differentiate the slow learner from the dyslexia subjects.

Overall System's Accuracy
To study about the performance of the designed system, accuracy that is defined as the ability of test to differentiate the strength of differentiating between the classified subjects and healthy cases [32], is calculated in term of accuracy as shown in Table 7 and (8). The calculated accuracy of the developed system towards overall subjects is presented in Table 8. The calculated accuracy of the developed system towards dyslexia subjects is presented in Table 9. on the other hand, Naï ve Bayes was more efficient in constructing the algorithm as it only took 0 s which are way better than the other two algorithms. Table 10 shows the results generated utilising WEKA.  For the dataset with dyslexic and slow learner subjects, the result generated for Decision Table was 73.33 %, which was the highest accuracy rate among all three algorithms in correctly classified instances. This implies that only 22 instances were correctly classified and the other 8 instances were incorrectly classified. Random Forest algorithms only achieved 60 % in correctly classified instances as only 18 instances were correctly classified and it failed to classify the rest of the subjects. Despite the accuracy rate for Naï ve Bayes was only 60 % but it managed to obtain the least absolute mean error rate (0.3074) and consumed the least time to build the algorithm model which is 0 s. Table 11 shows the results generated utilising WEKA.

Discussion on System's Accuracy using Fuzzy Logic and WEKA Analysis
Based on the accuracy calculated in the previous section, the overall system's accuracy obtained was 56.7 % whereas the accuracy towards dyslexia was 100 % using fuzzy inference system. It can be inferred that the accuracy of the overall system is lower than the one using only dyslexia subjects which shows the difference of 43.3 %. The lower accuracy in the overall system implies that the developed system is less compatible in classifying slow learner subject. Therefore, how to differentiate between slow learners and dyslexia subjects using developed system can be a subject for future research. Speaking of the effectiveness of the developed system towards dyslexia subjects, 100 % of classification accuracy indicates it is adequate to implement only four tests from Ujian Pengesanan Awal Dislexia Bahasa Melayu. However, in WEKA analysis, although the same datasets were utilised, it gave different outcomes as compared to fuzzy inference system. As for dyslexia subjects, Random Forest and Decision Table algorithms only managed to attain 82.35 % in correctly classified instances which were lower if compared to fuzzy inference system. Meanwhile, combining dyslexia and slow learner subjects, the best classification algorithms went to Decision Table as it obtained 73.33 % in correctly classified instances which were higher than using fuzzy inference system. The difference in both classification systems suggests that a larger dataset is preferably needed in conducting a data mining task and using WEKA as small dataset is unable to deliver sufficient information to fill up the space between sample sizes [33]. Moreover, the current results generated via WEKA do not represent the best algorithms for dyslexia classification, as there are more aspects required to be looked into by considering the functionality of the algorithms used and the complexity of the issue. Hence, it is recommended that more experimental works be needed to be carried out in the future work.

Conclusions
In this paper, a rapid dyslexia screening system based on fuzzy logic was successfully developed and tested on two subjects categories; dyslexia and slow learner and the result was compared with WEKA analysis. The developed system was able to classify two conditions; high risk or low risk of dyslexia negating the necessity to determine the dyslexic condition manually. In order to facilitate risk classification procedure, the Fuzzy Inference System (FIS) is utilised. The experimental results using pre-existed data showed the accuracy of 56.7 % and 100 % for overall performance (using both subjects categories) and towards dyslexia only, respectively. Such low accuracy towards both subjects categories implied that the proposed system requires futher modification on rule statements by adding another IQ test to differentiate slow learner from dyslexia. This is because the dyslexia subjects are not affectedin the aspect of IQ.
As for WEKA analysis, Random Forest and Decision Table algorithms had produced the same result in correctly classified instances which was 82.35 % for dyslexia subjects. However, Random Forest is preferably ideal as it has less mean absolute error than Decision Table. On the other hand, Decision Table showed the best performance among three algorithms which was able to classify 73.33 % dyslexic conditions towards both subjects' categories.
For future works, the accuracy and robustness of the fuzzy inference system will be improved by refining the rule statements and membership functions to cope with more varying degrees of risk levels. In addition, more experimental work will be required in finding for the best-fit algorithm in WEKA for dyslexia classification. Besides that, it is also suggested to implement the web-based fuzzy logic screening for the ease of system's usage so that it can benefit parents as well as teachers.