The Characteristics of High-Risk Tryout Test Items for Indonesian Elementary Schools Students

One of the measurement forms applied in the educational field to know students' ability is a high-risk tryout test. This research examined the quality of Mathematics tryout test questions used by Indonesian teachers to prepare for the national examination as a high-risk examination. This study used a quantitative approach and analyzed based on item response theory to determine the fitness of the model and item eligibility. The data used were the answers of elementary school students who took the 2016 national examination tryout of mathematics subject with the total number of 924 students from 20 elementary schools in Indonesia. The mathematics subject consisting of 40 questions, which are adjusted to the blueprint of elementary school national examination, was chosen. The results obtained indicate that the mathematics tryout test items meet the assumptions of uni-dimension, local independence, and parameter invariance items. The logistic parameter model used was the 2-PL model, with 31 out of 40 eligible items of questions. The novelty of this research is the use of item response theory to analyze the quality of an instrument. The aim of the use of item response theory is to avoid examinee sample dependent and item sample dependent. The educational stakeholders can use this research as a reflection in preparing high-risk examination.


Introduction
Assessing or measuring students' abilities is part of the education system. One of the purposes of evaluating/measuring students' ability is to use it as teachers' reflection material to improve teaching and learning processes and conduct a remedial program for students. It is also used to know the progress or learning outcomes of each student for reporting to parents, determining class progress, and determining students' graduation. Further, it is also used to treat students, in teaching and learning situations, appropriately to their abilities or characteristics, to know the background (psychology, physical and environmental) of students who experience learning difficulties. The results can be used as a basis for solving students' learning difficulties.
Measurement, in the educational field, means measuring the attributes or characteristics of individual students. In this case, it is not the students who are measured, but their characteristics or traits. This is according to explanation [1], which defines measurement as assigning numbers to individuals in a systematic way that reflects the characteristics of individuals. The instruments that can be used to obtain information about the characteristics of student abilities can be in the form of questions.
The questions made by the teachers need to be analyzed whether or not they measure the right competencies. This should be done repeatedly at the end of the learning process, which will eventually be measured at the end of the study period. The end of the study period refers to the national examination that has been carried out in Indonesia. A national exam is done to measure and determine the ability of students at the end of their studies. To prepare the national exam and to get maximum results, many efforts have been done by both schools and individuals, such as a tryout test. Many institutions host a tryout test, starting from learner courses institution and even by the school itself. However, there has not been any research that investigates the tryout test items whether or not the items are following national examination standards and are capable of measuring students' abilities.
It is essential to ensure the quality of the tryout test items to be able to measure the students' ability to prepare for the National Examination. Generally, the ones who make the tryout test items are the team of teachers in the local area. In the implementation of the national examination tryout for Mathematics subject in elementary schools of Indonesian suburbs, a multiple-choice test was used to measure students' abilities. The test items compiled by the Teacher Team of a particular school group are used for elementary schools, which are incorporated in that school group without analyzing the quality of the test items and analyzing the fitness of the instrument with the ability of students.
Therefore, to overcome weaknesses in the preparation of the national examination tryout test as applied to the Mathematics tryout test in the particular district, it is crucial to conduct an analysis of the quality of Mathematics tryout tests compiled by the teachers of a specific school group, since the results of the tryout can describe students' passing rate. The analysis of the quality of this tryout test item is focused on the analysis of item response theory because the analysis has a mathematical model, which means that the opportunity for the test takers to have correct answers depends on the ability of the students and the characteristics of items [2]. Based on the explanation above, a study is needed to test the assumptions of the item response theory in the developed test, to test the fitness of the item response theory model, and to analyze the items whether or not they are feasible to prepare students for national examination.

Materials and Methods
The research method used in this study was a quantitative approach with the item response theory. The data used were the answers of elementary school students who took the 2016 national examination tryout of mathematics subject with the total number of 924 students from 20 elementary schools in Indonesia. The mathematics subject consisting of 40 questions, which are adjusted to the blueprint of elementary school national examination, was chosen.
The raw data which was collected based on selected tryout samples were processed and analyzed to see the met assumptions of the item response theory consisting of uni-dimension, local independence, and parameter invariance, testing the fitness of the logistical parameter model of the item response theory, i.e., whether the questions are developed according to 1-PL, 2-PL or 3-PL models. From the logistical parameter model that is met, then the magnitude of the discriminating power parameters, the level of difficulty, and pseudo-guessing that are known were used to determine whether the items in the tryout questions are good or not.
The software used to process data was SPSS 20, BILOG MG 3.0, and Microsoft Excel 2013. The SPSS program was used to test local uni-dimensional assumptions and independence with factor analysis. BILOG MG 3.0 and Microsoft Excel 2013 were used to test the assumptions of parameter invariance as well as to test the fitness of the model on the tryout questions that were tested and to analyze the good and bad items.

Results and Discussion
Examine the characteristics of the mathematics national examination tryout items for elementary schools, on a broad scale, is done by testing the assumptions of the item response theory, i.e., uni-dimension, local independence, and parameter invariance. Uni-dimension is to get each test item only measures one ability; local independence is to find out the factors that influence the constant tryout results; the subject's response to any item pair will be statistically independent of one another; while parameter invariance examines the characteristics of the item which do not depend on the parameter distribution of the ability of test-takers and the parameters that characterize test takers do not depend on the characteristics of the items [2].
This article also analyzes the fitness test of the mathematics national examination tryout model for elementary schools that have been developed with 1-PL, 2-PL, and 3-PL models. From these results, parameters that fit the model, and the characteristics of both good and bad items from the developed instrument will be obtained. The following are the results of the discussion of the item response theory analysis that has been carried out.

The Assumptions of Item Response Theory Test
The quality of the questions will affect the accuracy to determine the ability of respondents. Detecting the quality of questions can be done through theoretical and empirical analysis. In general, this empirical item analysis can be divided into two, i.e., the classical test theory approach and Item Response Theory (IRT).
The classical test theory, or pure classical score theory, is based on an additive model, i.e., the observed score is the sum of the actual score and measurement error score [3]. Classical test theory has developed widely and has been the mainstream among psychologists and education experts, as well as other behavioral studies, for 20 decades [3]. Classical test theory has weaknesses because it is examinee sample dependent and sample dependent items [4]; [5]; [6]; [7]. Other weaknesses in classical test theory are that the parameters in classical test theory are the characteristics of items that depend on the sample group used, besides that, classical test theory also requires the equality of measurement error for all subjects involved in the test, parallel definitions in classical test theory is also This study used the IRT approach. According to [2], one of the IRT assumptions is that a uni-dimensional test is carried out using factor analysis, to see the Eigenvalues on the covariance-variance matrix of inter items. The data analysis with factor analysis was preceded by an analysis of sample adequacy using SPSS. The results of the analysis with SPSS are obtained as follows. Sig. 0 The Chi-Square value indicates the analysis of the sample adequacy in the Bartlett test of 17763.545 with 780 degrees of freedom and a p-value of less than 0.01. This shows that the sample size of 924 in this study was sufficient to test the assumptions of the item response theory, namely uni-dimension. To find out the uni-dimensionality, SPSS is used with a scree plot to know the Eigenvalues as follows.
Eigenvalues are presented by the scree plot in Figure 1 above. Based on the results of the scree plot, it appears that the eigenvalue began to run off at the 13 th factor. This shows that there is more than 1 dominant factor in the mathematics national examination tryout for elementary schools. Still, the other 11 factors do not contribute significantly to the variance component that can be explained. The conclusion is that the developed instrument measures have at least 2 factors with the first factor as the dominant factor. Another way that can be done is by looking at the percentage of variance, which is higher than 20% or a comparison of the first Eigenvalues with the second ones of 5 or 4 [8]. Based on data analysis with the SPSS 22 program, the following results were obtained. From table 2 above, it is seen that there are two high Eigenvalues, 7.439, and 2.019. Those two values appear to be much higher compared to other Eigenvalues, but the dominant one is 7.439, so, the dominant factor is the first factor. It is also following the explanation of [9]; [10] that if there is only one dominant factor, then an instrument fulfills the uni-dimensional nature.
The second assumption is local independence as evidenced by the uni-dimensionality of participant response data on the mathematics national examination tryout for elementary schools, after the uni-dimensional assumption test, there is one dominant factor which means that the instrument measures one dimension of the ability to do national mathematics examination. [11] states that the results of the uni-dimensional analysis indicate that there is no correlation between the participant's response to an item to another item, or it can be said that the assumption of local independence has been fulfilled.
The third assumption is the invariance of item parameters and capability parameters. This assumption of item parameter invariance is evidenced by estimating item parameters in different groups of test-takers based on gender. In this article, the assumptions are proven by estimating the parameters of the items, which include discriminating power (a), difficulty level (b), and pseudo-guessing (c) in the group of test participants depicted in scatter plot diagrams [11]. From the scatter plot above, it can be interpreted that the discriminating power parameters in the male and female groups of each point are relatively close to the slope 1. This shows that there is no variation in the estimated parameter results in the female and male groups.

2638
The Characteristics of High-Risk Tryout Test Items for Indonesian Elementary Schools Students The scatter plot for the parameter (b), the level of difficulty estimated in the group of males and females, is depicted in figure 3. Based on the scatter plot, it was obtained that each point is relatively close to the line with slope 1. This means that there is no variation in the parameter of the level of difficulty of the estimation results in the male and female groups. In other words, parameter invariance is fulfilled. The scatter plot for the parameter (c), pseudo-guessing estimated in the group of males and females, is depicted in Figure 4. Based on the scatter plot, it was obtained that each point is relatively close to the line with the slope of 1. This shows that there is no variation in the pseudo-guessing parameter estimation results in the male and female groups. In other words, parameter invariance is fulfilled.
The parameter invariance of ability is proven by estimating items parameters in different item groups based on odd and even numbers. In this article, the assumptions are proven by estimating item parameters that include discriminating power (a), difficulty level (b), and pseudo-guessing (c) in the Odd-Even group of clusters illustrated in scatter plot diagrams [11]. From the scatter plot diagram above, it was found that the mathematics national examination tryout items for elementary schools are spread out, but are not around the linear line, so the bias data is not invariant.  From the scatter plot diagram above, it was found that the mathematics national examination tryout items for elementary schools are spread out, but are not around the linear line, so the bias data is not invariant.
Based on the test results above, the assumptions of item response theory in the tryout questions for elementary schools developed and tested on a broad scale have met all three characteristics, i.e., uni-dimension, local independence, and invariance of item parameters.

Fitness Test of Logistic Parameter Model
To test the fitness of the logistic parameter model, an analysis with item response theory with 1-PL, 2-PL, and 3-PL models can be used. The users of this theory need to choose whether the data analyzed is following one of the three models. Two ways can be used to determine the fitness of the analysis model that will be used, i.e., with the fitness of the model statistically and with the characteristic curve plot [11]. In this study, statistical model fitness was used.
From the statistical model selection, it was made the fitness of the three models based on the Chi-square value. The fitness of the models can be known by comparing the chi-square results of calculations with the chi-square table with a certain degree of freedom. The items fit with the model if the calculated chi-square value does not exceed the chi-square value of the table. The fitness is also known from the probability value (significance, sig.). If the sig. Value << X, then the item does not match the model [11]. In this study, the data obtained were analyzed using Bilog MG 3.0 software to obtain the Chi-Square value in determining the fitness of the model. The fitness of the PL 1 model is checked by testing at the probability/significance. If the sign value> a value = 0.05, then the model is fit or the model used is following the items developed, and vice versa, if the sign <value a = 0.05, the model used is not following the items developed.
Based on the analysis results, PL 1 model turns out to have 15 fit items, PL 2 model has 30 fit items, and 3 PL model has 25 fit items. From these results, most items are fit with Model 2 of Logistics Parameters. This shows that further item response theory analysis conducted in this study is to use the Model 2 Logistics Parameters.

Instrument Item Parameters and Item Classification of Good or Bad of 2 Logistics Parameters
The stages of the model fit test have been carried out as it turned out that the model's fit in the 2 Logistics Parameters was the most appropriate to the mathematics national examination tryout items for elementary schools that would be used to estimate the item parameters and abilities. The estimated parameters are the Discriminating Power (a), and Difficulty Level (b) with the help of Bilog MG 3.0 software.
From the analysis of items, 2 Logistics Parameters with Bilog-MG show that 7 items are not good because the level of difficulty of the items is less than 0.4, and the discriminating power is also low. Therefore, the seven items must be dropped/omitted. So, the questions for mathematics national examination tryouts are 31 good items and 9 bad items. The use of good items is recommended rather than bad, so, students can develop their thinking skills [9].

Conclusions
The assumptions of item response theory have been fulfilled by the tryout questions of Mathematics for elementary schools which have been tested on a broad scale consisting of uni-dimensions, local independence and invariance of item parameters. Uni-dimensional assumptions are indicated by factor analysis using SPSS and 1 dominant factor which is obtained from two factors resulting from the analysis. The assumption of local independence in the tryout questions of Mathematics is also fulfilled as a result of the previous assumption test, which states that the Mathematics tryout item is uni-dimensional. Meanwhile, item parameter invariance assumptions were also fulfilled based on the results of the analysis of student responses in different gender groups.
Based on the results of the model fit test with statistical analysis of significance values to determine which model is following the 2016 Mathematics tryout test, the 2 logistic parameter model is the fittest model with the magnitude of the discriminating power parameters (a) is -5.470 to 2.179 and the difficulty level parameter (b) is -1.087 to 1.926. From the known item parameters, it is obtained the results of the classification of good and bad items with 31 eligible items and 9 improper items.