Rasch Strategies for Evaluating Quality of the Conceptions and Alternative Assessment Survey (CETAS)

Due to society demand for educational development, education in Malaysia has begun to utilize alternative assessment approach in schools and universities. This study developed Conceptions and Alternative Assessment Survey (CETAS) to examine lecturers’ conceptions of assessment (AC) and their practice of alternative assessment (AAP). In order for CETAS to be useful, a pilot study was conducted to examine quality of items in using Rasch Analysis approach. A total of 38 lecturers involved in this study. After item analysis, this study found that four items, Item 7 (AC), Item 8 (AC), Item 16 (AAP) and Item 30 (AAP) did not meet the requirement of fit statistics analysis and local item dependency. Therefore, four items were deleted while other 58 are suitable to be used for measuring the intended constructs. In addition, the scale calibration analysis also revealed that Scale 3 (slightly disagree) was not well-functioning. Therefore, after consideration of analysis and expert review, Scale 3 was collapsed leaving CETAS with 5 scales. Nevertheless, CETAS has a good item and person reliability and can be used to examine lecturers’ conceptions of assessment and their practices of alternative assessment.


Introduction
Recently, alternative assessment has regained increasing attention after it was first introduced in 1990 (Quenemoen, 2008). In general, assessment is defined as a measure of performance including knowledge, skills, attitudes and beliefs. Generally, for an educator, the major purposes of assessment merely revolve around classroom such as diagnosing students' strength and weakness, monitoring students' progress, assigning students' grades and determining their own instructional effectiveness (Popham, 2014;Green & Johnson, 2010). In addition, Popham (2014) listed three more important purposes of assessment such as influencing public perceptions of educational effectiveness, evaluating educators, and clarifying instructional intentions.
Traditional assessment which usually employed pen and paper tests has few limitations; it can only measure what learners can do at one particular time (Law & Eckes, 1995;Dikli, 2003); focus on lower level thinking skills such as knowledge and comprehension (Dikli, 2003;Ince & Yilmaz, 2012); no immediate feedback is given to the learners (Bailey & Brown, 1998); the purpose of traditional assessment is usually norm-referenced (Dikli, 2003) and it does not necessarily reflect students' own experience (Brempong, 2019). Generally, traditional assessments are for assigning students grade, diagnosing students' strength and weakness. In contrast, alternative assessment is also known as non-formal testing and focuses on assessing higher-order thinking skills (Bagley, 2010), life-long skills (Ince and Yilmaz, 2012), real-life tasks (Quansah, 2018) which allow students to demonstrate best what they have learned (Quansah, 2018). In addition, alternative assessment allows instructors to evaluate what students can do and cannot do instead of what they know and do not know. Alternative assessment also has been evidently reported to motivate students to learn (Tal & Miedijensky, 2005;Bachelor, 2015). In general, alternative assessments are useful for monitoring students' progress and determining instructional effectiveness.
In Malaysia, a shift of assessment approaches from traditional assessment to alternative assessment occurs in both schools and universities. This is due to society demand for educational development in order to move towards a more powerful learning environment. Assessment has always been a responsibility of the instructors. Watkins et al. (2005) pointed out instructors' conception of assessment is influenced by their viewed of theories of teaching and learning. Additionally, Badariah et al., (2014) study revealed that lecturers in higher education have limited practice of assessment for learning. This may be due to unfamiliarity with formative assessment. The study did not further examine lecturers' view on conceptions and alternative assessment. Therefore, this study developed a survey to examine lecturers' perspective on conceptions of alternative assessment and focus on assessing quality of items in CETAS using Rasch Measurement Analysis (Rasch, 1960).

Rasch Measurement Model
Rasch Measurement Model (RMM) is item response theory-based model that has been used numerously in evaluating item quality. RMM estimates only one parameter; difficulty parameter. While item discrimination and guessing are assumed to be constant (Magno, 2009). One of RMM strengths is its measurement requirement in which items in any instrument should fit the model well enough to produce useful measures (Boone & Noltemeyer, 2017). Another strength of RMM is that it can convert nonlinear raw data to linear scale once the requirement is met (Boone & Noltemeyer, 2017;Boone, 2016). Therefore, once the requirement is met, researchers can make meaningful interpretation of their survey score.
There are varieties of ways of how RMM can be used to evaluate items quality of social science research instrument (Bond, 2003;Boone & Noltemeyer, 2017;Boone, 2016). Among analyses that RMM offers are; (i) examining internal consistency, (ii) examining fit analysis statistics, (iii) examining unidimensionality and item dependency, and (iv) scale calibration. Internal consistency includes person and item reliability and separation. Fit analyses include examining point measure correlation, mean-square error (MNSQ), and standardized fit statistic (ZSTD). As for unidimensionality, the analysis should focus on its eigenvalue. Lastly, to measure local item dependency, Rasch Analysis can examine residual correlation. As for the analysis guideline, Table 1 shows requirements for all analyses above. Should not be larger than 3 Eigenvalue larger than 3 indicates some kind of secondary effect

Residual Correlation
Should not be larger than 0.70 The correlation that is more than 0.70 indicates the item is independent with another item.

Instrument
Conceptions of Alternative Assessment Survey (CETAS) have two main dimensions; conception on assessment in general and also alternative assessment practice. For assessment conceptions; CETAS measures how lecturers view assessment in terms of improvement on teaching and learning, institutional accountability, irrelevances, and student accountability, while for alternative assessment practice; CETAS also measures lecturers' alternative assessment practice approach such as authentic assessment, challenge-based, integrated, performance, personalized, profiling, project-based and real-time. The distribution of items is shown in Table 2.

Sample
The pilot study involved 38 lecturers from Universiti Teknologi Malaysia. The number of respondent required by Rasch Analysis is 30 persons. According to Wright and Tennant (1996), 30 well-targeted samples are enough to evaluate items quality.

Internal Consistency
As shown in Table 3, overall analysis indicates that Conceptions of Alternative Assessment Survey (CETAS) has a person reliability of 2.67 and person separation of 0.88. Based on the person separation and person reliability, items in CETAS are able to separate people to 2.67 ≈ 3 level. Person separation that is more than 2 and person reliability that is more than 0.80 indicate that CETAS is sensitive enough to distinguish between low and high respondents.
As for item analysis, item separation indicates that how able person sample is to distinguish items between different levels of difficulty. CETAS has item reliability of 0.94 and separation of 3.90. Item separation that is more than 2 and person reliability that is more than 0.80 imply that the person sample is large enough to confirm the hierarchy of the items in CETAS. In this pilot study, the total of respondents is 38, which means that 38 respondents are enough to confirm the hierarchy of the items in CETAS.

Unidimensionality
To examine unidimensionality, Linacre (2015) suggested to examine Unexplained Variance in the 1 st Contrast. Eigenvalue should not be larger than 3. However, as shown in Table 4, CETAS has an eigenvalue of 9.0 which is two times larger than required value. This indicates that CETAS measures multidimensionality. It shows that CETAS did not measure only one dimension. The result is reasonable as CETAS has two main dimensions which are Assessment Conception (AC) and Alternative Assessment Practice (AAP). In AC, there are four sub-dimensions such as 'Improvement on Teaching and Learning', 'Institutional Accountability', 'Irrelevance' and 'Student Accountability'. As for AAP, there are eight sub-dimensions which measure 'Authentic', 'Challenge-based', 'Integrated', 'Performance', 'Personalized', 'Profiling', 'Project-based' and 'Real-time'. Therefore, this study further examines each of these sub-dimensions.

Unidimensionality of Assessment Conception
Further analysis indicates that all eigenvalue of the Assessment Conception sub-dimensions is less than 3 as shown in Table 5. However, the percentage of raw unexplained variance each sub-dimension has exceeded 15% limit. Therefore, further analysis of local item dependency should be conducted. The correlation between a pair of items should not exceed 0.70. If this correlation occurred, one of the items should be removed as both of the items probably shared the same dimension. To identify dependent item, Winstep produces the largest standardized residual correlation table between a pair of items. Table 6 shows correlation between pairs of Assessment Conception items that have large correlation. As can be seen in Table 6, correlation between Item 7 and Item 18 is quite large with a correlation of 0.69. The correlation between Item 1 and Item 3 is also quite large as well with a correlation of 0.68. Though all correlation is less than 0.70, Item 7, Item 8, and Item 18 have appeared more than two times in the Table 6. Therefore, Item 7, Item 8 and Item 18 are flagged as possible items to be omitted.  Table 7 also shows that Item 7 has an unfit value for all fit indices. Point measure correlation value of Item 7 is also small with a value of 0.02. Additionally, Item 7 also appeared in local item dependency table. After considering all fit indices and local item dependency, Item 7 will be omitted from Conception Assessment items.

Unidimensionality of Alternative Assessment Practice
The second dimension is Alternative Assessment Practice. As can be seen in Table 8, eigenvalue of each sub-dimension is less than 3. However, the percentage of the eigenvalue is more than the limit 15%. Therefore, local dependency of Alternative Assessment Practice item is examined. Local item dependency as shown in Table 9 shows that two pairs of item have correlation more than 0.70, which is Item 8 and Item 16, and Item 8 and Item 9. Item 8 obviously has a high correlation between two items, which is Item 16 and Item 9. Thus, Item 8 is considered locally dependent. Therefore, Item 8 will be omitted. Fit analysis statistics are further conducted to examine any other items that are unfit to measurement. Table 10 shows fit analysis for alternative assessment practice items. Based on the table, only two items did not fulfil all criteria of fit statistics which are Item 16 and Item 30. Therefore, Item 16 and Item 30 are considered unfit thus they will be omitted from this questionnaire.

Rating Scale Calibration
Rasch Analysis also allows researcher to calibrate instrument scale. It provides information whether one of the scales should be collapsed or another scale should be included. Conceptions and Alternative Assessment Survey (CETAS) have six scales which are: 1. Strongly Disagree 2. Disagree 3. Slightly Disagree 4. Slightly Agree 5. Agree 6. Strongly Agree There are few ways to examine the rating scale as shown in Table 11. Firstly, it's by examining observed count and observed average. As for observed count, a minimum of 10 observations if required for observed count in each scale with a fair distribution across the rating scales (Linacre, 2012). As for observed average, the indice should steadily and consistently increase. Secondly, by examining structure calibration. The structure calibration is manually calculated between two scales. The difference between two scales should not be less than 1.4 and should not be more than 5 (Linacre, 2012). A value below 1.4 indicates overlapping between categories and the respondents are unable to differentiate the scales. Lastly is by observing probability curve pattern. Each scale should have a distinct peak to indicate the rating scale is well-functioning. As can be seen in Figure 1, observed average of first scale is -1.04 and the average steadily increases up to 1.69 at the last scale. However, observed count for scale 2 is decreased. The observed counts for scale 1 and 2 are lowest among all of the counts. Table 12 shows structure calibration difference between scales. There are calibration differences that are out of the required range which are between scale 2 and scale 3 and also between scale 3 and scale 4. Additionally, Figure 2 also shows that scale 2 is overshadowed by scale 1, while scale 3 is overshadowed by category 4. Both scale 2 and scale 3 have no distinct peak compared to other scales. This indicates that scale 2 and scale 3 are not well functioning. The respondents were unable to differentiate these scales.

Discussion
This study developed an instrument to measure lecturers' conception towards assessment and also their alternative assessment practice. Conceptions of Alternative Assessment (CETAS) has been responded by 38 lecturers from Universiti Teknologi Malaysia (UTM). In order for CETAS being useful for operational use, item analysis was conducted to evaluate quality of item using Rasch Analysis. There were four item analyses conducted; (i) examining internal consistency, (ii) examining unidimensionality and local item dependency, (iii) examining fit statistics and (iv) scale calibration.
Based on the value of person and item reliability and separation, items in CETAS are sensitive enough to separate respondents into three levels of ability. One of the characteristics of good items is it can discriminate respondents into at least two different abilities. Unidimensionality and local item dependency complement each other. Violation of local item dependency may affect unidimensionality of an instrument. As for CETAS, overall unidimensionality analysis indicated that there was a sign of multidimensionality. However, it does make sense as CETAS has two different dimensions; conceptions (AC) and practice (AAP). Therefore, unidimensionality analysis was conducted separately for assessment conception and alternative assessment practice. Any items that were deemed inappropriate are deleted to preserve unidimensionality of CETAS. Any items that did not fulfil requirement for local item dependency and fit statistics were deleted. Therefore, total four items were deleted, Item 7 (AC), Item 8 (AC), Item 16 (AAP) and Item 30 (AAP).
Lastly, based on scale calibration analysis, respondents could not differentiate between scale 3 and scale 4 (disagree). In addition, scale 3 also has no peak, which means the scale is not well functioning. Therefore, based on analysis and expert review on these scales, scale 3 (slightly disagree) was collapsed. For operational use in the future, CETAS will only have five scales. After deletion, CETAS can be used in the future to examine lecturers' conceptions of assessment and their practice of alternative assessment.