Application of Rasch Analysis in Measuring Teacher Collegial Supervisory Instrument's Reliability and Validity

The concept of collegial supervision (CS) is defined as collaborative work beyond their professional sphere of relationships often offered by educators through feedback and sharing platform. However, there is still lack of studies and instruments that evaluated the CS practice within Malaysian context. In measuring the suitability of CS practice in secondary schools in Malaysia, a questionnaire with 26 self-developed items that represent five sub-dimensions/constructs namely, collegial relations, teacher’s province, teacher growth, teacher collaboration, and reflective inquiry was developed from series of interviews with secondary teachers. The major aim of this paper is to validate and examined the psychometric elements through the application of Rasch analysis in measuring items’-person reliability, principal component analysis, items person distribution, fit and dimensionality analyses. The analysis was performed based on feedback of 357 teachers from secondary teachers. Findings revealed on high values on person-items reliability, and the items’ difficulty are significantly aligned or matched with teachers’ ability. Also, principal component analysis revealed an acceptable value of raw variance explained, and that most teachers agreed with most of the items through structure measurement on the items’ validity. Thus, it is concluding on the internal consistencies of the items within Malaysian CS which later contributed to the CS items for Malaysian secondary schools.


Introduction
Empirical evidence has defined collegial supervision (CS) through the lenses of instructional practice as collaborative efforts made by teachers in refining their instructional practices across knowledge sharing platforms and feedback for teachers' professional growth [1,2,3]. In defining the concept of CS, Singh and Manser (2002) [4] believed that it is a learning process towards the practice of shared responsibility and values among school community which includes principals and teachers. In another definition, CS is sometimes referred to as peer supervision [5,6] for its emphasis on colleagues' assessments and feedbacks as 'informal' supervisors, with broader mechanisms to improve teachers' performance in the instructional practice which is highly concerned with teaching and learning. In the context of this study, CS is referred to as a consistent process of facilitation where colleagues (i.e. principles, administrators, and teachers) work together and offer one another feedback on their performances. According to Glatthorn (1984) [5], the approach is directed towards cooperative professional growth. Furthermore, CS is deemed as a collective process beyond professional sphere of relationships [6], towards common vision, aiming for the school cultural based improvement and focusing on the teachers' growth and development, interpersonal relationship and collaborative approaches.
As mentioned by Khun-Ineree (2020) [32], the pertinent reason why CS is needed in secondary are based on the lack of knowledge on supervisee practices in helping teachers improve their teaching and learning activities. Thus, it will effect students' performances in their academic achievements. In addition, the trusted person in supervising teachers which is referring to the school's administrators are packed with meetings and bust schedules [32]. In addition, it is cautioned that not all teachers will accept comments and advices provided by their colleagues although the purpose of the CS is to improve other teachers' professional development [32,33] due to differences in the professional and personal relationship. In addition, Aktas (2018) [34] also stressed although CS is provided to teachers especially to novice teachers, they need to be flexible and selective in choosing their instructional approaches. This is because through the mentoring and collegial assessment approaches, the novice teachers will be shadowed their mentor or their senior teachers who gave assessments towards the improvement in teaching and learning during the collegial supervisory approach.
Although the practice of CS in school context has begun as early as 1984 by the work of Glatthorn, due to no specific measure of collegiality (Sabharwal, 2011) [7], and the fact that most of the studies conducted are non-quantitative approaches [6], the complexity of collegial practice itself [8,9,10], led to the 'paucity' of studies on collegiality until there have been new developments on the collegiality measurement scale [11,12]. It is apparent that the literature obtained on the CS practice in the context of Malaysia heavily emphasises direct supervision in its clinical mode [13,14] and pays little attention or less indication to CS practice. Succinctly, clear standard framework, model and items that related to CS also seems to be unavailable within context of secondary schools in Malaysia. This is deemed as a claim that there is limited empirical evidence about the framework; model and instrumentation of CS meant for secondary schools still received little attention among local researchers. In other words, the standard framework and their measuring instrumentation of CS are arguably unknown in the context of secondary schools. This study therefore validates the psychometric --findings of CS scale -through the application of Rasch analysis in measuring items'-person reliability, principal component analysis, items -person distribution, fit and dimensionality analyses.

Objectives of the Study
This study was designed to addressed the following research objectives: i To obtain the person and items reliability on CS items. ii To access the psychometrics of CS scaling based on principal component analysis, items -person distribution, fit and dimensionality analyses.

Conceptual Framework
Theoretical framework of this study is an adaption process from the Zepeda's framework (2007, p. 28) [15] which focuses on formative and cyclical approaches of the instructional supervision. In her framework, Zepeda defined the CS approach as a professional development meant for teachers' development which based in instructional supervision which consisted of formative supervision and evaluations. In this study context, the standard of CS was chosen to replace the professional development aspects due to its similar nature; the CS itself is a professional development's type of supervision [16,17,18]. In the formative supervision (observation), it is concerned with the on-going individual's professional development with what is carried out in the CS dimensions. The pre-observation conference, classroom observation and post-observation as formative supervision are elements highlighted by Zepeda (2007) [15] in the CS dimensions [2,31].

Sampling
A total of 357 teachers were selected to provide their feedback based on the listed items. Secondary school teachers were selected using the multistage cluster sampling technique known as multiple probability technique [19] used due to the difficulty in determining the entire population. This technique is appropriate for large populations that are geographically spread and naturally in the population [19] in order to ease the group's identification, locate lists [20] as well as reducing bias and representativeness issues.

Instrumentation
In this study, the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) questionnaire was designed to assess the effective practice of CS in Malaysian secondary schools. The questionnaire consisted of 28 items which comprised two items on demographics, and 26 items that represent the six sub-dimensions of CS: namely, collegial relations (CR) (5 items), teacher's province (PR) (5 items), teacher growth (TG) (5 items), teacher collaboration (TC) (6 items), and reflective inquiry (RI) (5 items). In the demographics, two items were constructed: teachers' gender and their years of experiences within the teaching profession.
The items were constructed and derived from transcripts of series of interviews with teachers related to the practice of CS. The senior teachers were purposely selected and asked to provide their comments to all items. Items were initially constructed in the Malay language. However, later, it was decided to provide an English translation given the suggestions from the English language teachers. The translation process from Malay language to English was conducted by a senior English language teacher with the assistance of a Malay language teacher. Later, the items were checked by senior teachers to assess the content validity of all 26 items. In terms of the scaling, SFCSMSS uses a five-point Likert scale: 1: strongly disagree, 2: disagree, 3: Neutral, 4: agree, and 5: strongly agree. The five-point scale was decided use based on the following justifications: (a) it is a common rating scale among social science researchers; (b) the scale provides equal opportunity for all respondents while providing their answers [21].

Teachers' Demographics
The total number of teachers participated in this study was 357 which demonstrated a response rate of 59.01%, which exceed the suggested return rate (49%) as recommended by (Baruch & Holtom, 2008) [22]. Table 1 below illustrates the distribution of teachers' demographics according to their gender and years of experiences. In terms of teachers' experience, teachers were clustered into three major groups: the first group are teachers who had experience between 0 to 10 years of experience, followed by teachers who had 11 to 20 years and the final group of teachers who had experience between 21 to 30 years. Based on their experiences, majority of 149 teachers (41.7 %) that participated were between 11 to 20 years of experience followed by 111 teachers (31.1%) with 21 to 30 years' experience, and only 97 teachers (27.2%) who had experiences between 0 to 10 years. Based on teachers' gender, 98 teachers (27.5 %) that participated in this study were male teachers and 259 teachers (72.5 %) were female teachers, which is an indication of high numbers of female teachers in local secondary schools in Malaysia compared to their counterpart. Table 1 below indicates the data consisted of teachers' years of experience and gender.

Items and Person's Reliability
The Rasch Person-Item Reliability tests were performed because of its capability in determining the internal reliability of items as well as respondents. As shown in Figure 1, the Rasch person reliability is 0.94, which is considered an acceptable value [23]. A person separation value indicates the value of 4.04 which indicates that the instruments are sensitive enough to distinguish between teachers with many years of experiences and teachers with less experience. Thus, there is no additional items that are needed [24]. Based on the analysis, teachers were categorised into four major classifications; teachers who always received supervision, teachers who received medium amount of supervision, teachers who received least amount of supervision and teachers who never received any type of supervision. In Figure 2 below, the item separation is higher than 3.0, which implies that the person sample is large enough to confirm the item difficulty hierarchy of the instrument [24]. In sum, both reliability values indicate a sufficient sample in determining the item difficulty index of each item [23,25]. Items were classified into five classifications as too difficult, least difficult, answerable, easy and too easy to answer.

The Person-Item Distribution Map
Using Winsteps application, the Rasch analysis was performed based on 357 feedbacks from secondary teachers. Given the Person-Item Distribution Map (PIDM), which illustrated in Figure 3, the person-item's distribution map has indicated teachers' abilities to response to the items' difficulty. Using the map which is produced by Rasch Measurement Model, 357 teachers were placed on the right side of the distribution map while all 26 items were plotted in the left side of the distribution map based on logit scale distribution. A "logit" scale was used to express item difficulty on a linear scale that extends from negative infinity to positive infinity [26]. Using the distribution map, Mean item was plotted and served as a threshold which indicates as zero value on the logit scale distribution. In this plot, items that are placed higher than the Mean item formally indicate that items were difficult items compared to items which plotted in lower than the Mean item. On the left side, teachers' abilities are matched with item difficulty. If the right side of the map data showed higher than the left side, most of items were considered as difficult for teachers and vice-versa [27]. Based on item-person map, items that are labelled as CR5 and PR3 are considered as difficult items while four items, CR2, PR1, TC6 and TG3 are known as easy items. However, based on overall items and person distribution, the items' difficulties are matched with the teachers' abilities in answering all 26 items given in the questionnaire.
From Figure 3, data showed that Person mean value, Mean person was indicated at 1.62 threshold value while Mean item was indicated at 0.0 values. The highest teacher managed to score 8.78 logit and the lowest scored -3.32 logit. As for the item distribution, the most difficult item perceived by teachers is CR5 with 1.84 logit and the easiest item is noted at -1.04 logit. Based on the findings, a total of 11 items (PR3, RI2, RI4, CR3, RI3, RI5, TC2, TC1, TG2, and TG1) were found above the Mean item which indicated the secondary teachers' abilities in understanding and answering the given items in the questionnaires. Overall, it is assumed that the 26 items within the questionnaire were considered not that difficult due to the teachers' abilities are above the items' difficulty. Based on the Rasch analysis, a total of 265 teachers (74.5 %) were above the Mean item and 37 teachers (10.5 %) were below the Mean item . In conclusion, the items are aligned with teachers' abilities since most of 26 items were seen matching with teachers' abilities in responding and answering all items in the questionnaire. Thus, it is assumed that teachers could understand it well and answer all the questions correctly.
Based on the items-person distribution map which obtained from teachers' feedback towards the five constructs of CS concept, only 11 items were above the Mean item and another 15 items were located below the Mean item. All findings which are related to items' plots and locations are presented in Table 2 below. From Table 2, most of items found difficult are related to reflective inquiry which has four items that scatted above the Mean item which are RI2, RI4, RI3 and RI5. As for collegial relations (CR) (CR3, CR5), teacher growth (TG) (TG1, TG2) and teacher collaboration (TC) (TC1, TC2) constructs, each construct has 2 items which fell above the Mean item and teacher's province construct has only one item which is PR3. There are 15 items that plotted below the Mean item which indicated that most items in the questionnaire were easy items compared to 11 items that situated above the Mean item.

Item's Fit and Dimensionality Analysis
Items' fit and misfit analyses were also conducted in this section. In Figure 4, the findings show that one item from the questionnaire which did not fit with the measurement of RM. Item no 5 has min square (MNSQ) values outside the in-fit range of 0.4 < x< 1.5 [24] which at 3.68 value. However, the remaining 25 items within this CS questionnaire were located within the acceptable range between 0.4 with 1.5. Furthermore, items' dimensionality also was inspected and analysed using the principal component analysis performed through Winstep. The expected values are obtained using Rasch measurement which require the measurement to explain at least 40% of raw variance, and that the unexplained variance in the first contrast should not be more than 15% [28,29]. In Figure 5, the data disclosed a raw variance of 49.7% explained by measures. This value is low compared to the value of the model (51.7%). The 6.7 % of unexplained variance was accepted as it is less than the maximum value 15%.    In addition, the communication validity which represents the structure calibration calculated from the rating scale used by the instrument (e.g. Likert scale) was examined. Rasch analysis helps to determine the validity of the scale used by 'zero setting' and calibrate the rating scale used. Rasch analysis also verifies the probability of even spreading (i.e. equal interval) between the specified scale [30].    Based on the findings, the pattern of the observed responses ranged between -2.19 logit and improved in one direction to +3.16 logit. This showed that the pattern of the teachers' responses is considered as normal due to the increase from negative to positive value. In this reliability analysis, the values of deviation between scale 1 and 2, 2 and 3, and 3 and 4 are 4.0, 4.0 and 4.5, respectively. These results confirmed the validity of the scales, indicating that items are differentiated by the teachers. In this sense, teachers clearly understood the difference between all scales. Also, they knew how to answer the questions by rating their answers through the given scales. This result confirmed that the validity of the structure calibration is rejected as the value of deviation is more than 1.4 and less than 5 (1.4< s <5) [23].

Discussion
In examining on the psychometrics elements of the collegial supervision practice items which are labelled as the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS), a total of 357 secondary teachers were asked to give feedback on the internal consistency of the SFCSMSS items. In the first phase, items were later analysed quantitatively using the Rasch analysis in determining the reliability of the items within the CS questionnaire. Later, secondary teachers' feedback was analysed in measuring the items' internal consistency using the principal-component analysis, items -person distribution, fit and dimensionality analyses. In answering the objectives of the study, Rasch measurement model analysis was conducted throughout the study in determining the reliability analysis followed by principal-component analysis, items-person distribution, and later the fit and dimensionality analysis was performed and reported.
In determining the reliability analysis, the Rasch analyses indicated that items of the the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) have indicated a suitable and acceptable values which are considered as acceptable, sufficient and have high consistency in measuring secondary teachers' collegial supervision practice. Based on the results, it showed that items from the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) are considered acceptable and measuring the collegial practice among teachers in secondary schools in Malaysia. In fact, using the standard items of Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS), teachers' collegial practice can be examined and investigated. Based on the items' analysis, items analysis was separated into five three major classifications: Items were classified into five classifications as too difficult, least difficult, answerable, easy and too easy to answer items. As for the person separation, the Rasch analysis which was employed to analyse the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) items has revealed that there are four major classification according the teachers' demographics; from senior teachers to novice or less experienced teachers. In addition, through the analysis, Rasch analysis also showed the segregation of teachers who have been supervised by their school administrators: teachers who always received supervision, teachers who received medium amount of supervision, teachers who received least amount of supervision and teachers who never received any type of supervision. Through the analysis, results imply that the person sample is large enough to confirm the item difficulty hierarchy of the instrument [24]. In sum, both reliability values indicate a sufficient sample in determining the item difficulty index of each item [23,25].
In investigating the second objective which is related to measurement of items psychometrics, these analyses were conducted which comprised the principal-component analysis, items-person distribution, and lastly inspecting the fit and dimensionality of the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) items. From the Rasch's item and person distribution, 357 teachers were placed on the right side of the distribution map while all 26 items were plotted in the left side of the distribution map based on logit scale distribution. A "logit" scale was used to express item difficulty on a linear scale that extends from negative infinity to positive infinity [26]. Using the item-person distribution map, items and teachers' abilities in answering the items were matched in a distribution map. Through the logit scale, results indicated that only two items were classified as difficult items for teachers to provide feedback. Thus, items have higher abilities that teachers' abilities. In addition, four items are considered as easy items. Thus, 20 items are matched with teachers' abilities in answering the 26 items.
Additionally, a total of 11 items from the Rasch analysis were reported matched with secondary teachers' abilities in answering the items within the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) items. Hence, it is assumed that items within the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) items are considered as items that matched with teachers' abilities. Thus, it is assumed that teachers could understand it well and answer all the questions correctly. Through in-depth analysis on the items' descriptions, most difficult items were mostly from the reflective inquiry construct which has four items. However, another four constructs which are collegial relations, teacher growth, teachers' province and teacher collaboration have items that matched with teachers' abilities and items that below the teachers' abilities which are labelled as easy items. In measuring the items' fit and misfit analysis, only one item that reported did not matched with acceptable measurement of Rasch. Therefore, another 26 items are ranging within the acceptable values and range. Using the the principal component analysis, the variance that accounted are also reported the acceptable raw variance which also indicate the internal consistency of all 26 items on the Standard Framework of CS for Malaysian Secondary Schools' (SFCSMSS) In this study, we are acknowledged on the limitation of the study. Firstly, the study is limited to the feedback provided by 357 secondary teachers. Therefore, the feedbacks are limited to the 357 secondary teachers which did not represent the whole secondary teachers in Malaysian schools. In order to generalise the findings, it is suggested that future study to replicate the study with a larger sample size in order to obtain the overall perceptions of secondary teachers related to their practice of collegial supervision whether CS is considered beneficial their instructional tasks and enhance their competencies. The next limitation is related to the items used in the questionnaire which considered very simple and being analysed with descriptive statistics to determine the collegial practice within the secondary schools setting. As for future study, it is suggested to replicate the study with the other context of schools such as technical and vocational schools, religious-based schools, primary and even international schools which also practice the collegial supervision approach.

Conclusions
Based on the comprehensive analysis using Winstep, this study has established evidence that items measured the secondary school CS practice exhibited acceptable values in measuring the practice of CS across the sampled secondary schools. From these analyses, teachers are classified according to their abilities in answering all items in the questionnaire which reflected their performance in understanding and providing responses to the items within the questionnaire. Rasch analysis reached the conclusion that there is a linkage between items' difficulty with teachers' understanding across the 26 items in the questionnaire. Thus, Rasch analysis is considered as suitable analysis in measuring items difficulty that matched with teachers understanding and probability of teachers in providing responses to the provided scales. As such, Rasch analysis potentially provides researchers with mechanisms in monitoring respondents' categorisation.