Measuring University Students’ Perceived Self-efficacy In Science Communication in Middle and High Schools

Service learning typically involves university students in teaching and learning activities for middle and high school students, however, measurement of university students’ self-efficacy in science communication is still lacking. In this study, an instrument to measure university students’ perceived self-efficacy in communicating science to middle and high school students was developed and validated using a sample of 104 university students (19 graduate students and 85 undergraduate students). The rating scale Rasch model and Winsteps computer program were used to analyze the students’ responses to pilot and final revised instrument. The results have revealed that the final revised instrument which contains 20 items with four response categories is well-targeted and measures from this instrument are reasonably valid and reliable. Issues associated with using the instrument are also discussed.


Introduction
In the US, there is a long history of involving university students in middle and high school science education. A good example is the NSF funding program called Graduate STEM (Science, Technology, Engineering and Mathematics) Fellows in K-12 Education (GK-12). Through interactions with teachers and students in middle and high schools, graduate STEM fellows improve their science communication and teaching skills while enriching STEM content and instruction for their partners. Over the years, the idea has been expanded to placing university students (graduate and undergraduate) in middle and high school classrooms in order to learn science communication, teaching skills, leadership, teamwork, and civic engagement. This form of university student learning has also been called service learning.
There has been well-established evidence on the benefits of placing university students in middle and high school classrooms. For example, teachers involved in the GK-12 program have reported increased STEM content knowledge (e.g., Gamse et al. [1]), a use of more effective pedagogical techniques [2], greater access to STEM resources [3], to name just a few. For middle and high school students, in a recent evaluation of the GK-12 program [4], a majority of teachers indicated that the program had positive effects on their students' STEM knowledge and skills. STEM students working in middle and high school classrooms have reported gains as well. In another recent evaluation of the GK-12 program [5], a majority of current and former graduate students indicated that their GK-12 experience benefited their ability to conduct various activities requiring communication, teaching, and teamwork skills. A majority of their college faculty advisors also concurred that the GK-12 program helps their students develop skills in these areas.
While the benefits of GK-12 and similar service learning programs have been reported as described above, measurement of university students' gains using standardized measurement instruments is still lacking. Our study intends to fill this gap. It focuses on university students' perceived self-efficacy in science communication.

Self-efficacy
According to Bandura [15], self-efficacy is a person's belief in one's capabilities to organize and execute the courses of action required to produce certain attainments, and people will not attempt to do things if they do not believe they can produce certain results [15,16]. In other words, self-efficacy can affect the initiation of behavior, the amount of effort expended and the persistence of behavior in spite of challenges and negative experiences [17]. Other researchers have reached the same conclusion. Self-efficacy not only affects one's cognitive, motivational and affective processes [18] but also determines how the person approaches tasks and responds to set-backs and what the person will do with the skills and knowledge he/she has [19]. The more students succeed, the more they believe they can succeed (self-efficacy), and therefore, the more they do succeed [20].
Self-efficacy is a context-specific rather than a stable characteristic trait. It is therefore thought to have a direct effect on performance in specific contexts. Self-efficacy judgment varies based on the level of skill and perseverance required to achieve a given task in a given context [17,[21][22][23]. Ormrod [24] pointed out that, while self-efficacy is similar to self-concept or self-esteem, an important distinction for self-efficacy is that it is domain, task, or situation specific. Examples provided by Salas [20] is that a teacher may have a strong sense of self-efficacy in teaching mathematics, but weaker self-efficacy in teaching English; or a student may have high self-efficacy when performing mathematics skills, but a low self-efficacy in language arts. Self-efficacy is related to perceived specific abilities rather than generalized self-beliefs [25]. Bursal and Yigit [26] (2012) proposed that self-efficacy beliefs should be extended to specific subject areas since they are context and subject matter dependent.
Over the past decades, many scholars have studied self-efficacy in educational settings, they have found a great influence of self-efficacy on teaching and learning processes (Armor et al., 1976;[22,26,28-38]. Bandura [39] pointed out that educational activities can influence a person's self-efficacy and, therefore, that these activities should utilize methods which can increase self-efficacy. Jones [18] found that self-efficacy can be developed through experience, for example, when one sees that someone like himself/herself succeeds following sustained effort, he/she will believe he/she can succeed too. Other research studies have demonstrated that when training for a specific skill, high self-efficacy is positively correlated with performance [19,40]. Dellinger [21] et al. proposed that self-efficacy be represented in a causal model of interactions among self and society, internal personal factors, and the external environment as reciprocating factors [21] . They argued that internal personal factors (cognitive, affective and biological events) and the external environment influence behaviors, while the environment is impacted by behaviors and personal factors, and personal factors are impacted by behaviors and the environment.
In summary, Communication is essentially as much a matter of listening as it is of talking and, to be effective, each party must have some understanding of the other: "To be effective with any audience, communication must be an interactive process…" [43]. In order to engage the audience, science communicators must identify audience's preconceptions or alternative conceptions of science. The process of participation and engagement in science is a contextual one [44].
Accordingly, in this study, university students' perceived self-efficacy in science communication was defined as university students' beliefs in their capabilities to help middle and high school students understand science. In our study, science communication is not just about university students' knowledge and understanding of science; it is also about their knowledge of their audience, namely middle and high school students. Specifically, we intend to develop a standardized instrument for measuring university students' perceived self-efficacy in communicating science to middle and high school students. The specific research questions are: 1. What is the validity evidence for supporting the use of the measurement instrument to measure university students' perceived self-efficacy in communicating science to middle and high school students?
2. What is the reliability evidence for supporting the use of the measurement instrument to measure university students' perceived self-efficacy in communicating science to middle and high school students?

Participants
The participants were Eighty-seven university students including sixty-eight undergraduate students, one master student and eighteen doctoral students, most of them were in STEM fields (i.e., biological science, chemistry, geological and earth sciences, geography), who took part in a NSF-funded project over three years (2011)(2012)(2013). Due to IRB protocol, no information on students' ages, gender, racial identities, etc. was collected. These students were assigned to go to local middle and high schools every week to work with students and teachers in science by engaging in such activities as assisting teachers in teaching lessons and find relevant resources, helping students understand science and leading small group activities with students in or after class, etc. Those students completed the pilot instrument after they had completed at least one semester placement in middle and high schools from 2011-2013. Seventeen additional undergraduate students completed the revised instrument after they had completed their placement in local middle and high schools in Dec. 2013.

Procedure
The development of the instrument of university students' perceived self-efficacy in science communication followed a construct modeling approach [45][46]. The construct modeling approach to developing a measurement instrument starts with a clearly defined construct, which "precipitates an idea or a concept that is the theoretical object of our interest in the respondent…" [46], operationalized by progress variables. Assessment tasks are then derived from the defined progress variables, and data collected from pilot-testing and field-testing are used to examine the fit between the progress variables and data using Rasch modeling [47][48] .
In our study, the construct of science communication self-efficacy was defined as the university students' beliefs in their capabilities to help middle and high school students understand science. We used a Likert-scale [49] type question format. Using response scales to collect attitude data has a long history in science education, for each Likert-scale item, respondents are asked to specify their levels of agreement to a given statement, usually expressed in a format such as: strongly disagree, disagree, neutral, agree, strongly agree [47] . The pilot measurement instrument contained 20 items with five response categories to describe respondents' levels of self-efficacy in communicating science. Response categories were coded as 1 through 5 in an ordinal scale: 1-Nothing, 2-Very Little, 3-Some Influence, 4-Quite a Bit, and 5-A Great Deal. The items related to three major aspects of the progress variable on science communication to middle and high school students: understanding students, developing science content, and explain the content.

Data Analysis
Student responses to the 20-item pilot measurement instrument were analyzed using the rating scale Rasch model [50]. In the past 30 years, Rasch measurement has been increasingly used in a wide variety of disciplines [51], and is becoming the convention for developing quality measurement instruments in all social sciences [52]. Based on item response theory (IRT) model, the Rasch model, as a one-parameter logistic mode, provides information of construct validity by fit statistic [52]. When there is good model-data-fit, measures produced by the instrument are interval, the interval scale measures have precise measurement errors for both individual items and subjects, allowing for inferential statistical analyses to be conducted with more power. Compared with classical test theory (CTT), Rasch models have several advantages [53], i.e., while the Classical Test Theory (CTT) analyses attach less importance to the functioning of specific items [54]. Rasch analyses can identify poor patterns of items and person performance, i.e., inform how well the model fits the data, and detect weak, biased, redundant items [55][56]. Embretson and Reise [57] also state that IRT models have four advantages over the CTT model: (a) an IRT trait level estimate can be derived from any items for which properties are known, (b) item properties are directly linked to test behaviors, and (c) the independent variables, trait level and item properties, can be estimated separately without additional data (p. 61).
In terms of reliability, we used the person separation index and the item separation index provided by Winsteps to evaluate the reliability of measures. The person separation index is an estimate of the adjusted person standard deviation divided by the average measurement error, indicates how well the instrument can discriminate persons on the measured variable. The item separation index indicates an estimate in standard error units of the spread of separation of items along the measurement construct [58]. The reliability separation index greater than two is considered adequate [47].
In regard to the substantive aspect of validity, our evaluation of the instrument focused on item quality proposed by Liu and Boone's [51]framework of validity evidence. According to Liu and Boone [51], "if assessment data fit the Rasch model well, then there is evidence to claim that the originally hypothesized dimension or construct exists, and is assessed by the instrument, thus providing evidence for content and construct validity" [51]. We examined item quality indices (i.e., the mean square residual, the standardized mean square residual) for each item from the rating scale model as implemented in Winsteps computer program [59]. The mean square residual (MNSQ) and the standardized mean square residual (ZSTD) are typically used as the fit indicators to examine how well each item accords with the Rasch unidimensional model. Item MNSQ has an expected value of 1.0 and a range from zero to infinity. Mean-squares greater than 1.0 indicate the data are less predictable than the model expects (underfit), i.e., a mean-square of 1.4 indicates that there is 40% more randomness in the data than modeled. Mean-squares less than 1.0 indicate fits better than expected (overfit), i.e., a mean-square of 0.6 indicates a 40% deficiency in Rasch-model-predicted randomness, which implies 100*(1-0.6)/0.6 = 67% more ambiguity in the inferred measure than modeled (high discrimination). Based on Linacre's suggestion (Linacre, 2010), items fit the model when their MNSQs fall within the range of 0.6 to 1.4 (for rating scale). ZSTD values are within the range of -2 to +2 (Liu, 2010) when there is a good fit; a positive z-residual indicates that responses are worse than expected; a negative z-residual indicates that responses are better than expected [60]. Item-measure correlation (point-measure correlation/PTMEA) were also examined in this study, zero or negative point-measure correlation indicates a rating scale with reversed direction [61].
The Rasch model constructs a one-dimensional measurement system regardless of the facts that empirical data are always more than on latent dimension [62]. In this study, PCA (principal component analysis) was applied to standardized residuals to identify possible dimensions existing in the scale [63]. A variance greater than or equal to 50% for the Rasch dimension can be considered good [64], and scale unidimensionality also can be assumed if the second dimension (first contrast) has the strength of less than 3 items (in terms of eigenvalues) and the unexplained variance by the first contrast is less than 5% [63]. However, there is no agreement on criteria for representing the existence of a secondary dimension when working with standardized residual-based PCA [61,[65][66][67][68][69] .
We also use Rasch analyses to verify and improve the functioning of rating scale categorization [70] , because how effectively an instrument's rating scale structure represents a construct is a substantive aspect of validity evidence [71], and effective structure increases the accuracy and precision of the resulting measures, the likelihood of measure stability, and related inferences for future samples [70,72].

Pilot-study Item and person separation and reliability
Based on the analysis of the pilot instrument, item separation was 3.33 (reliability=0.92) and person separation was 2.56 (reliability =0.87), both were acceptable. The mean of the infit mean squares (MNSQ) at 1.01 and the outfit mean squares (MNSQ) at 0.99 were very close to the expected value of one. The mean infit ZSTD and outfit ZSTD were both inside the conventionally acceptable range of -2 to + 2.

Person Ability and Item Difficulty Measures
From Figure 2, we can see that the Wright map of items and subjects showed that students' self-efficacy had a wide range of variation (person ability measures ranged from -1.34 to 3.33 logits). However, the item difficulty measures ranged from -0.68 to 0.84 logits, narrower than the range of person ability measures. Most items gathered along the middle to lower end of the subjects' communication efficacy range, no item was available for higher science communication efficacy subjects. There are three items at the low levels of the scale, and only one student fell below them, suggesting that those three items need to be improved or removed. Item 16, "Lead small group activities/discussions with students after school or during weekends" (0.84 logits), item 14, "Facilitate out-of-school science learning activities" (0.75 logits), item 10, "Develop out-of-school science learning activities" (0.74 logits), item 19, "Tutor students after school or during weekends" (0.72 logits) were the hardest four items to endorse, indicating that respondents feel relatively less self-efficacy in the aspect of explain the science content during weekends or out of school. The above findings suggested that the items of pilot instrument as a whole were relatively easy for those respondents, thus there was a need for addition of more difficult items for higher efficacy students.

Fit Statistics for Items
The purpose of the fit statistics is to aid in measurement quality control, to identify which data meet Rasch model specifications which don't (RMT http://www.rasch.org/rmt/r mt103a.htm). Inspection of the fit statistics for all pilot 20 items (seen in Table 1

Unidimensionality
From Table 1, we see factor loadings of the 20 items ranged from -0.56 to 0.68. Items 11, 12, 13, 15, 17, 18, 10 had the contrast loadings over 0.40, and items 3, 4, 6, 7 produced factor loadings of less than -.50, suggesting that they might measure additional dimension. Total variance accounted for 39.1%, the first component had an eigenvalue of 3.5, representing 17.6% of the total variance, below the expected norm. Eigenvalue of components 2 to 5 was 3.2, 2.8, 1.9, and 1.2, respectively, and the proportion of total variance accounted by component 2 to 5 was 9.9%, 8.4%, 5.7%, and 3.7%, indicating that unidimensionality of items was not ideal.

Rating Scale Category Structure
Our item category frequencies had a good spread, meeting the expectations [74]. The measure for category 1 was -2.80, meaning that the average agreeability estimate for persons answering 1 across all items was -2.80 logits. For categories of 2, 3, 4, 5, the category agreeability estimate was -1.30 logits, -0.13 logits, 1.25 logits, and 3.10 logits, respectively, meeting the requirement of the rating scale design, which was increasing monotonically with category.
The step calibration of the 20 items increased monotonically by 0.51 logits, 1.30 logits, and 1.37 logits; however, the category threshold between category 2 ("very little") and category 3 ("some influence) was too close for respondents to differentiate, suggesting that the respondents did not reliably distinguish between the two categories [58] .

Item and Instrument Revisions
Based on the Rasch analysis results of the pilot-study instrument, a number of improvements were made to the instruments. Specifically, in order to accurately measure the university students' perceived self-efficacy in science communication of persons with the highest ability level, we added four new items: new item 17 ,"Explain a difficult science concept to students", new item 18, "Explain current research to teachers", new item 19, "Facilitate student learning in museums ", new item 20 ,"Explain science to parents".
Four pilot items had similar measures and had more or less poor fit indicators: item "understand middle and high school students' science background knowledge" (-0.41 logits, infit ZSTD= -2.4 and outfit ZSTD= -2.3), item "understand middle and high school students' interest in science" (-0.43 logits, infit ZSTD= -4.2 and outfit ZSTD= -4.1, infit MNSQ= 0.47 and outfit MNSQ= 0.49), item "Understand middle and high school students' social and cultural backgrounds" (-0.34 logits), item "Understand middle and high school students' attention span" (-0.53 logits, infit ZSTD= 2.5 and outfit ZSTD= 2.9, infit MNSQ= 1.43 and outfit MNSQ= 1.51). In terms of the analysis and considered that "Understand middle and high school students' social and cultural backgrounds" and "Understand middle and high school students' attention span" (one of the easiest items to endorse in pilot study) may less related to the measured construct, therefore, we removed these two items from the instrument. We also removed item "Lead small group activities/discussions with students after school or during weekends"(infit ZSTD= 2.2 and outfit ZSTD= 2.2) and item "Tutor students after school or during weekends" (infit ZSTD= 2.8 and outfit ZSTD= 2.8,infit MNSQ=1.44, outfit MNSQ=1.45), which were the hardest items to endorse in pilot study, because they not only poor fit the model but also pertained to "weekends" activities that were not central to the measured construct.
According to Linacre [70] , "For a five category rating scale, advances of at least 1.0 logits between step calibrations are needed in order for that scale to be equivalent to four dichotomies…when the advance is less than 1.0 logits …redefining the categories to have wider substantive meaning or combining categories may be indicated" [70]. Other researchers also report that collapsing one or two categories will increase the test reliability [72,74], i.e., Stone and Wright [75] found in their survey of perceived fear, combining five categories into three increases the test reliability. Therefore, we collapsed the rating scale categories from five to four. The new categories became: 1-Little, 2-Some, 3-Quite a bit, and 4-A Great Deal.

Field-testing
The revised instrument included again 20 items which were then responded by 17 university students. Responses by the 17 university students were combined with the responses by the former 87 university students from the pilot study by the following recoding: 1 was coded as 1, 2 and 3 were coded as 2, 4 as 3, and 5 as 4. The combined responses were then submitted to Rasch analysis again. The findings reported next are based on this analysis.
Resulted from the revised instrument, the person separation index was 2.77, with an equivalent Cronbach's reliability coefficient (α value) of 0.88. Item separation index was 2.94, and its corresponding Cronbach's α value was 0.90, indicating reliable item and person estimation. Rasch measurement also produces an SEM as an additional measure of reliability for each individual person and item measure. Persons and items with measures closer to their means have smaller SEMs than those further from the means, SEM values for persons and items were small, ranging from 0.14 to 0.33. Figure 3 presents the Wright map of the revised instrument, we can see that university students' perceived self-efficacy measures have a wider range of variation from -2.33 logits to 5.92 logits, while the revised item measures also have a wider ranged from -0.97 logits to 1.23 logits. The first two most difficult items (item 20, item 19) were the new items (1.23 logits, 1.12 logits), and item 17 (0.39 logits) and item 18 (0.60 logits) were both above the mean of the items, indicating that the four new items were relatively difficult items just as intended. However, there was still one gap located near two standard deviations from the mean of the items; fifteen university students had a lower perceived self-efficacy than any item could assess. Another gap existed at the top of the continuum, where 14 higher perceived self-efficacy university students were in that gap. Table 2 presents fit statistics for the final 20 items in the revised instrument. We can see that infit MNSQs ranged from 0.65 to 1.29 whereas the outfit MNSQs ranged from 0.69 to 1.31; both were regarded as being acceptable. Infit ZSTDs and outfit ZSTDs all ranged from -2.0 to +2.0 with the exception of item 2 ( infit ZSTD= -3.0 and outfit ZSTD= -2.5), item 6 ( infit ZSTD=1.8 and outfit ZSTD= 2.2). All the items exhibited strong positive point-measure correlations (PTMEA) ranging from 0.50 to 0.70.
Measures resulted from the revised measurement accounted for 43.9% of total variance, though 4% higher than pilot measurement, yet still below the expected norm. The first component had an eigenvalue of 3.2, representing 16.0% of the total variance. Eigenvalue of components 2 to 5 was 2.6, 2.1, 1.7, and 1.5, respectively, and the proportion of total variance accounted by component 2 to 5 was 9.9%, 8.4%, 5.7%, and 3.7%, indicating that unidimensionality of items was not ideal. From Table 2, we see factor loadings of the 20 items ranged from -0.62 to 0.67. Items 9, 10, 11, 14 had the contrast loadings over 0.40, and items 3, 5 produced factor loadings of less than -.50, suggesting that they still might measure additional dimension. Table 3 presents the category structure statistics. As shown in Table 3, with four categories instead of five, each category count satisfied the criterion for minimum counts of 10 observations [70] . The average category measures were ordered and increased monotonically from -1.01 logits to 1.60 logits. The outfit MNSQ ranged from 0.96 logits to 1.02 logits, indicating expected category usage [70]. In addition, the category threshold calibrations increased monotonically with categories and the distances were all more than 1.1 logits, meeting the guidelines given by Linacre [70]. Inspecting the category probability curves (see Figure 4), we see that each category represented a distinct region of the underlying construct, thus, collapsing category 1 and 2 had indeed improved our rating scale diagnostics.

Discussion
The purpose of this study was to develop a standardized instrument for measuring university students' perceived self-efficacy in communicating science. In order to evaluate the validity and reliability of the instrument, we first conducted a pilot study and examined the person separation index and item separation index, person ability and item difficulty measures, item quality indices, unidimensionality, and the functioning of rating scale categorization of the pilot instrument using Rasch model analysis. Then, based on the analyses of the results, we improved the pilot instrument and examined the revised instrument.
From the above presented findings, our revised instrument appeared to be highly reliable as indicated by the Rasch reliability statistics. Overall, items fit the Rasch model well, suggesting that there is evidence for the construct validity of the revised instrument measures. Examination of the person-item map distribution of revealed that the revised instrument item difficulty measures were better than before, but still narrower than the range of person ability measures, with absence of items at the high end of the scale, suggesting more difficult items need to be added.
Item 20, "Explain science to parents" (1.23 logits) and item 19, "Facilitate student learning in museums" (1.12 logits) were ranked top on the difficulty column. Parents have a strong influence on children development; they not only influence children's in-school achievement but also make decisions about children's out-of-school activities [76]. Communicating science with parents is a good way to help their children better understand science but may also be a hard way that needs years of teaching experiences and skills. As for university students, mostly have little teaching experiences; it can be expected that they feel relatively less self-efficacy on "Explain science to parents". Middle and high school students enjoy visits to museums and can also benefit a lot from visiting, yet in order to maximize the benefits of the visit, more work need to be done by our teachers, for example, Carr [77] provided some useful guidelines for facilitate student learning in museums: (a) children in museums should have opportunities to interpret open questions about the meaning of evidence. (b) children in museums should have opportunities to construct knowledge, rather than receive it. (c) children in museums should have sustained encounters with process, ambiguity, collaboration, and mystery, encounters leading to grounded knowledge of how thinking happens... [77]. Therefore, it needs to pay more attention and think carefully about what museum, especially science museum has to offer and how it can be related to what students are learning about science in order to "Facilitate student learning in museums", it also would be the reason to explain why most of our respondents as university students had less efficacy on this item. According to Bandura [40] , Self-efficacy beliefs come from four main sources: (1) mastery experiences, (2) vicarious experiences, (3) verbal persuasion, and (4) physiological indexes. Among these sources, mastery experiences were claimed to be the most important self-efficacy source, the results above are consistent with the claim.
Although unidimensionality of the revised instrument is still less ideal, yet it is common in the literature involving Rasch analysis that reported variance accounted for by Rasch measures based on PCA is less than 50% [63,79,80] (Cervellione et al., 2009; , and our variance accounted for 43.9% by Rasch measures in this study is decent and not unusual.
After collapsing two categories, the four categories provide better functioning of the scale. Actually the issue of the preferred number of responses on a Likert scale has been discussed much on research methodology [82], some researchers argued that in a five-categorized Likert scale there is a middle box representing a neutral category and the respondents tend to choose the middle box for various reasons [83]. Since in an attitude Likert scale, we do not know exactly where the attitudes turn from slightly positive to slightly negative [81], it may be better use even number categories for attitude scale.
Overall, despite the aforementioned limitations, the results suggest that the revised instrument of university students' perceived self-efficacy in science communication with the new 20-items is well-targeted at the university students. Measures from this instrument are reasonably valid and reliable, thus are appropriate for assessing university students' perceived self-efficacy in science communication.