Surgical Theatre ( Operating Room ) Measure STEEM ( OREEM ) Scoring Overestimates Educational Environment : the 1-toL Bias

The surgical theatre educational environment measures STEEM, OREEM and mini-STEEM for students (student-STEEM) comprise an up to now disregarded systematic overestimation (OE) due to inaccurate percentage calculation. The aim of the present study was to investigate the magnitude of and suggest a correction for this systematic bias. After an initial theoretical exploration of the problem, published scores were retrieved from the literature and corrected using statistical theorems. Overestimations and differences between pseudo-percentages and real percentages were plotted against real percentages. Reported STEEM overall mean score of 74.4% (pseudopercentage) was corrected to 67.9% (real percentage), eliminating thus the 6.4% OE. Corresponding figures for OREEM and student-STEEM are 73.6%, 67.0%, 6.6% and 69.1%, 61.4%, 7.7% respectively. A total of 45 overestimated scores were retrieved and corrected. OE (range 2.8 to 13.6%, mean 7.3%) showed a complete (r = -1, p < 0.001) negative linear regression of real percentages (RP), namely, OE = 20-0.2*RP. No uncorrected score can achieve less than 20%. The non-0-based 1-to-5 coding overestimates STEEM, OREEM and student-STEEM educational environment scores if expressed as percentages due to the ‘1-to-5 bias’, or rather 1-to-L bias, whereupon L correlates to the number of points in the Likert scale, the number of options. The worse the educational environment the greater the overestimation, reducing instruments’ usefulness exactly then when alarm bells should be ringing. Hence, question coding should always be zero (0) based, as proposed by Likert. The 1-to-L bias applies to any questionnaire at any field of research.


Introduction-Theoretical Exploration
Many questionnaires assessing educational environment, as perceived by the participants, have been developed.The DREEM for undergraduates [1,2], the PHEEM for hospital-based junior doctors [3], the ATEEM for anesthetists in the surgical theatre [4], the STEEM [5] and OREEM [6] for surgeons in the surgical theatre / operating room, and the mini-STEEM [7], a short version of STEEM for undergraduates, hereinafter referred to as student-STEEM or sSTEEM.DREEM, PHEEM and ATEEM use a five-point 0-to-4 Likert scale to code individual questions.On the contrary, the other three (STEEM, OREEM and sSTEEM) use the five-point 1-to-5 scale.However, this raises a problem when the scores are expressed as percentages.Namely, it introduces error into the assessment by overestimating the quality of the educational environment, especially when it is (very) poor.
DREEM consists of 50 questions, each scored 0-to-4 (strongly disagree to strongly agree), thus giving an overall score range 0-to-200 (50*0 to 50*4).PHEEM and ATEEM consist of 40 questions, each scored 0-4, giving an overall score range 0-160 (40*0-40*4). 1 STEEM and OREEM consist of 40 questions too, but they are scored 1-5, giving an overall score range 40-200 (40*1-40*5), and sSTEEM consists of 13 questions, each scored 1-5, giving an overall score range 13-65 (13*1-13*5).Each of the inventories is divided in a different number of subscales containing a different number of questions, giving a lot of subscale score ranges.To interpret a score obtained after administrating any of the instruments, the score range is usually divided in four equal zones, the lower of which representing the very poor educational environment, the second the poor, the third the good, and the fourth the very good [2].But one has to remember all these score ranges, interpretation zones and cut-points.
Obviously, these 'percentages' are the quotients of the corresponding divisions: 148.7/200 = 0.7435, 147.2/200 = 0.736, and 44.9/65 = 0.6908.That is, the actual OMSs were divided by the upper limit of the corresponding range, overlooking that its lower limit was not zero, but 40 in STEEM and OREEM and 13 in sSTEEM.However, a true percentage equals the OMS divided by the upper limit, if and only if the lower limit is zero.Otherwise, the quotient is a pseudo-percentage, not a percentage.The expression "148.7/200(74.4%)" is quite misleading; more accurately, it should be "148.7 in the range 40-200 (or 74.4 in the range 20-100)", but not 74.4%.Anybody seeing "148.7" or any other number automatically understands a point within a range from 0 to an upper limit and anybody seeing a percentage automatically understands a point within the standard range 0-100%.However, 74.4% is a point within the 20-100 range, i.e., it is not a percentage really, and this causes the problem (Figure 1).Using the reported STEEM OMS as example, the first two graphs in Figure 1   The problem originated when the 0-based 0-4 coding had been moved to the right by just 1 point (1 instead of 0, 2 instead of 1, etc.) and the 1-5 range was obtained.Then, adding 40 questions to produce the 40-question overall score, the 0-160 range was moved by 40 points and the 40-200 range was obtained.The consequences are explored in Table 1, using five different scenarios in a STEEM administration to a single participant (or a group of participants) who selects exclusively the same option in all forty questions, either 'strongly disagree' (scenario A) or 'disagree' (B) or 'uncertain' (C) or 'agree' (D) or 'strongly agree' (E).
Coding the answers 1-5 as the papers in question did, the 40-based actual OMS is 40 (scenario A), 80 (B), 120 (C), 160 (D) and 200 (E).Dividing them by the maximum score possible (200), the supposed standard but in reality pseudostandard OMS becomes 20, 40, 60, 80 and 100 respectively (Table 1).Obviously, no such OMS can be less than 20%, the score for the worst imaginable environment where the participant chose 'strongly disagree' in all questions (scenario A).The score range is not 0%-100%, but only 20%-100%: the worst fifth has been cut, i.e., any non-0-based 'standard' OMS is a pseudostandard.
Coding the answers 0-4, as DREEM, PHEEM and ATEEM did (and STEEM, OREEM and sSTEEM should), the initial non-standard OMS becomes 0, 40, 80, 120 and 160 respectively, and the corresponding (real) standard OMS 0%, 25%, 50%, 75% and 100%, ranging in a real percent scale.The differences between pseudostandard and standard OMS (namely, 20%, 15%, 10%, 5%, 0% respectively) are the corresponding overestimations, decreasing from 20% to 0% as the standard OMS increases from 0% to 100%, with a constant rate of -0.2, the minus sign indicating their reverse relation: the greater the standard OMS the lesser the overestimation and vice-versa.The maximum overestimation can be 20% and the minimum 0% in the worst and the best imaginable environment respectively (scenarios A and E, with standard OMS 0 and 100).In other words: Having revealed and explained the overestimation introduced by the 1-to-5 bias, our aim was to correct all reported scores in the original papers and explore the degree of over-appraisal.

Correction
All overall, subscale and question pseudostandard scores anywhere in the original papers [5][6][7]10] were retrieved (Table 2) and corrected, using the following statistical theorems [11]: That is, if a constant C is added to all individual values of a variable X, the mean (M) of the new variable C+X equals the mean of variable X plus C {2}, while the standard deviation (SD) of C+X equals the SD of X {3}.And if all individual values of a variable X are multiplied by a constant C, the mean of variable CX equals the mean of variable X multiplied by C {4}, while the SD of CX equals the SD of X multiplied by the absolute value of C {5}.
The first two formulas were used to rescale the non-0-based (40-based, 13-based, 1-based etc) reported mean scores and standard deviations to 0-based values, where C = -Q (Q the number of questions per scale or subscale): subtracting Q from the reported non-0-based actual scores, the 0-based actual scores were obtained due to theorem {2}, while the 0-based standard deviations equal the reported non-0-based standard deviations {3}.The next two theorems were used to transform these 0-based values to the standard range 0-100, where C = 25/M (see notes in Table 2 for details): multiplying the 0-based values by 25/M their equivalents in the standard range 0-100 were obtained due to {4} and {5}.Finally, since it should remain unchanged in reported and corrected data, the coefficient of variance (CV = SD/M) was used to verify our transformations.The graphical presentation of these theorems is demonstrated in Figure 1.

Overestimation and Interpretation
The overestimation was calculated as the difference between reported non-0-based pseudostandard mean scores and corrected 0-based standard mean scores, and regressed against corrected standard mean scores.Dividing the standard scale 0-100 in four equal zones, 0-24.9, 25-49.9,50-74.9,75-100 [2], we compared the distribution in these zones of the pseudostandard and real standard mean scores.The same distribution of 33 DREEM standard overall mean scores from a recent review [12] was also compared to both.Abbreviations.In the first column: Si / Qi = the subscale / question i, i = 1, 2, 3, ... In the last column: OE = overestimation.In the paper: OMS / SMS / QMS = overall / subscale / question mean score.
Interpretation.75-100 very good, 50-74.9good, 25-49.9poor, 0-24.9 very poor (no such score was reported).Symbols: L (in honor of Likert) = the number of points (anchors) of an L-point Likert scale for question coding; in all educational environment measures L=5: 'strongly disagree', 'disagree', 'uncertain', 'agree', 'strongly agree'.Q = the number of questions per scale, subscale or question.B = the bottom (lower) limit of a score range.U = the upper limit of a score range.M = mean score (in bold scores that changed interpretation zone after correction), SD = standard deviation.C = constant; |C| the absolute value of C. Any symbol with a subscript (e.g.Mn, M0, M% etc) denotes the symbol in a non-0-based (n) and 0-based (0), and the standard 0-100 (%) scale.
Calculations: Columns L to Mn% appear as given in the corresponding papers, unless in italics denoting numbers calculated by us.

Reported and Corrected Scores
The results are shown in Table 2. Column L presents the number of points in the L-point Likert scale; in all three questionnaires L=5.Column Q presents the number of questions per scale, subscale or question; the OREEM and STEEM consist of 40 questions, their first subscale consists of 13 questions, etc.The next two columns present the lower (B n ) and upper (U n ) limits of the range within which any non-0-based score could be reported; for example, 40-200 for the overall STEEM / OREEM scores, 13-65 for the overall sSTEEM score, 1-5 for any single question score, etc.The columns M n and SD n present the reported non-0-based overall, subscale or question non-standard mean scores and standard deviations; for example, the non-standard OMS was 147.2 for OREEM (no SD was reported), 148.7 for STEEM (no SD was reported), and 44.9 for sSTEEM with a SD of 7.1.The columns M n% and SD n% present corresponding reported as standard mean scores and standard deviations in the erroneously supposed standard (1-100) but in fact pseudostandard (20-100) scale; for example reported as 'standard' overall means were 73.6%, 74.4% and 69.1%; no SD was reported, except for sSTEEM (10.9% overall, 20.7% first question etc.).The following six columns present the 0-based corrected values.Columns B 0 and U 0 present the lower and upper limits of the range within which any 0-based score could be found.Columns M 0 and SD 0 present the 0-based mean scores and standard deviations, and columns M 0% and SD 0% present mean scores and standard deviations in the standard 0-100 scale.Finally, the last column presents the overestimations (OE).

Overestimation and Its Implication on Score Interpretation
Figure 2 reveals a complete (r = -1) negative linear relation between the overestimation and the corrected standard mean score (M 0% ), which is not obvious in the last column (OE) of Table 2.Because of this perfect linearity, we can predict no overestimation at all if M 0% = 100 (this makes sense: there is no room for improvement), but there is a 20% overestimation if M 0% = 0, i.e., the 1-to-5 bias adds up to 20% overestimation as M 0% moves from 100% to 0%, with a rate of 0.2 per M 0% unit.This is exactly what was theoretically predicted in Table 1 and formula {1}.2) Thus, the uncorrected OREEM, STEEM, and sSTEEM scores will never fall within the worst fifth 0-20% (Figure 3), although this might be the situation as appraised by the survey participants.The uncorrected 'standard' mean score, i.e., the pseudostandard mean score (M n% ), can never be less than 20%, since this 20% is entirely attributable to the overestimation.In real life, it would almost be impossible to find a score indicating a very poor environment, i.e., within the worst quarter: in order to obtain an uncorrected (pseudostandard) 25% you would only need a real standard score of 6%, the 19% overestimation makes up the rest (Figure 3).
All standard mean scores were overestimated (Table 2, last column OE) by 2.8%-13.6%(mean 7.2%, median 6.7%).In addition, one in three of them (15/45 = 33%) had erroneously been sorted in an upper interpretation zone.Namely, almost one in four (11/45 = 24%) being in the 'good' zone had been interpreted as if they were in the 'very good' zone, and about one in ten (4/45 = 9%) being in the 'poor' zone had been interpreted as if they were in the 'good' one.
Table 3 presents the distribution in four interpretation zones (quarters) of all 45 reported non-0-based pseudostandard mean scores and the corrected 0-based standard ones.Almost three times more (17/6) pseudostandard than standard scores were found in the top interpretation zone (p=0.016).The high fraction of reported pseudostandard scores in the top quarter (38%, about thirteen times the DREEM equivalent from a recent review 3%; p=0.001) was eliminated after the appropriate correction (p=0.241).

Discussion
The STEEM, OREEM and student-STEEM non-0-based 1-5 question coding introduces an up to 20% overestimation of standard (percent) scores when assessing the quality of surgical educational environments that, to date, has escaped observation.The worse the quality of the environment, the greater the overestimation, beautifying things exactly where we need the warning bells to ring, i.e., in poor areas, especially in very poor ones.This reduces the usefulness of these otherwise very valuable instruments.Removing the 1-to-5 bias lead to this latent defect disappearing.This does not mean that other possibly coexisting biases in reporting [13] have also been eliminated.However, it does mean that there is no reason to believe that DREEM respondents have reporting biases different to those affecting STEEM, OREEM and sSTEEM responses.Surgical educational environment quality, as assessed by participants, appears to worsen after the 1-to-5 bias elimination; in reality, it had previously been erroneously overestimated.
A 0-based Likert scale should always be used when coding question response options, so that the most negative point would be coded as '0' [14], as originally recommended by Likert [15].

Conclusion
The non-0-based question coding in the STEEM, OREEM and student-STEEM questionnaires overestimates the quality of the educational environment due to the 1-to-5 bias or rather the 1-to-L bias, whereupon L indicates the number of points of the L-point Likert scale.Any non-0-based 'standard' score is a pseudostandard.The worse the educational environment the greater the overestimation is, beautifying things exactly when the alarm bell should be ringing.To raise the usefulness of these otherwise very good instruments, question coding should be always 0-based, i.e., the most negative point should be coded as '0', as originally recommended by Likert.
It is worth to note, that, generalizing, this should be applied to any Likert scale (L = 2, 3, 4 etc.) and any questionnaire of any field of study (education, quality of life, psychology, economics etc.), in order to avoid misleading statements or assumptions leading to inadequate political, economic, scientific or other related decisions.

Figure 1 .
Figure 1.Pseudoscores and Pseudopercentages versus Real Scores and Real Percentages clarify what has been reported, while the next two what should be reported.In the top graph, the arrow extending from 40 to 200 indicates the non-standard 40-based overall STEEM range, while the dotted vertical line at 148.7 indicates the non-standard 40-based OMS.In the second graph, their transformation to supposed standard percent values is presented; the arrow extending from 20 to 100 (100*40/200 to 100*200/200) indicates the pseudostandard 20-based overall STEEM range, and the dotted vertical line at 74.4 indicates the reported as standard but 20-based pseudostandard OMS.In fact, the worst fifth (0-20) of the real standard range 0-100 has been cut and the reported as standard values start from 20%, i.e., they have been shrunk to the right and thus they are pseudo-standard.That's where overestimation comes from.In the third graph, the arrow and the dotted vertical line have been moved to the left by exactly 40 points.The arrow now indicates the non-standard 0-based overall STEEM range 0-160, while the dotted vertical line at 108.7 indicates the non-standard 0-based OMS.All these new points equal the reported ones minus 40 (0 = 40-40; 160 = 200-40; 108.7 = 148.7-40).In the last graph, the 0-160 range has been shrunk to fit the standard 0-based 0-100 range, where the OMS becomes 67.9 (108.7*100/160= 67.9375),i.e., the 0-based standard OMS (67.9) equals the 0-based actual OMS (108.7)multiplied by the constant C = 100/160.Therefore, the reported non 0-based pseudostandard OMS 74.4% overestimates the 0-based standard OMS 67.9% by 6.4% (= 74.35% -67.9375%).

Table 1 .
Five different scenarios from the worst (A) to the best (Eoption distribution; OE: Overestimation 1 A participant or a set of participants choose the same option in all forty questions, either exclusively 'strongly disagree' (scenario A) or exclusively 'disagree' (B) or exclusively 'uncertain' (C) or exclusively 'agree' (D) or exclusively 'strongly agree' (E).

Figure 3 .
Figure 3. Magnitude of the 1-to-L bias in comparison to the real standard mean score