Gaussian Distribution on Validity Testing to Analyze the Acceptance Tolerance and Significance Level

Some researches need data homogeneity. The dispersion of data causes research towards an absurd direction. The outlier makes unrealistic homogeneity. The research can reject the extreme data as outlier to estimate trimmed arithmetic mean. Because of the wide data dispersion, it will fail to identify the outliers. The study will evaluate the confidence interval and compare it with the acceptance tolerance. There are three types of invalidity of data gathering: outliers, too wide dispersion, distracted central tendency.


Introduction
Measurement is a key observation activity to gather numerical expression of evidence dimensions that are required in quantitative study. The measurement results depend on four main elements of measurement: the measured object, the measuring instrument, the measuring operator, and the measuring method. Sampling ensures the measured objects are representative part of the evidence population. Non-static dimension of moving objects is more difficult to be measured than static dimension of motionless objects. The attributes of measuring instruments such as units, scales, limits and technical specification (accuracy, precision, resolution and sensitivity) contribute to rigorous measurement. Thorough measurement needs conscientiousness of measuring operator. Complicated measuring methods sometimes trigger negligence in measurement.
Many studies exploit questionnaires as measuring instrument to gather information. Questionnaire responses depend on selected respondents as representative sample of population [1,2]. There are two types of errors that can occur in the selection of respondents: coverage error and sampling error [3][4][5][6] (see Figure 1). Coverage error is an error in making a list of populations in a sampling frame that misses some members of the population (undercoverage) or includes non-members of the population (overcoverage). Sampling error is randomization error in sample selection so any subsets of the population are not represented. Respondents of questionnaires play multiple roles in the measurement. Besides being involved as measuring operators, respondents occasionally act as measuring objects or measuring instruments. Respondent judgment becomes the essence of measurement using questionnaires. Human judgment affects measurement bias, especially on subjective questions. Many mistakes in measurement bias are caused by respondent personality tendencies, such as strictness, leniency and central tendency [7][8][9][10][11] (see Figure  2). Measurement Bias Due To Respondent Personality Tendency Strictness indicates severe respondent who tends to measure at lower level than actual performance. Leniency indicates mild respondent who tends to measure at higher level than actual performance. Central tendency indicates indecisive respondent who tends to measure around medium level neglecting actual performance.
Errors during information gathering process raise measurement bias on applying questionnaire as a measuring instrument. There are three types of errors as follows: random error, systematic error and illegitimate error [6,[12][13][14]. Random error arises triggered by random circumstances or other unpredictable factors, for example, the respondent's mental state when answering the questionnaire that causes information distortion. Systematic error arises triggered by traceable factors, for example, the ambiguous questions that cause a wide variety of respondent's answers. Illegitimate error arises triggered by carelessness or impropriety of measuring operation, for example, letting the respondent answer the question even though he does not understand it.
Errors in determining the type, order and content of questions also contribute to measurement bias. The main principle of preparing good questionnaire is ensuring respondent interprets the questions matching researcher's mind. Questionnaire comprehension problems arise because of respondent's background diversity [15]. Different respondents may conceive the same question in different meaning, especially if the questions are translated from other languages that are culturally diverse [16]. Survey guidelines insist the researcher to write the short, simple and clear questions and avoid mistaken questions. Some mistaken questions contain unfamiliar words, ambiguous terms, technical jargons, obscure notion, vague meaning, confusing sentence, etc. [15,[17][18][19][20][21][22][23][24][25][26][27].
Uncertainty in the questionnaire result was increasing when it asks questions about the attitudes, opinions or perceptions of respondents [22,[28][29][30][31][32][33]. Table 1 denotes four fundamental levels of measurement scales, their properties and the common applicable type of statistical analysis [23,25]. To measure the attitudes, opinions or perceptions of respondents, many studies use dichotomous-ordinal or polytomous-ordinal [24,25,[34][35][36][37] with 2 points till 11 points of scales. The attitude measurement constructs scales of questionnaire using the rating scale technique, such as Likert, Guttman, Thurstone, and semantic differential, etc. [19,24,25,27,[38][39][40][41][42]. Some attitude scales are illustrated in Figure 3.  Two fundamental criteria of measurement in scientific research are reliability and validity [19,20,[43][44][45][46][47][48]. The mentioned sources of errors lead problems of measurement reliability and validity. Reliability refers to the extent to which the repeated measurement provides consistent result. Reliability is consistency of measurement across time, across measured objects, and across measuring operators. Validity refers to the extent to which the measure reflects the variable it is intended to. Validity is conformity of measurement across measuring operators, across measuring instrument and across measuring methods.
The ideal validity test compares the measured evidence to measurement that is conducted by qualified operator using calibrated instrument and standard methods. In measurement using questionnaire, it is absurd to acquire the calibrated questions and to assign the trained respondents. Researcher can enhance the questions, but not the respondents, while respondent judgment is included as part of measuring instrument. And the diversity of respondents becomes another barrier of validity in the measurement using questionnaires.

Objectives
This study is concerned with validity of measurement using questionnaires. It aims to apply Gaussian distribution for validity testing. It assumes the valid value according to most respondent opinion. Due to mode, median and mean coinciding in the Gaussian distribution, it assigns each mean of questionnaire variable as reference target.
The Gaussian distribution forms a symmetrical bell-shaped curve. For single specific measuring object, the good measurement shall gather data tending to be around a specific value as center of curve. If data is distorted significantly from the center of curve, then it can be suspected as measurement bias. The study also aims to inspect three types of measurement bias: outliers, too wide dispersion, distracted central tendency.

Methods
The goodness of measurement has two essential tools: reliability and validity. Many studies [45][46][47][48] have classified several types of reliability and validity (see Fig.  4). Reliability test had three categories: Stability, Equivalence and Internal consistency. Validity test had three categories: Translation validity, Construct validity and Criterion validity.

Reliability
Reliability refers to the consistency of measurement. It represents the degree to which the repeated measurement gathers consistent data across time and across other various elements.
Stability reliability is the consistency of repeated measurement across time with the same measured objects, the same measuring instruments and the same measuring operators. Stability reliability is assessed through test-retest reliability or parallel-form reliability. For the test-retest reliability, each respondent is asked to answer the same questionnaires twice at different times (significant time apart). For the parallel-form reliability, each respondent is asked to answer two sets of similar questionnaires (different questions order) at once.
Equivalence reliability is the consistency among multiple measuring operators or among alternative forms. Equivalence reliability is assessed through alternative-form reliability or inter-rater reliability. For the alternative-form reliability, each respondent is asked to answer a pair of questionnaires (different versions of questions, but comparable). For inter-rater reliability, it compares the responses of questionnaires between respondents.
Internal consistency reliability is the consistency among two or more measuring instruments, which measure the same measured objects. Internal consistency reliability is assessed through split-half reliability, inter-item reliability or item-to-total reliability. For split-half reliability, each pair of similar questions (different versions of questions) is set at the first-half and second-half of the questionnaires or at even and odd sequence. For inter-item reliability, several questions are spread in the questionnaires, which actually measure the same variable. For item-to-total reliability, a set of formative or reflective variables are compared to their sum or average.

Validity
Validity refers to the exactness of measurement. It represents the degree to which the measurement gathers the unbiased data.
Translation validity is the exactness of measuring instruments to be adequately translated into the intended measure. Translation validity is assessed through face validity or content validity. For face validity, it uses human judgment to assess glance appearance of measurement result. For content validity, it examines content domain to ensure the question covers the intended measure.
Construct validity is the exactness of measurement resulting to induce inferences related to the relevant theories or logical hypothesis. Construct validity is assessed through homogeneity validity, convergence validity or evidence validity. For homogeneity validity, the gathered data congregate around a specific score. For convergence validity, it compares the data measured by two or more measuring instruments (sometimes requiring scale conversion). For evidence validity, it compares the gathered data to theoretical propositions.
Criterion validity is the exactness of measurement to differentiate measured objects on the referents or criterions. Criterion validity is assessed through concurrent validity, predictive validity, convergent validity or divergent validity. For concurrent validity, the scale distinguishes measured objects which are known to be different. For predictive validity, the measurement enables to interpolate or extrapolate referring to predictive criterion. For convergent validity, the gathered data of two or more variables denote highly correlated, the same as the referent. For discriminant or divergent validity, the gathered data of two or more variables denote uncorrelated, the same as the referent.

Standards for Educational and Psychological Tests and Manuals [54]
Types: criterion-related, construct-related, content-related

Standards for Educational and
Psychological Tests [55] Aspects: criterion-related, construct-related, content-related Standards for Educational and Psychological Testing [56] Categories: criterion-related, construct-related, content-related

Standards for Educational and
Psychological Testing [57] Sources of evidence: content, response processes, internal structure, relations to other variables, consequences of testing

Standards for Educational and Psychological Tests [58]
Sources of evidence: content, response processes, internal structure, relations to other variables, consequences of testing Table 2 shows the evolution of validity standard that was published by The American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). Since 1999, they refer to types of validity evidence, rather than distinct types of validity. They emphasize that reliability and validity are functions of the interpretations of measurement results for the intended uses and not of the measurement itself.

Conceptual Thinking
Gaussian distribution or normal distribution is a very common and well-known continuous probability distribution. In 1809, Karl Friedrich Gauss developed the well-fitted formula for the distribution of measurement errors in scientific study. Gaussian distribution is also related to central limit theorem.
The Gaussian distribution with probability density function f(x) and parameters, mean µ and standard deviation σ, has following properties: • It is symmetric about its mean, median and mode which coincide at the point x=µ.
Its density is log-concave and infinitely differentiable.
• It has smooth and symmetrical bell-shaped curve.
Because of the central limit theorem, Gaussian distribution is widely used in scientific study. Most scientific studies, which require statistical method to examine differences between means, apply Gaussian distribution. In 1924, Walter Andrew Shewhart developed the statistical quality control chart based on principles of Gaussian distribution.
This study applies the Gaussian distribution to test the validity of measurement using questionnaires. Considering a given acceptance tolerance, it verifies the measurement whether there is or not measurement bias. Using the principles of applying the Gaussian distribution to the Shewhart Process Control, it presumes acceptance tolerances like specification limits and significance levels of confidence interval like control limits. Figure 5 shows the conceptual framework of this study. The validity testing is applying Gaussian to examine the sources of evidences of measurement bias. It requires determining an acceptance tolerance compared with evidence distribution of measurement data based on Gaussian distribution. It will check three types of measurement errors, i.e. outliers, too wide dispersion, and distracted central tendency. The variables in this study were defined by the following notations: x i,j : j-th respondent answer for i-th question, where i represents the question number of m questions in the questionnaires, and j represents the respondent number of n respondents. The acceptance tolerance is defined by adding and subtracting the given acceptable tolerance to the mean as follows: Figure 6 shows the conceptual logic of applying Gaussian distribution to validity testing. Every data that lies out of acceptance tolerance is outlier. Platykurtic distribution indicates error of too wide dispersion, due to high standard deviation that is widening the empirical range and enlarging the rejection region greater than the acceptable significance level. Bimodal triggers error of distracted central tendency, since significantly separated mean and mode raise high standard deviation.

Results
We distributed questionnaires to gather respondent opinion about a service provider performance in order to demonstrate a numerical example of validity test implementation. The Questionnaires contain 12 questions with a 5-points of Likert scales to 30 respondents. We recommend using acceptable tolerance about 1.3 to 1.7 for 5-points of Likert scales. The recapitulation of data gathering result is shown in Table 3. The content of each table cell indicates the number of respondents whose answers are corresponding to the column label for each question.  We recalculate the measured data of questions which have outliers after removing the outliers (see Table 5). The other questions are not valid and they have to be revised, because of mistakes on question design.
The measured data of each question are calculated to obtain the average, the standard deviation, the acceptance tolerance and the probability of rejection region. Table 4 shows the calculation result. We use 1.5 for acceptable tolerance and 0.10 for acceptable significance level. According to the probability of rejection region, there are 6 questions of 12 questions that are valid (sig i < α), but two of which have outliers.
After removing the outliers, the probability of rejection region decreases becoming smaller than acceptable significance level, sig i < α. This also occurs on two questions, 5th and 11th, which are previously valid but having outliers.

Discussion
The validity testing results of the numerical example are illustrated in Figure 7. Fig. 7D, 7G and 7I show that the Gaussian distribution of questions (4th, 7th and 9th) has leptokurtic distribution. The probability of rejection region of these questions is respectively 0.017, 0.034 and 0.002 (see Table 4). All respondent opinions are concentrated in the average. Gaussian Distribution on Validity Testing to Analyze the Acceptance Tolerance and Significance Level  Fig. 7A and 7B also show that the Gaussian distribution of questions (1st and 2nd) has mesokurtic distribution but with lower peak. The probability of rejection region is respectively 0.170 and 0.135. Respondent opinions are almost equally spread over 3 choices in Fig 7A. There are outliers in 1st question that contribute to higher probability of rejection region. Respondent opinions spread on all choices in Fig 7B. It is an error of too wide dispersion. It makes an unreasonable conclusion because it combines significantly contradictory answers. Fig. 7C, 7F, 7H and 7J show that the Gaussian distribution of questions (3rd, 6th, 8th and 10th) has platykurtic distribution. The probability of rejection region is respectively 0.363, 0.269, 0.295 and 0.361. The empirical histogram briefly shows bimodal presence. It can be influenced by the presence of bimodal or the existence of outliers. The presence of bimodal is clearer in Fig. 7C and 7J, it indicates that 3rd and 10th questions have error of distracted central tendency. The platykurtic distribution in Fig. 7F and 7H is more influenced by the existence of outliers in 6th and 8th questions. Fig 7F also shows that 6th question has error of too wide dispersion.
Outlier errors occur due to several factors, especially human error of respondent. They can be triggered by respondent failing to understand the question or otherwise the questions beyond the capability of respondent. The questions that contain jargon, slang, foreign language, uncommon terms or abbreviation will lead respondent misunderstanding. Outlier errors can be induced by respondent bias, such as strictness, leniency, recency effect or memorable experience effect. Outlier errors can be caused by errors in respondent selection or response process.
Errors of too wide dispersion occur due to the unclear scale and the content of the question. They can be affected by some mistakes in question design, such as biased question, emotional question, and vague question.
Errors of distracted central tendency occur due to mistakes in question design, such as confusing question, ambiguous question, double barrel question, double negativequestion, and hypothetical question.
The application of the principle of Gaussian distribution in validity testing helps to detect the measurement bias of the questionnaires' responses with result that is caused by mistakes in question design, respondent selection, respondent behavior and response process. It diagnoses the mistakes by checking error of outliers, error of too wide dispersion, and error of distracted central tendency.

Conclusions
In the questionnaire, the sources of evidence of measurement bias consist of question design, respondent selection, respondent behavior and response process. Gaussian distribution principles are applied for validity testing to examine the evidence of bias. It notices to the empirical bell-shaped curve and compares it relatively to the given acceptance tolerance. It finds out the probability of rejection region. The measurements are valid if the probability of rejection region is less than acceptable significance level.