Power Comparisons of Normality Tests Based on L-moments and Classical Tests

Normality tests are used in the statistical analysis to determine whether a normal distribution is acceptable as a model for the data analysed. A wide range of available tests employs different properties of normal distribution to compare empirical and theoretical distributions. In the present paper, we perform the Monte Carlo simulation to analyse test power. We compare commonly known and applied tests (standard and robust versions of the Jarque-Bera test, Lilliefors test, chi-square goodness-of-ﬁt t est, S hapiro-Francia test, Cramer-von Mises goodness-of-fit test, Shapiro-Wilk test, D ' Agostino test, and Anderson-Darling test) to the test based on robust L-moments. In the text, in Jarque-Bera type test the moment characteristics of skewness and kurtosis are replaced with their robust versions - L-skewness and L-kurtosis. The distributions with heavy tails (lognormal, Weibull, loglogistic and Student) are used to draw random samples to show the performance of tests when applied on data with outliers. Small sample properties (from 10 observations) are analysed up to large samples of 200 observations. Our results concerning the properties of the classical tests are in line with the conclusion of other recent articles. We concentrate on properties of the test based on L-moments. This normality test is comparable to well-performing and reliable tests; however, it is outperformed by the most powerful Shapiro-Wilks and Shapiro-Francia tests. It works well for Student (symmetric) distribution, comparably with the most frequently used Jarque-Berra tests. As expected, the test is robust to the presence of outliers in comparison with sensitive tests based on product moments or correlations. The test turns out to be very universally reliable.


Introduction
In parametric statistics, the assumption about the probability distribution of data is crucial. The optimal or at least acceptable choice of a probability distribution allows a wide range of powerful parametric procedures to be applied, inappropriate choice potentially resulting in misleading or completely erroneous and even dangerous outcomes. There are numerous statistical methods based on the normal distribution. The issue can be approached theoretically, using the theoretical basis of the data, or experimentally, exploring available data. In addition to exploratory graphs describing the empirical distribution, statistical tests have been designed to determine whether the random sample is drawn from the selected distribution (the normal one in this paper), or if this assumption is at least reasonable. These tests make it possible to assess the suitability of the selected distributions in the form of probabilistic models for the analysed data, not choosing the best of multiple options. Some of them are intended for the normal distribution only (based on its specific properties), others are more generally applicable goodness-of-fit tests.
If the goal is to select a proper test, we can employ a commonly used application-specific one along with practiceproven knowledge and software. It is necessary to consider individually the power of each test, depending on the real shape of the distribution or the random sample size. However, it is very difficult to assess the power of tests analytically, the recommendations thus being often supported by simulation studies. In the present paper, we use the Monte Carlo simulation to compare the power of the most commonly used goodnessof-fit test for a normal distribution. A normality test based on robust L-moments as a robust version of the Jarque-Bera test is compared to more frequently applied tests. To investigate the power of a test, in our simulation, we draw data from differ-ent -symmetric short-and long-tailed as well as asymmetricdistributions and evaluate the relative frequency of rejection of the hypothesis of normality.
Many statistical tests have been developed to verify whether a sample is drawn from a normal distribution. In our simulation, we compare the performance of ten commonly used tests: in addition to the test relying on L-moments (part 3.1) and standard [ We give original references to these tests in part 3.2 together with some remarks of properties (type one errors and power of tests).
The L-moment-based test, both Jarque-Berra tests, the Lilliefors, Anderson-Darling, D'Agostino, Shapiro-Francia and Shapiro-Wilk tests are intended only for testing normality, while the chi-square goodness-of-fit, Anderson-Darling and Cramer-von Mises tests are generally applicable as goodnessof-fit tests for testing a fully specified distribution (with given parameters) or a family of distributions. We compare the test based on L-moments (part 3.1), tests comparing sample skewness and kurtosis characteristics (Jarque-Bera, robust Jarque-Berra and D'Agostino), empirical and theoretical cumulative distribution functions (Lilliefors, Anderson-Darling) and Cramer based on the correlation coefficient (Shapiro-Francia and Shapiro Wilks).
The choice of the most powerful test in various situations is of interest not only to statisticians, but also to users of statistical methods who only use the statistical tests in their research work. There is a huge spectrum of recommendations available in the statistical literature. The most common method how to approach the problem is a Monte Carlo simulation ( [16], [26], [24], [4], [27], and [23]). The authors obtain results for a special choice of distributions and subsequently generalize the results to more general properties of distributions (symmetric, asymmetric, heavy-tailed distributions). Special attention should be paid to the sample size, as the qualities of tests depends strongly on it. Usually the idea behind a test causes the highest power for small or for large samples. Moreover, for small samples the asymptotic approximations are not applicable. According to Yap and Sin [26], descriptive and graphical information supplemented with formal normality tests can aid in making conclusion about the distribution of a variable. In their study, for symmetric short-tailed distributions, D'Agostino and Shapiro-Wilk tests showed the best results; the power of Jarque-Bera and D'Agostino tests beeing comparable to the Shapiro-Wilk for symmetric longtailed distributions. For asymmetric distributions, the authors based on their research recommend also the Shapiro-Wilk test. Shaphiro-Wilk test was also found well performing for small samples (10-150 observations) for various deviations from normality (Student,chi-squared or beta distribution). For symmetric distribution with high sample kurtosis (symmetric longtailed), the researcher can use the JB, SW or AD test. Razali and Wah in [20] show that Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, Lilliefors test and Kolmogorov-Smirnov test. However, the power of all four tests is still low for small sample size.
In the paper, we investigate the above-mentioned normality test properties focusing on the L-moment-based test. An overview of L-moments and the underlying rationale are provided in section 2. In part 3, all tests and reasons for their construction are introduced. The results of the simulation are given in tables and figures in section 4. In the last part of the article, we make some suggestions emerging from the simulation. All computations are performed using the R [19] software.

Methods
Let X 1 , X 2 , . . . , X n be a random sample of the size n from a continuous probability distribution of a random variable X with distribution function F and quantile function Q(P ) = F −1 (P ). The hypothesis of normality (X ∼ N (µ; σ 2 )) assumes that F (x) = Φ x−µ σ , where Φ is the cumulative distribution function of the standard normal distribution. Suppose X i:j denotes the i-th order statistic in a sample of the size j. Let us denote by s 2 a sample variance (based on a random sample x 1 , x 2 , . . . , x n with the sample size n) and sample coefficients of skewness and kurtosis a 3 and a 4 , respectively, as wherex is the mean of the sample. The (theoretical) coefficient of skewness α 3 of all normal distributions is equal to 0, the coefficient of kurtosis α 4 equalling 3. The above properties are implemented in the paper, especially in the Jarque-Berra test.

L-moments
In [12], the formula for the evaluation of L-moments (for k = 1, 2, . . .) based on the quantile function Q(.) is given in the form where P * r is the r-th shifted Legendre polynomial written as polynomial coefficients p * r,l being p * r,l = (−1) r−l r l r + l l , l ≤ r.
If the expected value of the probability distribution is finite, then all L-moments are finite, [12]. It follows that for all distributions with a finite expected value, the second, third and fourth L-moments are also finite, allowing for application of 996 Power Comparisons of Normality Tests Based on L-moments and Classical Tests the L-skewness and L-kurtosis. Therefore, we can use Lmoments and methods based on them even for the distributions without a finite variance or higher moments. In this paper, we employ the Student distribution with the degrees of freedom ν = 1 − 20. The expected value is finite for ν > 1, since for the Cauchy distribution for ν = 1, the expected value is undefined. The L-moments of the Student distribution are finite for all degrees of freedom greater than 1, however, the variance is defined for ν > 2, and skewness and kurtosis coefficients for ν > 3 and ν > 4, respectively. For k = 1, 2, 3, 4 we obtain ( [12], [15]) first four Lmoments λ 1 , λ 2 , λ 3 , and λ 4 . As robust counterparts of the normalized product moment characteristics of variability, shape and concentration (variation coefficient, coefficients of skewness (variation coefficient, coefficients of skewness α 3 and kurtosis α 4 ) we define the following statistics based on Lmoments -L-coefficient of variation τ , L-skewness τ 3 and Lkurtosis τ 4 given by the formula As previously described, these normalized values are correctly defined for all distributions with a finite expected value and according to [12] are bounded and we have The bounded area of all possible points (τ 3 , τ 4 ) is shown in Figure 2.
The boundedness of L-moment ratios is an advantage. It is easier to interpret a statistic that is limited within the [−1, 1] interval than the conventional skewness coefficient which may take arbitrarily large values. We obtain λ 3 = τ 3 = 0 for all symmetric distributions with the finite expected value, this property corresponding to the standard coefficient of skewness for symmetric distributions. For all normal distributions, Lskewness equals 0 and L-kurtosis 0.1226; we can compare this property with the classical skewness and kurtosis coefficients of 0 and 3, respectively.
The sample counterpart (based on a random sample of the size n) to the L-moments are sample L-moments l k , k = 1, 2, . . . given in [12] as whereQ is a sample quantile function. Then the sample Lskewness t 3 and sample L-kurtosis t 4 are defined as

Tests of normality
Let us test the null hypothesis that the data are sampled from a normal distribution.

Test of normality based on L-moments
The test proposed in [11] is based on the transformations of L-skewness and L-kurtosis and a comparison with the values of normal distributions (τ 3,4 in (13)). The transformed Lskewness has approximately the standard normal distribution which also applies to the transformed L-kurtosis or more precisely Moreover, the transformations are independent. It follows, that the test statistics has an approximate chi-square distribution with two degrees of freedom.
In terms of the test power comparison, the authors in [11] show that the test relying on τ 3,4 has the greatest power in distributions with high kurtosis in small samples and relatively high power in medium-sized or large samples. The test shows consistently high power, outperforming other normality tests for different distributions.

Tests of normality used in the simulation
The Jarque-Bera test for normality [14] is one of the most common goodness-of-fit tests utilized in economics not only in the distribution fitting but also in regression diagnostics and time series models. The test compares the sample coefficients of skewness and kurtosis with corresponding values for a normal distribution. This generally preferred test resembles an L-moment-based test, the latter using only robust L-moments instead of the classical product ones. The classical sample moments (sample variance, sample coefficients of skewness and kurtosis) are highly sensitive to outliers, the third and fourth powers of differences from the mean being used to evaluate them. The authors in [9] propose a modification of the Jarque-Bera test that utilizes a robust to outliers estimate of variance, using -instead of standard deviation s given as a root in (1)the mean absolute deviation from the sample medianx Simulation studies indicate that the robust Jarque-Berra test shows higher or similar power in detecting heavy-tailed alternatives compared to the classical JB test. The Lilliefors test [18] is a variant of the Kolmogorov-Smirnov (KS) test for data normality testing without a fully specified distribution. The limitation of the KS test is its high sensitivity to extreme values, the LF correction rendering the test less conservative. It is sometimes argued, however, that the test has a low power and therefore should not be seriously considered for testing normality [24].
The Shapiro-Wilk test [21] is based on the correlations of weighted order statistics from the sample and the corresponding normal scores. According to [24], some researchers recommend the SW test as the best choice for testing the normality of data. It is sensitive to non-normality even for samples smaller than 20 observations, asymmetric and long-tailed distributions in particular. Having compared the Shapiro-Wilk, Lilliefors and Anderson-Darling tests, the Monte Carlo simulation in [20] revealed that the SW test (followed by the AD) has the best power for a given level of significance.
The Shapiro-Francia test [22] is well-established and intuitively appealing for testing the departure from normality. The test statistic is the squared correlation between the ordered sample values and the (approximated) expected ordered quantiles from the standard normal distribution. The SF test statistic is one of the most powerful omnibus and purpose tests of nonnormality available, its results being quick but not inferior.
The Cramer von Mises and Anderson-Darling [1] tests also compare the hypothetical and empirical distribution function with the test statistics, using the difference defined by the expected value of squared difference instead of the absolute value. The weighting function included in the test statistics allows to consider whether the emphasis is placed on the differences near the median or in the distribution tails. Thus, compared with the Cramer-von Mises distance, the Anderson-Darling distance puts more weight on observations in the tails of the distribution.
The asymptotic chi-square goodness-of-fit test is based on a comparison of hypothetical and empirical frequencies of the selected intervals. It is applicable to a fully specified and as well as estimated distribution, the situation affecting the degrees of freedom of the chi-squared distribution.
The D'Agostino test [3] is a test of normality applicable for samples of size fifty or larger which possesses the desirable omnibus property. Simulation results of the test power for possible alternatives if the sample size is 50 indicate that the test compares favourably with the Shapiro-Wilk test.

Results
We compare the power of tests based on a simulation study.
The Weibull (W) distribution with a positive shape parameter a and positive scale parameter δ is described by the probability density function The skewness increases with the shape parameter a, the kurtosis being also increasing. In the simulation, parameters a = 0.25, 0.5, 1, 2 and δ = 1. The coefficient of skewness is lower than 3 for a = 0.25, 0.5 and 1 and higher for the choice a = 2 ( Table 1). The loglogistic (LL)distribution is defined by a positive scale parameter γ and a shape parameter δ The parameters δ = 1 and γ = 1, 2, 5, 6. No product moments are defined for γ = 1, the expected value is finite for γ = 2, variance for γ = 3, and third product moment for γ = 3. Figure 1 shows that the densities are decreasing (for γ = 1), unimodal and positively skewed for higher parameters. The values of parameters and classical and L-moment characteristics of the selected distributions are given in Tables Figure 2 and it is easier to assess compliance or difference between probability distributions.
In the case of lognormal and Weibull distributions, all tests can be applied to the selected data. The choice of parameters for the loglogistic distribution were made to violate the assumptions of some test (see Table 2).
Using the Student distribution (t) with degrees of freedom ν for ν = 1, 2, . . . , 20, we obtain E(X) = 0, ν > 1. From the formula for the variance of the Student distribution V ar(X) = ν/(ν − 2), ν > 2, we obtain the values from 3 for ν = 3 to 0.11 for ν = 20. The coefficient of skewness is 0 for ν > 3 as the distribution is symmetric. For ν > 4, the coefficient of kurtosis is equal to α 4 = 6/(6 − ν), ν > 4. For the values of α, we obtain those from 6 for ν = 5 to 0.375 for ν = 20. The Student distribution converges to normal with increasing degrees of freedom, and the approximation made for the values greater than 30 is assumed as acceptable.
We made 30,000 draws from the selected distributions and evaluated p-values for all tests applicable to particular data. The relative frequency of the rejection of the null hypothesis was recorded as a proxy to the power of the test, because the hypothesis was not valid.
All computations were made in the freeware R environment [19] providing a powerful tool for both statistical modelling and simulations. Packages lmoments [13], lmom [15] and lmomco [2] were used for the evaluation of L-moments, moments [17] for standard product moments and actuart [5] for

Results of the simulation study
The simulation results for lognormal, Weibull and loglogistic distributions are displayed in Tables 3, 4 and 5, respectively. We present the relative frequencies of the normality hypothesis rejection for all distributions, including the one that does not meet the requirements for the characteristics used in the test criterion calculation (for the Cauchy distribution, neither product moments nor L-moments exist, for all tests based on the skewness and kurtosis, the finite fourth product moment is assumed). The results for sample sizes 10, 20, 50, 100, and 200 are presented in the tables, including the chi-squared goodness-of-fit results, although it is an asymptotic test, the small sample test outcomes likely resulting from this asymptotics. In the last columns of all tables, the difference between the minimum and maximum hypothesis rejection relative frequencies is shown to demonstrate the difference between the best and the worst test performance for a particular distribution and sample size. The lognormal distribution is unimodal and positively skewed with the coefficients of skewness and kurtosis increasing in the parameter σ (and independent of µ). Therefore, the power is lower for smaller parameters, especially for the test based on, inter alia, skewness. There are some tests with bad results (Lilliefors, chi-squared), a group of others that yield average outcomes (Cramer von Mises, robust Jarque-Bera, Lmoments-based, Anderson-Darling), and the best performing ones (Shapiro-Wilks, Shapiro-Francia and D'Agostino, the latter exhibiting excellent properties). The Jarque-Bera test outperforms its robust version, the latter variant being of inferior quality to the robust test based on L-moments. For sample sizes of 100 and 200 observations, there are almost no type-two statistical errors and the frequencies of correct decisions are close to one.
For the Weibull distribution within the selected parameterization, the skewness increases with the shape parameter a. The kurtosis is also increasing, but its coefficient is lower than 3 for all selected values (Table 1). Table 4 displays the results, the performance of all tests declining with increasing parameter. For high values of the parameter, larger samples are needed to distinguish between Weibull and normal distributions. In this case, the D'Agostino test is not comparable to the most powerful pair of the Shapiro-Wilks and Shapiro-Francia tests as is the case with the lognormal distribution. Both Jarque-Bera tests are in the group of poor-performing ones together with the chisquare (except for 10 observations and a = 2) and Lilliefors tests.
For the loglogistic distribution, no product moments are defined for σ = 1. Thus, we use the parameters implying only finite expected value σ = 2, finite variance σ = 3, or a finite third product moment for σ = 3. Figure 1 shows that the densities are decreasing (for σ = 1), unimodal and positively skewed for higher parameters. Problems with the existence of product moments (Table 2) are to be considered carefully. Since we test normality and if the null hypothesis is valid, we nevertheless apply the test although this is problematic according to the nature of the given distribution. The Anderson-Darling performs well for small values of σ = 2. The three other tests -D'Agostino, Shapiro-Wilks and Shapiro-Francia -are also strong, the latter in particular maintaining a consistently high performance. Figures 3, 4 and, 5 illustrates the test power dependence on the degrees of freedom of the Student distribution. Unlike the other distributions analysed, the Student distribution is symmetric, converging quickly to the standard normal distribution, although we still do not make a normal approximation of the highest value of freedom degrees ν = 20. The graph lines of the same type referring to the test are shifted from the left (sample size 10) to the right (sample size 200). For small values of freedom degrees, it is easier to distinguish between the Student and normal distributions, the power of all tests being relatively high. With increasing degrees of freedom, the two distributions are increasingly similar, thus more difficult to distinguish. For small samples and heavy-tailed distributions, robust tests have the highest power, the one based on L-moments outperforming the robust Jarque-Bera test. The Shapiro-Francia and D'Agostino tests (the latter for higher freedom degree values) can handle heavy tail problems and the resulting data outliers.

Conclusion
Testing the normality of data is an essential part of the statistical analysis, the issue receiving considerable attention in the literature. In the present article, Monte Carlo simulations are employed to investigate the ability of commonly used normality tests to detect the abnormality of the data. We apply a L-moment-based test by Harri and Coble [11] and compare their properties to widely known and frequently used tests of normality.
Our results are in line with the recommendations of others, and we present qualities of the test based on L-moments. All tests in the study are well implemented in the program R.
The simulation results show that for symmetric short-tailed distributions, the D'Agostino and Shapiro-Wilk tests are the most powerful. In terms of symmetric long-tailed distributions, the powers of the Jarque-Berra and D'Agostino tests are comparable with those of the Shapiro-Wilk test. As for asymmetric distributions, the Shapiro-Wilk test followed by the Anderson-Darling test have the highest power. The Jarque-Bera and robust Jarque-Bera tests achieve similar performance for all simulated distributions except the Student one.
The test based on L-moments works similarly to average tests, being, however, outperformed by the Shapiro-Wilks and Shapiro-Francia tests. In the case of a symmetric distribution with long tails (Student distribution for small freedom degrees), the robust tests (L-moment-based and robust Jarque-Berra ones) exceed the performance of other tests. This property is expected, as the outliers occur in the samples, and some tests are sensitive to such values (especially test based on correlations or product moments).
The test relying on L-moments is suitable for all distributions analysed, and along with the robust Jarque-Bera it is the most powerful for the Student distribution, outperforming the latter test for asymmetric distributions and being comparable for long tails. Hence, we consider the L-moment-based test a generally acceptable high-performance option for long-tail data applicable to small samples as well.