On The Gaussian Approximation To Bayesian Posterior Distributions

The present article derives the minimal number $N$ of observations needed to consider a Bayesian posterior distribution as Gaussian. Two examples are presented. Within one of them, a chi-squared distribution, the observable $x$ as well as the parameter $\xi$ are defined all over the real axis, in the other one, the binomial distribution, the observable $x$ is an entire number while the parameter $\xi$ is defined on a finite interval of the real axis. The required minimal $N$ is high in the first case and low for the binomial model. In both cases the precise definition of the measure $\mu$ on the scale of $\xi$ is crucial.


General Notions and Definitions
Bayesian statistics distinguishes the observations x 1 , . . . , x N from the parameter ξ that "conditions" them. Bayes' theorem -given below -expresses the uncertainty about ξ via a probability distribution of ξ conditioned by the observations x 1 , . . . , x N . This so-called posterior distribution becomes always narrower with increasing N . By consequence the "true value" of ξ is approached more and more closely. Simultaneously the posterior approaches a Gaussian distribution. This is a consequence of the fact that the posterior after N observations becomes the N -th power of the posterior from one observation.
The present article, based on Bayesian statistics, derives the minimal N needed for the Gaussian approximation. In the first part, comprising Sects. 2 and 3, the variables x and ξ are defined along the real axis as a whole. A chi-squared distribution provides an example. In the second part, i.e. in Sects. 4 and 5, the parameter ξ is defined on a finite interval of the real axis. The so-called trigonometric distribution serves as example.
We explain some notions used throughout the following text. Bayes' theorem [4] states the posterior distribution of ξ conditioned by x . The Bayesian prior distribution µ(ξ) serves also as the measure of integration over ξ , see Eq. (33) and the explanation there. The quantity normalises P (x|ξ) to unity. Equation (2) gives the posterior in the case of a single observation x . There is usually more than one observation, but in the present text there shall be only one parameter ξ . The Fisher information [16,17,18,19] is the expectation value of the quantity [∂/∂ξ ln p(x|ξ)] 2 . The Fisher information is always positive. When p is a Gaussian distribution then F is the inverse of the mean square value of the Gaussian.
Form invariance is a symmetry relation between the observation x and the condition ξ . It is defined by a group -in the sense of the theory of Lie groups [24,32] -of transformations that leave p(x|ξ) unchanged when applied to both, x and ξ . See Chap. 6 of [25] and the textbooks [1,7].
In the examples of the present text the symmetry shows up by the fact that the model p(x|ξ) depends on the difference x − ξ . The symmetry group then consists of all possible simultaneous translations of x and ξ by the same shift. Many statistical models can be reformulated such as to display translational form invariance. In section 2 the details of translational form invariance are discussed. It is taken as the starting point for a Gaussian approximation to the posterior distribution since the Gaussian itself is form invariant under translations. The invariant measure of the group of translations is identified with the prior distribution. Here, this implies µ(ξ) ≡ const .
In Sects. 2 and 4 two different cases are considered: Continuous x and ξ within a chi-squared model and dichotomic x within the binomial model. Both cases are widespread and of practical interest for many applications.
A likelihood function L N (ξ) is proportional to the probability density p N (x|ξ) of N observations x considered as a function of ξ while x is given. When -in the present context of Eq. (5) -the domain of definition of ξ extends over the whole real axis, the likelihood function possesses a maximum: Since the posterior is normalised, it must tend to zero when ξ goes to infinity.
The value ξ ML where the maximum occurs, is called the maximum likelihood or ML estimator of the "true value" of ξ .
For every series of observations x there is a ML estimator ξ ML = ξ ML (x) . In sections 2.4 and 4.4 one shall see that ξ ML is the sufficient statistic [37] of the model -a notion widely discussed in the development of the Rasch model [53,49,55,13,14,21,25].
It might appear that the constancy of the prior is nothing but the "principle of indifference", well-known in the history of statistical and Bayesian reasoning, well-known also for its difficulties when applied to transform probability densities [11,20]. Note that in the present paper the constancy of the prior occurs as a consequence of the well-defined group theoretical property of "form invariance" [26].
In Sects. 3.1 and 5 the likelihood function of a model p(x|ξ) is compared to the Gaussian model The N -fold Gaussian model, i.e. the distribution of N observations x = (x 1 , . . . , x N ) , is This yields the posterior distribution where < x > is the average see appendix I. The Fisher information of the Gaussian (6) is With increasing number N of observations the posterior of any model p with the prior (5) assumes the Gaussian form and becomes always narrower tending towards Dirac's delta distribution. We shall determine the minimal N which allows to approximate a given posterior by the Gaussian (8)

Form Invariance Along the Real Axis
The present section considers a statistical model p(x|ξ) where both, the observable x and the parameter ξ , are defined all along the real axis. Furthermore, x and ξ shall be related via form invariance. These two properties allow a general approximation to the logarithmic likelihood function which contains the sum over ln p(x k − ξ) . These logarithms are N random numbers since the x k are random, whence the sum over the ln p(x k − ξ) will have a Gaussian distribution for sufficiently large N . Yet the central limit theorem does not allow to answer the question, how large N must be for the Gaussian approximation to apply. However, the sum over the ln p(x k − ξ) becomes equal to the N -fold (negative) Kullback-Leibler divergence H(ξ ML |ξ) , see Ref. [38] and Eq. (24). The Kullback-Leibler divergence is a quantity that measures the distance between the distributions p(x|ξ ML ) and p(x|ξ) . It will be represented by a Taylor expansion with respect to ξ . With increasing N its terms of higher order become negligible as compared to the term of second order. This leads to the criterion for the Gaussian approximation.

The prior distribution
The above-mentioned two properties of the model p(x|ξ) entail that p depends on the difference x − ξ and only on this difference. Thus the model reads Neither Bayes [4] nor Laplace [39] who independently established Bayes' theorem, gave a prescription to obtain the prior distribution. Form invariance gives us the prescription: The prior shall be invariant under the symmetry group of translations [32,34,35,25].
Under the translation ξ = ξ + a (12) the prior transforms as a density, i.e. µ(ξ) goes over into This shall be independent of a ; thus µ is constant as foreseen in Eq. (5). See also Eq. (33) in Sect. 2.3.
For N events x , conditioned by one and the same value of ξ , the model is The posterior is where Since µ is constant its value drops out of the posterior and P becomes .
One can numerically calculate this expression, determine the shortest interval in which ξ is found with the probability of 99.73 percent, and thus obtain an error interval for ξ , see Chap. 3 of [25]. In Sect. 2.4 this procedure is replaced by considering the logarithm of the likelihood function. This will show that the posterior P N of Eq. (17) tends towards a Gaussian with increasing N .

The Maximum-Likelihood Estimator
The expression p N (x|ξ) of Eq. (14) is called a likelihood function L N (ξ) when it is considered as a function of ξ , while x is given.
A likelihood function in the context of translational invariance possesses a maximum since it is a normalisable function defined all along the real axis. When there are several maxima, one must look for the absolute maximum or even redefine the model such that there is an absolute maximum. The place ξ ML , where the maximum occurs, depends on the observed x . Hence, ξ ML = ξ ML (x) is a function of x . The value ξ ML is called the maximum likelihood (ML) estimator of the parameter ξ . For the example of the chi-squared model in Sect. 3.2 the ML estimator is given in appendix B.
The ML estimator has been introduced by R.A. Fisher [15,16,2]. It estimates the "true value" of ξ which conditions the observations x . For every finite N , however, the true value remains hidden. With N → ∞ the ML estimator converges to it. An example is given by the Gaussian model (7); the average (9) is its ML estimator. Neyman and Scott [45] have argued against ML estimation. Their argument says that a bias may remain between the expectation value of ξ and the ML estimator. This has caused a considerable debate [58,56]. The argument of Neyman and Scott is bound to the distinction between between a "structural" and an "incidental" parameter. The model they studied, is form invariant which means that there is a Lie group of transformations that leaves it invariant. Each of the two parameters describes a subgroup. The elements of the different subgroups do not commute with each other. One of the subgroups describes translations, the other one describes dilations, see, e.g., Sect. 7.3 of [25]. The present article describes a more basic situation. We also have two parameters; however, the present subgroups are identical. Both are translational.
At least in the present context of translational form invariance such a bias becomes arbitrarily small with increasing N -as is shown for the chi-squared model of Eq. (45) in Sect. 3.2. It possesses such a bias; but the fact that the posterior tends to a Gaussian with increasing N implies that the bias goes to zero. Future research sould show whether this holds also for the model discussed by Neyman and Scott.
We study the transition of the posterior to a Gaussian distribution by help of the logarithm of P N given in Eq. (15). Since both, µ(ξ) and m(x) , are independent of ξ , this likelihood function is proportional to the product of the p(x k − ξ) , i.e.
Therefore the logarithm of L N is -up to an additive constant -the sum over the logarithms ln p(x k − ξ) , For sufficiently large N the sum over the logarithms is expressed by the expectation value of ln p(x − ξ) which in principle requires This integral is the expectation value of ln p(x − ξ) taken with the distribution conditioned by the "true value" of ξ . Jaynes and Bretthorst [31] have called it the "asymptotic" likelihood function, asymptotic in the sense of N → ∞ . The true value of ξ remains, however, hidden for every finite N . We replace it by the ML estimator ξ ML obtained from the observations x = (x 1 , . . . , x N ) . Then the "asymptotic" likelihood function becomes We want to define const such that ln L N becomes a Kullback-Leibler distance. This is reached when const is set to Then the "asymptotic" likelihood function becomes We call this integral the functional It is the negative of the Kullback-Leibler divergence [38,27,9] between the distributions conditioned by ξ ML and by ξ .
Thus the logarithmic likelihood function becomes The Taylor expansion of H with respect to ξ will yield our criterion by requiring that terms of higher than the second order be negligible.
How large must N be for ln L N to become equal to the expectation value (23)? We follow the assumption that there is such an N . For this and larger N equation (23) holds: The logarithmic likelihood function of the Gaussian (8) is where the quantity in rectangular brackets is the functional (24) for the Gaussian (6), The quantity < x > , given in Eq. (9), is the ML estimator of the Gaussian model. One sees that Eqs. (21) and (25) are fulfilled by the Gaussian model.
Equation (103) in appendix A shows that the maximum value of H(ξ ML |ξ) -and thus of the logarithmic likelihoodoccurs at ξ = ξ ML for every distribution of the form p(x − ξ) .
The translational invariance of p given by Eq. (11) leads to a translational invariance of H since This result has been obtained by substituting Thus H(ξ ML |ξ) depends on the difference ξ ML − ξ and only on this difference.
When there are isolated points where p(x ) or p(x − ξ ML + ξ) vanish, then appendix H shows that the integral exists despite the divergence of the integrand in (28).

The Fisher Information
The information that carries the name of R.A. Fisher [17] is a central concept of statistics and estimation theory. It provides the prior, if form invariance holds. The Fisher information F is defined as the expectation value of the squared derivative of the logarithmic likelihood function, see Refs. [16,17,18,19] and Sect. 4.2 of [21], i.e.
The two expressions equal each other because p is normalised to unity for every ξ , see appendix C. Note that the definition (30) refers to a single observation, see [42]. Other authors, however, define the Fisher information for multiple observations [47]. The integration in (30) Thus the Fisher information of the model (11) is independent of ξ . This is a consequence of form invariance under translations.
By the definitions of H in (24) and F in (30) one recognises that Since F is positive, the second derivative of H is seen to be negative at ξ = ξ ML . This means that the second derivative of L(ξ) is negative, as it should be at a maximum.
The prior distribution µ is proportional to the square root of F , according to Hartigan [26] and Jaynes [30]. A measure is the inverse unit of length on the scale of ξ .
By identifying the integration measure on the scale of ξ with the Bayesian prior distribution, we follow Jeffreys [32] as was done by other authors [33,29,35,48]. The prior is independent of N . Since it is independent of ξ and ξ ML , a given difference ξ ML − ξ describes the same length of way everywhere on the scale of ξ . See also [3,5,60,59,25] and the footnote 1 .

The Posterior of a Form Invariant Model
>From Eq. (25) and the fact that the prior is constant follows that the posterior distribution is where N N normalises P N to unity. The posterior from N observations has the form of the N -th power of the posterior for N = 1 . Therefore with increasing N the posterior is more and more restricted to the immediate neighborhood of the maximum at ξ = ξ ML . At this maximum H(ξ ML |ξ ML ) vanishes, since H equals the Kullback-Leibler distance between the distributions p(x − ξ ML ) and p(x − ξ) . Thus for ξ = ξ ML the exponential in (34) has the value of unity, With increasing N the curvature of the likelihood function exp(N H(ξ ML |ξ)) at its maximum increases according to N F , were F is the Fisher information at N = 1 . The Fisher information is independent of ξ in the present context.
Thus the likelihood function behaves as the Gaussian (8), see Eqs. (26) and (27). It does so in a suitable interval around the maximum value (35) of the likelihood. Thus within that interval the likelihood becomes a Gaussian function.
One also sees that the posterior depends on the observations x via and only via the estimator ξ ML = ξ ML (x) ; this means that ξ ML (x) is the sufficient statistic [36]. The notion of sufficient statistic has been widely discussed in the development of the Rasch model of item response theory [53,49,13,14,55,21,25]. We shall come to it in Sect. 4.5 where the binomial model is treated.

The Criterion for the Gaussian Approximation
A criterion will now be formulated that allows to find the minimal N so that the Gaussian distribution (8) can be accepted as approximating the posterior P N in Eq. (34).
The Gaussian approximation is justified whenever the Taylor expansion of ln L N (ξ) up to the second order in ξ is "sufficiently precise". This is the expansion of H(ξ ML |ξ) with respect to ξ at the point ξ = ξ ML . In Sect. 3.1 the Taylor expansion is written down and our understanding of "sufficiently precise" is defined. This yields the desired criterion. In Sect. 3.2, the criterion is applied to a chi-squared model.

Formulation of the Criterion
The Taylor expansion of H up to an arbitrary order n is The quantity R is the remainder. The zero-th order, ν = 0 , of this expansion vanishes. We expand up to n = 2 and choose Lagrange's remainder among several versions suggested in Sect. 0.317 of [22]. This is The value of Θ remains open except for the fact that it lies between 0 and 1 .
After N observations the Gaussian (8) is accepted as a valid approximation if the remainder R is negligible for all containing 99.73 percent of the area of the Gaussian function. We call it the 3σ-interval of the Gaussian approximation of the model p . This value is the expected curvature of the likelihood function. The Fisher information together with the measure µ has been defined in Sect. 2.3.
The term of ν = 0 in the expansion (36) vanishes according to the definition (24) of the functional H . The first derivative ∂ ∂ξ H vanishes at ξ = ξ ML because H attains its maximum value for ξ = ξ ML . The second derivative ∂ 2 ∂ξ 2 H at ξ = ξ ML yields the negative Fisher information according to Eq. (32). The third derivative -in the remainder -is calculated by help of the translational invariance (28) of H. This gives The subscripts | ... in the r.h.s. of this equation mean repetitions of the subscript on the l.h.s.
In order to decide whether the remainder R is small enough to be neglected, we must find below the largest absolute value |H (3) | max of (40). The largest value of (ξ − ξ ML ) 3 occurs at the upper end of the interval (38), whence the maximum absolute value of the remainder is This shall be small compared to the second-order-term everywhere in the interval (38). Thus we require This gives, by use of (39), We consider this condition to be fulfilled if The last unequality is our criterion for the validity of the Gaussian approximation. Replacing the requirement (43) by (44), we follow a convention. A step of one "order of magnitude" is usually considered to realise the requirement that one object be large compared to another one.

The Chi-Squared Model
As an example let us consider the model It is normalised to unity, see section 8.312, no. 10 of [22]. This is a chi-squared distribution with two degrees of freedom transformed such that it depends on the difference x − ξ between event and parameter, see Eq. (114) in appendix D.
The ML estimator is ξ ML = x , see Eq. (106) in appendix B.
In the expansion (36) the terms of order ν = 0, 1 vanish according to Sect. 3.1. The derivatives of H are obtained as follows. The first line of (28) with (45) gives In the last line the first term on the r.h.s. has the value −1 due to the normalisation of the model 45. Whence, we obtain The substitution For ξ = ξ ML this becomes see appendix E. This yields This result was expected since H(ξ ML |ξ) has an extreme value at ξ = ξ ML .
The second derivative of H with respect to ξ is The definition of H(ξ ML |ξ) in the first line of Eq. (28) together with the substitution (48) yields a symmetry relation of the functional H , The last line of Eq. (52) yields the derivatives One especially obtains the Fisher information Equation (54) gives the maximum |H (3) | max of the absolute value of (40) in the interval (38) to be Then Eq. (44) yields the criterion Thus the posterior of the chi-squared distribution (45) with two degrees of freedom can be considered Gaussian when the number N of events is larger than or equal to 160 . This shows that an intuition based on the central limit theorem, were a current "rule of thumb" says N ≈ 30 , can be misleading. See Sect. 7.4.2 of [54] or Hogg et al. [28].
To find the necessessary N , we have not used the article by Berry [6] on the accuracy of the Gaussian approximation because Berry does not discuss the measure on the scales of the variables.

The Binomial Model
In the following Sects. 4 and 5 the binomial model q(x|ξ) is considered. It is also called the model of the "simple alternative" since its event x is restricted to two possible values x = 0 or x = 1 while ξ is a continuous variable which conditions the probability to find x . By consequence, form invariance cannot be expressed by the difference between x and ξ . Yet form invariance under translations is found by reformulating the model such that it depends on the difference between the ML estimator ξ ML and the parameter ξ , and only on this difference.

Definition of the Binomial Model
The binomial model is given by Here, x = 1 can be interpreted as the "correct" and x = 0 as the "false" answer to a given question. Instead of correct or false, the two possibilities may also mean one or the other side of a thrown coin. The function R of the continuous variable ξ is called item response function.
The name of item response function (IRF) is due to authors who discussed the ideas of "item response theory" [43,12,41,44,53,50,51,52] applied in intelligence tests as well as educational tests, e.g. the PISA tests [46]. Differing from them we define the IRF by the requirement that the measure µ on the scale of ξ should be constant. This means that the Fisher information F shall be independent of ξ , see Eq. (33). The Fisher information of the binomial model is defined much as in Eq. (4), except for the integral over x which becomes a sum over x = 1, 2 . We make use of the second version of Eq. (30) and obtain a differential equation with the solution R(ξ) = cos 2 ξ , −π/2 ≤ ξ ≤ π/2 ,  (60) is called the trigonometric model.

By the definition of the IRF (60) the Fisher information of the model (59) becomes
see appendix J; whence the measure (33) on the scale of ξ is G. Rasch did not consider the measure on the scale of the IRF, he derived the IRF by requiring the property of "specific objectivity" to any measurement [53,50,51,52]. By this property the score of correct answers became the sufficient statistic in his model. The trigonometric IRF, however, is compatible with specific objectivity, see Sect. 3.1 of [21].
For N events x = (x 1 , . . . , x N ) conditioned by one and the same ξ, the binomial model reads This can be rewritten by help of the score s c of the answers that yield x = 1 as well as the number N − s c of the answers that yield x = 0 . One obtains The quantity N s c is a binomial coefficient. By applying the binomial formula one finds that q N is normalised, The posterior distribution will be obtained in Sect. 4.5.

The ML Estimator for the Trigonometric Model
Given the event s c in the framework of the distribution (64), the ML estimator is found by solving the the ML equation This leads to With the IRF (60) this ML-equation becomes which is solved by ξ ML (s c ) such that The denominator of the r.h.s. of (68) seems to exclude the values ξ ML = ±π/2 and ξ ML = 0 because it vanishes there. These values correspond to the "uniform answers" s c = 0 and s c = N , respectively. We show that the uniform answers are not excluded. In the case of s c = 0 equation (68)

The Likelihood Function of the Trigonometric Model in the case of N = 1
The likelihood function of the trigonometric model q N =1 with R according to (60) is form invariant. To show this we study again the two cases of s c = 0, 1 .
(i) Let s c = 0 be observed. Then the likelihood function is proportional to 1 − R(ξ) and the ML estimator is ξ ML = ±π/2 . Thus the likelihood function is This can also be expressed as (ii) Now let s = 1 be observed. Then the likelihood function is proportional to R(ξ) whence the ML estimator is ξ ML = 0 , according to the case of Eq. (71). Thus the likelihood function is Since now ξ ML = 0 this is again expressed by Eq. (73).
Thus if N = 1 and R given by (60) the likelihood function of the trigonometric model depends on the difference ξ − ξ ML and only on this difference. This is what we call form invariance under translations. In summary: The posterior of the trigonometric model q is obtained with a constant prior and it is form invariant for N = 1 .
The result (73) is found for every value of ξ ML given by Eq. (69). Thus it remains true for any number N of observations. The posterior of the trigonometric model is form invariant under translations in the sense that it depends on the difference ξ − ξ ML . However, the variables ξ and ξ ML do not "make the same use" of the interval −π/2 < ξ < π/2 given in Eq.
(72): The variable ξ is defined everywhere on this interval. The ML estimator ξ ML assumes only a finite number of values within that interval. In the following Sect. 4.4 the trigonometric model itself -not only its posterior -will be brought into translational form invariance.

The Trigonometric Model with Translational Invariance
For arbitrary N the ML estimator given by Eq. (69) is not restricted to two values. For sufficiently large N any value in the interval −π/2 ≤ ξ ML ≤ 0 can be approached arbitrarily closely. Is it possible to define the binomial model such that the model itself is translationally form invariant? Yes, this is possible and leads to the same Fisher information (61).
Consider the model which depends on the difference between the observation s and the condition ξ .
According to the argument Sect. 32 on the posterior of the trignometric model, the ML estimator comes arbitrarily close to every value in the interval −π/2 ≤ ξ ML ≤ π/2 since, for arbitrary N , it assumes every rational number in this interval. Thus the trigonometric model can be extended to the model 75. The fact that ξ ML is defined within, −π/2 ≤ ξ ML ≤ π/2 means that ξ is defined in the same interval.
We shall show that (75) is translationally form invariant and has the Fisher information (61) in agreement with the trigonometric model (59) with (60).
The normalisation N in (75) equals the integral over the domain of s ; it is independent of ξ because the domain given in (75) is an interval of length π which is equal to one period of the periodic function cos 2 (x − ξ) . Thus N is given by the integral see appendix G.
The ML estimator ξ ML for a given event s occurs at since the likelihood function t(s|ξ) of (75) becomes maximal when the argument of the cos 2 -function is zero.
The posterior of t(s|ξ) shall be a function of ξ ML − ξ as it is in the context of the model (45) in Sect. 3. This requires to shift -within the posterior -the value of ξ ML to the value of zero. We can do so since the measure on the scale of ξ is constant. The shift avoids that the distance |ξ ML − ξ| bacomes larger than the length π of the interval in which ξ is defined. In the context of the model (45) such a shift was not needed since every difference ξ ML − ξ was contained in the infinite domain −∞ < ξ < ∞ , where ξ was defined.
Since ξ ML is the sufficient statistic of the model (75) the score s c of correct answers determines the sufficient statistic when N is given. In this sense the model (75) confirms the requirement of G. Rasch [53,49,55,13,14] that the score should be the sufficient statistic of the binomial model. See especially Chap. 4.6 of [21] and Chap. 12.3.1 of [25], were so-called Guttman schemes [23] are analysed.

The Fisher information of the model (75) is
by Eq. (30). For the second derivative in the integrand one finds .
Together with (77) this gives Thus the Fisher information of the translationally invariant model (75) agrees with the Fisher information (61) obtained from the binomial model (59).

The Functional H for the Trigonometric Model with Translational Invariance
We express the logarithmic likelihood function ln L N for N events given by the model (75) in analogy to Eq. (25) as The integration extends over one period of the cos 2 -function. Inserting (75) into (82) one obtains The functional H(ξ ML |ξ) exists and has the property (28) because the integral in (83) exists although the function cos 2 s with, e.g., s = s − ξ (84) will vanish at an isolated point s 0 and ln cos 2 s 0 will diverge. Appendix H shows that the integral over an -interval that includes s 0 , does exist. Here, may be arbitrarily small. The value of this integral is not proportional to ; instead it is proportional to ln . Although this is not as small as , it approaches zero when → 0 . Thus the logarithmic likelihood ln L N of Eq. (83) exists even if the integration runs over a point where ln cos 2 diverges.
Applying the substitution (84) to the integrand in (83) one obtains The substitution does not require a shift of the limits of integration because the integrand is periodic with a period of length π and the integration covers one period. Thus the functional H(ξ ML |ξ) for the translationally invariant trigonometric model has the property (28) which was found earlier for models defined on the entire real axis.
The fact that L N equals N times H(0| − α) shows that Eq. (34) holds here, too: The posterior distribution from N events of the trigonometric model is proportional to the N -th power of the posterior from one event.

The Gaussian Approximation to the Trigonometric Model with Translational Invariance
The present section establishes the criterion for the validity of the Gaussian approximation to the likelihood function of the trigonometric model with translational invariance. In analogy to the procedure in Sect. 3 we expand H(ξ ML |ξ) into a Taylor series with respect to ξ at ξ ML . Again the zero-th order of this expansion vanishes by the definition of H in Eq. (24). Again the likelihood function is related to H via Eq. (25). Using the abbreviation (120), introduced in appendix F, we obtain Appendix F furthermore shows that the expression (86) is mirror symmetric with respect to ξ ML − ξ = 0 , i.e.
Therefore all odd derivatives of H(0|ξ ML − ξ) vanish at ξ ML − ξ = 0 , thus By consequence, the remainder of the present expansion is not of the third order as it is in Sect. 3.1; it rather is of the fourth order.
All even derivatives ∂ 2ν ∂α 2ν H(0|α) exist. Therefore in the present context the Taylor expansion (36) becomes In analogy to Sect. 3.1 the Gaussian approximation is considered valid when the term of fourth order is negligible as compared to the term of second order for all ξ − ξ ML in the interval (38). See FIG. 2. For these values of ξ we have to find the maximum of the absolute value of the fourth derivative in Eq. (89). We call it |H (4) | max . In analogy to Eq. (42) the Gaussian approximation is accepted when As in Eq. (39) the Fisher information F equals σ −2 and we obtain the condition In appendix F the value of |H (4) | max is found to be The value of the Fisher information F is given by Eq. (81). Whence, the condition (92) for the Gaussian approximation becomes We consider it to be fulfilled if Thus the condition (96) for the Gaussian approximation to the binomial model is much more easily fulfilled than the corresponding condition (44) for the chi-squared model. The reason is: In Sect. 3.1 the remainder was of third order; in the present case it is of fourth order.
The step from (94) to (95) is due to the common idea that a given positive x be large against y > 0 when x is one order of magnitude larger than y . This idea is commonly used for approximations in mathematics [8] and physics [10]. If a higher accuracy is required, this rule can easily be adapted.
In Chap. 12 of [25] a version of Item Response Theory is presented which makes use of the trigonometric IRF (60).
In that context a (simulated) competence test was discussed which asked 20 questions. From the present result (96) follows that the estimated person-parameters have a Gaussian posterior distribution.
In Sect. 3.2 as well as in the present section the condition for the Gaussian approximation is independent of the value of ξ ML . In the case of the binomial model, this means that the Gaussian approximation is valid even for uniform or close to uniform patterns of answers.

Conclusions
The present text presents a criterion for the validity of the Gaussian approximation to the likelihood function of a statistical model p when N observations x = (x 1 , . . . , x N ) have been collected.
We require that a statistical model possesses a symmetry between the observed quantity x and the parameter ξ . This symmetry is defined in terms of a Lie group. It has been called form invariance. It allows to formally specify the prior distribution -required by Bayes but not specified by him. The model p can then be parameterised such that it depends on the difference between the observed quantity x and the parameter ξ which conditions the observations. The model p(x − ξ) remains invariant when both quantities are shifted by the same amount. We call this property translational form invariance. Then the likelihood function L N (ξ) is shown to depend on the difference ξ ML − ξ between the maximum likelihood estimator ξ ML and the parameter ξ . This means that the Bayesian posterior distribution, too, depends on ξ ML − ξ . One can even shift the scale of ξ such that the ML estimator is found at ξ ML = 0 .
A total of N observations conditioned by one and the same parameter ξ is generally expected to lead to a Gaussian likelihood function for sufficiently large N . The question of how large N must be, in order to justify the Gaussian approximation, is answered in Sects. 3 and 5 for two quite different examples. The basic idea is, that a valid Gaussian approximation to the posterior distribution means that the error assigned to the parameter ξ equals σ = (N F ) −1/2 within the interval |ξ ML − ξ| < 3σ . Here, F is the Fisher information yielded by the model p . The probability for ξ to lie outside this 3σ-interval is neglected. A stricter condition for the Gaussian approximation would be achieved by requiring an interval larger than 3σ for its validity.
The example of Sect. 3 is a version of the chi-squared model with two degrees of freeedom. This distribution, as well as its posterior, has a considerable skewness. In this case one needs N = 160 observations for the Gaussian approximation to be acceptable. The large number of required observations is due to the skewness.
The example of Sect. 5 is the trigonometric model, a specific form of the binomial model based on form invariance. It strongly differs from the chi-squared model since the observations are not taken from a continuum of real numbers but rather from the alternative of 0 or 1 . The likelihood function turns out to exhibit a mirror symmetry -as the Gaussian exhibits, too. This helps to approach the Gaussian distribution. We find: For N ≥ 8 the posterior of the trigonometric model can be considered as Gaussian. This holds for every value of the ML estimator. It is a favorable result for the application of the binomial model to competence tests such as the PISA studies [46].
In general, we see a practical interest in our results since the normal distribution is the basis of parametric methods in applied statistics, widely used in many areas (education, medicine, science, etc.). To know whether the normal distribution is applicable or not, is of interest for practitioners in these fields.

A Comparing Two Distributions
We show that for any two normalised distributions p and q the unequality l p l ln q l p l ≤ 0 (97) holds provided that p and q are labelled by the entire numbers l and are normalised according to l q l = 1 , The l.h.s. of (97) attains its maximum value of zero when and only when p l and q l agree for every l .
The unequality (97) is a consequence of the unequality The linear function s − 1 is tangent to the function ln s . The common point lies at s = 1 , where both functions have the value of 1 . Setting s = q l /p l the unequality entails for every l . The quantity on the l.h.s. has also been introduced by Campbell in Sect. 1 of Chap. 5 of [9]. Summing (101) over l yields the unequality (97). When the distributions p and q agree whith each other, the l.h.s. of (97) vanishes. Then and only then the expression assumes its maximum value.
One can interprete p l as the probability p(x l − ξ ML ) ∆x N contained in a bin centered at the value x l of the real variable x and having the width ∆x N . Here, p shall be a normalised probability density. Similarly one can interprete q l as the probability p(x l − ξ) ∆x N . In the limit of ∆x N → 0 the unequality (97) then yields the unequality

B The ML Estimator of the Chi-Squared Model
The ML estimator of the chi-squared model (45) is calculated.
Up to an additive constant (independent of ξ) the logarithmic likelihood function is given by The ML estimator ξ ML solves the ML equation The solution is

C Two Versions of the Fisher Information
It shall be shown that the two lines of Eq. (30) agree with each other. Let us start from the second line which we write as This expression can be rewritten The first one of the two integrals in the last line vanishes since p is normalised to unity for every ξ . The second integral in the last line corresponds to the first line of Eq. (30).

D The Chi-Squared Model
Each of the quantities x k , where k = 1, 2 , shall have the Gaussian distribution with one and the same root mean square value σ . The chi-squared model χ sq 2 (T |σ) with two degrees of freedom is the distribution of the quantity (110) It is given by see e.g. Eq. (4.34) of Ref. [25]. This distribution is normalised to unity. The transformations z = ln T , ζ = ln(2σ 2 ) (112) lead toχ where T and σ must be expressed by z and ζ . This gives which corresponds to Eq. (45).

E Derivatives of the Functional H for the chi-Squared Distribution
The integral in the first line of Eq. (50) yields Γ(2) . It is a special case of the formula given in Sect. 8.312, no. 10 of Ref. [22]. The value of the Gamma function required in Eq. (50) is ds cos 2 (s − ξ ML ) ln cos 2 (s − ξ) The substitution yields Here, the second line is obtained from the first one because the integrand is periodic with a period of π , hence, the shift of the limits of integration is immaterial. With the abbreviation this reads ds cos 2 s ln cos 2 (s + α) Although cos(s − α) vanishes at a point within the domain of integration, the integral (123) exists and can be obtained as if the integrand were simply undefined at this isolated point, see appendix H.
Comparing (121) with (123) shows that H(0|−α) is a mirror-symmetrical function of the difference α . By consequence, all odd derivatives vanish at α = 0 , We calculate the derivatives with respect to α . The first derivative is the basis of all higher ones. It must be rewritten in order to see that all derivatives exist. Starting from Eq. (123) we find ds cos 2 s ln cos 2 (s − α) − cos 2 s ln cos 2 s = N ∂ ∂α π/2 −π/2 ds cos 2 s ln cos 2 (s − α) . ds cos 2 (s + α) ln cos s .
The square [. . .] 2 of a binomial expression displays two squares and a mixed term. Here, the mixed term, as a function of s , is antisymmetric with respect to s = 0 . Therefore the integral over the mixed term vanishes and we obtain ds cos 2 s cos 2 α + sin 2 s sin 2 α ln cos s .
This integrand, as a function of s , is symmetric with respect to s = 0 . Therefore we have by help of the identity sin(2α) = 2 cos α sin α .
The maximum of the absolute value of the fourth derivative is found at α = 0 which means Note that the second derivative in Eq. (141) for α = 0 gives the value which is, up to its sign, the Fisher information (81) of the trigonometric model with translational invariance.

G The Normalisation of the Trigonometric Model with Translational Invariance
By partial integration of the cos 2 -function one finds

H Integrating over a Logarithmic Divergence
In the -interval − < s − s 0 < , > 0 , and this contributes a negligible amount to the expression (82) when is small. Thus ln L N and the functional H(ξ ML |ξ) exist.

I The Likelihood Function of a Gaussian Model
The N -fold Gaussian model is given in Eq. (7). We write it as by introducing the averages It is not difficult to factorise this according to The posterior distribution G N (ξ|x) is given by the factor that depends on ξ , i.e.
in agreement with Eq. (8). The maximum of the likelihood function occurs at

J The Fisher Information of the Binomial Model
The Fisher information of the model binomial model (59) shall be independent of ξ . This means that the expression be independent of ξ . Here, R is the derivative of R(ξ) . Thus the numerator of (161) should be proportional to the denominator. This is reached when we set R(ξ) = cos 2 ξ , −π/2 ≤ ξ ≤ π/2 (162) and obtain F (ξ) ≡ 4 .