The Way of Pooling p-values

Pooling p-values arises both in practical (in any science and engineering applications) and theoretical (statistical) issues. The p-value (sometimes p value) is a probability used as a statistical decision quantity: in practical applications, it is used to decide if an experimenter has to believe that his/her collected data confirm or disconfirm his/her hypothesis  about the “reality” of a phenomenon. It is a real number, determination of a Random Variable, uniformly distributed, related to the data provided by the measurement of a phenomenon. Almost all statistical software provides p-values when statistical hypotheses are considered, e.g. in Analysis of Variance and regression methods. Combining the p-values from various samples is crucial, because the number of degrees of freedom (df) of the samples we want to combine is influencing our decision: forgetting this can have dangerous consequences. One way of pooling p-values is provided by a formula of Fisher; unfortunately, this method does not consider the number of degrees of freedom. We will show other ways of doing that and we will prove that theory is more important than any formula which does not consider the phenomenon on which we have to decide: the distribution of the Random Variables is fundamental in order to pool data from various samples. Manager, professors and scholars should remember Deming’s profound knowledge and Juran’s ideas; profound knowledge means “understanding variation (type of variation)” in any process, production or managerial; not understanding variation causes cost of poor quality (more than 80% of sales value) and do not permits a real improvement.


Introduction
Let's consider a "continuous" probability distribution F(t)=P[Tt], T being a random variable. U=F -1 (T) is a random variable uniformly distributed: G(u)=P [Uu]. pvalue is the probability 1-G(u)=P [U>u], WHEN F(t) is related to a statistical test THEN the p-value is a result of a statistical test.
The p-value (sometimes p value) [1,2] is a statistical decision quantity: it is used, in practice, to decide if we have to believe that our data confirm or disconfirm our hypothesis about the "reality" of a phenomenon.
It is a real number, determination of a Random Variab le related to the data provided by the measurement of a phenomenon.
In this section, we provide the general ideas about Hypothesis Testing, while in section 2 we describe the pvalues.
We consider here only the parametric version of the method.
We ask the reader to look at figure 1, which shows a typical situation that anybody can be confronted with.
For example, let's assume that we want to decide about the following simple case : We want to test if people weigh more than 75 kg, assuming that we want to risk only =5% and assuming that the standard deviation of the weight is 8 kg.
Connect this case to figure 1 which refers to any general setting related "Hypothesis Testing". There you see the (above mentioned) risk  (not its value 5%) related to the Hypothesis H0 (left unspecified in the case) and the hypothesis H1=(not shown the value) people weigh (in mean) more than 75 kg; regarding the Probability Model the information we have is limited to the standard deviation is 8 kg.
Before arriving to the decision based on the data in our hand, we describe the general framework of figure 1.
Any statistical decision is referred to a Probability Model assumed (either on technical or theoretical reasons) to rule the data we are going to collect. The Probability Model is generally specified by the Cumulative Distribution which depends on various "parameters". To explain figure 1, we consider only 1 parameter.
Let  be the parameter we want to "test"; previous to any collection of data we should state two Hypotheses and a probability , named the "significance level": 1. The "Null Hypothesis", named H0, where we assume, before any collection of data, a value for the parameter ; we indicate it with the symbol 0; 0 is a number, while  is the symbol of the parameter: we write H0: [=0] 2. The "Alternative Hypothesis", named H1, where we assume, before any collection of data, another value for the parameter ; we indicate it with the symbol 1; 1 is a number different from 0, while  is the symbol of the parameter: we write H1: [=1] 3. The probability , the "significance level" that we assume, before any collection of data and analysis of the data, is the probability that we accept of being wrong IF we, after the collection and the analysis of the data, claim the Null Hypothesis H0: [=0] is Rejected, when actually (and nobody knows it!) the Null Hypothesis H0: [=0] should not be Rejected .
From the points 1, 2, 3, using the Theory, we can, before any collection and analysis of the data, find two items:  A "formula", named Test Statistics s, that will provide us with a number, after the analysis of the data  And an interval of the real line (real numbers) C, named Critical Region (or Rejection Region) such that we Reject "the Null Hypothesis H0: Let's assume that we collect the data and analyse them, according to the Theory (probability and statistic), and compute the number s (computed from the Decision Function); IF s  C, then we, according to the Theory, should Reject the Null Hypothesis H0: [=0]; IF s  C, we Accept [but really we do not reject] the Null Hypothesis H0: [=0].
NOTICE that the Test Statistics s is the "determination (=estimate)" of the estimator S, which is a Random Variable! In figure 1 it is clearly shown that, in order to take a decision, we need the Probability Model (on the top of the figure); the model has to be suitable to the analysis of our data for the parameter we want to test! Moreover, above the yellow box, you see the Hypothesis H1 and not the risk ; this depends on the fact that in various cases  is not stated and one wants to compute the power of the test 1-.
For example, in the above case, the parameter we want to test is the "mean weight" ; we have to assume a Probability Model depending from the parameter  (and other parameters).
Deming and Juran [3,4,5] ("It is a hazard to copy". "It is necessary to understand the theory of what one wishes to do or to make". "Without theory, experience has no meaning". "A figure without a theory tells nothing". «The result is that hundreds of people are learning what is wrong. I make this statement on the basis of experience, seeing every day the devastating effects of incompetent teaching and faulty applications») and Galetto [ Notice that the sample size n can be computed when both risks  (related to H0) and  (related to H1) are stated. If  is not stated as in figure 1, the sample size is defined by the experimenter.
From these last two points, according to the Theory, anybody can derive that the Probability Model for decision; since we assumed N(,  2 ) as Probability Model of the data, in figure 1 the Probability Model for decision is the Normal distribution with mean  and variance  2 /n, N(, 2 /n).

p-values
In section 1, we saw the general ideas of Testing of Hypothesis (see figure1). The decision there depends on the Test Statistic s and on the Critical Region C which is related to the risk  (significance level): Test Statistic s and Critical Region C were the two statistical decision quantities. The Test Statistic s is a real number, determination of a Random Variable S, depending on the Probability Model.
Now we see a different statistical decision quantity: the p-value (sometimes p value) [1,2]. It is used to decide if we have to believe that our data confirm or disconfirm our hypothesis about the "reality" of a phenomenon. It is a real number, determination of a Random Variable related to the data provided by the measurement of a phenomenon.
For the same purpose almost every statistical software provides the p-value (probability value).
In our example, it is the probability that the Random Variable Mean (indicated generally with ̅ ) is greater than the empirical (observed) mean ̅ = 79.25, that is From the example we derive the definition of the p-value: it is the probability, under the Null Hypothesis H0 (opposed to the Alternative Hypothesis H1 about the distribution of the random variable), that the variate (RV) to be observed takes values equal to or more extreme than the value observed.
The p-value can be viewed as an index of the "strength of evidence" against the Null Hypothesis H0; a small pvalue indicates an unlikely event and, hence, an unlikely hypothesis.
Let f(x) be the pdf of the estimator T and t the determination of T; after the elaboration of the collected data, we can compute the integral which is a real number.
The value of the integral  is the determination of the Random Variable , related to the  "parameter p-value".
The author thinks the opposite. Let's see why. IF, with other 25 data, ̅ = 77.75 then ̅ ∈ C and, AGAIN in this new case, we reject H0 with 5% significance level, same as before. BUT, from (1), the new p-value (estimation of the same parameter p-value ) is 0.0428: the "strength of evidence" is less than before… What is the true p-value for H0=people weigh (in mean) 75 kg=0 based on both samples? We do not know. What is the true significance level? =0.05 for both decisions! Notice that the example used the normal distribution, and the decision function depended on that.
For example, the same numerical procedure cannot be used (but Six Sigma Professionals do not know it) for the following case:

Pooling p-values
To explain the point, we consider the data of table 1, drawn form a reliability test on the same type of systems.
They are times to failure (hours) of 5 items in sample 1 (sample size 13) and on 10 items in sample 2 (sample size 17).
According to figure 1, we set H0=[0<100 h] where  is the MTTF (Mean Time To Failure of the items) and =0.05; the probability model depends on the exponential distribution; the alternative hypothesis is H1=[1>100 h].
We have to find the test statistic and the Critical Region. According [7][8][9][10][11][12][13][14][15][16], the test statistics is the "Total Time On Test" t1, the determination of the RV T1(t); "Total Time On Test" is the sum of all the data of the items on test. In this case, the Probability Model for decision f(x), in figure  1, is the Erlang distribution with mean 5, and the decision is Reject H0, because the Critical Region is C=>915.3 and t1=925. 3 is a Random Variable. We call Π the "Random Variable p-value". The p-value 0.047 is the estimate of the  "parameter pvalue".
Consider secondly sample 2; there are 10 failures and 17 items on test.
The p-value is 0.063=∫ f(x) ∞ 1521.5 dx. This confirms that we have to Accept H0, because 0.063>0. 05 We have two contradictory decisions with risk 5%, because different samples (as obvious) provide different data and any statistical analysis depends on the data.
Notice that we used the same value =0.05 for both tests.
Since we have two independent comparisons, we have to consider the Bonferroni Correction: in order to avoid a lot of spurious positives, the alfa value needs to be lowered to account for the number of comparisons being performed. In this case, we have two comparisons so that modified=0.025. With this modification, we have to Accept H0, because 0.047>0.025 and 0.063>0.025; with this ideas the "two decisions" are no longer contradictory.
To decide which of the "two (contradictory) decisions" could be the right one we think to test a new hypothesis According [7][8][9][10][11][12][13][14][15][16] the test statistics is the "Ratio of the Total Time On Test t1/t2,", the determination of the RV T1(t)/T2(t); in this case the Probability Model for decision f(x), in figure 1 Hence we should Reject H0…. The two samples (combined) tell us that the "true mean" should be >100 h.

Comparison of "Pooling p-values" and Theory
In this section, we test a new "Null Hypothesis" H0=Fisher's method and Reliability Theory provide the same decisions about combining two samples as in § 2, based on exponential distributed data, with 5 failures out of 13 items and with 10 failures out of 17 items .
The dfs are important both for exponential data and for normal data. We consider these two cases.
Using Fisher's method, two small p-values P1 and P2 combine to form a smaller p-value. The yellow-green boundary defines the region where the "combined" p-value is below 0.05. For example, if both p-values are around 0.10, or if one is around 0.04 and one is around 0.25, the "combined" p-value is around 0.05.  The comparison of Fisher's method and the Theory [7][8][9][10][11][12][13][14][15][16] is shown in figure 3.
It is clear that Fisher's p-value is higher than the value computed from Theory, taking into account the dfs! That means wrong decisions depending on the data… We should reject the "Null Hypothesis" H0=Fisher's method and Reliability Theory provide the same decisions about combining two samples as in § 3, based on exponentially distributed data, with 5 failures out of 13 items and with 10 failures out of 17 items and we should reject the "Null Hypothesis" H0=Fisher's method and Reliability Theory provide the same decisions about combining two samples as in § 3, based on Normally distributed data, with 5 failures out of 13 items and with 10 failures out of 17 items.
The same for others distributions of the data and different numbers of degrees of freedom. This is in line with W. E. Deming's "profound knowledge" [3,4] and Juran's ideas [5].

Conclusions
We saw that Theory is more important than formulae which do not consider the phenomenon on which we have to decide (distribution of the data).
The number of degrees of freedom of the samples we want to combine is influencing our decision: forgetting this can have dangerous consequences.
Manager, professors and scholars should remember