Uncoupling Multidimensional Contingency Tables

A parsimonious and robust new method, based on information theory, to analyze multidimensional contingency tables is presented. It swiftly reveals the important relations between dependent and independent variables and casually detects confounding effects in a straightforward manner. The method in its simplicity could replace logistic regression and log-linear analysis that, in dealing with their limitations and defects, have grown complicated and convoluted.


Method
When planning a study of cause and effect, one primarily selects an effect Y and probable causes X i , and then designs a study that lets one find out whether the presumed causes X i actually had relevant influence on Y . One must always contend with the fact that further variables may also have an effect on Y or X i , therefore the study almost always includes measurements of further variables X i that might need to be controlled. In experimental designs one can control the X-variables that might have an effect through direct control or randomization, but in survey studies, observational by design, such direct control becomes impossible and must be replaced by statistical control.
The basis of a simple new method of analysis is the entropy H of a distribution, where summation is over all k categories for all p i > 0 H is readily interpreted as a definition of variance for categorical variables, with H = 0 when all cases are concentrated on a single category and H reaching the maximum of ln(k) for a rectangular distribution. For analyzing this variance the method relies on the terseness ζ (zeta) introduced by Preuss and Vorkauf [1]. ζ is a coefficient of the closeness of relations between a complete set of variables, or a coefficient of total correlation. ζ = 1 − H(Xi|X1,...,Xi−1,Xi+1,...,X k ) H(X1,X2,X3,...,X k ) It is defined for tables with any number of dimensions, it is normalized to 1, independent of the base of the logarithm and, especially, independent of the sample size N . Therefore it is comparable for tables of different size and dimensionality, a quality that the usual measures do not achieve, especially not χ 2 . Comparisons based on ζ need no corrections like a division by degrees of freedom; to arrive at a solution even with sparse tables, there is no need to add a constant like 1 / 2 to every cell frequency. These qualities were decisive for choosing ζ for an analysis of multivariate tables, where the many sub-tables of different size and dimensionality of a high-dimensional table have to be compared.
A method is introduced to find the contribution of the correlation between any subset of the variables to the correlation between all variables. This is done by combining the categories of two or more variables (the subset) into one composite variable; for this operation Preuss [2] coined the term uncoupling. For instance the interdependence of X i = [A, B] and X j = [1,2,3] is eliminated by combining the values of X i and X j into the composite variable This uncoupling operation removes the interdependence of X i and X j .
The data structure is analyzed by calculating ∆ζ for each pair, triple, . . . of uncoupled variables ∆ζ is the loss of terseness when the dependence of two or more variables ('some') is suppressed by uncoupling. It can be interpreted as the contribution of the correlation of some variables to the total correlation of all variables. The contribution ∆ζ to the total correlation seems ideal for quantifying strength of effect to measure the relative importance of independent variables that Kruskal and Majors [3] demanded, in preference to the frequent misuse of significance tests that they deplored.
As this simple operation of uncoupling variables to eliminate their interdependence is at the core of the new analysis, the analysis was also named uncoupling.
A further measure occasionally used in the analysis, also based on the entropy, is γ Y , the uncertainty coefficient (Press et al. [4]) that is defined for two-dimensional tables only. But this restriction to two dimensions can be relaxed: by declaring any one of several X i as the dependent variable Y and by uncoupling all other X i into one composite X, γ y in effect becomes a multivariate extension of the uncertainty coefficient, measuring the proportion of variance (entropy) in the variable Y that is explained by the remaining variables combined through uncoupling in the composite X. This extension was called separability by Preuss & Vorkauf [1]. γ y has only a supporting role in the Uncoupling analysis. When it is very small for any X i viewed as dependent Y in turn (γ y < 0.01, say), one may decide to simplify the analysis by excluding this X i ; this may be helpful in larger problems with many variables, such as in case-control studies with many hypothetical causes.
The Berkeley admissions data of Bickel et al. [5] have been intensively studied by many authors, because they show a clear gender bias in the total data; a bias that vanishes when the departments are taken into account. These data may serve as a first demonstration of the use of uncoupling. The two tables in table 1, one the original 2 × 6 × 2, the other (2×6)×2 with gender and department uncoupled, contain the same 2×6×2 = 24 cell frequencies, no variable was summed out or aggregated. This maintenance of the complete original data is the salient feature of uncoupling, making it preferable to aggregation. As uncoupling only removes the dependence between uncoupled variables and keeps all data intact, this method complies with Fisher's [6] demand to use all of the data, not aggregated sub-tables: "In inductive reasoning the whole of the data, or the available axioms, or the available observations, has to be taken into account." The result of Uncoupling's analysis in table 2 is very concise: We find a strong Department × Gender interaction: women tend to apply to other departments than men. There is also a strong Department×Admission interaction: departments differ strongly in their admissions rate. The Gender × Admission interaction is too small to be worth mentioning; the gender bias observed when departments were summed out was due to confounding.

Some applications
The following exemplary applications to a number of different types of study are intended to indicate the wide range of applicability. They should make the reader familiar with the method and its implications.

Drug Use of High School Seniors
A study on drug use (A=alcohol, a=no; C=cigarettes, c=no; M=marijuana, m=no) of male and female and white and non-white students (cited from Agresti [7], found the data of table 3. We see a clear preponderance over expectation for the complete pattern of abstinence (acm) and the complete pattern of drug use (ACM), the only mixed pattern with an observed count higher than expectation is Acm, the use of alcohol only. Uncoupling's analysis of terseness in table 4: for the full table, ζ = .0995. Uncoupling the triple of all three substances accounts for 96% of the terseness, and the three substantial pairwise effects C × M , A × C, and A × M are all concerned with correlations between substances used. As gender and race do not enter any of the substantial pairwise effects, one could simply eliminate these two control variables from the final model to obtain a final result of Uncoupling's analysis: C×M, A×C, A×M.
To confirm the legitimacy of this elimination, I tested the relation between substance use and gender/race by creating an uncoupled composite variable gender × race = [White-Female, WhiteMale, otherFemale, otherMale], and crosstabulated it with usagepattern =[acm,. . . ,ACM]. I judged the resulting separability γ = .006 for predicting usage pattern from gender×race to be small enough to be ignored, although χ 2 still indicated a significant difference from zero.
In his log-linear analysis using G 2 for model selection, Agresti arrived at a more complex model, as many terms were statistically significant, even when an effect was of negligible importance. His reduced model is A×C, A×M, C×M, A×Gender, A×Race, M×Gender, Race×Gender.
A log-linear analysis with SPSS [8], using their automated model selection procedure, resulted in a different model: A×C×Race×Gender, A×M, C×M×Gender, M×Race.
Both Agresti's and SPSS's model selections are perfectly admissible, but produce quite different results. As one is free, e.g., to either eliminate one interaction after the other or to eliminate all third order interactions in one sweep, the result of an investigator's model selection becomes subject to his personal judgment and less objective.
Uncoupling orders the effects along an unequivocal scale of correlation instead of significance tests and does not need subjective decisions, it only needs a specified minimum of an effect size or a "smallest meaningful difference" and uses the size of terseness ∆ζ in reducing the saturated model to arrive at a meaningful parsimonious set of variables and interactions, namely C×M, A×C, and M×A.
One may ask why the concept of the smallest meaningful effect is hardly ever used in testing hypotheses or in parsing the saturated model in log-linear analysis, although we all are familiar with it in the context of power calculations.
Uncoupling's subsequent post factum significance testing has no consequence for model building, it merely makes us aware that an apparently strong effect may not be strong enough with a feeble sample size, or, as in the case of these data, a feeble effect is exaggerated by a large sample.
The choice is: -with Uncoupling, just a few lines of unequivocal program output -or lengthy operations with traditional methods, filling many pages of output in sequential simplifications of the saturated model, where another researcher may reach different conclusions without committing any error.

A Problem too Hard for Logistic Regression
In a flyer advertising Cytel's LogXact [9] program for exact solutions of logistic regression the following problem is presented: "Diaphragm Use and Urinary Tract Infection. 130 college women with urinary tract infections were compared to 109 matched, uninfected controls. How is urinary tract infection related to age and contraceptive use?" How to read this table: The first column on the left tells us that there were 2 women total (1 infected) with these characteristics: Age<24, did not use oral contraceptive, did not use condom, did not use lubricated condom, did use spermicide, and did not use diaphragm.
"Challenge: Try fitting a logistic regression model to these data with all six covariates included. Conventional asymptotic logistic regression cannot meaningfully fit this model when the variable for diaphragm use is included. Only the exact methods in LogXact 4 can pin down the effect of this potentially important variable".
. Clearly, this data set poses difficulties, as the frequency table is rather sparse, only 44 of the possible 2 7 =128 cell frequencies are greater than zero. More difficulty stems from the perfect discrimination of diaphragm and infection.
Where a standard logistic regression found no solution for the variable diaphragm and LogXact was able to arrive at a solution identifying diaphragm as a significant predictor of urinary tract infection, Uncoupling had no problem whatsoever finding a solution when I took the challenge; just like LogXact it identified diaphragm as an important predictor. Uncoupling's simple and robust computation of ζ does not contain any iterative algorithm that may converge slowly or not at all.

Radelet's Death Penalty Data
A well-known problem is Radelet's study (cited from Agresti [7]) on Florida death penalties influenced by the defendant's race when controlling for the victim's race as shown in table 5. If the victim was white, black defendants received a death penalty more often than white defendants, and this was also the case when the victim was black. Yet, when collapsing the table by ignoring the victim's race (summing out the victim's race), in the total white defendants received the death penalty more often (cf. the arrows in table 5).
The primary question was: "How strongly does the color of the defendant determine the penalty?", and we get two conflicting answers when we compare the total result with the result within victims' race. The puzzling reversal of trend in the collapsed table is known as Simpson's paradox, a phenomenon that cannot occur in designed experiments where all variables are orthogonal to each other.
In this low-dimensional example Simpson's paradox was recognizable by visual inspection of the cross-tabulation. In the analysis with Uncoupling the non-orthogonality is more quickly detected by studying the ∆ζ in table 6 : Whereas the terseness of the complete table was ζ = .2577, uncoupling X 1 =defendant and X 2 =victim reduces the terseness by 95%, ∆ζ = .2436.
Black defendants tend to have killed black victims and white defendants tend to have killed white victims, and this non-orthogonality produces the baffling paradox.
This annoying interdependence of X 1 and X 2 is eliminated by uncoupling, combining the values of X 1 (race of defendant) and X 2 (race of victim) into a composite variable victim/defendant = [W/W,W/B,B/W,B/B] to remove any dependence.
Terseness is reduced to just ζ = .0141 for the 2 × [2 × 2] table in which X 1 and X 2 are uncoupled. We should revise our original question and ask: "How is the sentence determined by the composite of victim's and defendant's race?". The separability γ sentence of predicting the death sentence using the composite variable is γ = .0505, and this answers the revised question. We might go on to look at white and black defendants only and find that γ sentence is a rather small .0113 for white defendants versus a strong .1612 for black defendants. The black defendant's sentence is strongly influenced by the race of his victim. This finding is rarely mentioned in published analyses.
It is our conviction that the summing out of the control variable X 2 =victim, in an effort to produce a summary, amounts to an illegal act that produces Simpson's paradox (cf. Fisher's demand above [6]). In this extreme case, where summing out produced the paradox, you will probably agree, but I would like to propose a general rule banning the summing out of control variables when they are involved in a sizeable ∆ζ effect. The error involved in collapsing tables when an effect is insignificant, routinely done in parsing loglinear models, is only gradually less severe than when a very large X i -X j -relationship produces Simpson's paradox.

Byssinosis, an epidemiological example
Let us now turn to a complex data set with six variables by Higgins and Koch [10] as shown in table 7. The complete 3 × 3 × 2 × 2 × 2 × 2 table is difficult to assess. When one tries to find the main factors leading to byssinosis, a lung disease caused by exposure to cotton dust, one has to take into account many interrelations between the possibly illness-inducing variables. Higgins and Koch devised a laborious χ 2 -based set of rules designed to find the important factors; they concluded that dustiness of the workplace is the most important determinant of illness, gender of employee is 2 nd , and smoking is in 3 rd place. From the content of the study, it seems curious that the length of employment and therefore the length of exposure to dust came in 4 th place only. Could it be that some confounding relation has suppressed the relation between length of exposition and byssinosis? The ∆ζ in table 8 should provide an answer to this question. Reassuringly, the order of pairs that include byssinosis is 1 st dust, 2 nd length of employment, and 3 rd smoking. This appears more plausible for a lung disease.
But the largest ∆ζ occurs for the uncoupling of race and length of employment, which is responsible for almost half of the terseness ζ =.0984 of the whole table, non-whites have a much higher turnover.
This non-orthogonality has the effect that the clear increase of byssinosis with length of employment (and therefore exposure) seen within race is reduced when race is summed out (table 9). Here, the collapsing of the table by summing out race was not yet an error producing a reversal of trend as in Simpson's paradox, but it is an error that led Higgins and Koch to underestimate the effect of length of exposure on developing a byssinosis, producing an "attenuated Simpson".
The error of summing out will affect any of the statistical models usually applied in the analysis of data, as in the last resort they all use summaries of partially collapsed tables to arrive at their estimates of main effects. Fortunately, collapsing of tables by summing out variables is not needed; uncoupling can successfully replace it without producing confounding results, as it does not discard data but merely rearranges them.

A program for the analysis
A program Uncoupling (Windows) is available from the author that starts by computing the separabilities γ with each variable in turn regarded as the dependant variable.
In the main part, terseness ζ is computed with all pairs and higher tuples of variables uncoupled. All combinations of variables are analyzed.
The program expects input either in the form of raw data (one row per case) or in the form of contingency tables (one row per cell). Internally, the program uses the dBase format, but one can enter the tables in the form of a text file.

80
Uncoupling Multidimensional Contingency Tables Optionally, one can request bootstrapped error estimates.