13 The Analysis of Categorical Data
counts in various categories
Fisher's Exact Test
H0: there is no sex bias (any differences observed are due to randomization)
We denote the counts in the table and on the margins as follows:
According to \(H_0\) the margins are fixed: There are 24 females, 24 males, 35 supervisors who choose to promote, and 13 who choose not to.
Also, the process of randomization determines the counts in the interior of the table (denoted by capital letters since they are random) subject to the constraints of the margins. With these constraints, there is only 1 degree of freedom in the interior of the table; if any interior count is fixed, the others may be determined
Consider \(N_{11}\), the number of males who are promoted. Under \(H_0\), the distribution of \(N_{11}\) is that of the number of successes in 24 draws without replacement from a population of 35 successes and 13 failures; that is, the distribution of \(N_{11}\) is hypergeometric \(H(N, n, r)=H(48, 24, 35)\) Vi får alltså t.ex.
\[p(n_{11})=\frac{\binom{35}{x}\binom{48-35}{24-x}}{\binom{48}{24}}=\frac{\binom{n_{1.}}{n_{11}}\binom{n_{..}-n_{1.}}{n_{.1}-n_{11}}}{\binom{n_{..}}{n_{.1}}}=\frac{\binom{n_{1.}}{n_{11}}\binom{n_{2.}}{n_{21}}}{\binom{n_{..}}{n_{.1}}}\]
We will use \(N_{11}\) as the test statistic for testing the null hypothesis. The preceding hypergeometric probability distribution is the null distribution of \(N_{11}\) and is tabled here. A two-sided test rejects for extreme values of \(N_{11}\)
Glöm inte att använda F(21-1) ist. för F(21) eftersom det är diskret, och 21 ska ingå i sannolikheten
The Chi-Square Test of Homogeneity
Suppose we have independent observations from \(J\) multinomial distributions each of which has \(I\) cells, and that we want to test whether the cell probabilities of the multinomials are equal - that is, to test the homogeneity of the multinomial distributions
The six counts for Sense and Sensibility will be modeled as a realization of a multinomial random variable with unknown cell probabilities and total count 375; the counts for the other works will be similarly modeled as independent multinomial random variables.
(written by
admirer)
\(\downarrow\)
Thus we must consider comparing \(J\) multinomial distributions each having \(I\) categories
\(\pi_{ij}\) = probability of the \(i\)th category of the \(j\)th multinomial
\(H_0:\pi_{i1}=\pi_{i2}=\cdots=\pi_{iJ}\) för alla \(i\)
dvs t.ex. ordet with lika sannolikt i alla verk
We may view this as a goodness-of-fit test: Does the model prescribed by the null hypothesis fit the data? To test goodness of fit, we will compare observed values with expected values as in Chapter 9, using likelihood ratio statistics or Pearson's chi-square statistic. We will assume that the data consist of independent samples from each multinomial distribution, and we will denote
\(n_{ij}\) = the count in the \(i\)th category of the \(j\)th multinomial
Under \(H_0\), each of the \(J\) multinomials has the same probability for the \(i\)th category, say \(\pi_i\). The following theorem shows that the mle of \(\pi_i\) is simply \(n_{i.}/n_{..}\), which is an obvious estimate. #
📜
Under \(H_0\), the mle of the parameters \(\pi_i\) are\[\hat{\pi_{i}}=\frac{n_{i}}{n_{..}}\]för alla \(i\).
\(n_{i.}\) = total number of responses in the \(i\)th category
\(n_{..}\) = grand total number of responses
Expected value för den i:te kategorin hos den j:te multinomialen är den uppskattade sannolikheten hos den cellen gånger det totala antalet observationer för den j:te multinomialen:\[E_{ij}=n_{.j}\hat{\pi}_{i}=\frac{n_{.j}n_{i.}}{n_{..}}\]
Chi-två-statistikan blir därför\[X^{2}=\sum_{i=1}^{I}\sum_{j=1}^{J}\frac{(O_{ij}-E_{ij})^{2}}{E_{ij}}=\sum_{i=1}^{I}\sum_{j=1}^{J}\frac{(n_{ij}-n_{i.}n_{.j}/n_{..})^{2}}{n_{i.}n_{.j}/n_{..}}\]
df = #independent counts - #independent parameters est. from the data
Each multinomial has \(I-1\) independent counts since the totals are fixed, and \(I-1\) independent parameters have been estimated. We thus get\[df=(I-1)(J-1)\]
The Chi-Square Test of Independence
contingency table
sample of size \(n\) cross-classified in a table with \(I\) rows and \(J\) columns
The joint distribution of the counts \(n_{ij}\) is multinomial with cell probabilities \(\pi_{ij}\)
marginal probabilities
\[\pi_{.j}=\sum_{i=1}^{I}\pi_{ij}\]
\[\pi_{i.}=\sum_{j=1}^{J}\pi_{ij}\]
If the row and column probabilities are independent of each other,\[\pi_{ij}=\pi_{i.}\pi_{.j}\]
"Finns det en association mellan college och att bara vara gift en gång eller är de oberoende?"
"Kommer böckerna från samma multinomialfördelning?"
We thus consider the following null hypothesis:
\(H_{0}:\pi_{ij}=\pi_{i.}\pi_{.j}\) (för alla i,j)
Under \(H_0\), the mle of \(\pi_{ij}\) is\[\hat{\pi}_{ij}=\hat{\pi}_{i.}\hat{\pi}_{.j}=\frac{n_{i.}}{n}\times\frac{n_{.j}}{n}\]
Under \(H_A\), the mle of \(\pi_{ij}\) is simply\[\tilde{\pi}_{ij}=\frac{n_{ij}}{n}\]
These estimates can be used to form a likelihood ratio test or an asymptotically equivalent Pearson's chi-square test,\[X^{2}=\sum_{i=1}^{I}\sum_{j=1}^{J}\frac{(O_{ij}-E_{ij})^{2}}{E_{ij}}\]\[=\sum_{i=1}^{I}\sum_{j=1}^{J}\frac{(n_{ij}-n_{i.}n_{.j}/n)^{2}}{n_{i.}n_{.j}/n}\]
df = (I-1)(J-1)
\(O_{ij}\) = observed counts \(n_{ij}\)
\(E_{ij}\) = the fitted counts = \(n\hat{\pi}_{ij}=\frac{n_{i.}n_{.j}}{n}\)
assumes grand total fixed
assumes margins fixed
Matched-Pairs Designs
As with experiments involving continuous data, pairing can control for extraneous sources of variability and can increase the power of the test
Odds Ratios
showed that there was no significant difference
BUT THEY WERE WRONG, because they had used pairs of siblings (disease, no disease), which violated the assumption of independence between the multinomials #
The RIGHT WAY is to take account of the pairings:
Viewed in this way, the data are a sample of size 85 from a multinomial distribution with four cells.
We can represent the probabilities in the table as follows:
The appropriate null hypothesis states that the probabilities of tonsillectomy and no tonsillectomy are the same for patients and siblings - that is, \(\pi_{1.}=\pi_{.1}\) and \(\pi_{2.}=\pi_{.2}\), or\[\pi_{11}+\pi_{12}=\pi_{11}+\pi_{21}\]\[\pi_{12}+\pi_{22}=\pi_{21}+\pi_{22}\]
These equations simplify to \(\pi_{12}=\pi_{21}\)
\(\downarrow\)
\(H_0:\pi_{12}=\pi_{21}\)
Under the null hypothesis, the off-diagonal probabilities are equal, and under the alternative they are not
McNemar's Test
Under the null hypothesis, the mle's of the cell probabilities are\[\hat{\pi}_{11}=\frac{n_{11}}{n}\]\[\hat{\pi}_{22}=\frac{n_{22}}{n}\]\[\hat{\pi}_{12}=\hat{\pi}_{21}=\frac{n_{12}+n_{21}}{2n}\]
The contributions to the chi-square statistic from the \(n_{11}\) and \(n_{22}\) cells are equal to zero; the remainder of the statistic is\[X^{2}=\frac{\left(n_{12}-(n_{12}+n_{21})/2\right)^{2}}{(n_{12}+n_{21})/2}+\frac{\left(n_{21}-(n_{12}+n_{21})/2\right)^{2}}{(n_{12}+n_{21})/2}=\frac{(n_{12}-n_{21})^{2}}{n_{12}+n_{21}}\]
df = 1
How to estimate odds and odds ratio?
3 possible sampling designs
retrospective study: take a \(D\)-sample and a control \(\overline{D}\)-sample, then find who had been exposed
Notice that this contingency table contains more information than the previous one
prospective study: take an \(X\)-sample and a control \(\overline{X}\)-sample, then watch who gets affected
\[\textrm{odds}(A)=\frac{P(A)}{1-P(A)}\]
Since this implies that\[P(A)=\frac{\textrm{odds}(A)}{1+\textrm{odds}(A)}\]odds of 2 ( or 2 to 1), for example, correspond to P(A) = 2/3
\(X\) = exposed
\(D\) = diseased
complementary events
\(\overline{X}\), \(\overline{D}\)
\[\textrm{odds}(D|X)=\frac{P(D|X)}{1-P(D|X)}\]\[\textrm{odds}(D|\overline{X})=\frac{P(D|\overline{X})}{1-P(D|\overline{X})}\]
The odds ratio\[\Delta=\frac{\textrm{odds}(D|X)}{\textrm{odds}(D|\overline{X})}\]is a measure of the influence of exposure on subsequent disease
simple random sampling
we get \(\Delta=\frac{\pi_{11}\pi_{00}}{\pi_{01}\pi_{10}}\)
svårt om sjukdomen är sällsynt
In this case the data allow us to estimate and compare \(P(D|X)\) and \(P(D|\overline{X})\) and, hence, the odds ratio
We can directly estimate \(P(X|D)\) and \(P(X|\overline{D})\). We cannot estimate joint or \(P(D|X)\) and \(P(D|\overline{X})\), but we can estimate \(\Delta\)
vi kan visa att det är samma som \(\Delta=\frac{\textrm{odds}(X|D)}{\textrm{odds}(X|\overline{D})}\) #