Chapter 8 Proportions and chi squared

This chapter looks at methods used for analyzing relationships in categorical data. The variable of interest is not a single continuous variable (e.g. how income or weight varies between groups) but the relative count or proportion of observations that fall into each category.

A key point is that the chi-squared test used in these cases is equivanet to a test of nested Poisson regression models.

8.1 Goodness of fit

A test of goodness-of-fit establishes whether an observed frequency distribution differs from a theoretical distribution. For example, we could test the hypothesis that a random sample with 44 men and 56 women has been drawn from a population in which men and women are equal in frequency, i.e. with the theoretical distribution of 50 men and 50 women.

Sample dataset:

Assume we have one category (mood) with three possible levels (‘happy’, ‘sad’ or ‘meh’). We have a sample of 120 observations in total, as generated by the code below. We also add dummy variables to represent ‘happy’ and ‘sad’ for use in the linear model.

mydata_chi1 <- data.frame(mood = c('happy', 'sad', 'meh'),
               counts = c(60, 90, 70)) %>% 
  
  # Dummy variables to be used in linear model
  mutate(mood_happy = if_else(mood == 'happy', 1, 0)) %>% 
  mutate(mood_sad = if_else(mood == 'sad', 1, 0))
Table 8.1: Categorical data with one variable
mood counts mood_happy mood_sad
happy 60 1 0
sad 90 0 1
meh 70 0 0

In this example, we may want to test whether the proportions of people with each mood in the underlying population are equal, based on our observation of 120 people.

Chi-squared test:

The goodness-of-fit test is based on the following test statistic:

\(\chi^2 = \sum \frac{(O_i - Ei)^2}{E_i}\)

where \(O_i\) is the observed count in each category and \(E_i\) is the expected count in each category. The test statistic follows a chi-squared (\(\chi^2\)) distribution where the degrees of freedom are equal to the number of categories minus one, i.e. \(d.f. = rows - 1\) (in this example, df = 2).

A worked example of a goodness-of-fit test is provided in this video by Khan Academy.

R’s built-in chi-squared test, chisq.test, compares the proportion of counts in each category with the expected proportions. By default, the expected proportions in each category are assumed to be equal.

#Built-in test
chisq.test(mydata_chi1$counts)
## 
##  Chi-squared test for given probabilities
## 
## data:  mydata_chi1$counts
## X-squared = 6.3636, df = 2, p-value = 0.04151

In this example, we would reject the null hypothesis that the proportion of people with each mood are all equal (p = 0.04151).

Equivalent linear model:

The equivalent linear model is a Poisson regression. This is a type of Generalized Linear Model (GLM). Poisson regressions use the natural log of the dependent variable, and so are sometimes referred to as a log-linear models. The follow model specification can be used:

\(log(y) = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + ...\)

where in this example \(x_1\) and \(x_2\) would be dummy variables representing ‘happy’ and ‘sad’ moods.

There are two reasons why we need to use a log-linear model rather than the OLS linear regression model, as explained in the book ‘Broadening Your Statistical Horizons’ by Julie Legler and Paul Roback:

  1. Since a Poisson random variable is a count, its minimum value is zero and, in theory, the maximum is unbounded. A line from an OLS is model certain to yield negative values for certain values of the dependent variable, \(x\); however, \(y\) can only take values from 0 to \(\infty\).

  2. The equal variance assumption in linear regression inference is violated, because as the expected value of a Poisson variable increases so too does the variance. With a Poisson distribution, the variance is equal to the expected value of the dependent variable, that is \(E(Y) = VAR(Y)\).

These issues are addressed by taking the natural log of the dependent variable. It is assumed that the resulting logarithm of the dependent variable can be modeled by a linear combination of the independent variable(s) used in the model.

Because we are using the natural log of the dependent variable, the coefficients from our regression, \(\beta_1\) and \(\beta_1\), are interpreted as the percentage differences in the number of people whose moods are happy or sad, relative to the reference level of ‘meh.’7 Because of this, it can be used as a test for differences of proportions; that is, whether there is a significant difference in the percentage of people in each category.

To assess whether the differences between categories (moods) are statistically significant we compare two nested models, just as we did with the ANOVA and ANCOVA tests in the previous chapter. In this case we compare a ‘full’ Poisson regression model, which includes our dummy variables, and a ‘null’ model which does not. Instead of using the F-test to compare nested models (as was the case with ANOVA/ANCOVA), here we use a ‘score test’, specifically the Rao test. This gives us a test statistic that follows a \(\chi^2\) distribution, with the degrees of freedom being equal to the additional parameters in the full model compared to the null model.

The code is as follows:

full = glm(counts ~ 1 + mood_happy + mood_sad, data=mydata_chi1, family=poisson())
null = glm(counts ~ 1,  data=mydata_chi1, family=poisson())
anova(null, full, test='Rao')
## Analysis of Deviance Table
## 
## Model 1: counts ~ 1
## Model 2: counts ~ 1 + mood_happy + mood_sad
##   Resid. Df Resid. Dev Df Deviance    Rao Pr(>Chi)  
## 1         2     6.2697                              
## 2         0     0.0000  2   6.2697 6.3636  0.04151 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison:

As seen above, the p-value is identical under both the built-in chi-squared test and the equivalent linear model:

Table 8.2: Goodness-of-fit: chi-squared test and equivalent linear model
model Chi-squared df p.value
chisq.test 6.3636 2 0.04151
glm 6.3636 2 0.04151

8.2 Contingency tables

Contingency tables are used in statistics to summarize the relationship between two categorical variables. They are used to test whether the frequency distribution differs between two or more groups. This can be tested using R’s built-in chi-squared test (chisq.test), similar to the goodness-of-fit test above, or again as two nested linear models (Poisson regressions), one of which includes dummy variables representing the interaction between the two categories in our table.

Sample dataset:

As an example, we might like to test whether a subjects mood (happy, sad or meh) is related to sex (male or female). We can create the following data set:

mydata_chi2 <- data.frame(
  sex = c('male','female'),
  happy = c(70,100),
  meh = c(32,30),
  sad = c(120,110)
)
Table 8.3: A contingency table
sex happy meh sad
male 70 32 120
female 100 30 110

Chi-squared test:

As shown above, a contingency table is a table that lists the frequencies of occurrence for categories of two variables. The first variable is shown in rows, and the second variable is shown in columns.

Contingency tables can be used to assess whether the proportion of observations in one category depends on, or is contingent upon, the other category in the table. There are actually two types of tests:

  • A test of homogeneity. This tests the null hypothesis that different populations have the same proportions of some characteristics. The key difference from the test of independence is that there are multiple populations that the data is drawn from. The null hypothesis is that the proportion of X is the same in all populations studied.
  • A test of independence. A test of independence tests the null hypothesis that there is no association between the two variables in a contingency table where the data is all drawn from one population. The null hypothesis is that X and Y are independent.

Both these tests involve the exactly the same mathematical procedures and only differ only in terms of the hypothesis being tested. Some further reading can be found here.

These tests are based on a similar test statistic to the goodness-of-fit test:

\(\chi^2 = \sum \frac{(O_i - Ei)^2}{E_i}\)

where \(O_i\) is the observed count in each category and \(E_i\) is the expected count in each category. The test statistic follows a chi-squared (\(\chi^2\)) distribution where the degrees of freedom are found by \(d.f. = (rows - 1)(columns - 1 )\).

A worked example of a chi-squared test is provided in this video by Khan Academy.

R’s built-in chi squared test, chisq.test, can be used to assess whether the distribution of observations in one category is contingent on the distribution of observations under the other category. Note that first we must convert our data to a matrix and drop the first column (in this case sex):

# Convert data to matrix format, need for the built-in chi-squared
mydata_chi2_matrix <- mydata_chi2 %>% 
  select(-sex) %>% 
  as.matrix()

Now carry out the chi-squared test:

# Built-in chi-squared
chisq.test(mydata_chi2_matrix)
## 
##  Pearson's Chi-squared test
## 
## data:  mydata_chi2_matrix
## X-squared = 5.0999, df = 2, p-value = 0.07809

Based on this p-value we would not reject the null hypothesis that sex and mood were independent at the 0.05 level of significance.

Equivalent linear model:

The equivalent linear model is a Poisson regression with interaction terms between the two sets of dummy variables (in this case one set of dummy variables is for mood, the other is for sex).

Here we are testing for the interaction between two the two categories, just as we did with the two-way ANOVA (though here our dependent variable is the natural log of the count of observations, rather than the value of a single continuous variable). The test is equivalent to the following linear model, which is expressed using matrix notation:

\(log(y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2 X_3 \qquad H_0: \beta_3 = 0\)

Here \(X_1\) and \(X_2\) represent the two categories in our model (in this example, mood and sex). \(\beta_3\) is a vector that relates to the interactions of the categories. Here, it will be a vector comprising two values, corresponding to the combinations of the sex and mood dummy variables (male and mood_happy, and male and mood__meh). The null hypothesis (of \(\beta_3 = 0\)) is that all values in this vector are zero, and that there is no interaction between sex and mood when explaining the distribution of observations in each category.

We begin by expressing our data in ‘long’ format, including dummy variables to identify the groups of people who are happy and ‘meh’. This following data will be used in our linear model:

Table 8.4: Contingency table data in long format with dummy variables
sex mood Freq mood_happy mood_meh sex_male
male happy 70 1 0 1
female happy 100 1 0 0
male meh 32 0 1 1
female meh 30 0 1 0
male sad 120 0 0 1
female sad 110 0 0 0

The test involves a comparison of two nested linear models: a full model which includes the interaction terms between the two sets of dummy variables, and the null model which excludes the interaction terms. Again we use the (Rao) score test.

full = glm(Freq ~ 1 + mood_happy + mood_meh + sex_male + mood_happy*sex_male + mood_meh*sex_male, 
           data = mydata_chi2_long, family = poisson())
null = glm(Freq ~ 1 + mood_happy + mood_meh + sex_male, 
           data = mydata_chi2_long, family = poisson())
anova(null, full, test = 'Rao') 
## Analysis of Deviance Table
## 
## Model 1: Freq ~ 1 + mood_happy + mood_meh + sex_male
## Model 2: Freq ~ 1 + mood_happy + mood_meh + sex_male + mood_happy * sex_male + 
##     mood_meh * sex_male
##   Resid. Df Resid. Dev Df Deviance    Rao Pr(>Chi)  
## 1         2     5.1199                              
## 2         0     0.0000  2   5.1199 5.0999  0.07809 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the only difference between the nested models is the two interaction terms. Intuitively, we are testing whether the interaction of the two categories is statistically significant, i.e. whether the interaction between mood and sex can explain the distribution of observations over and above the individual effects of mood and sex alone.

Comparison:

As summarized in the table below, the linear model gives the same test statistic as the built-in chi-squared test, has the same degrees of freedom (which is equal to the number of interaction terms), and the same p-value.

Table 8.5: Contingency table: chi-squared test and linear model
model Chi-squared df p.value
chisq.test 5.0999 2 0.07809
glm 5.0999 2 0.07809

  1. Technically, the percentage difference is equal to \(e^{\beta_1}\) and \(e^{\beta_2}\), where \(\beta_1\) and \(\beta_2\) are continuously compounded rates of growth between 0 and 1. So for example, in the example below, the coefficient on ‘sad’ is \(\beta_2 = 0.2513\). This mean the proportion of people in the ‘sad’ category are higher by \(e^{0.2513} - 1 = 0.286 = 28.6\%\) than the proportion of people in the ‘meh’ category. This is equal to difference of the count of people in these two categories i.e. 90 ‘sad’ people and 70 ‘meh’ people.