The chi-square test is a procedure for testing whether two categorical variables are related to each other in any way.

## Chi-Square Test - Example

A scientist want to know whether education level is related to marital status for all people in some country. He collects data on a simple random sample of n = 300 people, part of which are shown below.

## Chi-Square Test - Observed Frequencies

A good first step for these data is inspecting the contingency table of marital status by education. Such a table -shown below- displays the frequency distribution of marital status for each education category separately. So let's take a look at it.

The numbers in this table are known as the **observed frequencies**. They tell us an awful lot about our data. For instance,

- there's 4 marital status categories and 5 education levels;
- we succeeded in collecting data on our entire sample of n = 300 respondents (bottom right cell);
- we've 84 respondents with a Bachelor’s degree (bottom row, middle);
- we've 30 divorced respondents (last column, middle);
- we've 9 divorced respondents with a Bachelor’s degree.

## Chi-Square Test - Column Percentages

Although our contingency table is a great starting point, it doesn't really show us whether education level and marital status are related. This question is answered more easily from a slightly different table as shown below.

This table shows -for each education level separately- the (column) percentages of respondents that fall into each marital status category. If we inspect the first row, we see that 46% of respondents with middle school never married. If we move rightwards (towards higher education levels), we see this percentage decrease: only 18% of respondents with a PhD degree never married (top right cell).

Reversely, note that 64% of PhD respondents are married (second row). If we move towards the lower education levels (leftwards), we see this percentage decrease to 31% for respondents having just middle school.

In short, more **highly educated respondents seem to marry more** often than less educated respondents.

## Chi-Square Test - Clustered Bar Chart

Our last table seems to show a clear relation between marital status and education. This becomes much more apparent by visualizing this table as a clustered bar chart, shown below.

If we move from top to bottom (highest to lowest education) in this chart, we see the white bar (never married) increase. Marital status is clearly associated with education level: the lower someone’s education, the smaller the chance he’s married. Statistically, we say that education “says something” about marital status *in our sample*.

## Chi-Square Test - Null Hypothesis

The descriptive statistics we discussed so far are extremely useful and show that marital status is clearly related to education in our sample. However, we can't conclude that this holds for our entire population. The fundamental problem here is that **samples often differ from the populations** from which they were drawn. If marital status is completely unrelated to education in our population, then we may still observe some relation in our sample by mere chance.

The **idea behind the chi-square test** is basically this: we'll first simply assume that education and marital status are independent in our population. This is our null hypothesis. We'll then try to refute this by showing that our sample results are basically impossible under our null hypothesis. If so, we'll reject the null hypothesis and conclude that the variables weren't unrelated in our population anyway.

## Chi-Square Test - Statistical Independence

Before we continue, let's first make sure we understand what “independence” really means in the first place. The chart below visualizes what independence between education and marital status in our population -which may or may not hold- would look like.

The point here is that frequency distributions of marital status are identical over education levels. For instance, 50% of people never married, regardless whether they only have middle school or a PhD degree. Statistically, we say that **education level “says nothing” about marital status**. Although we don't demonstrate it here, the reverse holds too: marital status doesn't say anything about education either.

## Chi-Square Test - Expected Frequencies

If education and population are independent in our population, then we expect them to be roughly independent in our sample too. Given our sample frequency distributions for education and marital status, perfect independence would imply the contingency table shown below.

This table shows the expected frequencies for our sample under the null hypothesis. These are the most likely frequencies for our sample if education and marital status are statistically independent in our population.

Note that **many frequencies are non integers**, for instance 11.7 respondents with middle school who never married. Although there's no such thing as “11.7 respondents” in the real world, such non integer frequencies are just fine mathematically. We won't bother about them.

**Where do these numbers come from?** Well, for each cell, we multiply the marginal frequencies and divide them by our grand total of 300 respondents. For example, 39 respondents have middle school and 90 respondents never married. The expected frequency for respondents with middle school who never married is (39 * 90 / 300 =) 11.7.

## The Chi-Square Statistic

We're now starting to get somewhere. We first created a table with observed frequencies. These are the 20 frequencies (5 education levels * 4 marital statuses) we observed in our sample. We then created a second table holding the corresponding expected frequencies. These are the 20 frequencies that we should observe if the null hypothesis (education and marital status independent in our population) is true.

Minor differences between observed and expected frequencies can be expected due to mere sampling fluctuation. Large differences, however, suggest that the null hypothesis perhaps wasn't true after all. In order to express such differences in a single number, we sort of add up the difference between each observed frequency and its expected counterpart. The sum of these 20 numbers is known as the **chi-square statistic** and indicates the total difference between all observed and expected frequencies.

The precise calculations for expected frequencies and the chi-square statistic are shown in the formulas in this GoogleSheet, a screenshot of which is shown below.

## Simulation Study Chi-Square Values

We found a chi-square value of 23.57 in our sample. **Is that a normal value?** An intuitive approach to this question is a simulation study: we created fake data containing a large population of people. We made sure that these fake data have the same frequency distributions over education and marital status as our sample. However, we made these two variables perfectly independent. In other words: the null hypothesis holds perfectly for our simulated population.

We then drew 1,000 samples (having the same sample size of n = 300 as our actual sample) from this population and calculated the chi-square statistic for each sample. The outcomes are visualized in the histogram below.

## Simulation Study - Conclusion

What does this histogram tell us? Well, if our variables are perfectly independent in our population, then most chi-square values in a sample of n = 300 should fall between 5 and 20. Our chi-square value of 23.57 may be considered very large.

If we look at the results more closely, we see that only 2% of our chi-square values are larger than 23.57 (figure below). Conclusion: **if our null hypothesis is true, then a chi-square value of 23.57 is very unlikely**. The null hypothesis probably wasn't true after all: education and marital status are related in our population.

## Chi-Square Distribution

Our simulation told us that 23.57 is an unlikely high chi-square value under the null hypothesis. However, we usually come to this conclusion differently: given some reasonable assumptions, the **chi-square statistic follows a mathematical function**: the chi-square probability density function, usually called the chi-square distribution. The figure below shows what it looks like for our sample.

First, note that our simulated histogram nicely follows this theoretical curve -as it should. Second, our theoretical p-value of 0.023 is roughly the same as the 0.02 that our simulation came up with. In short, both approaches conclude that education and marital status are unlikely to be unrelated in our population.

## Chi-Square Test - Degrees of Freedom

The chi-square distribution in the previous figure says “df = 12” in its title. Df is short for degrees of freedom, a number that determines the exact shape of our curve. A thorough discussion of degrees of freedom is beyond the scope of this tutorial. For two categorical variables, df = (i - 1) * (j - 1) where i and j are the number of levels of the variables involved. Education has 5 categories and marital status has 4. Therefore, df = (5 - 1) * (4 - 1) = 12.

## Chi-Square Test in Practice

This tutorial tried to explain the reasoning behind a chi-square test using a simulation study and a GoogleSheet. In the real world, however, we usually have some statistical software (such as SPSS, SAS or STATA) run the entire procedure for us. The figure below shows the output generated by SPSS.

## This tutorial has 10 comments

## By Tubal Kumar Benys on November 5th, 2017

Very Nice explanation.. I had no clarity before now its total clarity

## By Hady Shaaban on December 17th, 2016

Excellent supportive illustration as usual :)

## By AFUYE OLUWATAYO SAMUEL on December 15th, 2016

very fantastic,amiable and well educative

## By Ruben Geert van den Berg on November 11th, 2016

Hi Remya!

Both Fisher's exact test and Yates' continuity corrected chi-square value are appropriate only if the marginal distributions (simple frequency counts) for both variables are fixed: if we draw repeated samples, then cannot change from sample to sample. This is

rarely the case in practicewith the exception of certain classification tasks. Generally, the Pearson chi-square statistic is the option of choice.The assumption of all expected frequencies > 5 is still heavily debated. It's been suggested that it's not entirely relevant for sample sizes > 20 or so. Don't take it too literally. If you think it may really be a problem, merge some categories as explained in SPSS - Merge Categories of Categorical Variable.

Hope that helps!

## By Remya Unnikrishnan on November 11th, 2016

Sir,

Can you please explain the situations when we have to use Continuity test and fisher's exact test value instead of pearson chi square value.

If the sample size is high and more than one cell has frequency <5 in 3x2,4x3,3x3 etc is it ok for taking pearson value..

Please do reply.

Thank you