Cohen’s Kappa (Statistics) - The Complete Guide
SPSS tutorials website header logo SPSS TUTORIALS VIDEO COURSE BASICS ANOVA REGRESSION FACTOR

Cohen’s Kappa – What & Why?

Cohen’s kappa is a measure that indicates to what extent
2 ratings agree better than chance level.

Cohen’s Kappa - Quick Example

Two pediatricians observe N = 50 children. They independently diagnose each child. The data thus obtained are in this Googlesheet, partly shown below.

Cohens Kappa Raw Data In Googlesheet

As we readily see, our raters agree on some children and disagree on others. So what we'd like to know is: to what extent do our raters agree on these diagnoses? An obvious approach is to compute the proportion of children on whom our raters agree. We can easily do so by creating a contingency table or “crosstab” for our raters as shown below.

Cohens Kappa Contingency Table

Note that the diagonal elements in green are the numbers of children on whom our raters agree. The observed agreement proportion \(P_a\) is easily calculated as

$$P_a = \frac{27 + 2 + 2 + 2 + 1}{50} = 0.68$$

This means that our raters diagnose 68% out of 50 children similarly. Now, this may seem pretty good but what if our raters would diagnose children as (un)healthy
by simply flipping coins?
Such diagnoses would be pretty worthless, right? Nevertheless, we'd expect an agreement proportion of \(P_a\) = 0.50 in this case: our raters would agree on 50% of children just by chance.

A solution to this problem is to correct for such a chance-level agreement proportion. Cohen’s kappa does just that.

Cohen’s Kappa - Formulas

First off, how many children are diagnosed similarly? For this, we simply add up the diagonal elements (in green) in the table below.

Cohens Kappa Contingency Table

This results in

$$\Sigma{o_{ij}} = 27 + 2 + 2 + 2 + 1 = 34$$

where \(o_{ij}\) denotes the observed frequencies on the diagonal.

Second, if our raters would not agree to any extent at all, how many similar diagnoses should we expect due to mere chance? Such expected frequencies are calculated as

$$e_{ij} = \frac{o_i\cdot o_j}{N}$$

where

Like so, the expected frequency for rater A = “mentally healthy” (n = 35) and rater B = “mentally healthy” (n = 33) is

$$e_{ij} = \frac{35\cdot 33}{50} = 23.1$$

The table below shows this and all other expected frequencies.

Cohens Kappa Expected Frequencies

The diagonal elements in green show all expected frequencies for both raters giving similar diagnoses by mere chance. These add up to

$$\Sigma{e_{ij}} = 23.1 + 0.72 + 0.32 + 0.30 + 0.08 = 24.52$$

Finally, Cohen’s kappa (denoted as \(\boldsymbol \kappa\), the Greek letter kappa) is computed as3

$$\kappa = \frac{\Sigma{o_{ij}} - \Sigma{e_{ij}}}{N - \Sigma{e_{ij}}}$$

For our example, this results in

$$\kappa = \frac{34 - 24.52}{50 - 24.52} = 0.372$$

as confirmed by our Googlesheet shown below.

Cohens Kappa Example Calculation

An alternative formula for Cohen’s kappa is

$$\kappa = \frac{P_a - P_c}{1 - P_c}$$

where

For our data, this results in

$$\kappa = \frac{0.68 - 0.49}{1 - 0.49} = 0.372$$

This formula also sheds some light on what Cohen’s kappa really means:

$$\kappa = \frac{\text{actual performance - chance performance}}{\text{perfect performance - chance performance}}$$

which comes down to

$$\kappa = \frac{\text{actual improvement over chance}}{\text{maximum possible improvement over chance}}$$

Cohen’s Kappa - Interpretation

Like we just saw, Cohen’s kappa basically indicates the extent to which observed agreement is better than chance agreement. Technically, agreement could be worse than chance too, resulting in Cohen’s kappa < 0. In short, Cohen’s kappa can run from -1.0 through 1.0 (both inclusive) where

Another way to think of Cohen’s kappa is the proportion of disagreement reduction compared to chance. For our example, we expected N = 25.48 different diagnoses by chance. Since \(\kappa\) = .372, this is reduced to

$$25.48 - (25.48 \cdot 0.372) = 16$$

different (disagreeing) diagnoses by our raters.

With regard to effect size, there's no clear consensus on rules of thumb. However, Twisk (2016)4 more or less proposes that

Cohen’s Kappa in SPSS

In SPSS, Cohen’s kappa is found under Analyze SPSS Menu Arrow Descriptive statistics SPSS Menu Arrow Crosstabs as shown below.

Cohens Kappa SPSS Crosstabs

The output (below) confirms that \(\kappa\) = .372 for our example.

Cohens Kappa In SPSS Output

Keep in mind that the significance level is based on the null hypothesis that \(\kappa\) = 0.0. However, concluding that kappa is probably not zero is pretty useless because zero doesn't come anywhere close to an acceptable value for kappa. A confidence interval would have been much more useful here but -sadly- SPSS doesn't include it.

When (Not) to Use Cohen’s Kappa?

Cohen’s kappa is mostly suitable for comparing 2 ratings if both ratings

For ordinal or quantitative variables, Cohen’s kappa is not your best option. This is because it only distinguishes between the ratings agreeing or disagreeing. So let's say we have answer categories such as

  1. very bad;
  2. somewhat bad;
  3. neutral;
  4. somewhat good;
  5. very good.

If 2 raters rate some item as “very bad” and “somewhat bad”, they slightly disagree. If they rate it as “very bad” and “very good”, they disagree much more strongly but Cohen’s kappa ignores this important difference: in both scenarios they simply “disagree”. A measure that does take into account how much raters disagree is weighted kappa. This is therefore a more suitable measure for ordinal variables.

Cohen’s kappa is an association measure for 2 nominal variables. For testing if this association is zero (both variables independent), we often use a chi-square independence test. Effect size measures for this test are

These measures can basically be seen as correlations for nominal variables. So how do they differ from Cohen’s kappa? Well,

whereas the other measures don't.

Second, if both ratings are ordinal, then weighted kappa is a more suitable measure than Cohen’s kappa.1 This measure takes into account (or “weights”) how much raters disagree. Weighted kappa was introduced in SPSS version 27 under Analyze SPSS Menu Arrow Scale SPSS Menu Arrow Weighted Kappa as shown below.

SPSS Weighted Kappa Menu

Lastly, if you have 3(+) raters instead of just two, use Fleiss-multirater-kappa. This measure is available in SPSS version 28(+) from Analyze SPSS Menu Arrow Scale SPSS Menu Arrow Reliability Analysis Note that this is the same dialog as used for Cronbach’s alpha.

References

  1. Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
  2. Warner, R.M. (2013). Applied Statistics (2nd. Edition). Thousand Oaks, CA: SAGE.
  3. Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
  4. Twisk, J.W.R. (2016). Inleiding in de Toegepaste Biostatistiek [Introduction to Applied Biostatistics]. Houten: Bohn Stafleu van Loghum.
  5. Van den Brink, W.P. & Mellenberg, G.J. (1998). Testleer en testconstructie [Test science and test construction]. Amsterdam: Boom.
  6. Fleiss, J.L., Levin, B. & Cho Paik, M. (2003). Statistical Methods for Rates and Proportions (3d. Edition). Hoboken, NJ: Wiley.

Cohen’s D – Effect Size for T-Test

Cohen’s D is the difference between 2 means
expressed in standard deviations.

Why Do We Need Cohen’s D?

Children from married and divorced parents completed some psychological tests: anxiety, depression and others. For comparing these 2 groups of children, their mean scores were compared using independent samples t-tests. The results are shown below.

SPSS T Test Output Table

Some basic conclusions are that

However, what we really want to know is are these small, medium or large differences? This is hard to answer for 2 reasons:

A solution to both problems is using the standard deviation as a unit of measurement like we do when computing z-scores. And a mean difference expressed in standard deviations -Cohen’s D- is an interpretable effect size measure for t-tests.

Cohen’s D - Formulas

Cohen’s D is computed as
$$D = \frac{M_1 - M_2}{S_p}$$
where

But precisely what is the “pooled estimated population standard deviation”? Well, the independent-samples t-test assumes that the 2 groups we compare have the same population standard deviation. And we estimate it by “pooling” our 2 sample standard deviations with

$$S_p = \sqrt{\frac{(N_1 - 1) \cdot S_1^2 + (N_2 - 1) \cdot S_2^2}{N_1 + N_2 - 2}}$$

Fortunately, we rarely need this formula: SPSS, JASP and Excel readily compute a t-test with Cohen’s D for us.

Cohen’s D in JASP

Running the exact same t-tests in JASP and requesting “effect size” with confidence intervals results in the output shown below.

Cohens D Output Jasp

Note that Cohen’s D ranges from -0.43 through -2.13. Some minimal guidelines are that

And there we have it. Roughly speaking, the effects for

We'll go into the interpretation of Cohen’s D into much more detail later on. Let's first see how Cohen’s D relates to power and the point-biserial correlation, a different effect size measure for a t-test.

Cohen’s D and Power

Very interestingly, the power for a t-test can be computed directly from Cohen’s D. This requires specifying both sample sizes and α, usually 0.05. The illustration below -created with G*Power- shows how power increases with total sample size. It assumes that both samples are equally large.

Power Versus Sample Size For Cohens D

If we test at α = 0.05 and we want power (1 - β) = 0.8 then

Cohen’s D and Overlapping Distributions

The assumptions for an independent-samples t-test are

  1. independent observations;
  2. normality: the outcome variable must be normally distributed in each subpopulation;
  3. homogeneity: both subpopulations must have equal population standard deviations and -hence- variances.

If assumptions 2 and 3 are perfectly met, then Cohen’s D implies which percentage of the frequency distributions overlap. The example below shows how some male population overlaps with some 69% of some female population when Cohen’s D = 0.8, a large effect.

Cohens D Overlapping Distributions

The percentage of overlap increases as Cohen’s D decreases. In this case, the distribution midpoints move towards each other. Some basic benchmarks are included in the interpretation table which we'll present in a minute.

Cohen’s D & Point-Biserial Correlation

An alternative effect size measure for the independent-samples t-test is \(R_{pb}\), the point-biserial correlation. This is simply a Pearson correlation between a quantitative and a dichotomous variable. It can be computed from Cohen’s D with
$$R_{pb} = \frac{D}{\sqrt{D^2 + 4}}$$

For our 3 benchmark values,

Alternatively, compute \(R_{pb}\) from the t-value and its degrees of freedom with
$$R_{pb} = \sqrt{\frac{t^2}{t^2 + df}}$$

Cohen’s D - Interpretation

The table below summarizes the rules of thumb regarding Cohen’s D that we discussed in the previous paragraphs.

Cohen’s DInterpretation Rpb% overlapRecommended N
d = 0.2Small effect ± 0.100± 92%788
d = 0.5Medium effect ± 0.243± 80%128
d = 0.8Large effect ± 0.371± 69%52

Cohen’s D for SPSS Users

Cohen’s D is available in SPSS versions 27 and higher. It's obtained from Analyze SPSS Menu Arrow Compare Means SPSS Menu Arrow Independent Samples T Test as shown below.

Cohens D In SPSS

For more details on the output, please consult SPSS Independent Samples T-Test.

If you're using SPSS version 26 or lower, you can use Cohens-d.xlsx. This Excel sheet recomputes all output for one or many t-tests including Cohen’s D and its confidence interval from

The input for our example data in divorced.sav and a tiny section of the resulting output is shown below.

Cohens D Excel Tool Screenshot

Note that the Excel tool doesn't require the raw data: a handful of descriptive statistics -possibly from a printed article- is sufficient.

SPSS users can easily create the required input from a simple MEANS command if it includes at least 2 variables. An example is

*Create table with N, mean and SD for test scores by divorced for copying into Excel.

means anxi to anti by divorced
/cells count mean stddev.

Copy-pasting the SPSS output table as Excel preserves the (hidden) decimals of the results. These can be made visible in Excel and reduce rounding inaccuracies.

SPSS Output Table Copy As Excel

Final Notes

I think Cohen’s D is useful but I still prefer R2, the squared (Pearson) correlation between the independent and dependent variable. Note that this is perfectly valid for dichotomous variables and also serves as the fundament for dummy variable regression.

The reason I prefer R2 is that it's in line with other effect size measures: the independent-samples t-test is a special case of ANOVA. And if we run a t-test as an ANOVA, η2 (eta squared) = R2 or the proportion of variance accounted for by the independent variable. This raises the question: why should we use a different effect size measure
if we compare 2 instead of 3+ subpopulations?
I think we shouldn't.

This line of reasoning also argues against reporting 1-tailed significance for t-tests: if we run a t-test as an ANOVA, the p-value is always the 2-tailed significance for the corresponding t-test. So why should you report a different measure for comparing 2 instead of 3+ means?

But anyway, that'll do for today. If you've any feedback -positive or negative- please drop us a comment below. And last but not least:

thanks for reading!

Chi-Square Goodness-of-Fit Test – Simple Tutorial

A chi-square goodness-of-fit test examines if a categorical variable
has some hypothesized frequency distribution in some population.
The chi-square goodness-of-fit test is also known as

Example - Testing Car Advertisements

A car manufacturer wants to launch a campaign for a new car. They'll show advertisements -or “ads”- in 4 different sizes. For ad each size, they have 4 ads that try to convey some message such as “this car is environmentally friendly”. They then asked N = 80 people which ad they liked most. The data thus obtained are in this Googlesheet, partly shown below.

Chi Square Goodness Of Fit Test Raw Data

So which ads performed best in our sample? Well, we can simply look up which ad was preferred by most respondents: the ad having the highest frequency is the mode for each ad size.

So let's have a look at the frequency distribution for the first ad size -ad1- as visualized in the bar chart shown below.

Observed Frequencies and Bar Chart

Chi Square Goodness Of Fit Test Bar Chart Frequencies

The observed frequencies shown in this chart are

  1. Safe and Family Friendly: 6
  2. Luxurious and Masculine: 29
  3. Environmentally Friendly: 16
  4. Spacious and Convenient: 29

Note that ad1 has a bimodal distribution: ads 2 and 4 are both winners with 29 votes. However, our data only hold a sample of N = 80. So can we conclude that ads 2 and 4
also perform best in the entire population?
The chi-square goodness-of-fit answers just that. And for this example, it does so by trying to reject the null hypothesis that all ads perform equally well in the population.

Null Hypothesis

Generally, the null hypothesis for a chi-square goodness-of-fit test is simply

$$H_0: P_{01}, P_{02},...,P_{0m},\; \sum_{i=0}^m\biggl(P_{0i}\biggr) = 1$$

where \(P_{0i}\) denote population proportions for \(m\) categories in some categorical variable. You can choose any set of proportions as long as they add up to one. In many cases, all proportions being equal is the most likely null hypothesis.
For a dichotomous variable having only 2 categories, you're better off using

Anyway, for our example, we'd like to show that some ads perform better than others. So we'll try to refute that our 4 population proportions are all equal and -hence- 0.25.

Expected Frequencies

Now, if the 4 population proportions really are 0.25 and we sample N = 80 respondents, then we expect each ad to be preferred by 0.25 · 80 = 20 respondents. That is, all 4 expected frequencies are 20. We need to know these expected frequencies for 2 reasons:

Assumptions

The chi-square goodness-of-fit test requires 2 assumptions2,3:

  1. independent observations;
  2. for 2 categories, each expected frequency \(Ei\) must be at least 5.
    For 3+ categories, each \(Ei\) must be at least 1 and no more than 20% of all \(Ei\) may be smaller than 5.

The observations in our data are independent because they are distinct persons who didn't interact while completing our survey. We also saw that all \(Ei\) are (0.25 · 80 =) 20 for our example. So this second assumption is met as well.

Formulas

We'll first compute the \(\chi^2\) test statistic as

$$\chi^2 = \sum\frac{(O_i - E_i)^2}{E_i}$$

where

For ad1, this results in

$$\chi^2 = \frac{(16 - 20)^2}{20} + \frac{(29 - 20)^2}{20} + \frac{(9 - 20)^2}{20} + \frac{(29 - 20)^2}{20} = 18.7 $$

If all assumptions have been met, \(\chi^2\) approximately follows a chi-square distribution with \(df\) degrees of freedom where

$$df = m - 1$$

for \(m\) frequencies. Since we have 4 frequencies for 4 different ads,

$$df = 4 - 1 = 3$$

for our example data. Finally, we can simply look up the significance level as

$$P(\chi^2(3) > 18.7) \approx 0.00032$$

We ran these calculations in this Googlesheet shown below.

Chi Square Goodness Of Fit Test Significance Test

So what does this mean? Well, if all 4 ads are equally preferred in the population, there's a 0.00032 chance of finding our observed frequencies. Since p < 0.05, we reject the null hypothesis. Conclusion: some ads are preferred by more people than others in the entire population of readers.

Right, so it's safe to assume that the population proportions are not all equal. But precisely how different are they? We can express this in a single number: the effect size.

Effect Size - Cohen’s W

The effect size for a chi-square goodness-of-fit test -as well as the chi-square independence test- is Cohen’s W. Some rules of thumb1 are that

Cohen’s W is computed as

$$W = \sqrt{\sum_{i = 1}^m\frac{(P_{oi} - P_{ei})^2}{P_{ei}}}$$

where

For ad1, the null hypothesis states that all expected proportions are 0.25. The observed proportions are computed from the observed frequencies (see screenshot below) and result in

$$W = \sqrt{\frac{(0.2 - 0.25)^2}{0.25} +\frac{(0.3625 - 0.25)^2}{0.25} +\frac{(0.075 - 0.25)^2}{0.25} +\frac{(0.3625 - 0.25)^2}{0.25} } = $$

$$W = \sqrt{0.234} = 0.483$$

We ran these computations in this Googlesheet shown below.

Chi Square Goodness Of Fit Test Effect Size Cohens W

For ad1, the effect size \(W\) = 0.483. This indicates a large overall difference between the observed and expected frequencies.

Power and Sample Size Calculation

Now that we computed our effect size, we're ready for our last 2 steps. First off, what about power? What's the probability demonstrating an effect if

The chart below -created in G*Power- answers just that.

Chi Square Goodness Of Fit Test Power Versus Effect Size Chart

Some basic conclusions are that

These outcomes are not too great: we only have a 0.60 probability of rejecting the null hypothesis if the population effect size is medium and N = 80. However, we can increase power by increasing the sample size. So which sample sizes do we need if

The chart below shows how required sample sizes decrease with increasing effect sizes.

Chi Square Goodness Of Fit Test Sample Size For Power Chart

Under the aforementioned conditions, we have power ≥ 0.80

References

  1. Cohen, J (1988). Statistical Power Analysis for the Social Sciences (2nd. Edition). Hillsdale, New Jersey, Lawrence Erlbaum Associates.
  2. Siegel, S. & Castellan, N.J. (1989). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). Singapore: McGraw-Hill.
  3. Warner, R.M. (2013). Applied Statistics (2nd. Edition). Thousand Oaks, CA: SAGE.

Simple Introduction to Confidence Intervals

A confidence interval is a range of values
that encloses a parameter with a given likelihood.
So let's say we've a sample of 200 people from a population of 100,000. Our sample data come up with a correlation of 0.41 and indicate that the 95% confidence interval for this correlation
runs from 0.29 to 0.52.
This means that

So basically, a confidence interval tells us how much our sample correlation is likely to differ from the population correlation we're after.

Confidence Intervals - Example

El Hierro is the smallest Canary island and has 8,077 inhabitants of 18 years or over. A scientist wants to know their average yearly income. He asks a sample of N = 100. The table below presents his findings.

Confidence Interval Descriptive Statistics

Based on these 100 people, he concludes that the average yearly income for all 8,077 inhabitants is probably between $25,630 and $32,052. So how does that work?

Confidence Intervals - How Does it Work?

Let's say the tax authorities have access to the yearly incomes of all 8,077 inhabitants. The table below shows some descriptive statistics.

Confidence Interval Population Parameters

Now, a scientist who samples 100 of these people can compute a sample mean income. This sample mean probably differs somewhat from the $32,383 population mean. Another scientist could also sample 100 people and come up with another different mean. And so on: if we'd draw 100 different samples, we'd probably find 100 different means. In short, sample means fluctuate over samples. So how much do they fluctuate? This is expressed by the standard deviation of sample means over samples, known as the standard error -SE- of the mean. SE is calculated as
$$SE = \frac{\sigma}{\sqrt{N}}$$
so for our data that'll be $$SE = \frac{$22,874}{\sqrt{100}} = $2,287.$$
Right. Now, statisticians also figured out the exact frequency distribution of sample means: the sampling distribution of the mean. For our data, it's shown below.

Confidence Interval Population Parameters

Our graph tells us that 95% of all samples will come up with a mean between roughly $27,808 and $36,958. This is basically the mean ± 2SE:

In practice, however, we usually don't know the population mean. So we estimate it from sample data. But how much is a sample mean likely to differ from its population counterpart? Well, we just saw that a sample mean has a 95% probability of falling within ± 2SE of the population mean.
Now, we don't know SE because it depends on the (unknown) population standard deviation. However, we can estimate SE from the sample standard deviation. By doing so, most samples will come up with roughly the correct SE. As a result, the 95% of samples whose means fall within ± 2SE
typically have confidence intervals enclosing the population mean
as illustrated below.

Confidence Intervals - Illustration

Confidence Interval Mean Sampling distribution and confidence intervals. Note that the interval for sample 3 does not contain the population mean μ. This holds for 5% of all CI’s.

Now, a sample having a mean within ±2SE may have a confidence interval not containing the population mean. This may happen if it underestimates the population standard deviation. The reverse may occur too.
However, the sample standard deviation is an unbiased estimator: on average it is exactly correct. So for all samples, exactly 95% of all 95% confidence intervals
contain the parameter they estimate.
Just as promised.

Confidence Intervals - Basic Properties

Right, so a confidence interval is basically a likely range of values for a parameter such as a population correlation, mean or proportion. Therefore, wider confidence intervals indicate less precise estimates for such parameters.

Three factors determine the width of a confidence interval. Everything else equal,

Confidence Intervals or Statistical Significance?

If both are available, confidence intervals. Why? Well, confidence intervals give the same -and more- information than statistical significance. Some examples:

Confidence Interval Versus Statistical Significance One Sample T Test

So should we stop reporting statistical significance altogether in favor of confidence intervals? Probably not. Confidence intervals are not available for nonparametric tests such as ANOVA or the chi-square independence test. If we compare 2 means, a single confidence interval for the difference tells it all. But that's not going to work for comparing 3 or more means...

Formulas and Example Calculations

Statistical software such as SPSS, Stata or SAS computes confidence intervals for us so there's no need to bother about any formulas or calculations. Do you want to know anyway? Then let's go: we computed the confidence interval for our example in this Googlesheet (downloadable as Excel) as shown below.

Confidence Interval Calculation Googlesheets

So how does it work? Well, first off, our sample data came up with the descriptive statistics shown below.

Confidence Interval Descriptive Statistics

We estimate the standard error of the mean as $$SE_{mean} = \frac{S}{\sqrt{N}}$$
so that'll be $$SE_{mean} = \frac{$16,185}{\sqrt{100}} = $1,6185.$$
Next, $$T = \frac{M - \mu}{SE_{mean}}$$
This formula tries to tell you that the difference between the sample mean \(M\) and the population mean \(\mu\) divided by \(SE_{mean}\) follows a t distribution. We're really just standardizing the mean difference here into a z-score (T).
Finally, we need the degrees of freedom given by $$Df = N - 1$$
so that'll be $$Df = 100 - 1 = 99.$$
So between which t-values do we find 95% of all (standardized) mean differences? We can look this up in Google sheets as shown below.

Inverse T Distribution Google Sheets

This tells us that a proportion of 0.025 (or 2.5%) of all t-values < -1.984. Because the t-distribution is symmetrical, a proportion of 0.975 of t-values > 1.984. These critical t-values are visualized below.

Critical T Value Df 99

The illustration tells us that our previous rule of thumb of roughly ±2SE is ±1.984SE for this example: 95% of all standardized mean differences are between -1.984 and 1.984. Finally, the 95% confidence interval is $$M - T_{0.975} \cdot SE_{mean} \lt \mu \lt M + T_{0.975} \cdot SE_{mean} $$
so that'll be $$$28,841 - 1.984 \cdot $1,619 \lt \mu \lt $28,841 + 1.984 \cdot $1,619$$
which results in $$$25,630 \lt \mu \lt $32,052.$$
Thanks for reading.

Covariance – Quick Introduction

Covariance - What is It?

A covariance is basically an unstandardized correlation. That is: a covariance is a number that indicates to what extent 2 variables are linearly related. In contrast to a (Pearson) correlation, however, a covariance depends on the scales of both variables involved as expressed by their standard deviations.

The figure below visualizes some correlations and covariances as scatterplots.

Covariances In Scatterplots

x1 and y1 are basically unrelated. The covariance and correlation are both close to zero;
x2 and y2 are strongly related but not linearly at all. The covariance and correlation are zero.
x3 and y3 are negatively related. The covariance and correlation are both negative;
x4 and y4 are positively related. The covariance and correlation are both positive;
x5 and y5 are strongly positively related. Because they have the same standard deviations as x4 and y4, the correlation and covariance both increase;
x6 and y6 are identical to x5 and y5 except that their standard deviations are 1.0 instead of 2.0. This shrinks the covariance with a factor 4.0 but does not affect the correlation.

Comparing plots and emphasizes that covariances are scale dependent whereas correlations aren't. This may make you wonder why should I ever compute a covariance
instead of a correlation?

Covariance or Correlation?

First off, the precise relation between a covariance and correlation is given by

$$S_{xy} = r_{xy} \cdot s_x \cdot s_y$$

where

This formula shows that a covariance can be seen as a correlation that's “weighted” by the product of the standard deviations of the 2 variables involved: everything else equal, larger standard deviations result in larger covariances.

This feature may be desirable for comparing associations among variable pairs. This only makes sense if all variables are measured on identical scales such as dollars, seconds or kilos. Some analyses that require covariances are the following:

1. Cronbach’s alpha is usually computed on covariances instead of correlations. This is because scale scores are computed as sums or means over unstandardized variables. Therefore, variables with larger SD's have more impact on scale scores. This is why associations among such variables also have more weight in the computation of Cronbach's alpha.

2. In factor analysis, a covariance matrix is sometimes analyzed instead of a correlation matrix. If so, associations among variables have more impact on the factor solution insofar as these variables have larger SD's.

3. Some analyses need to meet the assumption of equal covariance matrices over subpopulations. An example is MANOVA, in which the Box test -basically a multivariate expansion of Levene's test- is often used for testing this assumption.

4. Somewhat surprisingly, ANCOVA -meaning analysis of covariance- does not involve computing covariances.

So those are some analyses that involve covariances. So how are these computed? Well, which formula to use depends on which type of data you're analyzing.

Sample Covariance Formula

If your data contain a sample from a much larger population (usually the case), the sample covariance is computed as

$$S_{xy} = \frac{\sum\limits_{i = 1}^N(X_i - \overline{X})(Y_i - \overline{Y})}{N - 1}$$

where

Let's now get a grip on this formula by using it in a calculation example.

Covariance Calculation Example

The table below contains the weights in grams of 10 babies at birth (X) and at age 12 months (Y). What's the covariance between X and Y?

ID12345678910
X3777327937603579413830673438405944933517
Y869578449532880795377073887311465118378604

First off,

Therefore,

$$S_{xy} = \frac{(3777 - 3711)\cdot(8695 - 9227)\;+\;...\;+\;(3517 - 3711)\cdot(8604 - 9227)}{10 - 1}$$

$$S_{xy} = \frac{66 \cdot -532\;+\;...\;+\;-194 \cdot -623}{10 - 1}$$

$$S_{xy} = \frac{5189622}{10 - 1} = 576625$$

You can look up the entire calculation in this Googlesheet, partly shown below.

Compute Covariance In Googlesheets

Population Covariance Formula

If your data hold the entire population you'd like to study, you can compute the covariance as

$$\sigma_{xy} = \frac{\sum\limits_{i = 1}^N(X_i - \mu_x)(Y_i - \mu_Y)}{N}$$

where

Software for Computing Covariances

Both sample and population covariances are easily computed in Googlesheets and Excel. This Googlesheet, partly shown below, contains a couple of examples.

Covariance Formulas In Googlesheets

A full covariance matrix for several variables is easily obtained from SPSS. However, “covariance” in SPSS always refers to the sample covariance because the population covariance is completely absent from SPSS. Pretty poor for a “statistical package”. But anyway: the only menu based option for this is Analyze SPSS Menu Arrow Correlate SPSS Menu Arrow Bivariate as illustrated below.

Covariances In SPSS Correlations Options Dialog

A much better option, however, is using SPSS syntax like we did in Cronbach’s Alpha in SPSS. This is faster and results in a much nicer table layout as shown below.

Covariance Matrix From SPSS

Two quick notes are in place here:

Just like a correlation matrix, a covariance matrix is symmetrical: the covariance between X and Y is obviously equal to that between Y and X.

The main diagonal contains the covariances between each variable and itself. These are simply the variances (squared standard deviations) of our variables. This last point implies that we can compute a correlation matrix from a covariance matrix
but not reversely.
For example, the correlation between our first 2 variables is

$$r_{xy} = \frac{576625}{\sqrt{183629} \cdot \sqrt{2170571}} = 0.913$$

Right. I guess that should do regarding covariances. If you've any feedback, please throw us a comment below. Other than that:

thanks for reading!

Chi-Square Independence Test – What and Why?

Chi-Square Independence Test - What Is It?

The chi-square independence test evaluates if
two categorical variables are related in some population.
Example: a scientist wants to know if education level and marital status are related for all people in some country. He collects data on a simple random sample of n = 300 people, part of which are shown below.

Chi-Square Test - Raw Data View

Chi-Square Test - Observed Frequencies

A good first step for these data is inspecting the contingency table of marital status by education. Such a table -shown below- displays the frequency distribution of marital status for each education category separately. So let's take a look at it.

Chi-Square Test - Contingency Table

The numbers in this table are known as the observed frequencies. They tell us an awful lot about our data. For instance,

Chi-Square Test - Column Percentages

Although our contingency table is a great starting point, it doesn't really show us if education level and marital status are related. This question is answered more easily from a slightly different table as shown below.

Chi-Square Test - Column Percentages

This table shows -for each education level separately- the percentages of respondents that fall into each marital status category. Before reading on, take a careful look at this table and tell me is marital status related to education level and -if so- how? If we inspect the first row, we see that 46% of respondents with middle school never married. If we move rightwards (towards higher education levels), we see this percentage decrease: only 18% of respondents with a PhD degree never married (top right cell).

Reversely, note that 64% of PhD respondents are married (second row). If we move towards the lower education levels (leftwards), we see this percentage decrease to 31% for respondents having just middle school. In short, more highly educated respondents marry more often than less educated respondents.

Chi-Square Test - Stacked Bar Chart

Our last table shows a relation between marital status and education. This becomes much clearer by visualizing this table as a stacked bar chart, shown below.

Chi-Square Independence Test - Stacked Bar Chart Showing Dependence

If we move from top to bottom (highest to lowest education) in this chart, we see the dark blue bar (never married) increase. Marital status is clearly associated with education level.The lower someone’s education, the smaller the chance he’s married. That is: education “says something” about marital status (and reversely) in our sample. So what about the population?

Chi-Square Test - Null Hypothesis

The null hypothesis for a chi-square independence test is that two categorical variables are independent in some population. Now, marital status and education are related -thus not independent- in our sample. However, we can't conclude that this holds for our entire population. The basic problem is that samples usually differ from populations.

If marital status and education are perfectly independent in our population, we may still see some relation in our sample by mere chance. However, a strong relation in a large sample is extremely unlikely and hence refutes our null hypothesis. In this case we'll conclude that the variables were not independent in our population after all.

So exactly how strong is this dependence -or association- in our sample? And what's the probability -or p-value- of finding it if the variables are (perfectly) independent in the entire population?

Chi-Square Test - Statistical Independence

Before we continue, let's first make sure we understand what “independence” really means in the first place. In short, independence means that one variable doesn't
“say anything” about another variable.
A different way of saying the exact same thing is that independence means that the relative frequencies of one variable
are identical over all levels of some other variable.
Uh... say again? Well, what if we had found the chart below?

Chi-Square Independence Test - Stacked Bar Chart Showing Statistical Independence

What does education “say about” marital status? Absolutely nothing! Why? Because the frequency distributions of marital status are identical over education levels: no matter the education level, the probability of being married is 50% and the probability of never being married is 30%.

In this chart, education and marital status are perfectly independent. The hypothesis of independence tells us which frequencies we should have found in our sample: the expected frequencies.

Expected Frequencies

Expected frequencies are the frequencies we expect in a sample
if the null hypothesis holds.
If education and marital status are independent in our population, then we expect this in our sample too. This implies the contingency table -holding expected frequencies- shown below.

Chi-Square Test - Expected Frequencies

These expected frequencies are calculated as $$eij = \frac{oi\cdot oj}{N}$$
where

So for our first cell, that'll be $$eij = \frac{39 \cdot 90}{300} = 11.7$$
and so on. But let's not bother too much as our software will take care of all this.

Note that many expected frequencies are non integers. For instance, 11.7 respondents with middle school who never married. Although there's no such thing as “11.7 respondents” in the real world, such non integer frequencies are just fine mathematically. So at this point, we've 2 contingency tables:

The screenshot below shows both tables in this GoogleSheet (read-only). This sheet demonstrates all formulas that are used for this test.

Observed and Expected Frequencies in GoogleSheet

Residuals

Insofar as the observed and expected frequencies differ, our data deviate more from independence. So how much do they differ? First off, we subtract each expected frequency from each observed frequency, resulting in a residual. That is, $$rij = oij - eij$$
For our example, this results in (5 * 4 =) 20 residuals. Larger (absolute) residuals indicate a larger difference between our data and the null hypothesis. We basically add up all residuals, resulting in a single number: the χ2 (pronounce “chi-square”) test statistic.

Test Statistic

The chi-square test statistic is calculated as
$$\chi^2 = \Sigma{\frac{(oij - eij)^2}{eij}}$$
so for our data $$\chi^2 = \frac{(18 - 11.7)^2}{11.7} + \frac{(36 - 27)^2}{27} + ... + \frac{(6 - 5.4)^2}{5.4} = 23.57$$

Again, our software will take care of all this. But if you'd like to see the calculations, take a look at this GoogleSheet.

Chi-Square Test Statistic in GoogleSheet

So χ2 = 23.57 in our sample. This number summarizes the difference between our data and our independence hypothesis. Is 23.57 a large value? What's the probability of finding this? Well, we can calculate it from its sampling distribution but this requires a couple of assumptions.

Chi-Square Test Assumptions

The assumptions for a chi-square independence test are

  1. independent observations. This usually -not always- holds if each case in SPSS holds a unique person or other statistical unit. Since this is the case for our data, we'll assume this has been met.
  2. For a 2 by 2 table, all expected frequencies > 5.However, for a 2 by 2 table, a z-test for 2 independent proportions is preferred over the chi-square test.

    For a larger table, all expected frequencies > 1 and no more than 20% of all cells may have expected frequencies < 5.

If these assumptions hold, our χ2 test statistic follows a χ2 distribution. It's this distribution that tells us the probability of finding χ2 > 23.57.

Chi-Square Test - Degrees of Freedom

We'll get the p-value we're after from the chi-square distribution if we give it 2 numbers:

The degrees of freedom is basically a number that determines the exact shape of our distribution. The figure below illustrates this point.

Chi-Square Distributions with Different DF

Right. Now, degrees of freedom -or df- are calculated as $$df = (i - 1) \cdot (j - 1)$$
where

so in our example $$df = (5 - 1) \cdot (4 - 1) = 12.$$
And with df = 12, the probability of finding χ2 ≥ 23.57 ≈ 0.023.We simply look this up in SPSS or other appropriate software. This is our 1-tailed significance. It basically means, there's a 0.023 (or 2.3%) chance of finding this association in our sample if it is zero in our population.

Chi-Square Distribution with 1-Tailed P-Value

Since this is a small chance, we no longer believe our null hypothesis of our variables being independent in our population. Conclusion: marital status and education are related
in our population.
Now, keep in mind that our p-value of 0.023 only tells us that the association between our variables is probably not zero. It doesn't say anything about the strength of this association: the effect size.

Effect Size

For the effect size of a chi-square independence test, consult the appropriate association measure. If at least one nominal variable is involved, that'll usually be Cramér’s V (a sort of Pearson correlation for categorical variables). In our example Cramér’s V = 0.162. Since Cramér’s V takes on values between 0 and 1, 0.162 indicates a very weak association. If both variables had been ordinal, Kendall’s tau or a Spearman correlation would have been suitable as well.

Reporting

For reporting our results in APA style, we may write something like “An association between education and marital status was observed, χ2(12) = 23.57, p = 0.023.”

Chi-Square Independence Test - Software

You can run a chi-square independence test in Excel or Google Sheets but you probably want to use a more user friendly package such as

The figure below shows the output for our example generated by SPSS.

Chi-Square Test - SPSS Output

For a full tutorial (using a different example), see SPSS Chi-Square Independence Test.

Thanks for reading!

Cramér’s V – What and Why?

Cramér’s V is a number between 0 and 1 that indicates how strongly two categorical variables are associated. If we'd like to know if 2 categorical variables are associated, our first option is the chi-square independence test. A p-value close to zero means that our variables are very unlikely to be completely unassociated in some population. However, this does not mean the variables are strongly associated; a weak association in a large sample size may also result in p = 0.000.

Cramér’s V - Formula

A measure that does indicate the strength of the association is Cramér’s V, defined as

$$\phi_c = \sqrt{\frac{\chi^2}{N(k - 1)}}$$

where

Cramér’s V - Examples

A scientist wants to know if music preference is related to study major. He asks 200 students, resulting in the contingency table shown below.

Cramers V Crosstab Counts

These raw frequencies are just what we need for all sort of computations but they don't show much of a pattern. The association -if any- between the variables is easier to see if we inspect row percentages instead of raw frequencies. Things become even clearer if we visualize our percentages in stacked bar charts.

Cramér’s V - Independence

In our first example, the variables are perfectly independent: \(\chi^2\) = 0. According to our formula, chi-square = 0 implies that Cramér’s V = 0. This means that music preference “does not say anything” about study major. The associated table and chart make this clear.

Cramers V Crosstab Unassociated Percentages Cramers V Unassociated Variables Chart

Note that the frequency distribution of study major is identical in each music preference group. If we'd like to predict somebody’s study major, knowing his music preference does not help us the least little bit. Our best guess is always law or “other”.

Cramér’s V - Moderate Association

A second sample of 200 students show a different pattern. The row percentages are shown below.

Cramers V Crosstab Medium Association

This table shows quite some association between music preference and study major: the frequency distributions of studies are different for music preference groups. For instance, 60% of all students who prefer pop music study psychology. Those who prefer classical music mostly study law. The chart below visualizes our table.

Cramers V Medium Association Chart

Note that music preference says quite a bit about study major: knowing the former helps a lot in predicting the latter. For these data

It follows that

$$\phi_c = \sqrt{\frac{113}{200(3)}} = 0.43.$$

which is substantial but not super high since Cramér’s V has a maximum value of 1.

Cramér’s V - Perfect Association

In a third -and last- sample of students, music preference and study major are perfectly associated. The table and chart below show the row percentages.

Cramers V Crosstab Perfect Association Cramers V Perfect Association Chart

If we know a student’s music preference, we know his study major with certainty. This implies that our variables are perfectly associated. Do notice, however, that it doesn't work the other way around: we can't tell with certainty someone’s music preference from his study major but this is not necessary for perfect association: \(\chi^2\) = 600 so

$$\phi_c = \sqrt{\frac{600}{200(3)}} = 1,$$

which is the very highest possible value for Cramér’s V.

Alternative Measures

Cramér’s V - SPSS

In SPSS, Cramér’s V is available from Analyze SPSS Menu Arrow Descriptive Statistics SPSS Menu Arrow Crosstabs. Next, fill out the dialog as shown below.

Cramers V from SPSS Crosstabs

Warning: for tables larger than 2 by 2, SPSS returns nonsensical values for phi without throwing any warning or error. These are often > 1, which isn't even possible for Pearson correlations. Oddly, you can't request Cramér’s V without getting these crazy phi values.

Final Notes

Cramér’s V is also known as Cramér’s phi (coefficient)5. It is an extension of the aforementioned phi coefficient for tables larger than 2 by 2, hence its notation as \(\phi_c\). It's been suggested that its been replaced by “V” because old computers couldn't print the letter \(\phi\).3

Thank you for reading.

References

  1. Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
  2. Field, A. (2013). Discovering Statistics with IBM SPSS Newbury Park, CA: Sage.
  3. Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
  4. Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
  5. Sheskin, D. (2011). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, FL: Chapman & Hall/CRC.

Cochran’s Q Test in SPSS – Quick Tutorial

Cochran Q Test - What Is It?

SPSS Cochran Q test is a procedure for testing if the proportions of 3 or more dichotomous variables are equal in some population. These outcome variables have been measured on the same people or other statistical units.

SPSS Cochran Q Test Example

The principal of some university wants to know whether three exams are equally difficult. Fifteen students took these exams and their results are in exam-results.sav.

1. Quick Data Check

It's always a good idea to take a quick look at what the data look like before proceeding to any statistical tests. We'll open the data and inspect some histograms by running FREQUENCIES with the syntax below. Note the TO keyword in step 3.

*1. Set default directory.

cd 'd:downloaded'. /*or wherever data file is located.

*2. Open data.

get file 'exam-results.sav'.

*3. Quick check.

frequencies test_1 to test_3
/format notable
/histogram.
SPSS Cochran Q Test Histogram

The histograms indicate that the three variables are indeed dichotomous (there could have been some “Unknown” answer category but it doesn't occur). Since N = 15 for all variables, we conclude there's no missing values. Values 0 and 1 represent “Failed” and “Passed”.We suggest you RECODE your values if this is not the case. We therefore readily see that the proportions of students succeeding range from .53 to .87.

2. Assumptions Cochran Q Test

Cochran's Q test requires only one assumption:

3. Running SPSS Cochran Q Test

SPSS Cochran Q Test Dialog SPSS Cochran Q Test Dialog

We'll navigate to Analyze SPSS Menu Arrow Nonparametric Tests SPSS Menu Arrow Legacy Dialogs SPSS Menu Arrow K Related Samples...

We move our test variables under Test Variables,
select Descriptive under Statistics,
select Cochran's Q under Test Type and
click Paste
This results in the syntax below which we then run in order to obtain our results.

*Run Cochran Q test.

NPAR TESTS
/COCHRAN=test_1 test_2 test_3
/STATISTICS DESCRIPTIVES
/MISSING LISTWISE.

4. SPSS Cochran Q Test Output

SPSS Cochran Q Test Output

The first table (Descriptive Statistics) presents the descriptives we'll report. Do not report the results from DESCRIPTIVES instead.The reason is that the significance test is (necessarily) based on cases without missing values on any of the test variables. The descriptives obtained from Cochran's test are therefore limited to such complete cases too.
Since N = 15, the descriptives once again confirm that there are no missing values and
the proportions range from .53 to .87.Again, proportions correspond to means if 0 and 1 are used as values.

SPSS Cochran Q Test Output

The table Test Statistics presents the result of the significance test.
The p-value (“Asymp. Sig.”) is .093; if the three tests really are equally difficult in the population, there's still a 9.3% chance of finding the differences we observed in this sample. Since this chance is larger than 5%, we do not reject the null hypothesis that the tests are equally difficult.

5. Reporting Cochran's Q Test Results

When reporting the results from Cochran's Q test, we first present the aforementioned descriptive statistics. Cochran's Q statistic follows a chi-square distribution so we'll report something like “Cochran's Q test did not indicate any differences among the three proportions, χ2(2) = 4.75, p = .093”.