A Kruskal-Wallis test tests if 3(+) populations have
equal mean ranks on some outcome variable. The figure below illustrates the basic idea.

First off, our scores are ranked ascendingly, regardless of group membership.
Now, if scores are not related to group membership, then the average mean ranks should be roughly equal over groups.
If these average mean ranks are very different in our sample, then some groups tend to have higher scores than other groups in our population as well: scores are related to group membership.

Kruskal-Wallis Test - Purposes

The Kruskal-Wallis test is a distribution free alternative for an ANOVA: we basically want to know if 3+ populations have equal means on some variable. However,

ANOVA is not suitable if the dependent variable is ordinal;
ANOVA requires the dependent variable to be normally distributed in each subpopulation, especially if sample sizes are small.

The Kruskal-Wallis test is a suitable alternative for ANOVA if sample sizes are small and/or the dependent variable is ordinal.

Kruskal-Wallis Test Example

A hospital runs a quick pilot on 3 vaccines: they administer each to N = 5 participants. After a week, they measure the amount of antibodies in the participants’ blood. The data thus obtained are in this Googlesheet, partly shown below.

Now, we'd like to know if some vaccines trigger more antibodies than others in the underlying populations. Since antibodies is a quantitative variable, ANOVA seems the right choice here.

However, ANOVA requires antibodies to be normally distributed in each subpopulation. And due to our minimal sample sizes, we can't rely on the central limit theorem like we usually do (or should anyway). And on top of that, our sample sizes are too small to examine normality. Just the emphasize this point, the histograms for antibodies by group are shown below.

If anything, the bottom two histograms seem slightly positively skewed. This makes sense because the amount of antibodies has a lower bound of zero but no upper bound. However, speculations regarding the population distributions don't get any more serious than that.

A particularly bad idea here is trying to demonstrate normality by running

a Shapiro-Wilk normality test and/or
a Kolmogorov-Smirnov test.

Due to our tiny sample sizes, these tests are unlikely to reject the null hypothesis of normality. However, that's merely due to their lack of power and doesn't say anything about the population distributions. Put differently: a different null hypothesis (our variable following a uniform or Poisson distribution) would probably not be rejected either for the exact same data.

In short: ANOVA really requires normality for tiny sample sizes but we don't know if it holds. So we can't trust ANOVA results. And that's why we should use a Kruskal-Wallis test instead.

Kruskal-Wallis Test - Null Hypothesis

The null hypothesis for a Kruskal-Wallis test is that the mean ranks on some outcome variable
are equal across 3+ populations. Note that the outcome variable must be ordinal or quantitative in order for “mean ranks” to be meaningful.

Many textbooks propose an incorrect null hypothesis such as:

some outcome variable has equal medians over 3+ populations or
some outcome variable follows identical distributions over 3+ populations.

So why are these incorrect? Well, the Kruskal-Wallis formula uses only 2 statistics: ranks sums and the sample sizes on which they're based. It completely ignores everything else about the data -including medians and frequency distributions. Neither of these affect whether the null hypothesis is (not) rejected.

If that still doesn't convince you, we'll perhaps add some example data files to this tutorial. These illustrate that wildly different medians or frequency distributions don't always result in a “significant” Kruskal-Wallis test (or reversely).

Kruskal-Wallis Test Assumptions

A Kruskal-Wallis test requires 3 assumptions^1,5,8:

independent observations;
the dependent variable must be quantitative or ordinal;
sufficient sample sizes (say, each n_i ≥ 5) unless the exact significance level is computed.

Regarding the last assumption, exact p-values for the Kruskal-Wallis test can be computed. However, this is rarely done because it often requires very heavy computations. Some exact p-values are also found in Use of Ranks in One-Criterion Variance Analysis.

Instead, most software computes approximate (or “asymptotic”) p-values based on the chi-square distribution. This approximation is sufficiently accurate if the sample sizes are large enough. There's no real consensus with regard to required sample sizes: some authors¹ propose each n_i ≥ 4 while others⁶ suggest each n_i ≥ 6.

Kruskal-Wallis Test Formulas

First off, we rank the values on our dependent variable ascendingly, regardless of group membership. We did just that in this Googlesheet, partly shown below.

Next, we compute the sum over all ranks for each group separately.

Kruskal Wallis Test Descriptive Statistics

We then enter a) our samples sizes and b) our ranks sums into the following formula:

$$Kruskal\;Wallis\;H = \frac{12}{N(N + 1)}\sum\limits_{i = 1}^k\frac{R_i^2}{n_i} - 3(N + 1)$$

where

$N$ denotes the total sample size;
$k$ denotes the number of groups we're comparing;
$R_i$ denotes the rank sum for group $i$;
$n_i$ denotes the sample size for group $i$.

For our example, that'll be

$$Kruskal\;Wallis\;H = \frac{12}{15(15 + 1)}(\frac{55^2}{5}+\frac{20^2}{5}+\frac{45^2}{5}) - 3(15 + 1) =$$

$$Kruskal\;Wallis\;H = 0.05\cdot(605 + 80 + 405) - 48 = 6.50$$

$H$ approximately follows a chi-square (written as χ²) distribution with

$$df = k - 1$$

degrees of freedom ($df$) for $k$ groups. For our example,

$$df = 3 - 1 = 2$$

so our significance level is

$$\chi^2(2) = 6.50, p \approx 0.039.$$

The SPSS output for our example, shown below, confirms our calculations.

So what do we conclude now? Well, assuming alpha = 0.05, we reject our null hypothesis: the population mean ranks of antibodies are not equal among vaccines. In normal language, our 3 vaccines do not perform equally well. Judging from the mean ranks, it seems vaccine B performs worse than its competitors: its mean rank is lower and this means that it triggered fewer antibodies than the other vaccines.

Kruskal-Wallis Post Hoc Tests

Thus far, we concluded that the amounts of antibodies differ among our 3 vaccines. So precisely which vaccine differs from which vaccine? We'll compare each vaccine to each other vaccine for finding out. This procedure is generally known as running post-hoc tests.

In contrast to popular belief, Kruskal-Wallis post-hoc tests are not equivalent to Bonferroni corrected Mann-Whitney tests. Instead, each possible pair of groups is compared using the following formula:

$$Z_{kw} = \frac{\overline{R}_i - \overline{R}_j}{\sqrt{\frac{N(N + 1)}{12}(\frac{1}{n_i}+\frac{1}{n_j})}}$$

where

our test statistic, $Z_{kw}$, approximately follows a standard normal distribution;
$\overline R_i$ denotes the mean rank for group $i$;
$N$ denotes the total sample size (including groups not used in this pairwise comparison);
$n_i$ denotes the sample size for group $i$.

For comparing vaccines A and B, that'll be

$$Z_{kw} = \frac{11 - 4}{\sqrt{\frac{15(15 + 1)}{12}(\frac{1}{5}+\frac{1}{5})}} \approx 2.475 $$

$$P(|Z_{kw}| > 2.475) \approx 0.013$$

A Bonferroni correction is usually applied to this p-value because we're running multiple comparisons on (partly) the same observations. The number of pairwise comparisons for $k$ groups is

$$N_{comp} = \frac{k (k - 1)}{2}$$

Therefore, the Bonferroni corrected p-value for our example is

$$P_{Bonf} = 0.013 \cdot \frac{3 (2 - 1)}{2} \approx 0.040$$

The screenshot from SPSS (below) confirms these findings.

Kruskal Wallis Test Post Hoc Tests Output SPSS

Oddly, the difference between mean ranks, $\overline{R}_i - \overline{R}_j$, is denoted as “Test Statistic”.
The actual test statistic, $Z_{kw}$ is denoted as “Std. Test Statistic”.

APA Reporting a Kruskal-Wallis Test

For APA reporting our example analysis, we could write something like “a Kruskal-Wallis test indicated that the amount of antibodies
differed over vaccines, H(2) = 6.50, p = 0.039.

Although the APA doesn't mention it, we encourage reporting the mean ranks and perhaps some other descriptives statistics in a separate table as well.

Reporting Kruskal Wallis Test Descriptives

Right, so that should do. If you've any questions or remarks, please throw me a comment below. Other than that:

Thanks for reading!

References

Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
Warner, R.M. (2013). Applied Statistics (2nd. Edition). Thousand Oaks, CA: SAGE.
Agresti, A. & Franklin, C. (2014). Statistics. The Art & Science of Learning from Data. Essex: Pearson Education Limited.
Field, A. (2013). Discovering Statistics with IBM SPSS Statistics. Newbury Park, CA: Sage.
Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
Siegel, S. & Castellan, N.J. (1989). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). Singapore: McGraw-Hill.
Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
Kruskal, W.H. & Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583-621.

THIS TUTORIAL HAS 20 COMMENTS:

By Jon K Peck on April 28th, 2021

Each of the multiple testing adjustment methods has its own set of assumptions about the relationship between the tests. This link discusses the positive dependency assumption of BH and argues that it is likely to be satisified in the multiple testing case.
https://stats.stackexchange.com/questions/111756/the-meaning-of-positive-dependency-as-a-condition-to-use-the-usual-method-for
By YY Ma on April 29th, 2021

Thanks for sharing. Very helpful.
The post emphasizes the normality assumption of ANOVA.
But why does it not mention the equal variance assumption of ANOVA?
It looks that K-W test is also suitable for groups with unequal variances.
By Ruben Geert van den Berg on April 29th, 2021

Good question!

Short answer: violation of the homogeneity of variances assumption is not serious enough to justify abandoning ANOVA altogether.

First off, we only need homogeneity for (sharply) unequal sample sizes. If sample sizes are roughly equal, then ANOVA is robust to a violation of homogeneity.

Now if sample sizes are sharply unequal and homogeneity is violated, then a "normal" ANOVA is risky. However, there's an alternative ANOVA (the Welch test) that deals with exactly this problem. We cover this in SPSS ANOVA - Levene’s Test “Significant”.

But, indeed, a Kruskal-Wallis is also a suitable alternative for the latter scenario.

One reason for not using it, is that it does not have a clearly defined effect size. This, in turn, complicates power and sample size estimations.

Another reason for not using Kruskal-Wallis is that it can only handle a single factor whereas ANOVA easily handles all types of factorial designs.

Hope that helps!

Ruben
SPSS tutorials
By YY Ma on April 30th, 2021

Great thanks.
A very detailed and instructive answer.
It clears all my confusion about K-W test.
By YY Ma on May 20th, 2021

I find it is popular to check the normality of residuals. For example, JASP uses this method.

In my view, an advantage of this approach is that we can pool all the residuals to identify the distribution. When sample sizes are too small, it looks helpful.

Is my personal opinion right?

1 2 … 4

Kruskal-Wallis Test – Simple Tutorial

Kruskal-Wallis Test - Purposes

Kruskal-Wallis Test Example

Kruskal-Wallis Test - Null Hypothesis

Kruskal-Wallis Test Assumptions

Kruskal-Wallis Test Formulas

Kruskal-Wallis Post Hoc Tests

APA Reporting a Kruskal-Wallis Test

References

Tell us what you think!

THIS TUTORIAL HAS 20 COMMENTS:

By Jon K Peck on April 28th, 2021

By YY Ma on April 29th, 2021

By Ruben Geert van den Berg on April 29th, 2021

By YY Ma on April 30th, 2021

By YY Ma on May 20th, 2021