A Spearman rank correlation is a number between -1 and +1 that indicates to what extent 2 variables are monotonously related.

- Spearman Correlation - Example
- Spearman Rank Correlation - Basic Properties
- Spearman Rank Correlation - Assumptions
- Spearman Correlation - Formulas and Calculation
- Spearman Rank Correlation - Software

## Spearman Correlation - Example

A sample of 1,000 companies were asked about their number of employees and their revenue over 2018. For making these questions easier, they were offered answer categories. After completing the data collection, the contingency table below shows the results.

The question we'd like to answer is is company size related to revenue? A good look at our contingency table shows the obvious: companies having more employees typically make more revenue. But note that this relation is not perfect: there's 60 companies with 1 employee making $50,000 - $99,999 while there's 89 companies with 2-5 employees making $0 - $49,999. This relation becomes clear if we visualize our results in the chart below.

The chart shows an undisputable positive **monotonous relation** between size and revenue: *larger* companies tend to make *more* revenue than smaller companies. Next question.
How strong is the relation?
The first option that comes to mind is computing the Pearson correlation between company size and revenue. However, that's not going to work because we don't have company size or revenue in our data. We only have size and revenue *categories*. Company size and revenue are ordinal variables in our data: we know that 2-5 employees is larger than 1 employee but we don't know *how much* larger.

So which numbers can we use to calculate how strongly ordinal variables are related? Well, we can assign **ranks** to our categories as shown below.

As a last step, we simply compute the Pearson correlation between the size and revenue ranks. This results in a
Spearman rank correlation (Rs) = 0.81.
This tells us that our variables are strongly *monotonously* related. But in contrast to a normal Pearson correlation, we do not know if the relation is linear to any extent.

## Spearman Rank Correlation - Basic Properties

Like we just saw, a Spearman correlation is simply a Pearson correlation computed on ranks instead of data values or categories. This results in the following basic properties:

- Spearman correlations are always between -1 and +1;
- Spearman correlations are suitable for all but nominal variables. However, when both variables are either metric or dichotomous, Pearson correlations are usually the better choice;
- Spearman correlations indicate monotonous -rather than linear- relations;
- Spearman correlations are hardly affected by outliers. However, outliers should be excluded from analyses instead of determine whether Spearman or Pearson correlations are preferable;
- Spearman correlations serve the exact same purposes as Kendall’s tau.

## Spearman Rank Correlation - Assumptions

- The Spearman correlation itself only assumes that both variables are at least
**ordinal variables**. This excludes all but nominal variables. - The statistical significance test for a Spearman correlation assumes
**independent observations**or -precisely- independent and identically distributed variables.

## Spearman Correlation - Example II

A company needs to determine the expiration date for milk. They therefore take a tiny drop each hour and analyze the number of bacteria it contains. The results are shown below.

For bacteria versus time,

- the Pearson correlation is 0.58 but
- the Spearman correlation is 1.00.

There is a **perfect monotonous relation** between time and bacteria: with each hour passed, the number of bacteria grows. However, the relation is very non linear as shown by the Pearson correlation.

This example nicely illustrates the difference between these correlations. However, I'd argue against reporting a Spearman correlation here. Instead, model this curvilinear relation with a (probably exponential) function. This'll probably predict the number of bacteria with pinpoint precision.

## Spearman Correlation - Formulas and Calculation

First off, an example calculation, exact significance levels and critical values are given in this Googlesheet (shown below).

Right. Now, computing Spearman’s rank correlation always starts off with replacing scores by their ranks (use mean ranks for ties). Spearman’s correlation is now computed as the Pearson correlation over the (mean) ranks.

Alternatively, compute Spearman correlations with
$$R_s = 1 - \frac{6\cdot \Sigma \;D^2}{n^3 - n}$$

where \(D\) denotes the difference between the 2 ranks for each observation.

For **reasonable sample sizes** of N ≥ 30, the (approximate) statistical significance uses the t distribution. In this case, the test statistic
$$T = \frac{R_s \cdot \sqrt{N - 2}}{\sqrt{1 - R^2_s}}$$

follows a t-distribution with
$$Df = N - 2$$

degrees of freedom.

This approximation is inaccurate for **smaller sample sizes** of N < 30. In this case, look up the (exact) significance level from the table given in this Googlesheet. These exact p-values are based on a permutation test that we may discuss some other time. Or not.

## Spearman Rank Correlation - Software

Spearman correlations can be computed in Googlesheets or Excel but statistical software is a much easier option. JASP -which is freely downloadable- comes up with the correct Spearman correlation and its significance level as shown below.

SPSS also comes up with the correct correlation. However, its significance level is based on the t-distribution:
$$t = \frac{0.77\cdot\sqrt{4}}{\sqrt{(1 - 0.77^2)}} = 2.42$$

and
$$t(4) = 2.42,\;p = 0.072 $$

Again, this approximation is only accurate for larger sample sizes of N ≥ 30. For N = 6, it is wildly off as shown below.

Thanks for reading.

## THIS TUTORIAL HAS 15 COMMENTS:

## By Jon K Peck on September 13th, 2022

I have no control over what happens inside an R module used by an extension command, but I try to anticipate problematic input and issue a clearer error message than the typical R module provides. In the case of HETCOR, which uses the R hetcor module, that procedure sometimes fails in a way that can't be anticipated when it can't find a latent normality construct. The NANs usually come from convergence problems in its algorithm rather than being related to missing data.

hetcor computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables. For nominal variables, this would only be applicable when they are dichotomies.

If you can send me data and syntax, I'll see if I can find a reason for the failure in that case.

For polyserial correlation, the procedure tries to find a latent underlying bivariate normal distribution involving the ordinal variables.

Here's a quote from Wikipedia for polychoric.

This technique is frequently applied when analysing items on self-report instruments such as personality tests and surveys that often use rating scales with a small number of response options (e.g., strongly disagree to strongly agree). The smaller the number of response categories, the more a correlation between latent continuous variables will tend to be attenuated. Lee, Poon & Bentler (1995) have recommended a two-step approach to factor analysis for assessing the factor structure of tests involving ordinally measured items. Kiwanuka and colleagues (2022) have also illustrated the application of polychoric correlations and polychoric confirmatory factor analysis in nursing science. This aims to reduce the effect of statistical artifacts, such as the number of response scales or skewness of variables leading to items grouping together in factors. In some disciplines, the statistical technique is rarely applied however, some scholars [1] have demonstrated how it can be used as an alternative to the Pearson correlation.

## By YY on September 14th, 2022

“Spearman correlation is simply a Pearson correlation computed on ranks”.

Is the normality assumption required theoretically for the ranks when we test the Spearman’s correlation coefficient using the t test?

## By Ruben Geert van den Berg on September 14th, 2022

First off, in contrast to popular belief, Pearson correlations don't require normality.

Only the p-values and confidence intervals for Pearson correlations require normality. For reasonable sample sizes (say N > 25 or so) this is not an issue due to the central limit theorem.

So what about Spearman correlations? Assuming no ties, each rank 1, 2, ..., N occurs just once. This implies that ranks always follow a uniform distribution and knowing the sample size, we always know their exact distribution (and it's obviously not normal).

This also implies that Spearman correlations can only take on a limited number of values (the number of possible combinations of 2 rank variables), some of which are more likely than others under the null hypothesis. The exact significance level (rarely used for reasonable sample sizes) is based on precisely this.

Precisely this it the reason why distribution free tests (usually misreferred to as "nonparametric tests") don't require distributional assumptions such as normality or homogeneity: the sampling distributions of ranks are always known, regardless how the original values are distributed.

If ties are present, this reasoning only holds approximately so we still apply it with some corrections.

I hope this answers your question.

Keep up the good work!

Ruben

SPSS tutorials

## By Ruben Geert van den Berg on September 14th, 2022

First off, I think the point regarding attenuation appears very valid and relevant.

Jöreskog and Sörbom have articulated this point very strongly. But -sadly- all of my students from recent years simply used the (probably attenuated) Pearson correlations for their factor analyses on ordinal variables.

"For nominal variables, this would only be applicable when they are dichotomies."

Funny, I've been arguing for years that dichotomous variables should be defined as a separate measurement level (dichotomous, nominal,...) because they usually involve different analyses than all other variables.

But in any case, HETCOR reports polychoric correlations for non binary nominal variables. As these are nonsensical, I think that's a bad thing.

Perhaps better to ask the user to set them to ordinal/scale or actually check if they're dichotomous before proceeding?

## By Jon K Peck on September 14th, 2022

As the hetcor doc says,

hetcor computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.

We don't issues warnings when ordinary correlations are computed on 3+ valued nominal variables or use them as scale variables in regression. We have to leave some things up to the users. There aren't enough guard rails available in the world.

The detectCores function is imported from the parallel package a