A Pearson correlation is a number between -1 and +1 that indicates
to which extent 2 variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”.

Pearson correlations are only suitable for quantitative variables (including dichotomous variables).

For ordinal variables, use the Spearman correlation or Kendall’s tau and
for nominal variables, use Cramér’s V.

Correlation Coefficient - Example

We asked 40 freelancers for their yearly incomes over 2010 through 2014. Part of the raw data are shown below.

Today’s question is: is there any relation between income over 2010
and income over 2011? Well, a splendid way for finding out is inspecting a scatterplot for these two variables: we'll represent each freelancer by a dot. The horizontal and vertical positions of each dot indicate a freelancer’s income over 2010 and 2011. The result is shown below.

Pearson Correlation Coefficient - Scatterplot Incomes

Our scatterplot shows a strong relation between income over 2010 and 2011: freelancers who had a low income over 2010 (leftmost dots) typically had a low income over 2011 as well (lower dots) and vice versa. Furthermore, this relation is roughly linear; the main pattern in the dots is a straight line.
The extent to which our dots lie on a straight line indicates the strength of the relation. The Pearson correlation is a number that indicates the exact strength of this relation.

Correlation Coefficients and Scatterplots

A correlation coefficient indicates the extent to which dots in a scatterplot lie on a straight line. This implies that we can usually estimate correlations pretty accurately from nothing more than scatterplots. The figure below nicely illustrates this point.

Pearson Correlation Coefficient - Multiple Scatterplots

Correlation Coefficient - Basics

Some basic points regarding correlation coefficients are nicely illustrated by the previous figure. The least you should know is that

Correlations are never lower than -1. A correlation of -1 indicates that the data points in a scatter plot lie exactly on a straight descending line; the two variables are perfectly negatively linearly related.
A correlation of 0 means that two variables don't have any linear relation whatsoever. However, some non linear relation may exist between the two variables.
Correlation coefficients are never higher than 1. A correlation coefficient of 1 means that two variables are perfectly positively linearly related; the dots in a scatter plot lie exactly on a straight ascending line.

Correlation Coefficient - Perfect Linear Relations

Correlation Coefficient - Interpretation Caveats

When interpreting correlations, you should keep some things in mind. An elaborate discussion deserves a separate tutorial but we'll briefly mention two main points.

Correlations may or may not indicate causal relations. Reversely, causal relations from some variable to another variable may or may not result in a correlation between the two variables.
Correlations are very sensitive to outliers; a single unusual observation may have a huge impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.

Correlation Coefficient - Software

Most spreadsheet editors such as Excel, Google sheets and OpenOffice can compute correlations for you. The illustration below shows an example in Googlesheets.

Correlation Coefficient - Correlation Matrix

Keep in mind that correlations apply to pairs of variables. If you're interested in more than 2 variables, you'll probably want to take a look at the correlations between all different variable pairs. These correlations are usually shown in a square table known as a correlation matrix. Statistical software packages such as SPSS create correlations matrices before you can blink your eyes. An example is shown below.

Correlation Coefficient - SPSS Correlation Matrix

Note that the diagonal elements (in red) are the correlations between each variable and itself. This is why they are always 1.
Also note that the correlations beneath the diagonal (in grey) are redundant because they're identical to the correlations above the diagonal. Technically, we say that this is a symmetrical matrix.
Finally, note that the pattern of correlations makes perfect sense: correlations between yearly incomes become lower insofar as these years lie further apart.

Pearson Correlation - Formula

If we want to inspect correlations, we'll have a computer calculate them for us. You'll rarely (probably never) need the actual formula. However, for the sake of completeness, a Pearson correlation between variables X and Y is calculated by
$$r_{XY} = \frac{\sum_{i=1}^n(X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^n(X_i - \overline{X})^2}\sqrt{\sum_{i=1}^n(Y_i - \overline{Y})^2}}$$
The formula basically comes down to dividing the covariance by the product of the standard deviations. Since a coefficient is a number divided by some other number our formula shows why we speak of a correlation coefficient.

Correlation - Statistical Significance

The data we've available are often -but not always- a small sample from a much larger population. If so, we may find a non zero correlation in our sample
even if it's zero in the population. The figure below illustrates how this could happen.

Scatterplot Showing Sample Correlation if Population Correlation is Zero

If we ignore the colors for a second, all 1,000 dots in this scatterplot visualize some population. The population correlation -denoted by ρ- is zero between test 1 and test 2.
Now, we could draw a sample of N = 20 from this population for which the correlation r = 0.95. Reversely, this means that a sample correlation of 0.95 doesn't prove with certainty that there's a non zero correlation in the entire population. However, finding r = 0.95 with N = 20 is extremely unlikely if ρ = 0. But precisely how unlikely? And how do we know?

Correlation - Test Statistic

If ρ -a population correlation- is zero, then the probability for a given sample correlation -its statistical significance- depends on the sample size. We therefore combine the sample size and r into a single number, our test statistic t: $$T = R\sqrt{\frac{(n - 2)}{(1 - R^2)}}$$

Now, T itself is not interesting. However, we need it for finding the significance level for some correlation. T follows a t distribution with ν = n - 2 degrees of freedom but only if some assumptions are met.

Correlation Test - Assumptions

The statistical significance test for a Pearson correlation requires 3 assumptions:

independent observations;
the population correlation, ρ = 0;
normality: the 2 variables involved are bivariately normally distributed in the population. However, this is not needed for a reasonable sample size -say, N ≥ 20 or so.The reason for this lies in the central limit theorem.

Pearson Correlation - Sampling Distribution

In our example, the sample size N was 20. So if we meet our assumptions, T follows a t-distribution with df = 18 as shown below.

Pearson Correlation - T-Distribution with DF = 18

This distribution tells us that there's a 95% probability that -2.1 < t < 2.1, corresponding to -0.44 < r < 0.44. Conclusion: if N = 20, there's a 95% probability of finding -0.44 < r < 0.44. There's only a 5% probability of finding a correlation outside this range. That is, such correlations are statistically significant at α = 0.05 or lower: they are (highly) unlikely and thus refute the null hypothesis of a zero population correlation.
Last, our sample correlation of 0.95 has a p-value of 1.55e^-10 -one to 6,467,334,654. We can safely conclude there's a non zero correlation in our entire population.

Thanks for reading!

THIS TUTORIAL HAS 32 COMMENTS:

By Sandor on September 2nd, 2019

Beside of SPSS this correlation coefficient calculator can help you in quick r, p-value, and visualization.
https://www.answerminer.com/calculators/correlation-test
By Anna K on September 24th, 2019

Hi! Thank you for excellent tutorial. I have 1 question: why if n=20, df=18 there is 95% probability that -2.1 < t < 2.1??? (I look into t-distribution tables and do not understand why |2.1| - maybe |1.7| instead?) Thanx!
By Ruben Geert van den Berg on September 24th, 2019

Hi Anna!

First off, express the probability as a proportion instead of a percentage. So that'll be 0.95, right?

The remaining 0.05 is divided over the left and right tails so each tail holds a proportion of 0.025.

For probability distributions, try Googlesheet. The function you need is either t.dist or t.inv.

These always give left tail probabilities.

So for the left tail, try =t.inv(0.025,18). This gives -2.1.

For the right tail, try (1 - 0.025 =) =t.inv(0.975,18). That'll be 2.1.

Hope that helps!

SPSS tutorials
By Garang Ngong on December 1st, 2019

Very helpful
By Robert on January 14th, 2020

Excellent overview! Concise but lucid.

1 … 5 … 7

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Pearson Correlations – Quick Introduction

Correlation Coefficient - Example

Correlation Coefficients and Scatterplots

Correlation Coefficient - Basics

Correlation Coefficient - Interpretation Caveats

Correlation Coefficient - Software

Correlation Coefficient - Correlation Matrix

Pearson Correlation - Formula

Correlation - Statistical Significance

Correlation - Test Statistic

Correlation Test - Assumptions

Pearson Correlation - Sampling Distribution

Tell us what you think!

THIS TUTORIAL HAS 32 COMMENTS:

By Sandor on September 2nd, 2019

By Anna K on September 24th, 2019

By Ruben Geert van den Berg on September 24th, 2019

By Garang Ngong on December 1st, 2019

By Robert on January 14th, 2020