SPSS Correlation Analyis - The Ultimate Guide

SPSS Correlation Analysis Tutorial

Correlation Test - What Is It?
Null Hypothesis
Assumptions
Correlation Test in SPSS
Reporting

Correlation Test - What Is It?

A (Pearson) correlation is a number between -1 and +1 that indicates to what extent 2 quantitative variables are linearly related. It's best understood by looking at some scatterplots.

SPSS Correlation Analysis Nice Scatterplot

In short,

a correlation of -1 indicates a perfect linear descending relation: higher scores on one variable imply lower scores on the other variable.
a correlation of 0 means there's no linear relation between 2 variables whatsoever. However, there may be a (strong) non-linear relation nevertheless.
a correlation of 1 indicates a perfect ascending linear relation: higher scores on one variable are associated with higher scores on the other variable.

Null Hypothesis

A correlation test (usually) tests the null hypothesis that the population correlation is zero. Data often contain just a sample from a (much) larger population: I surveyed 100 customers (sample) but I'm really interested in all my 100,000 customers (population). Sample outcomes typically differ somewhat from population outcomes. So finding a non zero correlation in my sample does not prove that 2 variables are correlated in my entire population; if the population correlation is really zero, I may easily find a small correlation in my sample. However, finding a strong correlation in this case is very unlikely and suggests that my population correlation wasn't zero after all.

Correlation Test - Assumptions

Computing and interpreting correlation coefficients themselves does not require any assumptions. However, the statistical significance-test for correlations assumes

independent observations;
normality: our 2 variables must follow a bivariate normal distribution in our population. This assumption is not needed for sample sizes of N = 25 or more.For reasonable sample sizes, the central limit theorem ensures that the sampling distribution will be normal.

SPSS - Quick Data Check

Let's run some correlation tests in SPSS now. We'll use adolescents.sav, a data file which holds psychological test data on 128 children between 12 and 14 years old. Part of its variable view is shown below.

Now, before running any correlations, let's first make sure our data are plausible in the first place. Since all 5 variables are metric, we'll quickly inspect their histograms by running the syntax below.

*Quick data check: histograms over all relevant variables.

frequencies iq to wellb
/format notable
/histogram.

Histogram Output

Our histograms tell us a lot: our variables have between 5 and 10 missing values. Their means are close to 100 with standard deviations around 15 -which is good because that's how these tests have been calibrated. One thing bothers me, though, and it's shown below.

SPSS Correlation Analysis Histogram With Outlier

It seems like somebody scored zero on some tests -which is not plausible at all. If we ignore this, our correlations will be severely biased. Let's sort our cases, see what's going on and set some missing values before proceeding.

*Inspect case with low iq / anxi scores.

sort cases by iq.

*One case has zero on both tests. Set as missing value before proceeding.

missing values iq anxi (0).

If we now rerun our histograms, we'll see that all distributions look plausible. Only now should we proceed to running the actual correlations.

Running a Correlation Test in SPSS

Let's first navigate to Analyze Correlate Bivariate as shown below.

Move all relevant variables into the variables box. You probably don't want to change anything else here.

Clicking Paste results in the syntax below. Let's run it.

SPSS CORRELATIONS Syntax

*Correlations as pasted from menu.

CORRELATIONS
/VARIABLES=iq depr anxi soci wellb
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

*Shorter version, creates exact same output.

correlations iq to wellb
/print nosig.

Correlation Output

By default, SPSS always creates a full correlation matrix. Each correlation appears twice: above and below the main diagonal. The correlations on the main diagonal are the correlations between each variable and itself -which is why they are all 1 and not interesting at all. The 10 correlations below the diagonal are what we need. As a rule of thumb, a correlation is statistically significant if its “Sig. (2-tailed)” < 0.05. Now let's take a close look at our results: the strongest correlation is between depression and overall well-being : r = -0.801. It's based on N = 117 children and its 2-tailed significance, p = 0.000. This means there's a 0.000 probability of finding this sample correlation -or a larger one- if the actual population correlation is zero.

Note that IQ does not correlate with anything. Its strongest correlation is 0.152 with anxiety but p = 0.11 so it's not statistically significantly different from zero. That is, there's an 0.11 chance of finding it if the population correlation is zero. This correlation is too small to reject the null hypothesis.

Like so, our 10 correlations indicate to which extent each pair of variables are linearly related. Finally, note that each correlation is computed on a slightly different N -ranging from 111 to 117. This is because SPSS uses pairwise deletion of missing values by default for correlations.

Scatterplots

Strictly, we should inspect all scatterplots among our variables as well. After all, variables that don't correlate could still be related in some non-linear fashion. But for more than 5 or 6 variables, the number of possible scatterplots explodes so we often skip inspecting them. However, see SPSS - Create All Scatterplots Tool.
The syntax below creates just one scatterplot, just to get an idea of what our relation looks like. The result doesn't show anything unexpected, though.

*Simple scatterplot for wellbeing by depression.

graph
/scatter wellb with depr
/subtitle "Correlation = - 0.8 | N = 128".

Reporting a Correlation Test

The figure below shows the most basic format recommended by the APA for reporting correlations. Importantly, make sure the table indicates which correlations are statistically significant at p < 0.05 and perhaps p < 0.01. Also see SPSS Correlations in APA Format.

If possible, report the confidence intervals for your correlations as well. Oddly, SPSS doesn't include those. However, see SPSS Confidence Intervals for Correlations Tool.

Thanks for reading!

Pearson Correlations – Quick Introduction

A Pearson correlation is a number between -1 and +1 that indicates
to which extent 2 variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”.

Pearson correlations are only suitable for quantitative variables (including dichotomous variables).

For ordinal variables, use the Spearman correlation or Kendall’s tau and
for nominal variables, use Cramér’s V.

Correlation Coefficient - Example

We asked 40 freelancers for their yearly incomes over 2010 through 2014. Part of the raw data are shown below.

Today’s question is: is there any relation between income over 2010
and income over 2011? Well, a splendid way for finding out is inspecting a scatterplot for these two variables: we'll represent each freelancer by a dot. The horizontal and vertical positions of each dot indicate a freelancer’s income over 2010 and 2011. The result is shown below.

Pearson Correlation Coefficient - Scatterplot Incomes

Our scatterplot shows a strong relation between income over 2010 and 2011: freelancers who had a low income over 2010 (leftmost dots) typically had a low income over 2011 as well (lower dots) and vice versa. Furthermore, this relation is roughly linear; the main pattern in the dots is a straight line.
The extent to which our dots lie on a straight line indicates the strength of the relation. The Pearson correlation is a number that indicates the exact strength of this relation.

Correlation Coefficients and Scatterplots

A correlation coefficient indicates the extent to which dots in a scatterplot lie on a straight line. This implies that we can usually estimate correlations pretty accurately from nothing more than scatterplots. The figure below nicely illustrates this point.

Pearson Correlation Coefficient - Multiple Scatterplots

Correlation Coefficient - Basics

Some basic points regarding correlation coefficients are nicely illustrated by the previous figure. The least you should know is that

Correlations are never lower than -1. A correlation of -1 indicates that the data points in a scatter plot lie exactly on a straight descending line; the two variables are perfectly negatively linearly related.
A correlation of 0 means that two variables don't have any linear relation whatsoever. However, some non linear relation may exist between the two variables.
Correlation coefficients are never higher than 1. A correlation coefficient of 1 means that two variables are perfectly positively linearly related; the dots in a scatter plot lie exactly on a straight ascending line.

Correlation Coefficient - Perfect Linear Relations

Correlation Coefficient - Interpretation Caveats

When interpreting correlations, you should keep some things in mind. An elaborate discussion deserves a separate tutorial but we'll briefly mention two main points.

Correlations may or may not indicate causal relations. Reversely, causal relations from some variable to another variable may or may not result in a correlation between the two variables.
Correlations are very sensitive to outliers; a single unusual observation may have a huge impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.

Correlation Coefficient - Software

Most spreadsheet editors such as Excel, Google sheets and OpenOffice can compute correlations for you. The illustration below shows an example in Googlesheets.

Correlation Coefficient - Correlation Matrix

Keep in mind that correlations apply to pairs of variables. If you're interested in more than 2 variables, you'll probably want to take a look at the correlations between all different variable pairs. These correlations are usually shown in a square table known as a correlation matrix. Statistical software packages such as SPSS create correlations matrices before you can blink your eyes. An example is shown below.

Correlation Coefficient - SPSS Correlation Matrix

Note that the diagonal elements (in red) are the correlations between each variable and itself. This is why they are always 1.
Also note that the correlations beneath the diagonal (in grey) are redundant because they're identical to the correlations above the diagonal. Technically, we say that this is a symmetrical matrix.
Finally, note that the pattern of correlations makes perfect sense: correlations between yearly incomes become lower insofar as these years lie further apart.

Pearson Correlation - Formula

If we want to inspect correlations, we'll have a computer calculate them for us. You'll rarely (probably never) need the actual formula. However, for the sake of completeness, a Pearson correlation between variables X and Y is calculated by
$$r_{XY} = \frac{\sum_{i=1}^n(X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^n(X_i - \overline{X})^2}\sqrt{\sum_{i=1}^n(Y_i - \overline{Y})^2}}$$
The formula basically comes down to dividing the covariance by the product of the standard deviations. Since a coefficient is a number divided by some other number our formula shows why we speak of a correlation coefficient.

Correlation - Statistical Significance

The data we've available are often -but not always- a small sample from a much larger population. If so, we may find a non zero correlation in our sample
even if it's zero in the population. The figure below illustrates how this could happen.

Scatterplot Showing Sample Correlation if Population Correlation is Zero

If we ignore the colors for a second, all 1,000 dots in this scatterplot visualize some population. The population correlation -denoted by ρ- is zero between test 1 and test 2.
Now, we could draw a sample of N = 20 from this population for which the correlation r = 0.95. Reversely, this means that a sample correlation of 0.95 doesn't prove with certainty that there's a non zero correlation in the entire population. However, finding r = 0.95 with N = 20 is extremely unlikely if ρ = 0. But precisely how unlikely? And how do we know?

Correlation - Test Statistic

If ρ -a population correlation- is zero, then the probability for a given sample correlation -its statistical significance- depends on the sample size. We therefore combine the sample size and r into a single number, our test statistic t: $$T = R\sqrt{\frac{(n - 2)}{(1 - R^2)}}$$

Now, T itself is not interesting. However, we need it for finding the significance level for some correlation. T follows a t distribution with ν = n - 2 degrees of freedom but only if some assumptions are met.

Correlation Test - Assumptions

The statistical significance test for a Pearson correlation requires 3 assumptions:

independent observations;
the population correlation, ρ = 0;
normality: the 2 variables involved are bivariately normally distributed in the population. However, this is not needed for a reasonable sample size -say, N ≥ 20 or so.The reason for this lies in the central limit theorem.

Pearson Correlation - Sampling Distribution

In our example, the sample size N was 20. So if we meet our assumptions, T follows a t-distribution with df = 18 as shown below.

Pearson Correlation - T-Distribution with DF = 18

This distribution tells us that there's a 95% probability that -2.1 < t < 2.1, corresponding to -0.44 < r < 0.44. Conclusion: if N = 20, there's a 95% probability of finding -0.44 < r < 0.44. There's only a 5% probability of finding a correlation outside this range. That is, such correlations are statistically significant at α = 0.05 or lower: they are (highly) unlikely and thus refute the null hypothesis of a zero population correlation.
Last, our sample correlation of 0.95 has a p-value of 1.55e^-10 -one to 6,467,334,654. We can safely conclude there's a non zero correlation in our entire population.

Thanks for reading!

Creating APA Style Correlation Tables in SPSS

Introduction & Practice Data File

When running correlations in SPSS, we get the p-values as well. In some cases, we don't want that: if our data hold an entire population, such p-values are actually nonsensical. For some stupid reason, we can't get correlations without significance levels from the correlations dialog. However, this tutorial shows 2 ways for getting them anyway. We'll use adolescents-clean.sav throughout.

Correlation Table as Recommended by the APA

Option 1: FACTOR

A reasonable option is navigating to Analyze Dimension Reduction Factor as shown below.

Next, we'll move iq through wellb into the variables box and follow the steps outlines in the next screenshot.

Clicking Paste results in the syntax below. It'll create a correlation matrix without significance levels or sample sizes. Note that FACTOR uses listwise deletion of missing values by default but we can easily change this to pairwise deletion. Also, we can shorten the syntax quite a bit in case we need more than one correlation matrix.

Correlation Matrix from FACTOR Syntax

*Syntax pasted from Analyze - Dimension Reduction - Factor.

FACTOR
/VARIABLES iq depr anxi soci wellb
/MISSING pairwise /* WATCH OUT HERE: DEFAULT IS LISTWISE! */
/ANALYSIS iq depr anxi soci wellb
/PRINT CORRELATION EXTRACTION
/CRITERIA MINEIGEN(1) ITERATE(25)
/EXTRACTION PC
/ROTATION NOROTATE
/METHOD=CORRELATION.

*Can be shortened to...

factor
/variables iq to wellb
/missing pairwise
/print correlation.

*...or even...

factor
/variables iq to wellb
/print correlation.

*but this last version uses listwise deletion of missing values.

Result

SPSS Correlation Matrix From Factor Command

When using pairwise deletion, we no longer see the sample sizes used for each correlation. We may not want those in our table but perhaps we'd like to say something about them in our table title.
More importantly, we've no idea which correlations are statistically significant and which aren't. Our second approach deals nicely with both issues.

Option 2: Adjust Default Correlation Table

The fastest way to create correlations is simply running correlations iq to wellb. However, we sometimes want to have statistically significant correlations flagged. We'll do so by adding just one line.

*Create full correlation matrix and flag statistically significant correlations.

correlations iq to wellb
/print nosig.

This results in a standard correlation matrix with all sample sizes and p-values. However, we'll now make everything except the actual correlations invisible.

Adjusting Our Pivot Table Structure

We first right-click our correlation table and navigate to Edit Content In Separate Window as shown below.

Select Pivoting Trays from the Pivot menu.

Drag and drop the Statistics (row) dimension into the LAYER area and close the pivot editor.

Result

Same Results Faster?

If you like the final result, you may wonder if there's a faster way to accomplish it. Well, there is: the Python syntax below makes the adjustment on all pivot tables in your output. So make sure there's only correlation tables in your output before running it. It may crash otherwise.

*Move last row dimension into layer for all tables in output window. This syntax requires the SPSS Python Essentials.

begin program.
import SpssClient
SpssClient.StartClient()
oDoc = SpssClient.GetDesignatedOutputDoc()
oItems = oDoc.GetOutputItems()
for index in range(oItems.Size()):
    oItem = oItems.GetItemAt(oItems.Size() - index - 1)
    if oItem.GetType() == SpssClient.OutputItemType.PIVOT:
        pTable = oItem.GetSpecificType()
        pManager = pTable.PivotManager()
        nRows = pManager.GetNumRowDimensions()
        rDim = pManager.GetRowDimension(0)
        rDim.MoveToLayer(0)
SpssClient.StopClient()
end program.

Well, that's it. Hope you liked this tutorial and my script -I actually run it from my toolbar pretty often.

Thanks for reading!

SPSS Correlation Analysis Tutorial

Contents

Correlation Test - What Is It?

Null Hypothesis

Correlation Test - Assumptions

SPSS - Quick Data Check

Histogram Output

Running a Correlation Test in SPSS

SPSS CORRELATIONS Syntax

Correlation Output

Scatterplots

Reporting a Correlation Test

Pearson Correlations – Quick Introduction

Correlation Coefficient - Example

Correlation Coefficients and Scatterplots

Correlation Coefficient - Basics

Correlation Coefficient - Interpretation Caveats

Correlation Coefficient - Software

Correlation Coefficient - Correlation Matrix

Pearson Correlation - Formula

Correlation - Statistical Significance

Correlation - Test Statistic

Correlation Test - Assumptions

Pearson Correlation - Sampling Distribution

Creating APA Style Correlation Tables in SPSS

Introduction & Practice Data File

Option 1: FACTOR

Correlation Matrix from FACTOR Syntax

Result

Option 2: Adjust Default Correlation Table

Adjusting Our Pivot Table Structure

Result

Same Results Faster?