Finding & Excluding Outliers in SPSS - 3 Easy Options
SPSS tutorials website header logo SPSS TUTORIALS VIDEO COURSE BASICS ANOVA REGRESSION FACTOR

How to Find & Exclude Outliers in SPSS?

Summary

Outliers are basically values that fall outside of a normal range for some variable. But what's a “normal range”? This is subjective and may depend on substantive knowledge and prior research. Alternatively, there's some rules of thumb as well. These are less subjective but don't always result in better decisions as we're about to see.

In any case: we usually want to exclude outliers from data analysis. So how to do so in SPSS? We'll walk you through 3 methods, using life-choices.sav, partly shown below.

SPSS Life Choices Data Variable View In this tutorial, we'll find outliers for these reaction time variables.

During this tutorial, we'll focus exclusively on reac01 to reac05, the reaction times in milliseconds for 5 choice trials offered to the respondents.

Method I - Histograms

Let's first try to identify outliers by running some quick histograms over our 5 reaction time variables. Doing so from SPSS’ menu is discussed in Creating Histograms in SPSS. A faster option, though, is running the syntax below.

*Create frequency tables with histograms for 5 reaction time variables.

frequencies reac01 to reac05
/histogram.

Result

Let's take a good look at the first of our 5 histograms shown below.

SPSS Outliers In Histogram

The “normal range” for this variable seems to run from 500 through 1500 ms. It seems that 3 scores lie outside this range. So are these outliers? Honestly, different analysts will make different decisions here. Personally, I'd settle for only excluding the score ≥ 2000 ms. So what's the right way to do so? And what about the other variables?

Excluding Outliers from Data

The right way to exclude outliers from data analysis is to specify them as user missing values. So for reaction time 1 (reac01), running missing values reac01 (2000 thru hi). excludes reaction times of 2000 ms and higher from all data analyses and editing. So what about the other 4 variables?

The histograms for reac02 and reac03 don't show any outliers.

For reac04, we see some low outliers as well as a high outlier. We can find which values these are in the bottom and top of its frequency distribution as shown below.

SPSS Outliers In Frequency Table If we see any outliers in a histogram, we may look up the exact values in the corresponding frequency table.

We can exclude all of these outliers in one go by running missing values reac04 (lo thru 400,2085). By the way: “lo thru 400” means the lowest value in this variable (its minimum) through 400 ms.

For reac05, we see several low and high outliers. The obvious thing to do seems to run something like missing values reac05 (lo thru 400,2000 thru hi). But sadly, this only triggers the following error:

>Error # 4818 in column 46. Text: hi
>There are too many values specified.
>The limit is three individual values or
>one value and one range of values.
>Execution of this command stops.

The problem here is that you can't specify a low and a high
range of missing values in SPSS.
Since this is what you typically need to do, this is one of the biggest stupidities still found in SPSS today. A workaround for this problem is to

The syntax below does just that and reruns our histograms to check if all outliers have indeed been correctly excluded.

*Change low outliers to 999999999 for reac05.

recode reac05 (lo thru 400 = 999999999).

*Add value label to 999999999.

add value labels reac05 999999999 '(Recoded from 95 / 113 / 397 ms)'.

*Set range of high missing values.

missing values reac05 (2000 thru hi).

*Rerun frequency tables after excluding outliers.

frequencies reac01 to reac05
/histogram.

Result

First off, note that none of our 5 histograms show any outliers anymore; they're now excluded from all data analysis and editing. Also note the bottom of the frequency table for reac05 shown below.

SPSS Report Outliers In Frequency Table Low outliers after recoding and labelling are listed under Missing.

Even though we had to recode some values, we can still report precisely which outliers we excluded for this variable due to our value label.

Before proceeding to boxplots, I'd like to mention 2 worst practices for excluding outliers:

Sadly, supervisors sometimes force their students to take this road anyway. If so, SELECT IF permanently removes entire cases from your data.

Method II - Boxplots

If you ran the previous examples, you need to close and reopen life-choices.sav before proceeding with our second method.

We'll create a boxplot as discussed in Creating Boxplots in SPSS - Quick Guide: we first navigate to Analyze SPSS Menu Arrow Descriptive Statistics SPSS Menu Arrow Explore as shown below.

SPSS Analyze Descriptive Statistics Explore

Next, we'll fill in the dialogs as shown below.

SPSS Find Outliers In Boxplot Dialogs

Completing these steps results in the syntax below. Let's run it.

*Create boxplot and outlier summary.

EXAMINE VARIABLES=reac01 reac02 reac03 reac04 reac05
/PLOT BOXPLOT
/COMPARE VARIABLES
/STATISTICS EXTREME
/MISSING PAIRWISE
/NOTOTAL.

Result

Quick note: if you're not sure about interpreting boxplots, read up on Boxplots - Beginners Tutorial first.

SPSS Outliers In Boxplots Result

Our boxplot indicates some potential outliers for all 5 variables. But let's just ignore these and exclude only the extreme values that are observed for reac01, reac04 and reac05.

So, precisely which values should we exclude? We find them in the Extreme Values table. I like to copy-paste this into Excel. Now we can easily boldface all values that are extreme values according to our boxplot.

Boldface Outliers In Excel Copy-pasting the Extreme Values table into Excel allows you to easily boldface the exact outliers that we'll exclude.

Finally, we set these extreme values as user missing values with the syntax below. For a step-by-step explanation of this routine, look up Excluding Outliers from Data.

*Recode range of low outliers into huge value for reac05.

recode reac05 (lo thru 113 = 999999999).

*Label new value with original values.

add value labels reac05 999999999 '(Recoded from 95 / 113 ms)'.

*Set (ranges of) missing values for reac01, reac04 and reac05.

missing values
reac01 (2065)
reac04 (17,2085)
reac05 (1647 thru hi).

*Rerun boxplot and check if all extreme values are gone.

EXAMINE VARIABLES=reac01 reac02 reac03 reac04 reac05
/PLOT BOXPLOT
/COMPARE VARIABLES
/STATISTICS EXTREME
/MISSING PAIRWISE
/NOTOTAL.

Method III - Z-Scores (with Reporting)

A common approach to excluding outliers is to look up which values correspond to high z-scores. Again, there's different rules of thumb which z-scores should be considered outliers. Today, we settle for |z| ≥ 3.29 indicates an outlier. The basic idea here is that if a variable is perfectly normally distributed, then only 0.1% of its values will fall outside this range.

So what's the best way to do this in SPSS? Well, the first 2 steps are super simple:

Funnily, both steps are best done with a simple DESCRIPTIVES command as shown below.

*Create z-scores for reac01 to reac05.

descriptives reac01 to reac05
/save.

*Check min and max for z-scores.

descriptives zreac01 to zreac05.

Result

SPSS Find Outliers Based On Z Scores Minima and maxima for our newly computed z-scores.

Basic conclusions from this table are that

But which original values correspond to these high absolute z-scores? For each variable, we can run 2 simple steps:

The syntax below does just that but uses TEMPORARY and SELECT IF for filtering out non outliers.

*Find which values to exclude.

temporary.
select if(abs(zreac01) >= 3.29).
frequencies reac01.

temporary.
select if(abs(zreac04) >= 3.29).
frequencies reac04.

temporary.
select if(abs(zreac05) >= 3.29).
frequencies reac05.

*Save output because tables needed for reporting which outliers are excluded.

output save outfile = 'outlier-tables-01.spv'.

Result

SPSS Report Outliers Based On Z Scores Finding outliers by filtering out all non outliers based on their z-scores.

Note that each frequency table only contains a handful of outliers for which |z| ≥ 3.29. We'll now exclude these values from all data analyses and editing with the syntax below. For a detailed explanation of these steps, see Excluding Outliers from Data.

*Recode ranges of low outliers into 999999999.

recode reac04 (lo thru 107 = 999999999).
recode reac05 (lo thru 113 = 999999999).

*Label new values with original values.

add value labels reac04 999999999 '(Recoded from 17 / 107 ms)'.
add value labels reac05 999999999 '(Recoded from 95 / 113 ms)'.

*Set (ranges of) missing values for reac01, reac04 and reac05.

missing values
reac01 (1659 thru hi)
reac04 (1601 thru hi )
reac05 (1776 thru hi).

*Check if all outliers are indeed user missing values now.

temporary.
select if(abs(zreac01) >= 3.29).
frequencies reac01.

temporary.
select if(abs(zreac04) >= 3.29).
frequencies reac04.

temporary.
select if(abs(zreac05) >= 3.29).
frequencies reac05.

Method III - Z-Scores (without Reporting)

We can greatly speed up the z-score approach we just discussed but this comes at a price: we won't be able to report precisely which outliers we excluded. If that's ok with you, the syntax below almost fully automates the job.

*Create z-scores for reac01 to reac05.

descriptives reac01 to reac05
/save.

*Recode original values into 999999999 if z-score >= 3.29.

do repeat #ori = reac01 to reac05 / #z = zreac01 to zreac05.
if(abs(#z) >= 3.29) #ori = 999999999.
end repeat print.

*Add value labels.

add value labels reac01 to reac05 999999999 '(Excluded because |z| >= 3.29)'.

*Set missing values.

missing values reac01 to reac05 (999999999).

*Check how many outliers were excluded.

frequencies reac01 to reac05.

Result

The frequency table below tells us that 4 outliers having |z| ≥ 3.29 were excluded for reac04.

SPSS Exclude Outliers Based On Z Scores Result Under Missing we see the number of excluded outliers but not the exact values.

Sadly, we're no longer able to tell precisely which original values these correspond to.

Final Notes

Thus far, I deliberately avoided the discussion precisely which values should be considered outliers for our data. I feel that simply making a decision and being fully explicit about it is more constructive than endless debate.

I therefore blindly followed some rules of thumb for the boxplot and z-score approaches. As I warned earlier, these don't always result in good decisions: for the data at hand, reaction times below some 500 ms can't be taken seriously. However, the rules of thumb don't always exclude these.

As for most of data analysis, using common sense is usually a better idea...

Thanks for reading!

SPSS Mediation Analysis – The Complete Guide

Example

A scientist wants to know which factors affect general well-being among people suffering illnesses. In order to find out, she collects some data on a sample of N = 421 cancer patients. These data -partly shown below- are in wellbeing.sav.

SPSS Wellbeing Variable View

Now, our scientist believes that well-being is affected by pain as well as fatigue. On top of that, she believes that fatigue itself is also affected by pain. In short: pain partly affects well-being through fatigue. That is, fatigue mediates the effect from pain onto well-being as illustrated below.

Simple Mediation Analysis Diagram

The lower half illustrates a model in which fatigue would (erroneously) be left out. This is known as the “total effect model” and is often compared with the mediation model above it.

How to Examine Mediation Effects?

Now, let's suppose for a second that all expectations from our scientist are exactly correct. If so, then what should we see in our data? The classical approach to mediation (see Kenny & Baron, 1986) says that

So how to find out if our data is in line with these statements? Well, all paths are technically just b-coefficients. We'll therefore run 3 (separate) regression analyses:

SPSS B-Coefficients Output Paths c’ and b in basic SPSS regression output

SPSS Regression Dialogs

So let's first run the regression analysis for effect \(a\) (X onto mediator) in SPSS: we'll open wellbeing.sav and navigate to the linear regression dialogs as shown below.

SPSS Analyze Regression Linear

For a fairly basic analysis, we'll fill out these dialogs as shown below.

SPSS Mediation Analysis Dialogs

Completing these steps results in the SPSS syntax below. I suggest you shorten the pasted version a bit.

*EFFECT A (X ONTO MEDIATOR).
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT fatigue /* MEDIATOR */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).


*SHORTEN TO SOMETHING LIKE...
REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT fatigue /* MEDIATOR */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).

A second regression analysis estimates effects \(b\) and \(c\,'\). The easiest way to run it is to copy, paste and edit the first syntax as shown below.

*EFFECTS B (MEDIATOR ONTO Y) AND C' (X ONTO Y, DIRECT).

REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT wellb /* Y */
/METHOD=ENTER pain fatigue /* X AND MEDIATOR */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).

We'll use the syntax below for the third (and final) regression which estimates \(c\), the total effect.

*EFFECT C (X ONTO Y, TOTAL).

REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT wellb /* Y */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).

SPSS Mediation Analysis Output

For our mediation analysis, we really only need the 3 coefficients tables. I copy-pasted them into this Googlesheet (read-only, partly shown below).

SPSS Mediation Analysis Effects Googlesheets

So what do we conclude? Well, all requirements for mediation are met by our results:

The diagram below summarizes these results.

Mediation Analysis Summary

Note that both \(c\) and \(c\,'\) are significant. This is often called partial mediation: fatigue partially mediates the effect from pain onto well-being: adding it decreases the effect but doesn't nullify it altogether.

Besides partial mediation, we sometimes find full mediation. This means that \(c\) is significant but \(c\,'\) isn't: the effect is fully mediated and thus disappears when the mediator is added to the regression model.

APA Reporting Mediation Analysis

Mediation analysis is often reported as separate regression analyses as in “the first step of our analysis showed that the effect of pain on fatigue was significant, b = 0.09, p < .001...” Some authors also include t-values and degrees of freedom (df) for b-coefficients. For some very dumb reason, SPSS does not report degrees of freedom but you can compute them as

$$df = N - k - 1$$

where

Like so, we could report “the second step of our analysis showed that the effect of fatigue on well-being was also significant, b = -0.53, t(419) = -3.89, p < .001...”

Next Steps - The Sobel Test

In our analysis, the indirect effect of pain via fatigue onto well-being consists of two separate effects, \(a\) (pain onto fatigue) and \(b\) fatigue onto well-being. Now, the entire indirect effect \(ab\) is simply computed as

$$\text{indirect effect} \;ab = a \cdot b$$

This makes perfect sense: if wage \(a\) is $30 per hour and tax \(b\) is $0.20 per dollar income, then I'll pay $30 · $0.20 = $6.00 tax per hour, right?

For our example, \(ab\) = 0.09 · -0.53 = -0.049: for every unit increase in pain, well-being decreases by an average 0.049 units via fatigue. But how do we obtain the p-value and confidence interval for this indirect effect? There's 2 basic options:

The second approach assumes \(ab\) is normally distributed with

$$se_{ab} = \sqrt{a^2se^2_b + b^2se^2_a + se^2_a se^2_b}$$

where

\(se_{ab}\) denotes the standard error of \(ab\) and so on.

For the actual calculations, I suggest you try our Sobel Test Calculator.xlsx, partly shown below.

Sobel Test Calculation Tool Example

So what does this tell us? Well, our indirect effect is significant, B = -0.049, p = .002, 95% CI [-0.08, -0.02].

Next Steps - Index of Mediation

Our research variables (such as pain & fatigue) were measured on different scales without clear units of measurement. This renders it impossible to compare their effects. The solution is to report standardized coefficients known as β (Greek letter “beta”).

Our SPSS output already includes beta for most effects but not for \(ab\). However, we can easily compute it as

$$\beta_{ab} = \frac{ab \cdot SD_x}{SD_y}$$

where

\(SD_x\) is the sample-standard-deviation of our X variable and so on.

This standardized indirect effect is known as the index of mediation. For computing it, we may run something like DESCRIPTIVES pain wellb. in SPSS. After copy-pasting the resulting table into this Googlesheet, we'll compute \(\beta_{ab}\) with a quick formula as shown below.

SPSS Mediation Analysis Summary Table Googlesheets

Adding the output from our Sobel test calculator to this sheet results in a very complete and clear summary table for our mediation analysis.

Final Notes

Mediation analysis in SPSS can be done with or without the PROCESS macro. Some reasons for not using PROCESS are that

SPSS Process Dialog

So why does anybody use PROCESS? Some reasons may be that

Right. I hope this tutorial has been helpful for running, reporting and understanding mediation analysis in SPSS. This is perhaps not the easiest topic but remember that practice makes perfect.

Thanks for reading!

Skewness – What & Why?

Skewness is a number that indicates to what extent
a variable is asymmetrically distributed.

Positive (Right) Skewness Example

A scientist has 1,000 people complete some psychological tests. For test 5, the test scores have skewness = 2.0. A histogram of these scores is shown below.

Positively Skewed Distribution

The histogram shows a very asymmetrical frequency distribution. Most people score 20 points or lower but the right tail stretches out to 90 or so. This distribution is right skewed.
If we move to the right along the x-axis, we go from 0 to 20 to 40 points and so on. So towards the right of the graph, the scores become more positive. Therefore, right skewness is positive skewness which means skewness > 0. This first example has skewness = 2.0 as indicated in the right top corner of the graph. The scores are strongly positively skewed.

Negative (Left) Skewness Example

Another variable -the scores on test 2- turn out to have skewness = -1.0. Their histogram is shown below.

Negatively Skewed Distribution

The bulk of scores are between 60 and 100 or so. However, the left tail is stretched out somewhat. So this distribution is left skewed.
Right: to the left, to the left. If we follow the x-axis to the left, we move towards more negative scores. This is why left skewness is negative skewness. And indeed, skewness = -1.0 for these scores. Their distribution is left skewed. However, it is less skewed -or more symmetrical- than our first example which had skewness = 2.0.

Symmetrical Distribution Implies Zero Skewness

Finally, symmetrical distributions have skewness = 0. The scores on test 3 -having skewness = 0.1- come close.

Symmetrical Distribution

Now, observed distributions are rarely precisely symmetrical. This is mostly seen for some theoretical sampling distributions. Some examples are

These distributions are all exactly symmetrical and thus have skewness = 0.000...

Population Skewness - Formula and Calculation

If you'd like to compute skewnesses for one or more variables, just leave the calculations to some software. But -just for the sake of completeness- I'll list the formulas anyway.
If your data contain your entire population, compute the population skewness as:
$$Population\;skewness = \Sigma\biggl(\frac{X_i - \mu}{\sigma}\biggr)^3\cdot\frac{1}{N}$$
where

For an example calculation using this formula, see this Googlesheet (shown below).

Population Skewness Calculation Example Googlesheet

It also shows how to obtain population skewness directly by using =SKEW.P(...) where “.P” means “population”. This confirms the outcome of our manual calculation. Sadly, neither SPSS nor JASP compute population skewness: both are limited to sample skewness.

Sample Skewness - Formula and Calculation

If your data hold a simple random sample from some population, use
$$Sample\;skewness = \frac{N\cdot\Sigma(X_i - \overline{X})^3}{S^3(N - 1)(N - 2)}$$
where

An example calculation is shown in this Googlesheet (shown below).

Sample Skewness Calculation Example Googlesheet

An easier option for obtaining sample skewness is using =SKEW(...). which confirms the outcome of our manual calculation.

Skewness in SPSS

First off, “skewness” in SPSS always refers to sample skewness: it quietly assumes that your data hold a sample rather than an entire population. There's plenty of options for obtaining it. My favorite is via MEANS because the syntax and output are clean and simple. The screenshots below guide you through.

SPSS Skewness Via Means Menu SPSS Skewness Via Means Dialog

The syntax can be as simple as means v1 to v5
/cells skew.
A very complete table -including means, standard deviations, medians and more- is run from means v1 to v5
/cells count min max mean median stddev skew kurt.
The result is shown below.

Skewness in SPSS Output Table

Skewness - Implications for Data Analysis

Many analyses -ANOVA, t-tests, regression and others- require the normality assumption: variables should be normally distributed in the population. The normal distribution has skewness = 0. So observing substantial skewness in some sample data suggests that the normality assumption is violated.
Such violations of normality are no problem for large sample sizes -say N > 20 or 25 or so. In this case, most tests are robust against such violations. This is due to the central limit theorem. In short, for large sample sizes, skewness is
no real problem for statistical tests.
However, skewness is often associated with large standard deviations. These may result in large standard errors and low statistical power. Like so, substantial skewness may decrease the chance of rejecting some null hypothesis in order to demonstrate some effect. In this case, a nonparametric test may be a wiser choice as it may have more power. Violations of normality do pose a real threat
for small sample sizes
of -say- N < 20 or so. With small sample sizes, many tests are not robust against a violation of the normality assumption. The solution -once again- is using a nonparametric test because these don't require normality.
Last but not least, there isn't any statistical test for examining if population skewness = 0. An indirect way for testing this is a normality test such as

However, when normality is really needed -with small sample sizes- such tests have low power: they may not reach statistical significance even when departures from normality are severe. Like so, they mainly provide you with a false sense of security.

And that's about it, I guess. If you've any remarks -either positive or negative- please throw in a comment below. We do love a bit of discussion.

Thanks for reading!

How to Draw Regression Lines in SPSS?

Summary & Example Data

This tutorial walks you through different options for drawing (non)linear regression lines for either all cases or subgroups. All examples use bank-clean.sav, partly shown below.

SPSS Bank Clean Variable View

Method A - Legacy Dialogs

A simple option for drawing linear regression lines is found under Graphs SPSS Menu Arrow Legacy Dialogs SPSS Menu Arrow Scatter/Dot as illustrated by the screenshots below.

SPSS Scatterplot Dialogs

Completing these steps results in the SPSS syntax below. Running it creates a scatterplot to which we can easily add our regression line in the next step.

*SCATTERPLOT FROM GRAPHS - LEGACY DIALOGS - SCATTER/DOT.

GRAPH
/SCATTERPLOT(BIVAR)=whours WITH salary
/MISSING=LISTWISE.

For adding a regression line, first double click the chart to open it in a Chart Editor window. Next, click the “Add Fit Line at Total” icon as shown below.

SPSS Add Regression Line To Scatterplot

You can now simply close the fit line dialog and Chart Editor.

Result

SPSS Linear Regression Line In Scatterplot

The linear regression equation is shown in the label on our line: y = 9.31E3 + 4.49E2*x which means that

$$Salary' = 9,310 + 449 \cdot Hours$$

Note that 9.31E3 is scientific notation for 9.31 · 103 = 9,310 (with some rounding).

You can verify this result and obtain more detailed output by running a simple linear regression from the syntax below.

*SIMPLE LINEAR REGRESSION - ALL CASES.

regression
/dependent salary
/method enter whours.

When doing so, you'll also have significance levels and/or confidence intervals. Finally, note that a linear relation seems a very poor fit for these variables. So let's explore some more interesting options.

Method B - Chart Builder

For SPSS versions 25 and higher, you can obtain scatterplots with fit lines from the chart builder. Let's do so for job type groups separately: simply navigate to Graphs SPSS Menu Arrow Chart Builder and fill out the dialogs as shown below.

SPSS Draw Separate Regression Lines From Chart Builder

This results in the syntax below. Let's run it.

*SCATTERPLOT WITH LINEAR FIT LINES FOR SEPARATE GROUPS.

GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=whours salary jtype MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=NO SUBGROUP=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: whours=col(source(s), name("whours"))
DATA: salary=col(source(s), name("salary"))
DATA: jtype=col(source(s), name("jtype"), unit.category())
GUIDE: axis(dim(1), label("On average, how many hours do you work per week?"))
GUIDE: axis(dim(2), label("Gross monthly salary"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label("Current job type"))
GUIDE: text.title(label("Scatter Plot of Gross monthly salary by On average, how many hours do ",
    "you work per week? by Current job type"))
SCALE: cat(aesthetic(aesthetic.color.interior), include(
"1", "2", "3", "4", "5"))
ELEMENT: point(position(whours*salary), color.interior(jtype))
END GPL.

Result

SPSS Regression Lines Separate Groups

First off, this chart is mostly used for

Sadly, the styling for this chart is awful but we could have fixed this with a chart template if we hadn't been so damn lazy.

Anyway, note that R-square -a common effect size measure for regression- is between good and excellent for all groups except upper management. This handful of cases may be the main reason for the curvilinearity we see if we ignore the existence of subgroups.

Running the syntax below verifies the results shown in this plot and results in more detailed output.

*SORT AND SPLIT FILE.

sort cases by jtype.
split file layered by jtype.

*SIMPLE LINEAR REGRESSION.

regression
/dependent salary
/method enter whours.

*END SPLIT FILE.

split file off.

Method C - CURVEFIT

Scatterplots with (non)linear fit lines and basic regression tables are very easily obtained from CURVEFIT. Jus navigate to Analyze SPSS Menu Arrow Regression SPSS Menu Arrow Curve Estimation and fill out the dialog as shown below.

SPSS Curve Estimation Dialog

If you'd like to see all models, change /MODEL=LINEAR to /MODEL=ALL after pasting the syntax.

*CURVEFIT - ALL MODELS.

TSET NEWVAR=NONE.
CURVEFIT
/VARIABLES=salary WITH whours
/CONSTANT
/MODEL=ALL /* CHANGE THIS LINE MANUALLY */
/PLOT FIT.

Result

Ss SPSS Linear Nonlinear Regression Lines In Scatterplot

Despite the poor styling of this chart, most curves seem to fit these data better than a linear relation. This can somewhat be verified from the basic regression table shown below.

SPSS Curvefit Coefficients Output

Especially the cubic model seems to fit nicely. Its equation is

$$Salary' = -13114 + 1883 \cdot hours - 80 \cdot hours^2 + 1.17 \cdot hours^3$$

Sadly, this output is rather limited: do all predictors in the cubic model seriously contribute to r-squared? The syntax below results in more detailed output and verifies our initial results.

*QUICK REPLICATION CUBIC MODEL.

compute whours2 = whours**2.
compute whours3 = whours**3.

regression
/dependent salary
/method forward whours whours2 whours3.

Method D - Regression Variable Plots

Regression Variable Plots is an SPSS extension that's mostly useful for

I believe this extension is preinstalled with SPSS version 26 onwards. If not, it's supposedly available from STATS_REGRESS_PLOT but I used to have some trouble installing it on older SPSS versions.

Anyway: if installed, navigating to Graphs SPSS Menu Arrow Regression Variable Plots should open the dialog shown below.

SPSS Regression Variable Plots Dialogs

Completing these steps results in the syntax below. Let's run it.

*FIT CUBIC MODELS FOR SEPARATE GROUPS (BAD IDEA).

STATS REGRESS PLOT YVARS=salary XVARS=whours COLOR=jtype
/OPTIONS CATEGORICAL=BARS GROUP=1 INDENT=15 YSCALE=75
/FITLINES CUBIC APPLYTO=GROUP.

Result

SPSS Non Linear Regression Lines Separate Groups

Most groups don't show strong deviations from linearity. The main exception is upper management which shows a rather bizarre curve.

However, keep in mind that these are only a handful of observations; the curve is the result of overfitting. It (probably) won't replicate in other samples and can't be taken seriously.

Method E - All Scatterplots Tool

Most methods we discussed so far are pretty good for creating a single scatterplot with a fit line. However, we often want to check several such plots for things like outliers, homoscedasticity and linearity. This is especially relevant for

A very simple tool for precisely these purposes is downloadable from and discussed in SPSS - Create All Scatterplots Tool.

SPSS Create All Scatterplots Tool Dialog 2

Final Notes

Right, so those are the main options for obtaining scatterplots with fit lines in SPSS. I hope you enjoyed this quick tutorial as much as I have.

If you've any remarks, please throw me a comment below. And last but not least:

thanks for reading!

SPSS Mediation Analysis with PROCESS

Introduction

A study investigated general well-being among a random sample of N = 421 hospital patients. Some of these data are in wellbeing.sav, partly shown below.

SPSS Wellbeing Variable View

One investigator believes that

That is, the relation from pain onto well-being is thought to be mediated by fatigue, as visualized below (top half).

Simple Mediation Analysis Diagram

Besides this indirect effect through fatigue, pain could also directly affect well-being (top half, path \(c\,'\)).

Now, what would happen if this model were correct and we'd (erroneously) leave fatigue out of it? Well, in this case the direct and indirect effects would be added up into a total effect (path \(c\), lower half). If all these hypotheses are correct, we should see the following in our data:

One approach to such a mediation analysis is a series of (linear) regression analyses as discussed in SPSS Mediation Analysis Tutorial. An alternative, however, is using the SPSS PROCESS macro as we'll demonstrate below.

Quick Data Checks

Rather than blindly jumping into some advanced analyses, let's first see if our data look plausible in the first place. As a quick check, let's inspect the histograms of all variables involved. We'll do so from the SPSS syntax below. For more details, consult Creating Histograms in SPSS.

*QUICK CHECK DISTRIBUTIONS / OUTLIERS / MISSING VALUES.

frequencies pain fatigue wellb
/format notable
/histogram.

Result

First off, note that all variables have N = 421 so there's no missing values. This is important to make sure because PROCESS can only handle cases that are complete on all variables involved in the analysis.

Second, there seem to be some slight outliers. This especially holds for fatigue as shown below.

SPSS Outlier In Histogram

I think these values still look pretty plausible and I don't expect them to have a major impact on our analyses. Although disputable, I'll leave them in the data for now.

SPSS PROCESS Dialogs

First off, make sure you have PROCESS installed as covered in SPSS PROCESS Macro Tutorial. After opening our data in SPSS, let's navigate to Analyze SPSS Menu Arrow Regression SPSS Menu Arrow PROCESS v4.2 by Andrew F. Hayes as shown below.

SPSS Analyze Regression Process 42

For a simple mediation analysis, we fill out the PROCESS dialogs as shown below.

SPSS Process Dialogs Simple Mediation

After completing these steps, you can either

We discussed this last option in SPSS PROCESS Macro Tutorial. This may take you a couple of minutes but it'll pay off in the end. Our final syntax is shown below.

*CREATE TABLES INSTEAD OF TEXT FOR PROCESS OUTPUT.

set mdisplay tables.

*READ PROCESS DEFINITION.

insert file = 'd:/downloaded/DEFINE-PROCESS-42.sps'.

*RUN PROCESS MODEL 4 (SIMPLE MEDIATION).

!PROCESS
y=wellb
/x=pain
/m=fatigue
/stand = 1 /* INCLUDE STANDARDIZED (BETA) COEFFICIENTS */
/total = 1 /* INCLUDE TOTAL EFFECT MODEL */
/decimals=F10.4
/boot=5000
/conf=95
/model=4
/seed = 20221227. /* MAKE BOOTSTRAPPING REPLICABLE */

SPSS PROCESS Output

Let's first look at path \(a\): this is the effect from \(X\) (pain) onto \(M\) (fatigue). We find it in the output if we look for OUTCOME VARIABLE fatigue as shown below.

SPSS Process Output Path A

For path \(a\), b = 0.09, p < .001: on average, higher pain scores are associated with more fatigue and this is highly statistically significant. This outcome is as expected if our mediation model is correct.

SPSS PROCESS Output - Paths B and C’

Paths \(b\) and \(c\,'\) are found in a single table. It's the one for which OUTCOME VARIABLE is \(Y\) (well-being) and includes b-coefficients for both \(X\) (pain) and \(M\) fatigue.

SPSS Process Output Paths B C

Note that path \(b\) is highly significant, as expected from our mediation hypotheses. Path \(c\,'\) (the direct effect) is also significant but our mediation model does not require this.

SPSS PROCESS Output - Path C

Some (but not all) authors also report the total effect, path \(c\). It is found in the table that has OUTCOME VARIABLE \(Y\) (well-being) that does not have a b-coefficient for the mediator.

SPSS Process Output Path C

Mediation Summary Diagram & Conclusion

The 4 main paths we examined thus far suffice for a classical mediation analysis. We summarized them in the figure below.

Mediation Analysis Summary

As hypothesized, paths \(a\) and \(b\) are both significant. Also note that direct effect is closer to zero than the total effect. This makes sense because the (negative) direct effect is the (negative) total effect minus the (negative) indirect effect.

A final point is that the direct effect is still significant: the indirect effect only partly accounts for the relation from pain onto well-being. This is known as partial mediation. A careful conclusion could thus be that the effect from pain onto well-being
is partially mediated by fatigue.

Indirect Effect and Index of Mediation

Thus far, we established mediation by examining paths \(a\) and \(b\) separately. A more modern approach, however, focuses mostly on the entire indirect effect which is simply

$$\text{indirect effect } ab = a \cdot b$$

For our example, \(ab\) is the change in \(Y\) (well-being) associated with a 1-unit increase in \(X\) pain through \(M\) (fatigue). This indirect effect is shown in the table below.

SPSS Process Output Indirect Effect

Note that PROCESS does not compute any p-value or confidence interval (CI) for \(ab\). Instead, it estimates a CI by bootstrapping. This CI may be slightly different in your output because it's based on random sampling.

Importantly, the 95% CI [-0.08, -0.02] does not contain zero. This tells us that p < .05 even though we don't have an exact p-value. An alternative for bootstrapping that does come up with a p-value here is the Sobel test.

PROCESS also reports the standardized b-coefficient for \(ab\). This is usually denoted as β and is completely unrelated to (1 - β) or power in statistics. This number, 0.04, is known as the index of mediation and is often interpreted as an effect size measure.

A huge stupidity in this table is that b is denoted as “Effect” rather than “coeff” as in the other tables. For adding to the confusion, “Effect” refers to either b or β. Denoting b as b and β as β would have been highly preferable here.

APA Reporting Mediation Analysis

Mediation analysis is often reported as separate regression analyses: “the first step of our analysis showed that the effect of pain on fatigue was significant, b = 0.09, p < .001...” Some authors also include t-values and degrees of freedom (df) for b-coefficients. For some dumb reason, PROCESS does not report degrees of freedom but you can compute them as

$$df = N - k - 1$$

where

Like so, we could report “the second step of our analysis showed that the effect of fatigue on well-being was also significant, b = -0.53, t(419) = -3.89, p < .001...”

Final Notes

First off, mediation is inherently a causal model: \(X\) causes \(M\) which, in turn, causes \(Y\). Nevertheless, mediation analysis does not usually support any causal claims. A rare exception could be \(X\) being a (possibly dichotomous) manipulation variable. In most cases, however, we can merely conclude that our data do (not) contradict
some (causal) mediation model.
This is not quite the strong conclusion we'd usually like to draw.

A second point is that I dislike the verbose text reporting suggested by the APA. As shown below, a simple table presents our results much more clearly and concisely.

SPSS Mediation Analysis Summary Table

Lastly, we feel that our example analysis would have been stronger if we had standardized all variables into z-scores prior to running PROCESS. The simple reason is that unstandardized values are uninterpretable for variables such as pain, fatigue and so on. What does a pain score of 60 mean? Low? Medium? High?

In contrast: a pain z-score of -1 means one standard deviation below the mean. If these scores are normally distributed, this is roughly the 16th percentile.

This point carries over to our regression coefficients: b-coefficients are not interpretable because
we don't know how much a “unit” is
for our (in)dependent variables. Therefore, reporting only β coefficients makes much more sense.

Now, we do have these standardized coefficients in our output. However, most confidence intervals apply to the unstandardized coefficients. This can be fixed by standardizing all variables prior to running PROCESS.

Thanks for reading!

Power (Statistics) – The Ultimate Beginners Guide

In statistics, power is the probability of rejecting
a false null hypothesis.

Power - Minimal Example

Now, given a sample size of N = 10 and a population correlation ρ = 0.10, what's the probability of correctly rejecting the null hypothesis? This probability is known as power and denoted as (1 - β) in statistics. For the aforementioned example, (1 - β) is only .058 (roughly 6%) as shown below.

Gpower Example Single Correlation

If a population correlation ρ = .10 and
we sample N = 10 respondents, then
we need to find an absolute sample correlation of | r | > .63 for rejecting H0 at α = .05.
The probability of finding this is only .058.

So even though H0 is false, we're unlikely to actually reject it. Not rejecting a false H0 is known as a committing a type II error.

Type I and Type II Errors

Any null hypothesis may be true or false and we may or may not reject it. This results in the 4 scenarios outlined below.

Reality: H0 is trueReality: H0 is false
Decision: reject H0Type I error
Probability = α
Correct decision
Probability = (1 - β) = power
Decision: retain H0Correct decision
Probability = (1 - α)
Type II error
Probability = β

As you probably guess, we usually want the power for our tests to be as high as possible. But before taking a look at factors affecting power, let's first try and understand how a power calculation actually works.

Power Calculation Example

A pharmaceutical company wants to demonstrate that their medicine against high blood pressure actually works. They expect the following:

Given these considerations, what's the power for this study? Or -alternatively- what's the probability of rejecting H0 that the mean blood pressure is equal between treated and untreated populations?

Obviously, nobody knows the outcomes for this study until it's finished. However, we do know the most likely outcomes: they're our population estimates. So let's for a moment pretend that we'll find exactly these and enter them into a t-test calculator.

Power For T-Test Excel Example Compute t-test for expected sample sizes, means and SD's in Excel

We expect p = 0.023 so we expect to reject H0.
This is based on a t-distribution with df = 38 degrees of freedom (total sample size N = 40 - 2).
We expect to find t = 2.37 if the population mean difference is 6 mmHg (160 - 154).

Now, this expected (or average) t = 2.37 under the alternative hypothesis Ha is known as a noncentrality parameter or NCP. The NCP tells us how t is distributed under some exact alternative hypothesis and thus allows us to estimate the power for some test. The figure below illustrates how this works.

Central Noncentral T-Distribution For Power

A minor note here is that we'd also reject H0 if t < -2.02 but this probability is almost zero for our first scenario. The exact calculation can be replicated from the SPSS syntax below.

*Enter chosen alpha and expected NCP as raw data.
data list free/alpha ncp.
begin data
0.05 2.37
end data.

*Compute left (lct) and right (rct) critical t-values and power.
compute lct = idf.t(0.5 * alpha,38).
compute rct = idf.t(1 - (0.5 * alpha),38).
compute lprob = ncdf.t(lct,38,ncp).
compute rprob = 1 - ncdf.t(rct,38,ncp).
compute power = lprob + rprob.
execute.

*Show 3 decimal places for all values.
formats all (f8.3).

Power and Effect Size

Like we just saw, estimating power requires specifying

In the previous example, our scientists had an exact alternative hypothesis because they had very specific ideas regarding population means and standard deviations. In most applied studies, however, we're pretty clueless about such population parameters. This raises the question how do we get an exact alternative hypothesis?

For most tests, the alternative hypothesis can be specified as an effect size measure: a single number combining several means, variances and/or frequencies. Like so, we proceed from requiring a bunch of unknown parameters to a single unknown parameter.

What's even better: widely agreed upon rules of thumb are available for effect size measures. An overview is presented in this Googlesheet, partly shown below.

Effect Size Rules Of Thumb

In applied studies, we often use G*Power for estimating power. The screenshot below replicates our power calculation example for the blood pressure medicine study.

Gpower Example Independent Samples T-Test G*Power computes both effect size and power from two means and SD's

Note that estimating power in G*Power only requires

a single estimated effect size measure. Optionally, G*Power computes it for you, given your sample means and SD's.
the alpha level -often 0.05- used for testing the null hypothesis &
one or more sample sizes

Let's now take a look at how these 3 factors relate to power.

Factors Affecting Power

The figure below gives a quick overview how 3 factors relate to power.

Factors Affecting Power In Statistics

Let's now take a closer look at each of them.

Power & Alpha Level

Everything else equal, increasing alpha increases power. For our example calculation, power increases from 0.637 to 0.753 if we test at α = 0.10 instead of 0.05.

Sampling Distributions Power Versus Alpha

A higher alpha level results in smaller (absolute) critical values: we already reject H0 if t > 1.69 instead of t > 2.02. So the light blue area, indicating (1 - β), increases. We basically require a smaller deviation from H0 for statistical significance.

However, increasing alpha comes at a cost: it increases the probability of committing a type I error (rejecting H0 when it's actually true). Therefore, testing at α > 0.05 is generally frowned upon. In short, increasing alpha basically just decreases one problem by increasing another one.

Power & Effect Size

Everything else equal, a larger effect size results in higher power. For our example, power increases from 0.637 to 0.869 if we believe that Cohen’s D = 1.0 rather than 0.8.

Power Versus Effect Size Sampling Distributions

A larger effect size results in a larger noncentrality parameter (NCP). Therefore, the distributions under H0 and HA lie further apart. This increases the light blue area, indicating the power for this test.

Keep in mind, though, that we can estimate but not choose some population effect size. If we overestimate this effect size, we'll overestimate the power for our test accordingly. Therefore, we can't usually increase power by increasing an effect size.

An arguable exception is increasing an effect size by modifying a research design or analysis. For example, (partial) eta squared for a treatment effect in ANOVA may increase by adding a covariate to the analysis.

Power & Sample Size

Everything else equal, larger sample size(s) result in higher power. For our example, increasing the total sample size from N = 40 to N = 80 increases power from 0.637 to 0.912.

Power Versus Sample Size Sampling Distributions

The increase in power stems from our distributions lying further apart. This reflects an increased noncentrality parameter (NCP). But why does the NCP increase with larger sample sizes?

Well, recall that for a t-distribution, the NCP is the expected t-value under HA. Now, t is computed as

$$t = \frac{\overline{X_1} - \overline{X_2}}{SE}$$

where \(SE\) denotes the standard error of the mean difference. In turn, \(SE\) is computed as

$$SE = Sw\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$$

where \(S_w\) denotes the estimated population SD of the outcome variable. This formula shows that as sample sizes increase, \(SE\) decreases and therefore t (and hence the NCP) increases.

On top of this, degrees of freedom increase (from df = 38 to df = 78 for our example). This results in slightly smaller (absolute) critical t-values but this effect is very modest.

In short, increasing sample size(s) is a sound way to increase the power for some test.

Power & Research Design

Apart from sample size, effect size & α, research design may also affect power. Although there's no exact formulas, some general guidelines are that

3 Main Reasons for Power Calculations

Power calculations in applied research serve 3 main purposes:

Gpower Types Of Power Analyses Different types of power analysis are made simple by G*Power

Software for Power Calculations - G*Power

G*Power is freely downloadable software for running the aforementioned and many other power calculations. Among its features are

Linear Regression Power Sample Size Plot Required sample sizes for multiple linear regression, given desired power,
chosen α and 3 estimated effect sizes

Altogether, we think G*Power is amazing software and we highly recommend using it. The only disadvantage we can think of is that it requires rather unusual effect size measures. Some examples are

This is awkward because the APA and (perhaps therefore) most journal articles typically recommend reporting

These are also the measures we typically obtain from statistical packages such as SPSS or JASP. Fortunately, G*Power converts some measures and/or computes them from descriptive statistics like we saw in this screenshot.

Software for Power Calculations - SPSS

In SPSS, observed power can be obtained from the GLM, UNIANOVA and (deprecated) MANOVA procedures. Keep in mind that GLM - short for General Linear Model- is very general indeed: it can be used for a wide variety of analyses including

Observed Power In SPSS Glm Select Observed power from Analyze - General Linear Model -
Univariate - Options

Other power calculations (required sample sizes or estimating power prior to data collection) were added to SPSS version 27, released in 2020.

Power Analysis In SPSS 27 Power Analysis as found in SPSS version 27 onwards

In my opinion, SPSS power analysis is a pathetic attempt to compete with G*Power. If you don't believe me, just try running a couple of power analyses in both programs simultaneously. If you do believe me, ignore SPSS power analysis and just go for G*Power.

Thanks for reading.

SPSS PROCESS Macro Tutorial

What is PROCESS?

PROCESS is a freely downloadable SPSS tool for estimating regression models with mediation and/or moderation effects. An example of such a model is shown below.

Simple Mediation Analysis No Total Effect Diagram

This model can fairly easily be estimated without PROCESS as discussed in SPSS Mediation Analysis Tutorial. However, using PROCESS has some advantages (as well as disadvantages) over a more classical approach. So how to get PROCESS and how does it work?

Those who want to follow along may download and open wellbeing.sav, partly shown below.

SPSS Wellbeing Variable View

Note that this tutorial focuses on becoming proficient with PROCESS. The example analysis will be covered in a future tutorial.

Downloading & Installing PROCESS

PROCESS can be downloaded here (scroll down to “PROCESS macro for SPSS, SAS, and R”). The download comes as a .zip file which you first need to unzip. After doing so, in SPSS, navigate to Extensions SPSS Menu Arrow Utilities SPSS Menu Arrow Install Custom Dialog (Compatibility Mode) Select “process.spd” and click “Open” as shown below.

SPSS Install Custom Dialog Compatibility

This should work for most SPSS users on recent versions. If it doesn't, consult the installation instructions that are included with the download.

Running PROCESS

If you successfully installed PROCESS, you'll find it in the regression menu as shown below.

SPSS Analyze Regression Process

For a very basic mediation analysis, we fill out the dialog as shown below.

SPSS Process Dialog Simple Mediation

Y refers to the dependent (or “outcome”) variable;

X refers to the independent variable or “predictor” in a regression context;

For simple mediation, select model 4. We'll have a closer look at model numbers in a minute;

Just for now, let's click “Ok”.

Result

SPSS Process Macro Output As Text

The first thing that may strike you, is that the PROCESS output comes as plain text. This is awkward because formatting it is very tedious and you can't adjust any decimal places. So let's fix that.

Creating Tables instead of Text Output

If you're using SPSS version 24 or higher, run the following SPSS syntax: set mdisplay tables. After doing so, running PROCESS will result in normal SPSS output tables rather than plain text as shown below.

SPSS Process Macro Output As Tables

Note that you can readily copy-paste these tables into Excel and/or adjust their decimal places.

Using PROCESS with Syntax

First off: whatever you do in SPSS, save your syntax. Now, like any other SPSS dialog, PROCESS has a Paste button for pasting its syntax. However, a huge stupidity from the programmers is that doing so results in some 6,140 (!) lines of syntax. I'll add the first lines below.

/* PROCESS version 4.0 */.
/* Written by Andrew F Hayes */.
/* www.afhayes.com */.
/* www.processmacro.org */.
/* Copyright 2017-2021 by Andrew F Hayes */.
/* Documented in http://www.guilford.com/p/hayes3 */.
/* THIS CODE SHOULD BE DISTRIBUTED ONLY THROUGH PROCESSMACRO.ORG */.

You can run and save this syntax but having over 6,140 lines is awkward. Now, this huge syntax basically consists of 2 parts:

The macro call is at the very end of the pasted syntax (use the Ctrl + End shortcut in your syntax window) and looks as follows.

PROCESS
y=wellb
/x=pain
/m=fatigue                                
/decimals=F10.4                                
/boot=5000
/conf=95    
/model=4.

After you run the (huge) macro definition just once during your session, you only need one (short) macro call for every PROCESS model you'd like to run.

A nice way to implement this, is to move the entire macro definition into a separate SPSS syntax file. Those who want to try this can download DEFINE-PROCESS-40.sps.

Although technically not mandatory, macro names should really start with exclamation marks. Therefore, we replaced DEFINE PROCESS with DEFINE !PROCESS in line 2,983 of this file. The final trick is that we can run this huge syntax file without opening it by using the INSERT command. Like so, the syntax below replicates our entire first PROCESS analysis.

*READ HUGE SYNTAX CONTAINING MACRO DEFINITION.

insert file = 'd:/downloaded/DEFINE-PROCESS-40.sps'.

*RERUN FIRST PROCESS ANALYSIS.

!PROCESS
y=wellb
/x=pain
/m=fatigue                                
/decimals=F10.4                                
/boot=5000
/conf=95    
/model=4.

Note: for replicating this, you may need to replace d:/downloaded by the folder where DEFINE-PROCESS-40.sps is located on your computer.

PROCESS Model Numbers

As we speak, PROCESS implements 94 models. An overview of the most common ones is shown in this Googlesheet (read-only), partly shown below.

Process Model Numbers

For example, if we have an X, Y and 2 mediator variables, we may hypothesize parallel mediation as illustrated below.

Parallel Mediation Diagram

However, you could also hypothesize that mediator 1 affects mediator 2 which, in turn, affects Y. If you want to test this serial mediation effect, select model 6 in PROCESS.

Serial Mediation Diagram

For moderated mediation, things get more complicated: the moderator could act upon any combination of paths a, b or c’. If you believe the moderator only affects path c’, choose model 5 as shown below.

Moderated Mediation Diagram

An overview of all model numbers is given in this book.

PROCESS & Dummy Coding

A quick overview of variable types for PROCESS is shown in this Googlesheet (read-only), partly shown below.

SPSS Process Measurement Levels

Keep in mind that PROCESS is entirely based on linear regression. This requires that dependent variables are quantitative (interval or ratio measurement level). This includes mediators, which act as both dependent and independent variables.

All other variables

X and moderator variables W and Z can only be dummy coded within PROCESS as shown below.

SPSS Process Dummy Coding

Covariates must be dummy coded before using PROCESS. For a handy tool, see SPSS Create Dummy Variables Tool.

Making Bootstrapping Replicable

Some PROCESS models rely on bootstrapping for reporting confidence intervals. Very basically, bootstrapping comes down to

Like so, a 95% bootstrapped CI for some parameter consists of the [2.5th - 97.5th] percentiles for some statistic over the bootstrap samples.

Now, due to the random nature of bootstrapping, running a PROCESS model twice typically results in slightly different CI's. This is undesirable but a fix is to add a /SEED subcommand to the macro call as shown below.

!PROCESS
y=wellb
/x=pain
/m=fatigue                                
/decimals=F10.4                                
/boot=5000
/conf=95    
/model=4
/seed = 20221227. /*MAKE BOOTSTRAPPED CI'S REPLICABLE*/

The random seed can be any positive integer. Personally, I tend to use the current date in YYYYMMDD format (20221227 is 27 December, 2022). An alternative is to run something like SET SEED 20221227. before running PROCESS. In this case, you need to prevent PROCESS from overruling this random seed, which you can do by replacing set seed = !seed. by *set seed = !seed. in line 3,022 of the macro definition.

Strengths & Weaknesses of PROCESS

A first strength of PROCESS is that it can save a lot of time and effort. This holds especially true for more complex models such as serial and moderated mediation.

Second, the bootstrapping procedure implemented in PROCESS is thought to have higher power and more accuracy than alternatives such as the Sobel test.

A weakness, though, is that PROCESS does not generate regression residuals. These are often used to examine model assumptions such as linearity and homoscedasticity as discussed in Linear Regression in SPSS - A Simple Example.

Another weakness of PROCESS is that some very basic models are not possible at all in PROCESS. A simple example is parallel moderation as illustrated below.

Parallel Moderation Diagram

This can't be done because PROCESS is limited to a single X variable. Using just SPSS, estimating this model is a piece of cake. It's a tiny extension of the model discussed in SPSS Moderation Regression Tutorial.

A technical weakness is that PROCESS generates over 6,000 lines of syntax when pasted. The reason this happens is that PROCESS is built on 2 long deprecated SPSS techniques:

I hope this will soon be fixed. There's really no need to bother SPSS users with 6,000 lines of source code.

Thanks for reading!

SPSS Label Cleaning Tool

We sometimes receive data files with annoying prefixes or suffixes in variable and/or value labels. This tutorial presents a simple tool for removing these and some other “cleaning” operations.

Example Data File

All examples in this tutorial use dirty-labels.sav. As shown below, its labels are far from ideal.

SPSS Label Cleaning Tool Variable View

Some variable labels have suffixes that are irrelevant to the final data.
All value labels are prefixed by the values that represent them.
Variable and value labels have underscores instead of spaces.

Our tool deals with precisely such issues. Let's try it.

Prerequisites and Installation

First off, this tool requires SPSS version 24 or higher. Next, the SPSS Python 3 essentials must be installed, which is normally the case with recent SPSS versions.

Next, click SPSS_TUTORIALS_CLEAN_LABELS.spe for downloading our tool. You can install it by dragging & dropping it into a data editor window. Alternatively, navigate to Extensions SPSS Menu Arrow Install local extension bundle as shown below.

SPSS Extensions Install Local Extension Bundle

In the dialog that opens, navigate to the downloaded .spe file and select it. SPSS now throws a message that “The extension was successfully installed under Transform - SPSS tutorials - Clean Labels”.

Example I - Text Replacement over Variable and Value Labels

Let's first replace all underscores by spaces in both variable and value labels. We'll open Transform SPSS Menu Arrow SPSS tutorials - Clean Labels and fill out the dialog as shown below.

SPSS Label Cleaning Tool Dialog 1

Completing these steps results in the syntax below. Let's run it.

*Replace underscores by spaces in all value and variable labels.

SPSS TUTORIALS CLEAN_LABELS VARIABLES=v1 v2 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19
    v20 v21 v22 FIND='_' REPLACEBY=' '
/OPTIONS OPERATION=FIREPCONT PROCESS=BOTH ACTION=BOTH.

Results

First note that all underscores were replaced by spaces in all variable and value labels. This was done by creating and running

commands. We chose to have these commands printed to our output window as shown below.

SPSS Label Cleaning Tool Output 1

SPSS already ran this syntax but you can also copy-paste it into a syntax window. Like so, the adjustments can be replicated on any SPSS version with or without our tool installed. If there's a lot of syntax, consider moving it into a separate file and running it with INSERT.

Example II - Remove Suffix from Variable Labels

Some variable labels end with “ (proceed to question...” We'll remove these suffixes because they don't convey any interesting information and merely clutter up our output tables and charts.

Again, we start off at Transform SPSS Menu Arrow SPSS tutorials - Clean Labels and fill out the dialog as shown below.

SPSS Label Cleaning Tool Dialog 2

Quick tip: you can shorten the resulting syntax by using

We did just that in the syntax below.

*Remove " (proceed" and characters succeeding it from all variable labels.

SPSS TUTORIALS CLEAN_LABELS VARIABLES=all FIND=' (proceed' REPLACEBY=' '
/OPTIONS OPERATION=FIOCSUC PROCESS=VARLABS ACTION=RUN.

Note that running this syntax removes “ (proceed to” and all characters that follow this expression from all variable labels.

Example III - Remove Prefix from Value Labels

Another issue we sometimes encounter are value labels being prefixed with the values representing them as shown below.

SPSS Label Cleaning Tool Variable View

Removing “= ” (mind the space) and all characters preceding it from all value labels fixes the problem. The syntax below -created from Transform SPSS Menu Arrow SPSS tutorials - Clean Labels- does just that.

*Remove "= " and characters preceding it from all value labels.

SPSS TUTORIALS CLEAN_LABELS VARIABLES=all FIND='= ' REPLACEBY=' '
/OPTIONS OPERATION=FIOCPRE PROCESS=VALLABS ACTION=RUN.

Result

After our third and final example, all value and variable labels are nice, short can clean.

SPSS Label Cleaning Tool Results 1

So that'll wrap up the examples of our label cleaning tool.

Final Notes

I hope you'll find our tool as helpful as we do. This first version performs 4 cleaning operations that we recently needed for our daily work. We'll probably build in some more options when we (or you?) need them.

So if you've any suggestions or other remarks, please throw us a comment below. Other than that,

thanks for reading!

Kruskal-Wallis Test – Simple Tutorial

A Kruskal-Wallis test tests if 3(+) populations have
equal mean ranks on some outcome variable.
The figure below illustrates the basic idea.

Kruskal Wallis Test Diagram

Kruskal-Wallis Test - Purposes

The Kruskal-Wallis test is a distribution free alternative for an ANOVA: we basically want to know if 3+ populations have equal means on some variable. However,

The Kruskal-Wallis test is a suitable alternative for ANOVA if sample sizes are small and/or the dependent variable is ordinal.

Kruskal-Wallis Test Example

A hospital runs a quick pilot on 3 vaccines: they administer each to N = 5 participants. After a week, they measure the amount of antibodies in the participants’ blood. The data thus obtained are in this Googlesheet, partly shown below.

Kruskal Wallis Test Example Data

Now, we'd like to know if some vaccines trigger more antibodies than others in the underlying populations. Since antibodies is a quantitative variable, ANOVA seems the right choice here.

However, ANOVA requires antibodies to be normally distributed in each subpopulation. And due to our minimal sample sizes, we can't rely on the central limit theorem like we usually do (or should anyway). And on top of that, our sample sizes are too small to examine normality. Just the emphasize this point, the histograms for antibodies by group are shown below.

Split Histogram Antibodies

If anything, the bottom two histograms seem slightly positively skewed. This makes sense because the amount of antibodies has a lower bound of zero but no upper bound. However, speculations regarding the population distributions don't get any more serious than that.

A particularly bad idea here is trying to demonstrate normality by running

Due to our tiny sample sizes, these tests are unlikely to reject the null hypothesis of normality. However, that's merely due to their lack of power and doesn't say anything about the population distributions. Put differently: a different null hypothesis (our variable following a uniform or Poisson distribution) would probably not be rejected either for the exact same data.

In short: ANOVA really requires normality for tiny sample sizes but we don't know if it holds. So we can't trust ANOVA results. And that's why we should use a Kruskal-Wallis test instead.

Kruskal-Wallis Test - Null Hypothesis

The null hypothesis for a Kruskal-Wallis test is that the mean ranks on some outcome variable
are equal across 3+ populations.
Note that the outcome variable must be ordinal or quantitative in order for “mean ranks” to be meaningful.

Many textbooks propose an incorrect null hypothesis such as:

So why are these incorrect? Well, the Kruskal-Wallis formula uses only 2 statistics: ranks sums and the sample sizes on which they're based. It completely ignores everything else about the data -including medians and frequency distributions. Neither of these affect whether the null hypothesis is (not) rejected.

If that still doesn't convince you, we'll perhaps add some example data files to this tutorial. These illustrate that wildly different medians or frequency distributions don't always result in a “significant” Kruskal-Wallis test (or reversely).

Kruskal-Wallis Test Assumptions

A Kruskal-Wallis test requires 3 assumptions1,5,8:

  1. independent observations;
  2. the dependent variable must be quantitative or ordinal;
  3. sufficient sample sizes (say, each ni ≥ 5) unless the exact significance level is computed.

Regarding the last assumption, exact p-values for the Kruskal-Wallis test can be computed. However, this is rarely done because it often requires very heavy computations. Some exact p-values are also found in Use of Ranks in One-Criterion Variance Analysis.

Instead, most software computes approximate (or “asymptotic”) p-values based on the chi-square distribution. This approximation is sufficiently accurate if the sample sizes are large enough. There's no real consensus with regard to required sample sizes: some authors1 propose each ni ≥ 4 while others6 suggest each ni ≥ 6.

Kruskal-Wallis Test Formulas

First off, we rank the values on our dependent variable ascendingly, regardless of group membership. We did just that in this Googlesheet, partly shown below.

Kruskal Wallis Test Example Data

Next, we compute the sum over all ranks for each group separately.

Kruskal Wallis Test Descriptive Statistics

We then enter a) our samples sizes and b) our ranks sums into the following formula:

$$Kruskal\;Wallis\;H = \frac{12}{N(N + 1)}\sum\limits_{i = 1}^k\frac{R_i^2}{n_i} - 3(N + 1)$$

where

For our example, that'll be

$$Kruskal\;Wallis\;H = \frac{12}{15(15 + 1)}(\frac{55^2}{5}+\frac{20^2}{5}+\frac{45^2}{5}) - 3(15 + 1) =$$

$$Kruskal\;Wallis\;H = 0.05\cdot(605 + 80 + 405) - 48 = 6.50$$

\(H\) approximately follows a chi-square (written as χ2) distribution with

$$df = k - 1$$

degrees of freedom (\(df\)) for \(k\) groups. For our example,

$$df = 3 - 1 = 2$$

so our significance level is

$$\chi^2(2) = 6.50, p \approx 0.039.$$

The SPSS output for our example, shown below, confirms our calculations.

Kruskal Wallis Test Apa Reporting

So what do we conclude now? Well, assuming alpha = 0.05, we reject our null hypothesis: the population mean ranks of antibodies are not equal among vaccines. In normal language, our 3 vaccines do not perform equally well. Judging from the mean ranks, it seems vaccine B performs worse than its competitors: its mean rank is lower and this means that it triggered fewer antibodies than the other vaccines.

Kruskal-Wallis Post Hoc Tests

Thus far, we concluded that the amounts of antibodies differ among our 3 vaccines. So precisely which vaccine differs from which vaccine? We'll compare each vaccine to each other vaccine for finding out. This procedure is generally known as running post-hoc tests.

In contrast to popular belief, Kruskal-Wallis post-hoc tests are not equivalent to Bonferroni corrected Mann-Whitney tests. Instead, each possible pair of groups is compared using the following formula:

$$Z_{kw} = \frac{\overline{R}_i - \overline{R}_j}{\sqrt{\frac{N(N + 1)}{12}(\frac{1}{n_i}+\frac{1}{n_j})}}$$

where

For comparing vaccines A and B, that'll be

$$Z_{kw} = \frac{11 - 4}{\sqrt{\frac{15(15 + 1)}{12}(\frac{1}{5}+\frac{1}{5})}} \approx 2.475 $$

$$P(|Z_{kw}| > 2.475) \approx 0.013$$

A Bonferroni correction is usually applied to this p-value because we're running multiple comparisons on (partly) the same observations. The number of pairwise comparisons for \(k\) groups is

$$N_{comp} = \frac{k (k - 1)}{2}$$

Therefore, the Bonferroni corrected p-value for our example is

$$P_{Bonf} = 0.013 \cdot \frac{3 (2 - 1)}{2} \approx 0.040$$

The screenshot from SPSS (below) confirms these findings.

Kruskal Wallis Test Post Hoc Tests Output SPSS

Oddly, the difference between mean ranks, \(\overline{R}_i - \overline{R}_j\), is denoted as “Test Statistic”.
The actual test statistic, \(Z_{kw}\) is denoted as “Std. Test Statistic”.

APA Reporting a Kruskal-Wallis Test

For APA reporting our example analysis, we could write something like “a Kruskal-Wallis test indicated that the amount of antibodies
differed over vaccines, H(2) = 6.50, p = 0.039.

Kruskal Wallis Test Apa Reporting

Although the APA doesn't mention it, we encourage reporting the mean ranks and perhaps some other descriptives statistics in a separate table as well.

Reporting Kruskal Wallis Test Descriptives

Right, so that should do. If you've any questions or remarks, please throw me a comment below. Other than that:

Thanks for reading!

References

  1. Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
  2. Warner, R.M. (2013). Applied Statistics (2nd. Edition). Thousand Oaks, CA: SAGE.
  3. Agresti, A. & Franklin, C. (2014). Statistics. The Art & Science of Learning from Data. Essex: Pearson Education Limited.
  4. Field, A. (2013). Discovering Statistics with IBM SPSS Statistics. Newbury Park, CA: Sage.
  5. Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
  6. Siegel, S. & Castellan, N.J. (1989). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). Singapore: McGraw-Hill.
  7. Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
  8. Kruskal, W.H. & Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583-621.

SPSS – Kendall’s Concordance Coefficient W

Kendall’s Concordance Coefficient W is a number between 0 and 1
that indicates interrater agreement.
So let's say we had 5 people rank 6 different beers as shown below. We obviously want to know which beer is best, right? But could we also quantify how much these raters agree with each other? Kendall’s W does just that.

SPSS Beertest Dataview

Kendall’s W - Example

So let's take a really good look at our beer test results. The data -shown above- are in beertest.sav. For answering which beer was rated best, a Friedman test would be appropriate because our rankings are ordinal variables. A second question, however, is to what extent do all 5 judges agree on their beer rankings? If our judges don't agree at all which beers were best, then we can't possibly take their conclusions very seriously. Now, we could say that “our judges agreed to a large extent” but we'd like to be more precise and express the level of agreement in a single number. This number is known as Kendall’s Coefficient of Concordance W.2,3

Kendall’s W - Basic Idea

Let's consider the 2 hypothetical situations depicted below: perfect agreement and perfect disagreement among our raters. I invite you to stare at it and think for a minute.

Kendalls Concordance Coefficient - Basic Idea

As we see, the extent to which raters agree is indicated by the extent to which the column totals differ. We can express the extent to which numbers differ as a number: the variance or standard deviation.
Kendall’s W is defined as

$$W = \frac{Variance\,over\,column\,totals}{Maximum\,possible\,variance\,over\,column\,totals}$$

As a result, Kendall’s W is always between 0 and 1. For instance, our perfect disagreement example has W = 0; because all column totals are equal, their variance is zero.
Our perfect agreement example has W = 1 because the variance among column totals is equal to the maximal possible variance. No matter how you rearrange the rankings, you can't possibly increase this variance any further. Don't believe me? Give it a go then.
So what about our actual beer data? We'll quickly find out with SPSS.

Kendall’s W in SPSS

We'll get Kendall’s W from SPSS’ Nonparametric Tests menu. The screenshots below walk you through.

SPSS Kendalls Concordance Coefficient - Main Menu SPSS Kendalls W - Main Dialog

Note: SPSS thinks our rankings are nominal variables. This is because they contain few distinct values. Fortunately, this won't interfere with the current analysis. Completing these steps results in the syntax below.

Kendall’s W - Basic Syntax

*Kendall's W from nonparametric tests - legacy dialogs - k related samples.

NPAR TESTS
/KENDALL=beer_a beer_b beer_c beer_d beer_e beer_f
/MISSING LISTWISE.

Kendall’s W - Output

SPSS Kendalls W - Output

And there we have it: Kendall’s W = 0.78. Our beer judges agree with each other to a reasonable but not super high extent. Note that we also get a table with the (column) mean ranks that tells us which beer was rated most favorably.

Average Spearman Correlation over Judges

Another measure of concordance is the average over all possible Spearman correlations among all judges.1 It can be calculated from Kendall’s W with the following formula

$$\overline{R}_s = {kW - 1 \over k - 1}$$

where \(\overline{R}_s\) denotes the average Spearman correlation and \(k\) the number of judges. For our example, this comes down to

$$\overline{R}_s = {5(0.781) - 1 \over 5 - 1} = 0.726$$

We'll verify this by running and averaging all possible Spearman correlations in SPSS. We'll leave that for a next tutorial, however, as doing so properly requires some highly unusual -but interesting- syntax.

Thank you for reading!

References

  1. Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
  2. Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
  3. Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.