How to Find & Exclude Outliers in SPSS?
- Method I - Histograms
- Excluding Outliers from Data
- Method II - Boxplots
- Method III - Z-Scores (with Reporting)
- Method III - Z-Scores (without Reporting)
Summary
Outliers are basically values that fall outside of a normal range for some variable. But what's a “normal range”? This is subjective and may depend on substantive knowledge and prior research. Alternatively, there's some rules of thumb as well. These are less subjective but don't always result in better decisions as we're about to see.
In any case: we usually want to exclude outliers from data analysis. So how to do so in SPSS? We'll walk you through 3 methods, using life-choices.sav, partly shown below.
In this tutorial, we'll find outliers for these reaction time variables.
During this tutorial, we'll focus exclusively on reac01 to reac05, the reaction times in milliseconds for 5 choice trials offered to the respondents.
Method I - Histograms
Let's first try to identify outliers by running some quick histograms over our 5 reaction time variables. Doing so from SPSS’ menu is discussed in Creating Histograms in SPSS. A faster option, though, is running the syntax below.
frequencies reac01 to reac05
/histogram.
Result
Let's take a good look at the first of our 5 histograms shown below.
The “normal range” for this variable seems to run from 500 through 1500 ms. It seems that 3 scores lie outside this range. So are these outliers? Honestly, different analysts will make different decisions here. Personally, I'd settle for only excluding the score ≥ 2000 ms. So what's the right way to do so? And what about the other variables?
Excluding Outliers from Data
The right way to exclude outliers from data analysis is to specify them as user missing values. So for reaction time 1 (reac01), running missing values reac01 (2000 thru hi). excludes reaction times of 2000 ms and higher from all data analyses and editing. So what about the other 4 variables?
The histograms for reac02 and reac03 don't show any outliers.
For reac04, we see some low outliers as well as a high outlier. We can find which values these are in the bottom and top of its frequency distribution as shown below.
If we see any outliers in a histogram, we may look up the exact values in the corresponding frequency table.
We can exclude all of these outliers in one go by running missing values reac04 (lo thru 400,2085). By the way: “lo thru 400” means the lowest value in this variable (its minimum) through 400 ms.
For reac05, we see several low and high outliers. The obvious thing to do seems to run something like missing values reac05 (lo thru 400,2000 thru hi). But sadly, this only triggers the following error:
>There are too many values specified.
>The limit is three individual values or
>one value and one range of values.
>Execution of this command stops.
The problem here is that
you can't specify a low and a high
range of missing values in SPSS.
Since this is what you typically need to do, this is one of the biggest stupidities still found in SPSS today. A workaround for this problem is to
- RECODE the entire low range into some huge value such as 999999999;
- add the original values to a value label for this value;
- specify only a high range of missing values that includes 999999999.
The syntax below does just that and reruns our histograms to check if all outliers have indeed been correctly excluded.
recode reac05 (lo thru 400 = 999999999).
*Add value label to 999999999.
add value labels reac05 999999999 '(Recoded from 95 / 113 / 397 ms)'.
*Set range of high missing values.
missing values reac05 (2000 thru hi).
*Rerun frequency tables after excluding outliers.
frequencies reac01 to reac05
/histogram.
Result
First off, note that none of our 5 histograms show any outliers anymore; they're now excluded from all data analysis and editing. Also note the bottom of the frequency table for reac05 shown below.
Low outliers after recoding and labelling are listed under Missing.
Even though we had to recode some values, we can still report precisely which outliers we excluded for this variable due to our value label.
Before proceeding to boxplots, I'd like to mention 2 worst practices for excluding outliers:
- removing outliers by changing them into system missing values. After doing so, we no longer know which outliers we excluded. Also, we're clueless why values are system missing as they don't have any value labels.
- removing entire cases -often respondents- because they have 1(+) outliers. Such cases typically have mostly “normal” data values that we can use just fine for analyzing other (sets of) variables.
Sadly, supervisors sometimes force their students to take this road anyway. If so, SELECT IF permanently removes entire cases from your data.
Method II - Boxplots
If you ran the previous examples, you need to close and reopen life-choices.sav before proceeding with our second method.
We'll create a boxplot as discussed in Creating Boxplots in SPSS - Quick Guide: we first navigate to
as shown below.
Next, we'll fill in the dialogs as shown below.
Completing these steps results in the syntax below. Let's run it.
EXAMINE VARIABLES=reac01 reac02 reac03 reac04 reac05
/PLOT BOXPLOT
/COMPARE VARIABLES
/STATISTICS EXTREME
/MISSING PAIRWISE
/NOTOTAL.
Result
Quick note: if you're not sure about interpreting boxplots, read up on Boxplots - Beginners Tutorial first.
Our boxplot indicates some potential outliers for all 5 variables. But let's just ignore these and exclude only the extreme values that are observed for reac01, reac04 and reac05.
So, precisely which values should we exclude? We find them in the Extreme Values table. I like to copy-paste this into Excel. Now we can easily boldface all values that are extreme values according to our boxplot.
Copy-pasting the Extreme Values table into Excel allows you to easily boldface the exact outliers that we'll exclude.
Finally, we set these extreme values as user missing values with the syntax below. For a step-by-step explanation of this routine, look up Excluding Outliers from Data.
recode reac05 (lo thru 113 = 999999999).
*Label new value with original values.
add value labels reac05 999999999 '(Recoded from 95 / 113 ms)'.
*Set (ranges of) missing values for reac01, reac04 and reac05.
missing values
reac01 (2065)
reac04 (17,2085)
reac05 (1647 thru hi).
*Rerun boxplot and check if all extreme values are gone.
EXAMINE VARIABLES=reac01 reac02 reac03 reac04 reac05
/PLOT BOXPLOT
/COMPARE VARIABLES
/STATISTICS EXTREME
/MISSING PAIRWISE
/NOTOTAL.
Method III - Z-Scores (with Reporting)
A common approach to excluding outliers is to look up which values correspond to high z-scores. Again, there's different rules of thumb which z-scores should be considered outliers. Today, we settle for |z| ≥ 3.29 indicates an outlier. The basic idea here is that if a variable is perfectly normally distributed, then only 0.1% of its values will fall outside this range.
So what's the best way to do this in SPSS? Well, the first 2 steps are super simple:
- we add z-scores for all relevant variables to our data and
- see if their minima or maxima meet |z| ≥ 3.29.
Funnily, both steps are best done with a simple DESCRIPTIVES command as shown below.
descriptives reac01 to reac05
/save.
*Check min and max for z-scores.
descriptives zreac01 to zreac05.
Result
Minima and maxima for our newly computed z-scores.
Basic conclusions from this table are that
- reac01 has at least 1 high outlier;
- reac02 and reac03 don't have any outliers;
- reac04 and reac05 both have at least 1 low and 1 high outlier.
But which original values correspond to these high absolute z-scores? For each variable, we can run 2 simple steps:
- FILTER away cases having |z| < 3.29 (all non outliers);
- run a frequency table -now containing only outliers- on the original variable.
The syntax below does just that but uses TEMPORARY and SELECT IF for filtering out non outliers.
temporary.
select if(abs(zreac01) >= 3.29).
frequencies reac01.
temporary.
select if(abs(zreac04) >= 3.29).
frequencies reac04.
temporary.
select if(abs(zreac05) >= 3.29).
frequencies reac05.
*Save output because tables needed for reporting which outliers are excluded.
output save outfile = 'outlier-tables-01.spv'.
Result
Finding outliers by filtering out all non outliers based on their z-scores.
Note that each frequency table only contains a handful of outliers for which |z| ≥ 3.29. We'll now exclude these values from all data analyses and editing with the syntax below. For a detailed explanation of these steps, see Excluding Outliers from Data.
recode reac04 (lo thru 107 = 999999999).
recode reac05 (lo thru 113 = 999999999).
*Label new values with original values.
add value labels reac04 999999999 '(Recoded from 17 / 107 ms)'.
add value labels reac05 999999999 '(Recoded from 95 / 113 ms)'.
*Set (ranges of) missing values for reac01, reac04 and reac05.
missing values
reac01 (1659 thru hi)
reac04 (1601 thru hi )
reac05 (1776 thru hi).
*Check if all outliers are indeed user missing values now.
temporary.
select if(abs(zreac01) >= 3.29).
frequencies reac01.
temporary.
select if(abs(zreac04) >= 3.29).
frequencies reac04.
temporary.
select if(abs(zreac05) >= 3.29).
frequencies reac05.
Method III - Z-Scores (without Reporting)
We can greatly speed up the z-score approach we just discussed but this comes at a price: we won't be able to report precisely which outliers we excluded. If that's ok with you, the syntax below almost fully automates the job.
descriptives reac01 to reac05
/save.
*Recode original values into 999999999 if z-score >= 3.29.
do repeat #ori = reac01 to reac05 / #z = zreac01 to zreac05.
if(abs(#z) >= 3.29) #ori = 999999999.
end repeat print.
*Add value labels.
add value labels reac01 to reac05 999999999 '(Excluded because |z| >= 3.29)'.
*Set missing values.
missing values reac01 to reac05 (999999999).
*Check how many outliers were excluded.
frequencies reac01 to reac05.
Result
The frequency table below tells us that 4 outliers having |z| ≥ 3.29 were excluded for reac04.
Under Missing we see the number of excluded outliers but not the exact values.
Sadly, we're no longer able to tell precisely which original values these correspond to.
Final Notes
Thus far, I deliberately avoided the discussion precisely which values should be considered outliers for our data. I feel that simply making a decision and being fully explicit about it is more constructive than endless debate.
I therefore blindly followed some rules of thumb for the boxplot and z-score approaches. As I warned earlier, these don't always result in good decisions: for the data at hand, reaction times below some 500 ms can't be taken seriously. However, the rules of thumb don't always exclude these.
As for most of data analysis, using common sense is usually a better idea...
Thanks for reading!
SPSS Mediation Analysis – The Complete Guide
- How to Examine Mediation Effects?
- SPSS Regression Dialogs
- SPSS Mediation Analysis Output
- APA Reporting Mediation Analysis
- Next Steps - The Sobel Test
- Next Steps - Index of Mediation
Example
A scientist wants to know which factors affect general well-being among people suffering illnesses. In order to find out, she collects some data on a sample of N = 421 cancer patients. These data -partly shown below- are in wellbeing.sav.
Now, our scientist believes that well-being is affected by pain as well as fatigue. On top of that, she believes that fatigue itself is also affected by pain. In short: pain partly affects well-being through fatigue. That is, fatigue mediates the effect from pain onto well-being as illustrated below.
The lower half illustrates a model in which fatigue would (erroneously) be left out. This is known as the “total effect model” and is often compared with the mediation model above it.
How to Examine Mediation Effects?
Now, let's suppose for a second that all expectations from our scientist are exactly correct. If so, then what should we see in our data? The classical approach to mediation (see Kenny & Baron, 1986) says that
- \(a\) (from pain to fatigue) should be significant;
- \(b\) (from fatigue to well-being) should be significant;
- \(c\) (from pain to well-being) should be significant;
- \(c\,'\) (direct effect) should be closer to zero than \(c\) (total effect).
So how to find out if our data is in line with these statements? Well, all paths are technically just b-coefficients. We'll therefore run 3 (separate) regression analyses:
- regression from pain onto fatigue tells us if \(a\) is significant;
- multiple linear regression from pain and fatigue onto well-being tells us if \(b\) and \(c\,'\) are significant;
- regression from pain onto well-being tells if \(c\) is significant and/or different from \(c\,'\).
Paths c’ and b in basic SPSS regression output
SPSS Regression Dialogs
So let's first run the regression analysis for effect \(a\) (X onto mediator) in SPSS: we'll open wellbeing.sav and navigate to the linear regression dialogs as shown below.
For a fairly basic analysis, we'll fill out these dialogs as shown below.
Completing these steps results in the SPSS syntax below. I suggest you shorten the pasted version a bit.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT fatigue /* MEDIATOR */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).
*SHORTEN TO SOMETHING LIKE...
REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT fatigue /* MEDIATOR */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).
A second regression analysis estimates effects \(b\) and \(c\,'\). The easiest way to run it is to copy, paste and edit the first syntax as shown below.
REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT wellb /* Y */
/METHOD=ENTER pain fatigue /* X AND MEDIATOR */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).
We'll use the syntax below for the third (and final) regression which estimates \(c\), the total effect.
REGRESSION
/STATISTICS COEFF CI(95) R
/DEPENDENT wellb /* Y */
/METHOD=ENTER pain /* X */
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS HISTOGRAM(ZRESID).
SPSS Mediation Analysis Output
For our mediation analysis, we really only need the 3 coefficients tables. I copy-pasted them into this Googlesheet (read-only, partly shown below).
So what do we conclude? Well, all requirements for mediation are met by our results:
- effects \(a\), \(b\) and \(c\) are all statistically significant. This is because their “Sig.” or p < .05;
- the direct effect \(c\,'\) = -0.17 and thus closer to zero than the total effect \(c\) = -0.22.
The diagram below summarizes these results.
Note that both \(c\) and \(c\,'\) are significant. This is often called partial mediation: fatigue partially mediates the effect from pain onto well-being: adding it decreases the effect but doesn't nullify it altogether.
Besides partial mediation, we sometimes find full mediation. This means that \(c\) is significant but \(c\,'\) isn't: the effect is fully mediated and thus disappears when the mediator is added to the regression model.
APA Reporting Mediation Analysis
Mediation analysis is often reported as separate regression analyses as in “the first step of our analysis showed that the effect of pain on fatigue was significant, b = 0.09, p < .001...” Some authors also include t-values and degrees of freedom (df) for b-coefficients. For some very dumb reason, SPSS does not report degrees of freedom but you can compute them as
$$df = N - k - 1$$
where
- \(N\) denotes the total sample size (N = 421 in our example) and
- \(k\) denotes the number of predictors in the model (1 or 2 in our example).
Like so, we could report “the second step of our analysis showed that the effect of fatigue on well-being was also significant, b = -0.53, t(419) = -3.89, p < .001...”
Next Steps - The Sobel Test
In our analysis, the indirect effect of pain via fatigue onto well-being consists of two separate effects, \(a\) (pain onto fatigue) and \(b\) fatigue onto well-being. Now, the entire indirect effect \(ab\) is simply computed as
$$\text{indirect effect} \;ab = a \cdot b$$
This makes perfect sense: if wage \(a\) is $30 per hour and tax \(b\) is $0.20 per dollar income, then I'll pay $30 · $0.20 = $6.00 tax per hour, right?
For our example, \(ab\) = 0.09 · -0.53 = -0.049: for every unit increase in pain, well-being decreases by an average 0.049 units via fatigue. But how do we obtain the p-value and confidence interval for this indirect effect? There's 2 basic options:
- the modern literature favors bootstrapping as implemented in the PROCESS macro which we'll discuss later;
- the Sobel test (also known as “normal theory” approach).
The second approach assumes \(ab\) is normally distributed with
$$se_{ab} = \sqrt{a^2se^2_b + b^2se^2_a + se^2_a se^2_b}$$
where
\(se_{ab}\) denotes the standard error of \(ab\) and so on.
For the actual calculations, I suggest you try our Sobel Test Calculator.xlsx, partly shown below.
So what does this tell us? Well, our indirect effect is significant, B = -0.049, p = .002, 95% CI [-0.08, -0.02].
Next Steps - Index of Mediation
Our research variables (such as pain & fatigue) were measured on different scales without clear units of measurement. This renders it impossible to compare their effects. The solution is to report standardized coefficients known as β (Greek letter “beta”).
Our SPSS output already includes beta for most effects but not for \(ab\). However, we can easily compute it as
$$\beta_{ab} = \frac{ab \cdot SD_x}{SD_y}$$
where
\(SD_x\) is the sample-standard-deviation of our X variable and so on.
This standardized indirect effect is known as the index of mediation. For computing it, we may run something like DESCRIPTIVES pain wellb. in SPSS. After copy-pasting the resulting table into this Googlesheet, we'll compute \(\beta_{ab}\) with a quick formula as shown below.
Adding the output from our Sobel test calculator to this sheet results in a very complete and clear summary table for our mediation analysis.
Final Notes
Mediation analysis in SPSS can be done with or without the PROCESS macro. Some reasons for not using PROCESS are that
- many people find PROCESS difficult to use and dislike its output format;
- PROCESS can't create regression residuals and the associated plots for checking regression assumptions such as linearity, homoscedasticity and normality;
- the PROCESS output does not include adjusted r-squared;
- PROCESS does not offer pairwise exclusion of missing values.
So why does anybody use PROCESS? Some reasons may be that
- PROCESS uses bootstrapping rather than the Sobel test. This is said to result in higher power and more accurate confidence intervals. Sadly, bootstrapping does not yield a p-value for the indirect effect whereas the Sobel test does;
- using PROCESS may save a lot of work for more complex models (parallel, serial and moderated mediation);
- if needed, PROCESS handles dummy coding for the X variable and moderators (if any);
- PROCESS doesn't require the additional calculations that we implemented in our Googlesheet: it calculates everything you need in one go.
Right. I hope this tutorial has been helpful for running, reporting and understanding mediation analysis in SPSS. This is perhaps not the easiest topic but remember that practice makes perfect.
Thanks for reading!
Skewness – What & Why?
Skewness is a number that indicates to what extent
a variable is asymmetrically distributed.
- Positive (Right) Skewness Example
- Negative (Left) Skewness Example
- Population Skewness - Formula and Calculation
- Sample Skewness - Formula and Calculation
- Skewness in SPSS
- Skewness - Implications for Data Analysis
Positive (Right) Skewness Example
A scientist has 1,000 people complete some psychological tests. For test 5, the test scores have skewness = 2.0. A histogram of these scores is shown below.
The histogram shows a very asymmetrical frequency distribution. Most people score 20 points or lower but the right tail stretches out to 90 or so. This distribution is right skewed.
If we move to the right along the x-axis, we go from 0 to 20 to 40 points and so on. So towards the right of the graph, the scores become more positive. Therefore,
right skewness is positive skewness
which means skewness > 0. This first example has skewness = 2.0 as indicated in the right top corner of the graph. The scores are strongly positively skewed.
Negative (Left) Skewness Example
Another variable -the scores on test 2- turn out to have skewness = -1.0. Their histogram is shown below.
The bulk of scores are between 60 and 100 or so. However, the left tail is stretched out somewhat. So this distribution is left skewed.
Right: to the left, to the left. If we follow the x-axis to the left, we move towards more negative scores. This is why
left skewness is negative skewness.
And indeed, skewness = -1.0 for these scores. Their distribution is left skewed. However, it is less skewed -or more symmetrical- than our first example which had skewness = 2.0.
Symmetrical Distribution Implies Zero Skewness
Finally, symmetrical distributions have skewness = 0. The scores on test 3 -having skewness = 0.1- come close.
Now, observed distributions are rarely precisely symmetrical. This is mostly seen for some theoretical sampling distributions. Some examples are
- the (standard) normal distribution;
- the t distribution and
- the binomial distribution if p = 0.5.
These distributions are all exactly symmetrical and thus have skewness = 0.000...
Population Skewness - Formula and Calculation
If you'd like to compute skewnesses for one or more variables, just leave the calculations to some software. But -just for the sake of completeness- I'll list the formulas anyway.
If your data contain your entire population, compute the population skewness as:
$$Population\;skewness = \Sigma\biggl(\frac{X_i - \mu}{\sigma}\biggr)^3\cdot\frac{1}{N}$$
where
- \(X_i\) is each individual score;
- \(\mu\) is the population mean;
- \(\sigma\) is the population standard deviation and
- \(N\) is the population size.
For an example calculation using this formula, see this Googlesheet (shown below).
It also shows how to obtain population skewness directly by using =SKEW.P(...) where “.P” means “population”. This confirms the outcome of our manual calculation. Sadly, neither SPSS nor JASP compute population skewness: both are limited to sample skewness.
Sample Skewness - Formula and Calculation
If your data hold a simple random sample from some population, use
$$Sample\;skewness = \frac{N\cdot\Sigma(X_i - \overline{X})^3}{S^3(N - 1)(N - 2)}$$
where
- \(X_i\) is each individual score;
- \(\overline{X}\) is the sample mean;
- \(S\) is the sample-standard-deviation and
- \(N\) is the sample size.
An example calculation is shown in this Googlesheet (shown below).
An easier option for obtaining sample skewness is using =SKEW(...). which confirms the outcome of our manual calculation.
Skewness in SPSS
First off, “skewness” in SPSS always refers to sample skewness: it quietly assumes that your data hold a sample rather than an entire population. There's plenty of options for obtaining it. My favorite is via MEANS because the syntax and output are clean and simple. The screenshots below guide you through.
The syntax can be as simple as
means v1 to v5
/cells skew.
A very complete table -including means, standard deviations, medians and more- is run from
means v1 to v5
/cells count min max mean median stddev skew kurt.
The result is shown below.
Skewness - Implications for Data Analysis
Many analyses -ANOVA, t-tests, regression and others- require the normality assumption: variables should be normally distributed in the population. The normal distribution has skewness = 0. So observing substantial skewness in some sample data suggests that the normality assumption is violated.
Such violations of normality are no problem for large sample sizes -say N > 20 or 25 or so. In this case, most tests are robust against such violations. This is due to the central limit theorem. In short,
for large sample sizes, skewness is
no real problem for statistical tests.
However, skewness is often associated with large standard deviations. These may result in large standard errors and low statistical power. Like so, substantial skewness may decrease the chance of rejecting some null hypothesis in order to demonstrate some effect. In this case, a nonparametric test may be a wiser choice as it may have more power.
Violations of normality do pose a real threat
for small sample sizes
of -say- N < 20 or so. With small sample sizes, many tests are not robust against a violation of the normality assumption. The solution -once again- is using a nonparametric test because these don't require normality.
Last but not least, there isn't any statistical test for examining if population skewness = 0. An indirect way for testing this is a normality test such as
However, when normality is really needed -with small sample sizes- such tests have low power: they may not reach statistical significance even when departures from normality are severe. Like so, they mainly provide you with a false sense of security.
And that's about it, I guess. If you've any remarks -either positive or negative- please throw in a comment below. We do love a bit of discussion.
Thanks for reading!
How to Draw Regression Lines in SPSS?
- Method A - Legacy Dialogs
- Method B - Chart Builder
- Method C - CURVEFIT
- Method D - Regression Variable Plots
- Method E - All Scatterplots Tool
Summary & Example Data
This tutorial walks you through different options for drawing (non)linear regression lines for either all cases or subgroups. All examples use bank-clean.sav, partly shown below.
Method A - Legacy Dialogs
A simple option for drawing linear regression lines is found under as illustrated by the screenshots below.
Completing these steps results in the SPSS syntax below. Running it creates a scatterplot to which we can easily add our regression line in the next step.
GRAPH
/SCATTERPLOT(BIVAR)=whours WITH salary
/MISSING=LISTWISE.
For adding a regression line, first double click the chart to open it in a Chart Editor window. Next, click the “Add Fit Line at Total” icon as shown below.
You can now simply close the fit line dialog and Chart Editor.
Result
The linear regression equation is shown in the label on our line: y = 9.31E3 + 4.49E2*x which means that
$$Salary' = 9,310 + 449 \cdot Hours$$
Note that 9.31E3 is scientific notation for 9.31 · 103 = 9,310 (with some rounding).
You can verify this result and obtain more detailed output by running a simple linear regression from the syntax below.
regression
/dependent salary
/method enter whours.
When doing so, you'll also have significance levels and/or confidence intervals. Finally, note that a linear relation seems a very poor fit for these variables. So let's explore some more interesting options.
Method B - Chart Builder
For SPSS versions 25 and higher, you can obtain scatterplots with fit lines from the chart builder. Let's do so for job type groups separately: simply navigate to and fill out the dialogs as shown below.
This results in the syntax below. Let's run it.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=whours salary jtype MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
/FITLINE TOTAL=NO SUBGROUP=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: whours=col(source(s), name("whours"))
DATA: salary=col(source(s), name("salary"))
DATA: jtype=col(source(s), name("jtype"), unit.category())
GUIDE: axis(dim(1), label("On average, how many hours do you work per week?"))
GUIDE: axis(dim(2), label("Gross monthly salary"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label("Current job type"))
GUIDE: text.title(label("Scatter Plot of Gross monthly salary by On average, how many hours do ",
"you work per week? by Current job type"))
SCALE: cat(aesthetic(aesthetic.color.interior), include(
"1", "2", "3", "4", "5"))
ELEMENT: point(position(whours*salary), color.interior(jtype))
END GPL.
Result
First off, this chart is mostly used for
- inspecting homogeneity of regression slopes in ANCOVA and
- simple slopes analysis in moderation regression.
Sadly, the styling for this chart is awful but we could have fixed this with a chart template if we hadn't been so damn lazy.
Anyway, note that R-square -a common effect size measure for regression- is between good and excellent for all groups except upper management. This handful of cases may be the main reason for the curvilinearity we see if we ignore the existence of subgroups.
Running the syntax below verifies the results shown in this plot and results in more detailed output.
sort cases by jtype.
split file layered by jtype.
*SIMPLE LINEAR REGRESSION.
regression
/dependent salary
/method enter whours.
*END SPLIT FILE.
split file off.
Method C - CURVEFIT
Scatterplots with (non)linear fit lines and basic regression tables are very easily obtained from CURVEFIT. Jus navigate to
and fill out the dialog as shown below.
If you'd like to see all models, change /MODEL=LINEAR to /MODEL=ALL after pasting the syntax.
TSET NEWVAR=NONE.
CURVEFIT
/VARIABLES=salary WITH whours
/CONSTANT
/MODEL=ALL /* CHANGE THIS LINE MANUALLY */
/PLOT FIT.
Result
Despite the poor styling of this chart, most curves seem to fit these data better than a linear relation. This can somewhat be verified from the basic regression table shown below.
Especially the cubic model seems to fit nicely. Its equation is
$$Salary' = -13114 + 1883 \cdot hours - 80 \cdot hours^2 + 1.17 \cdot hours^3$$
Sadly, this output is rather limited: do all predictors in the cubic model seriously contribute to r-squared? The syntax below results in more detailed output and verifies our initial results.
compute whours2 = whours**2.
compute whours3 = whours**3.
regression
/dependent salary
/method forward whours whours2 whours3.
Method D - Regression Variable Plots
Regression Variable Plots is an SPSS extension that's mostly useful for
- creating several scatterplots and/or fit lines in one go;
- plotting nonlinear fit lines for separate groups;
- adding elements to and customizing these charts.
I believe this extension is preinstalled with SPSS version 26 onwards. If not, it's supposedly available from STATS_REGRESS_PLOT but I used to have some trouble installing it on older SPSS versions.
Anyway: if installed, navigating to should open the dialog shown below.
Completing these steps results in the syntax below. Let's run it.
STATS REGRESS PLOT YVARS=salary XVARS=whours COLOR=jtype
/OPTIONS CATEGORICAL=BARS GROUP=1 INDENT=15 YSCALE=75
/FITLINES CUBIC APPLYTO=GROUP.
Result
Most groups don't show strong deviations from linearity. The main exception is upper management which shows a rather bizarre curve.
However, keep in mind that these are only a handful of observations; the curve is the result of overfitting. It (probably) won't replicate in other samples and can't be taken seriously.
Method E - All Scatterplots Tool
Most methods we discussed so far are pretty good for creating a single scatterplot with a fit line. However, we often want to check several such plots for things like outliers, homoscedasticity and linearity. This is especially relevant for
A very simple tool for precisely these purposes is downloadable from and discussed in SPSS - Create All Scatterplots Tool.
Final Notes
Right, so those are the main options for obtaining scatterplots with fit lines in SPSS. I hope you enjoyed this quick tutorial as much as I have.
If you've any remarks, please throw me a comment below. And last but not least:
thanks for reading!
SPSS Mediation Analysis with PROCESS
- SPSS PROCESS Dialogs
- SPSS PROCESS Output
- Mediation Summary Diagram & Conclusion
- Indirect Effect and Index of Mediation
- APA Reporting Mediation Analysis
Introduction
A study investigated general well-being among a random sample of N = 421 hospital patients. Some of these data are in wellbeing.sav, partly shown below.
One investigator believes that
- pain increases fatigue and
- fatigue -in turn- decreases overall well-being.
That is, the relation from pain onto well-being is thought to be mediated by fatigue, as visualized below (top half).
Besides this indirect effect through fatigue, pain could also directly affect well-being (top half, path \(c\,'\)).
Now, what would happen if this model were correct and we'd (erroneously) leave fatigue out of it? Well, in this case the direct and indirect effects would be added up into a total effect (path \(c\), lower half). If all these hypotheses are correct, we should see the following in our data:
- assuming sufficient sample size, paths \(a\) and \(b\) should both be significant;
- path \(c\,'\) (direct effect) should be different from \(c\) (total effect).
One approach to such a mediation analysis is a series of (linear) regression analyses as discussed in SPSS Mediation Analysis Tutorial. An alternative, however, is using the SPSS PROCESS macro as we'll demonstrate below.
Quick Data Checks
Rather than blindly jumping into some advanced analyses, let's first see if our data look plausible in the first place. As a quick check, let's inspect the histograms of all variables involved. We'll do so from the SPSS syntax below. For more details, consult Creating Histograms in SPSS.
frequencies pain fatigue wellb
/format notable
/histogram.
Result
First off, note that all variables have N = 421 so there's no missing values. This is important to make sure because PROCESS can only handle cases that are complete on all variables involved in the analysis.
Second, there seem to be some slight outliers. This especially holds for fatigue as shown below.
I think these values still look pretty plausible and I don't expect them to have a major impact on our analyses. Although disputable, I'll leave them in the data for now.
SPSS PROCESS Dialogs
First off, make sure you have PROCESS installed as covered in SPSS PROCESS Macro Tutorial. After opening our data in SPSS, let's navigate to as shown below.
For a simple mediation analysis, we fill out the PROCESS dialogs as shown below.
After completing these steps, you can either
- click “Ok” and just run the analysis;
- click “Paste” and run the (huge) syntax that's pasted or;
- click “Paste”, rearrange the syntax and then run it.
We discussed this last option in SPSS PROCESS Macro Tutorial. This may take you a couple of minutes but it'll pay off in the end. Our final syntax is shown below.
set mdisplay tables.
*READ PROCESS DEFINITION.
insert file = 'd:/downloaded/DEFINE-PROCESS-42.sps'.
*RUN PROCESS MODEL 4 (SIMPLE MEDIATION).
!PROCESS
y=wellb
/x=pain
/m=fatigue
/stand = 1 /* INCLUDE STANDARDIZED (BETA) COEFFICIENTS */
/total = 1 /* INCLUDE TOTAL EFFECT MODEL */
/decimals=F10.4
/boot=5000
/conf=95
/model=4
/seed = 20221227. /* MAKE BOOTSTRAPPING REPLICABLE */
SPSS PROCESS Output
Let's first look at path \(a\): this is the effect from \(X\) (pain) onto \(M\) (fatigue). We find it in the output if we look for OUTCOME VARIABLE fatigue as shown below.
For path \(a\), b = 0.09, p < .001: on average, higher pain scores are associated with more fatigue and this is highly statistically significant. This outcome is as expected if our mediation model is correct.
SPSS PROCESS Output - Paths B and C’
Paths \(b\) and \(c\,'\) are found in a single table. It's the one for which OUTCOME VARIABLE is \(Y\) (well-being) and includes b-coefficients for both \(X\) (pain) and \(M\) fatigue.
Note that path \(b\) is highly significant, as expected from our mediation hypotheses. Path \(c\,'\) (the direct effect) is also significant but our mediation model does not require this.
SPSS PROCESS Output - Path C
Some (but not all) authors also report the total effect, path \(c\). It is found in the table that has OUTCOME VARIABLE \(Y\) (well-being) that does not have a b-coefficient for the mediator.
Mediation Summary Diagram & Conclusion
The 4 main paths we examined thus far suffice for a classical mediation analysis. We summarized them in the figure below.
As hypothesized, paths \(a\) and \(b\) are both significant. Also note that direct effect is closer to zero than the total effect. This makes sense because the (negative) direct effect is the (negative) total effect minus the (negative) indirect effect.
A final point is that the direct effect is still significant: the indirect effect only partly accounts for the relation from pain onto well-being. This is known as partial mediation. A careful conclusion could thus be that
the effect from pain onto well-being
is partially mediated by fatigue.
Indirect Effect and Index of Mediation
Thus far, we established mediation by examining paths \(a\) and \(b\) separately. A more modern approach, however, focuses mostly on the entire indirect effect which is simply
$$\text{indirect effect } ab = a \cdot b$$
For our example, \(ab\) is the change in \(Y\) (well-being) associated with a 1-unit increase in \(X\) pain through \(M\) (fatigue). This indirect effect is shown in the table below.
Note that PROCESS does not compute any p-value or confidence interval (CI) for \(ab\). Instead, it estimates a CI by bootstrapping. This CI may be slightly different in your output because it's based on random sampling.
Importantly, the 95% CI [-0.08, -0.02] does not contain zero. This tells us that p < .05 even though we don't have an exact p-value. An alternative for bootstrapping that does come up with a p-value here is the Sobel test.
PROCESS also reports the standardized b-coefficient for \(ab\). This is usually denoted as β and is completely unrelated to (1 - β) or power in statistics. This number, 0.04, is known as the index of mediation and is often interpreted as an effect size measure.
A huge stupidity in this table is that b is denoted as “Effect” rather than “coeff” as in the other tables. For adding to the confusion, “Effect” refers to either b or β. Denoting b as b and β as β would have been highly preferable here.
APA Reporting Mediation Analysis
Mediation analysis is often reported as separate regression analyses: “the first step of our analysis showed that the effect of pain on fatigue was significant, b = 0.09, p < .001...” Some authors also include t-values and degrees of freedom (df) for b-coefficients. For some dumb reason, PROCESS does not report degrees of freedom but you can compute them as
$$df = N - k - 1$$
where
- \(N\) denotes the total sample size (N = 421 in our example) and
- \(k\) denotes the number of predictors in the model (1 or 2 in our example).
Like so, we could report “the second step of our analysis showed that the effect of fatigue on well-being was also significant, b = -0.53, t(419) = -3.89, p < .001...”
Final Notes
First off, mediation is inherently a causal model: \(X\) causes \(M\) which, in turn, causes \(Y\). Nevertheless, mediation analysis does not usually support any causal claims. A rare exception could be \(X\) being a (possibly dichotomous) manipulation variable. In most cases, however, we can merely conclude that
our data do (not) contradict
some (causal) mediation model.
This is not quite the strong conclusion we'd usually like to draw.
A second point is that I dislike the verbose text reporting suggested by the APA. As shown below, a simple table presents our results much more clearly and concisely.
Lastly, we feel that our example analysis would have been stronger if we had standardized all variables into z-scores prior to running PROCESS. The simple reason is that unstandardized values are uninterpretable for variables such as pain, fatigue and so on. What does a pain score of 60 mean? Low? Medium? High?
In contrast: a pain z-score of -1 means one standard deviation below the mean. If these scores are normally distributed, this is roughly the 16th percentile.
This point carries over to our regression coefficients:
b-coefficients are not interpretable because
we don't know how much a “unit” is
for our (in)dependent variables. Therefore, reporting only β coefficients makes much more sense.
Now, we do have these standardized coefficients in our output. However, most confidence intervals apply to the unstandardized coefficients. This can be fixed by standardizing all variables prior to running PROCESS.
Thanks for reading!
Power (Statistics) – The Ultimate Beginners Guide
In statistics, power is the probability of rejecting
a false null hypothesis.
- Power Calculation Example
- Power & Alpha Level
- Power & Effect Size
- Power & Sample Size
- 3 Main Reasons for Power Calculations
- Software for Power Calculations - G*Power
Power - Minimal Example
- In some country, IQ and salary have a population correlation ρ = .10.
- A scientist examines a sample of N = 10 people and finds a sample correlation r = .15.
- He tests the (false) null hypothesis H0 that ρ = 0. The significance level for this test, p = .68.
- Since p > .05, his chosen alpha level, he does not reject his (false) null hypothesis that ρ = 0.
Now, given a sample size of N = 10 and a population correlation ρ = 0.10, what's the probability of correctly rejecting the null hypothesis? This probability is known as power and denoted as (1 - β) in statistics. For the aforementioned example, (1 - β) is only .058 (roughly 6%) as shown below.
If a population correlation ρ = .10 and
we sample N = 10 respondents, then
we need to find an absolute sample correlation of | r | > .63 for rejecting H0 at α = .05.
The probability of finding this is only .058.
So even though H0 is false, we're unlikely to actually reject it. Not rejecting a false H0 is known as a committing a type II error.
Type I and Type II Errors
Any null hypothesis may be true or false and we may or may not reject it. This results in the 4 scenarios outlined below.
| Reality: H0 is true | Reality: H0 is false | |
|---|---|---|
| Decision: reject H0 | Type I error Probability = α | Correct decision Probability = (1 - β) = power |
| Decision: retain H0 | Correct decision Probability = (1 - α) | Type II error Probability = β |
As you probably guess, we usually want the power for our tests to be as high as possible. But before taking a look at factors affecting power, let's first try and understand how a power calculation actually works.
Power Calculation Example
A pharmaceutical company wants to demonstrate that their medicine against high blood pressure actually works. They expect the following:
- the average blood pressure in some untreated population is 160 mmHg;
- they expect their medicine to lower this to roughly 154 mmHg;
- the standard deviation should be around 8 mmHg (both populations);
- they plan to use an independent samples t-test at α = 0.05 with N = 20 for either subsample.
Given these considerations, what's the power for this study? Or -alternatively- what's the probability of rejecting H0 that the mean blood pressure is equal between treated and untreated populations?
Obviously, nobody knows the outcomes for this study until it's finished. However, we do know the most likely outcomes: they're our population estimates. So let's for a moment pretend that we'll find exactly these and enter them into a t-test calculator.
Compute t-test for expected sample sizes, means and SD's in Excel
We expect p = 0.023 so we expect to reject H0.
This is based on a t-distribution with df = 38 degrees of freedom (total sample size N = 40 - 2).
We expect to find t = 2.37 if the population mean difference is 6 mmHg (160 - 154).
Now, this expected (or average) t = 2.37 under the alternative hypothesis Ha is known as a noncentrality parameter or NCP. The NCP tells us how t is distributed under some exact alternative hypothesis and thus allows us to estimate the power for some test. The figure below illustrates how this works.
- First off, our H0 is tested using a central t-distribution with df = 38;
- If we test at α = 0.05 (2-tailed), we'll reject H0 if t < -2.02 (left critical value) or if t > 2.02 (right critical value);
- If our alternative hypothesis HA is exactly true, t follows a noncentral t-distribution with df = 38 and NCP = 2.37;
- Under this noncentral t-distribution, the probability of finding t > 2.02 ≈ 0.637. So this is roughly the probability of rejecting H0 -or the power (1 - β)- for our first scenario.
A minor note here is that we'd also reject H0 if t < -2.02 but this probability is almost zero for our first scenario. The exact calculation can be replicated from the SPSS syntax below.
data list free/alpha ncp.
begin data
0.05 2.37
end data.
*Compute left (lct) and right (rct) critical t-values and power.
compute lct = idf.t(0.5 * alpha,38).
compute rct = idf.t(1 - (0.5 * alpha),38).
compute lprob = ncdf.t(lct,38,ncp).
compute rprob = 1 - ncdf.t(rct,38,ncp).
compute power = lprob + rprob.
execute.
*Show 3 decimal places for all values.
formats all (f8.3).
Power and Effect Size
Like we just saw, estimating power requires specifying
- an exact null hypothesis and
- an exact alternative hypothesis.
In the previous example, our scientists had an exact alternative hypothesis because they had very specific ideas regarding population means and standard deviations. In most applied studies, however, we're pretty clueless about such population parameters. This raises the question how do we get an exact alternative hypothesis?
For most tests, the alternative hypothesis can be specified as an effect size measure: a single number combining several means, variances and/or frequencies. Like so, we proceed from requiring a bunch of unknown parameters to a single unknown parameter.
What's even better: widely agreed upon rules of thumb are available for effect size measures. An overview is presented in this Googlesheet, partly shown below.
In applied studies, we often use G*Power for estimating power. The screenshot below replicates our power calculation example for the blood pressure medicine study.
G*Power computes both effect size and power from two means and SD's
Note that estimating power in G*Power only requires
a single estimated effect size measure. Optionally, G*Power computes it for you, given your sample means and SD's.
the alpha level -often 0.05- used for testing the null hypothesis &
one or more sample sizes
Let's now take a look at how these 3 factors relate to power.
Factors Affecting Power
The figure below gives a quick overview how 3 factors relate to power.
Let's now take a closer look at each of them.
Power & Alpha Level
Everything else equal, increasing alpha increases power. For our example calculation, power increases from 0.637 to 0.753 if we test at α = 0.10 instead of 0.05.
A higher alpha level results in smaller (absolute) critical values: we already reject H0 if t > 1.69 instead of t > 2.02. So the light blue area, indicating (1 - β), increases. We basically require a smaller deviation from H0 for statistical significance.
However, increasing alpha comes at a cost: it increases the probability of committing a type I error (rejecting H0 when it's actually true). Therefore, testing at α > 0.05 is generally frowned upon. In short, increasing alpha basically just decreases one problem by increasing another one.
Power & Effect Size
Everything else equal, a larger effect size results in higher power. For our example, power increases from 0.637 to 0.869 if we believe that Cohen’s D = 1.0 rather than 0.8.
A larger effect size results in a larger noncentrality parameter (NCP). Therefore, the distributions under H0 and HA lie further apart. This increases the light blue area, indicating the power for this test.
Keep in mind, though, that we can estimate but not choose some population effect size. If we overestimate this effect size, we'll overestimate the power for our test accordingly. Therefore, we can't usually increase power by increasing an effect size.
An arguable exception is increasing an effect size by modifying a research design or analysis. For example, (partial) eta squared for a treatment effect in ANOVA may increase by adding a covariate to the analysis.
Power & Sample Size
Everything else equal, larger sample size(s) result in higher power. For our example, increasing the total sample size from N = 40 to N = 80 increases power from 0.637 to 0.912.
The increase in power stems from our distributions lying further apart. This reflects an increased noncentrality parameter (NCP). But why does the NCP increase with larger sample sizes?
Well, recall that for a t-distribution, the NCP is the expected t-value under HA. Now, t is computed as
$$t = \frac{\overline{X_1} - \overline{X_2}}{SE}$$
where \(SE\) denotes the standard error of the mean difference. In turn, \(SE\) is computed as
$$SE = Sw\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$$
where \(S_w\) denotes the estimated population SD of the outcome variable. This formula shows that as sample sizes increase, \(SE\) decreases and therefore t (and hence the NCP) increases.
On top of this, degrees of freedom increase (from df = 38 to df = 78 for our example). This results in slightly smaller (absolute) critical t-values but this effect is very modest.
In short, increasing sample size(s) is a sound way to increase the power for some test.
Power & Research Design
Apart from sample size, effect size & α, research design may also affect power. Although there's no exact formulas, some general guidelines are that
- everything else equal, within-subjects designs tend to have more power than between-subjects designs;
- for ANCOVA, including one or two covariates tends to increase power for demonstrating a treatment effect;
- for multiple regression, power for each separate predictor tends to decrease as more predictors are added to the model;
3 Main Reasons for Power Calculations
Power calculations in applied research serve 3 main purposes:
- compute the required sample size prior to data collection. This involves estimating an effect size and choosing α (usually 0.05) and the desired power (1 - B), often 0.80;
- estimate power before collecting data for some planned analyses. This requires specifying the intended sample size, choosing an α and estimating which effect sizes are expected. If the estimated power is low, the planned study may be cancelled or proceed with a larger sample size;
- estimate power after data have been collected and analyzed. This calculation is based on the actual sample size, α used for testing and observed effect size.
Different types of power analysis are made simple by G*Power
Software for Power Calculations - G*Power
G*Power is freely downloadable software for running the aforementioned and many other power calculations. Among its features are
- computing effect sizes from descriptive statistics (mostly sample means and standard deviations);
- computing power, required sample sizes, required effect sizes and more;
- creating plots that visualize how power, effect size and sample size relate for many different statistical procedures. The figure below shows an example for multiple linear regression.
Required sample sizes for multiple linear regression, given desired power,chosen α and 3 estimated effect sizes
Altogether, we think G*Power is amazing software and we highly recommend using it. The only disadvantage we can think of is that it requires rather unusual effect size measures. Some examples are
- Cohen’s f for ANOVA and
- Cohen’s W for a chi-square test.
This is awkward because the APA and (perhaps therefore) most journal articles typically recommend reporting
- (partial) eta-squared for ANOVA and
- the contingency coefficient or (better) Cramér’s V for a chi-square test.
These are also the measures we typically obtain from statistical packages such as SPSS or JASP. Fortunately, G*Power converts some measures and/or computes them from descriptive statistics like we saw in this screenshot.
Software for Power Calculations - SPSS
In SPSS, observed power can be obtained from the GLM, UNIANOVA and (deprecated) MANOVA procedures. Keep in mind that GLM - short for General Linear Model- is very general indeed: it can be used for a wide variety of analyses including
- (multiple) linear regression;
- t-tests;
- ANCOVA (analysis of covariance);
- repeated measures ANOVA.
Select Observed power from Analyze - General Linear Model -Univariate - Options
Other power calculations (required sample sizes or estimating power prior to data collection) were added to SPSS version 27, released in 2020.
Power Analysis as found in SPSS version 27 onwards
In my opinion, SPSS power analysis is a pathetic attempt to compete with G*Power. If you don't believe me, just try running a couple of power analyses in both programs simultaneously. If you do believe me, ignore SPSS power analysis and just go for G*Power.
Thanks for reading.
SPSS PROCESS Macro Tutorial
- Downloading & Installing PROCESS
- Creating Tables instead of Text Output
- Using PROCESS with Syntax
- PROCESS Model Numbers
- PROCESS & Dummy Coding
- Strengths & Weaknesses of PROCESS
What is PROCESS?
PROCESS is a freely downloadable SPSS tool for estimating regression models with mediation and/or moderation effects. An example of such a model is shown below.
This model can fairly easily be estimated without PROCESS as discussed in SPSS Mediation Analysis Tutorial. However, using PROCESS has some advantages (as well as disadvantages) over a more classical approach. So how to get PROCESS and how does it work?
Those who want to follow along may download and open wellbeing.sav, partly shown below.
Note that this tutorial focuses on becoming proficient with PROCESS. The example analysis will be covered in a future tutorial.
Downloading & Installing PROCESS
PROCESS can be downloaded here (scroll down to “PROCESS macro for SPSS, SAS, and R”). The download comes as a .zip file which you first need to unzip. After doing so, in SPSS, navigate to Select “process.spd” and click “Open” as shown below.
This should work for most SPSS users on recent versions. If it doesn't, consult the installation instructions that are included with the download.
Running PROCESS
If you successfully installed PROCESS, you'll find it in the regression menu as shown below.
For a very basic mediation analysis, we fill out the dialog as shown below.
Y refers to the dependent (or “outcome”) variable;
X refers to the independent variable or “predictor” in a regression context;
For simple mediation, select model 4. We'll have a closer look at model numbers in a minute;
Just for now, let's click “Ok”.
Result
The first thing that may strike you, is that the PROCESS output comes as plain text. This is awkward because formatting it is very tedious and you can't adjust any decimal places. So let's fix that.
Creating Tables instead of Text Output
If you're using SPSS version 24 or higher, run the following SPSS syntax: set mdisplay tables. After doing so, running PROCESS will result in normal SPSS output tables rather than plain text as shown below.
Note that you can readily copy-paste these tables into Excel and/or adjust their decimal places.
Using PROCESS with Syntax
First off: whatever you do in SPSS, save your syntax. Now, like any other SPSS dialog, PROCESS has a Paste button for pasting its syntax. However, a huge stupidity from the programmers is that doing so results in some 6,140 (!) lines of syntax. I'll add the first lines below.
/* Written by Andrew F Hayes */.
/* www.afhayes.com */.
/* www.processmacro.org */.
/* Copyright 2017-2021 by Andrew F Hayes */.
/* Documented in http://www.guilford.com/p/hayes3 */.
/* THIS CODE SHOULD BE DISTRIBUTED ONLY THROUGH PROCESSMACRO.ORG */.
You can run and save this syntax but having over 6,140 lines is awkward. Now, this huge syntax basically consists of 2 parts:
- a macro definition of some 6,130 lines: this consists of the formulas and computations that are performed on the input (variables, models and so on) that the SPSS user specifies;
- a macro call of some 10 lines: this tells SPSS to run the macro and which input to use.
The macro call is at the very end of the pasted syntax (use the Ctrl + End shortcut in your syntax window) and looks as follows.
y=wellb
/x=pain
/m=fatigue
/decimals=F10.4
/boot=5000
/conf=95
/model=4.
After you run the (huge) macro definition just once during your session, you only need one (short) macro call for every PROCESS model you'd like to run.
A nice way to implement this, is to move the entire macro definition into a separate SPSS syntax file. Those who want to try this can download DEFINE-PROCESS-40.sps.
Although technically not mandatory, macro names should really start with exclamation marks. Therefore, we replaced DEFINE PROCESS with DEFINE !PROCESS in line 2,983 of this file. The final trick is that we can run this huge syntax file without opening it by using the INSERT command. Like so, the syntax below replicates our entire first PROCESS analysis.
insert file = 'd:/downloaded/DEFINE-PROCESS-40.sps'.
*RERUN FIRST PROCESS ANALYSIS.
!PROCESS
y=wellb
/x=pain
/m=fatigue
/decimals=F10.4
/boot=5000
/conf=95
/model=4.
Note: for replicating this, you may need to replace d:/downloaded by the folder where DEFINE-PROCESS-40.sps is located on your computer.
PROCESS Model Numbers
As we speak, PROCESS implements 94 models. An overview of the most common ones is shown in this Googlesheet (read-only), partly shown below.
For example, if we have an X, Y and 2 mediator variables, we may hypothesize parallel mediation as illustrated below.
However, you could also hypothesize that mediator 1 affects mediator 2 which, in turn, affects Y. If you want to test this serial mediation effect, select model 6 in PROCESS.
For moderated mediation, things get more complicated: the moderator could act upon any combination of paths a, b or c’. If you believe the moderator only affects path c’, choose model 5 as shown below.
An overview of all model numbers is given in this book.
PROCESS & Dummy Coding
A quick overview of variable types for PROCESS is shown in this Googlesheet (read-only), partly shown below.
Keep in mind that PROCESS is entirely based on linear regression. This requires that dependent variables are quantitative (interval or ratio measurement level). This includes mediators, which act as both dependent and independent variables.
All other variables
- may be quantitative;
- may be dichotomous (preferably coded as 0-1);
- or must be dummy coded (nominal and ordinal variables).
X and moderator variables W and Z can only be dummy coded within PROCESS as shown below.
Covariates must be dummy coded before using PROCESS. For a handy tool, see SPSS Create Dummy Variables Tool.
Making Bootstrapping Replicable
Some PROCESS models rely on bootstrapping for reporting confidence intervals. Very basically, bootstrapping comes down to
- drawing a simple random sample (with replacement) from the data;
- computing statistics (for PROCESS, these are b-coefficients) on this new sample;
- repeating this procedure many (typically 1,000 - 10,000) times;
- examining to what extent each statistic fluctuates over these bootstrap samples.
Like so, a 95% bootstrapped CI for some parameter consists of the [2.5th - 97.5th] percentiles for some statistic over the bootstrap samples.
Now, due to the random nature of bootstrapping, running a PROCESS model twice typically results in slightly different CI's. This is undesirable but a fix is to add a /SEED subcommand to the macro call as shown below.
y=wellb
/x=pain
/m=fatigue
/decimals=F10.4
/boot=5000
/conf=95
/model=4
/seed = 20221227. /*MAKE BOOTSTRAPPED CI'S REPLICABLE*/
The random seed can be any positive integer. Personally, I tend to use the current date in YYYYMMDD format (20221227 is 27 December, 2022). An alternative is to run something like SET SEED 20221227. before running PROCESS. In this case, you need to prevent PROCESS from overruling this random seed, which you can do by replacing set seed = !seed. by *set seed = !seed. in line 3,022 of the macro definition.
Strengths & Weaknesses of PROCESS
A first strength of PROCESS is that it can save a lot of time and effort. This holds especially true for more complex models such as serial and moderated mediation.
Second, the bootstrapping procedure implemented in PROCESS is thought to have higher power and more accuracy than alternatives such as the Sobel test.
A weakness, though, is that PROCESS does not generate regression residuals. These are often used to examine model assumptions such as linearity and homoscedasticity as discussed in Linear Regression in SPSS - A Simple Example.
Another weakness of PROCESS is that some very basic models are not possible at all in PROCESS. A simple example is parallel moderation as illustrated below.
This can't be done because PROCESS is limited to a single X variable. Using just SPSS, estimating this model is a piece of cake. It's a tiny extension of the model discussed in SPSS Moderation Regression Tutorial.
A technical weakness is that PROCESS generates over 6,000 lines of syntax when pasted. The reason this happens is that PROCESS is built on 2 long deprecated SPSS techniques:
- the front end is an SPSS custom dialog (.spd) file. These have long been replaced by SPSS extension bundles (.spe files);
- the actual syntax is wrapped into a macro. SPSS macros have been deprecated in favor of Python ages ago.
I hope this will soon be fixed. There's really no need to bother SPSS users with 6,000 lines of source code.
Thanks for reading!
SPSS Label Cleaning Tool
We sometimes receive data files with annoying prefixes or suffixes in variable and/or value labels. This tutorial presents a simple tool for removing these and some other “cleaning” operations.
- Prerequisites and Installation
- Example I - Text Replacement over Variable and Value Labels
- Example II - Remove Suffix from Variable Labels
- Example III - Remove Prefix from Value Labels
Example Data File
All examples in this tutorial use dirty-labels.sav. As shown below, its labels are far from ideal.
Some variable labels have suffixes that are irrelevant to the final data.
All value labels are prefixed by the values that represent them.
Variable and value labels have underscores instead of spaces.
Our tool deals with precisely such issues. Let's try it.
Prerequisites and Installation
First off, this tool requires SPSS version 24 or higher. Next, the SPSS Python 3 essentials must be installed, which is normally the case with recent SPSS versions.
Next, click SPSS_TUTORIALS_CLEAN_LABELS.spe for downloading our tool. You can install it by dragging & dropping it into a data editor window. Alternatively, navigate to
as shown below.
In the dialog that opens, navigate to the downloaded .spe file and select it. SPSS now throws a message that “The extension was successfully installed under Transform - SPSS tutorials - Clean Labels”.
Example I - Text Replacement over Variable and Value Labels
Let's first replace all underscores by spaces in both variable and value labels. We'll open
and fill out the dialog as shown below.
Completing these steps results in the syntax below. Let's run it.
SPSS TUTORIALS CLEAN_LABELS VARIABLES=v1 v2 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19
v20 v21 v22 FIND='_' REPLACEBY=' '
/OPTIONS OPERATION=FIREPCONT PROCESS=BOTH ACTION=BOTH.
Results
First note that all underscores were replaced by spaces in all variable and value labels. This was done by creating and running
- VARIABLE LABELS and
- ADD VALUE LABELS
commands. We chose to have these commands printed to our output window as shown below.
SPSS already ran this syntax but you can also copy-paste it into a syntax window. Like so, the adjustments can be replicated on any SPSS version with or without our tool installed. If there's a lot of syntax, consider moving it into a separate file and running it with INSERT.
Example II - Remove Suffix from Variable Labels
Some variable labels end with “ (proceed to question...” We'll remove these suffixes because they don't convey any interesting information and merely clutter up our output tables and charts.
Again, we start off at
and fill out the dialog as shown below.
Quick tip: you can shorten the resulting syntax by using
- TO for specifying a range of variables such as V5 TO V1;
- ALL for specifying all variables in the active dataset.
We did just that in the syntax below.
SPSS TUTORIALS CLEAN_LABELS VARIABLES=all FIND=' (proceed' REPLACEBY=' '
/OPTIONS OPERATION=FIOCSUC PROCESS=VARLABS ACTION=RUN.
Note that running this syntax removes “ (proceed to” and all characters that follow this expression from all variable labels.
Example III - Remove Prefix from Value Labels
Another issue we sometimes encounter are value labels being prefixed with the values representing them as
shown below.
Removing “= ” (mind the space) and all characters preceding it from all value labels fixes the problem. The syntax below -created from
-
does just that.
SPSS TUTORIALS CLEAN_LABELS VARIABLES=all FIND='= ' REPLACEBY=' '
/OPTIONS OPERATION=FIOCPRE PROCESS=VALLABS ACTION=RUN.
Result
After our third and final example, all value and variable labels are nice, short can clean.
So that'll wrap up the examples of our label cleaning tool.
Final Notes
I hope you'll find our tool as helpful as we do. This first version performs 4 cleaning operations that we recently needed for our daily work. We'll probably build in some more options when we (or you?) need them.
So if you've any suggestions or other remarks, please throw us a comment below. Other than that,
thanks for reading!
Kruskal-Wallis Test – Simple Tutorial
- Kruskal-Wallis Test Example
- Kruskal-Wallis Test Assumptions
- Kruskal-Wallis Test Formulas
- Kruskal-Wallis Post Hoc Tests
- APA Reporting a Kruskal-Wallis Test
A Kruskal-Wallis test tests if 3(+) populations have
equal mean ranks on some outcome variable.
The figure below illustrates the basic idea.
- First off, our scores are ranked ascendingly, regardless of group membership.
- Now, if scores are not related to group membership, then the average mean ranks should be roughly equal over groups.
- If these average mean ranks are very different in our sample, then some groups tend to have higher scores than other groups in our population as well: scores are related to group membership.
Kruskal-Wallis Test - Purposes
The Kruskal-Wallis test is a distribution free alternative for an ANOVA: we basically want to know if 3+ populations have equal means on some variable. However,
- ANOVA is not suitable if the dependent variable is ordinal;
- ANOVA requires the dependent variable to be normally distributed in each subpopulation, especially if sample sizes are small.
The Kruskal-Wallis test is a suitable alternative for ANOVA if sample sizes are small and/or the dependent variable is ordinal.
Kruskal-Wallis Test Example
A hospital runs a quick pilot on 3 vaccines: they administer each to N = 5 participants. After a week, they measure the amount of antibodies in the participants’ blood. The data thus obtained are in this Googlesheet, partly shown below.
Now, we'd like to know if some vaccines trigger more antibodies than others in the underlying populations. Since antibodies is a quantitative variable, ANOVA seems the right choice here.
However, ANOVA requires antibodies to be normally distributed in each subpopulation. And due to our minimal sample sizes, we can't rely on the central limit theorem like we usually do (or should anyway). And on top of that,
our sample sizes are too small to examine normality.
Just the emphasize this point, the histograms for antibodies by group are shown below.
If anything, the bottom two histograms seem slightly positively skewed. This makes sense because the amount of antibodies has a lower bound of zero but no upper bound. However, speculations regarding the population distributions don't get any more serious than that.
A particularly bad idea here is trying to demonstrate normality by running
- a Shapiro-Wilk normality test and/or
- a Kolmogorov-Smirnov test.
Due to our tiny sample sizes, these tests are unlikely to reject the null hypothesis of normality. However, that's merely due to their lack of power and doesn't say anything about the population distributions. Put differently: a different null hypothesis (our variable following a uniform or Poisson distribution) would probably not be rejected either for the exact same data.
In short: ANOVA really requires normality for tiny sample sizes but we don't know if it holds. So we can't trust ANOVA results. And that's why we should use a Kruskal-Wallis test instead.
Kruskal-Wallis Test - Null Hypothesis
The null hypothesis for a Kruskal-Wallis test is that
the mean ranks on some outcome variable
are equal across 3+ populations.
Note that the outcome variable must be ordinal or quantitative in order for “mean ranks” to be meaningful.
Many textbooks propose an incorrect null hypothesis such as:
- some outcome variable has equal medians over 3+ populations or
- some outcome variable follows identical distributions over 3+ populations.
So why are these incorrect? Well, the Kruskal-Wallis formula uses only 2 statistics: ranks sums and the sample sizes on which they're based. It completely ignores everything else about the data -including medians and frequency distributions. Neither of these affect whether the null hypothesis is (not) rejected.
If that still doesn't convince you, we'll perhaps add some example data files to this tutorial. These illustrate that wildly different medians or frequency distributions don't always result in a “significant” Kruskal-Wallis test (or reversely).
Kruskal-Wallis Test Assumptions
A Kruskal-Wallis test requires 3 assumptions1,5,8:
- independent observations;
- the dependent variable must be quantitative or ordinal;
- sufficient sample sizes (say, each ni ≥ 5) unless the exact significance level is computed.
Regarding the last assumption, exact p-values for the Kruskal-Wallis test can be computed. However, this is rarely done because it often requires very heavy computations. Some exact p-values are also found in Use of Ranks in One-Criterion Variance Analysis.
Instead, most software computes approximate (or “asymptotic”) p-values based on the chi-square distribution. This approximation is sufficiently accurate if the sample sizes are large enough. There's no real consensus with regard to required sample sizes: some authors1 propose each ni ≥ 4 while others6 suggest each ni ≥ 6.
Kruskal-Wallis Test Formulas
First off, we rank the values on our dependent variable ascendingly, regardless of group membership. We did just that in this Googlesheet, partly shown below.
Next, we compute the sum over all ranks for each group separately.
We then enter a) our samples sizes and b) our ranks sums into the following formula:
$$Kruskal\;Wallis\;H = \frac{12}{N(N + 1)}\sum\limits_{i = 1}^k\frac{R_i^2}{n_i} - 3(N + 1)$$
where
- \(N\) denotes the total sample size;
- \(k\) denotes the number of groups we're comparing;
- \(R_i\) denotes the rank sum for group \(i\);
- \(n_i\) denotes the sample size for group \(i\).
For our example, that'll be
$$Kruskal\;Wallis\;H = \frac{12}{15(15 + 1)}(\frac{55^2}{5}+\frac{20^2}{5}+\frac{45^2}{5}) - 3(15 + 1) =$$
$$Kruskal\;Wallis\;H = 0.05\cdot(605 + 80 + 405) - 48 = 6.50$$
\(H\) approximately follows a chi-square (written as χ2) distribution with
$$df = k - 1$$
degrees of freedom (\(df\)) for \(k\) groups. For our example,
$$df = 3 - 1 = 2$$
so our significance level is
$$\chi^2(2) = 6.50, p \approx 0.039.$$
The SPSS output for our example, shown below, confirms our calculations.
So what do we conclude now? Well, assuming alpha = 0.05, we reject our null hypothesis: the population mean ranks of antibodies are not equal among vaccines. In normal language, our 3 vaccines do not perform equally well. Judging from the mean ranks, it seems vaccine B performs worse than its competitors: its mean rank is lower and this means that it triggered fewer antibodies than the other vaccines.
Kruskal-Wallis Post Hoc Tests
Thus far, we concluded that the amounts of antibodies differ among our 3 vaccines. So precisely which vaccine differs from which vaccine? We'll compare each vaccine to each other vaccine for finding out. This procedure is generally known as running post-hoc tests.
In contrast to popular belief, Kruskal-Wallis post-hoc tests are not equivalent to Bonferroni corrected Mann-Whitney tests. Instead, each possible pair of groups is compared using the following formula:
$$Z_{kw} = \frac{\overline{R}_i - \overline{R}_j}{\sqrt{\frac{N(N + 1)}{12}(\frac{1}{n_i}+\frac{1}{n_j})}}$$
where
- our test statistic, \(Z_{kw}\), approximately follows a standard normal distribution;
- \(\overline R_i\) denotes the mean rank for group \(i\);
- \(N\) denotes the total sample size (including groups not used in this pairwise comparison);
- \(n_i\) denotes the sample size for group \(i\).
For comparing vaccines A and B, that'll be
$$Z_{kw} = \frac{11 - 4}{\sqrt{\frac{15(15 + 1)}{12}(\frac{1}{5}+\frac{1}{5})}} \approx 2.475 $$
$$P(|Z_{kw}| > 2.475) \approx 0.013$$
A Bonferroni correction is usually applied to this p-value because we're running multiple comparisons on (partly) the same observations. The number of pairwise comparisons for \(k\) groups is
$$N_{comp} = \frac{k (k - 1)}{2}$$
Therefore, the Bonferroni corrected p-value for our example is
$$P_{Bonf} = 0.013 \cdot \frac{3 (2 - 1)}{2} \approx 0.040$$
The screenshot from SPSS (below) confirms these findings.
Oddly, the difference between mean ranks, \(\overline{R}_i - \overline{R}_j\), is denoted as “Test Statistic”.
The actual test statistic, \(Z_{kw}\) is denoted as “Std. Test Statistic”.
APA Reporting a Kruskal-Wallis Test
For APA reporting our example analysis, we could write something like
“a Kruskal-Wallis test indicated that the amount of antibodies
differed over vaccines, H(2) = 6.50, p = 0.039.
Although the APA doesn't mention it, we encourage reporting the mean ranks and perhaps some other descriptives statistics in a separate table as well.
Right, so that should do. If you've any questions or remarks, please throw me a comment below. Other than that:
Thanks for reading!
References
- Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
- Warner, R.M. (2013). Applied Statistics (2nd. Edition). Thousand Oaks, CA: SAGE.
- Agresti, A. & Franklin, C. (2014). Statistics. The Art & Science of Learning from Data. Essex: Pearson Education Limited.
- Field, A. (2013). Discovering Statistics with IBM SPSS Statistics. Newbury Park, CA: Sage.
- Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
- Siegel, S. & Castellan, N.J. (1989). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). Singapore: McGraw-Hill.
- Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
- Kruskal, W.H. & Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583-621.
SPSS – Kendall’s Concordance Coefficient W
Kendall’s Concordance Coefficient W is a number between 0 and 1
that indicates interrater agreement.
So let's say we had 5 people rank 6 different beers as shown below. We obviously want to know which beer is best, right? But could we also quantify how much these raters agree with each other? Kendall’s W does just that.
Kendall’s W - Example
So let's take a really good look at our beer test results. The data -shown above- are in beertest.sav. For answering which beer was rated best, a Friedman test would be appropriate because our rankings are ordinal variables. A second question, however, is to what extent do all 5 judges agree on their beer rankings? If our judges don't agree at all which beers were best, then we can't possibly take their conclusions very seriously. Now, we could say that “our judges agreed to a large extent” but we'd like to be more precise and express the level of agreement in a single number. This number is known as Kendall’s Coefficient of Concordance W.2,3
Kendall’s W - Basic Idea
Let's consider the 2 hypothetical situations depicted below: perfect agreement and perfect disagreement among our raters. I invite you to stare at it and think for a minute.
As we see, the extent to which raters agree is indicated by the extent to which the column totals differ. We can express the extent to which numbers differ as a number: the variance or standard deviation.
Kendall’s W is defined as
$$W = \frac{Variance\,over\,column\,totals}{Maximum\,possible\,variance\,over\,column\,totals}$$
As a result, Kendall’s W is always between 0 and 1. For instance, our perfect disagreement example has W = 0; because all column totals are equal, their variance is zero.
Our perfect agreement example has W = 1 because the variance among column totals is equal to the maximal possible variance. No matter how you rearrange the rankings, you can't possibly increase this variance any further. Don't believe me? Give it a go then.
So what about our actual beer data? We'll quickly find out with SPSS.
Kendall’s W in SPSS
We'll get Kendall’s W from SPSS’ menu. The screenshots below walk you through.
Note: SPSS thinks our rankings are nominal variables. This is because they contain few distinct values. Fortunately, this won't interfere with the current analysis. Completing these steps results in the syntax below.
Kendall’s W - Basic Syntax
NPAR TESTS
/KENDALL=beer_a beer_b beer_c beer_d beer_e beer_f
/MISSING LISTWISE.
Kendall’s W - Output
And there we have it: Kendall’s W = 0.78. Our beer judges agree with each other to a reasonable but not super high extent. Note that we also get a table with the (column) mean ranks that tells us which beer was rated most favorably.
Average Spearman Correlation over Judges
Another measure of concordance is the average over all possible Spearman correlations among all judges.1 It can be calculated from Kendall’s W with the following formula
$$\overline{R}_s = {kW - 1 \over k - 1}$$
where \(\overline{R}_s\) denotes the average Spearman correlation and \(k\) the number of judges.
For our example, this comes down to
$$\overline{R}_s = {5(0.781) - 1 \over 5 - 1} = 0.726$$
We'll verify this by running and averaging all possible Spearman correlations in SPSS. We'll leave that for a next tutorial, however, as doing so properly requires some highly unusual -but interesting- syntax.
Thank you for reading!
References
- Howell, D.C. (2002). Statistical Methods for Psychology (5th ed.). Pacific Grove CA: Duxbury.
- Slotboom, A. (1987). Statistiek in woorden [Statistics in words]. Groningen: Wolters-Noordhoff.
- Van den Brink, W.P. & Koele, P. (2002). Statistiek, deel 3 [Statistics, part 3]. Amsterdam: Boom.
SPSS TUTORIALS