Mahalanobis Distances in SPSS - A Quick Guide
SPSS tutorials website header logo SPSS TUTORIALS VIDEO COURSE BASICS ANOVA REGRESSION FACTOR

Mahalanobis Distances in SPSS – A Quick Guide

Summary

In SPSS, you can compute (squared) Mahalanobis distances as a new variable in your data file. For doing so, navigate to Analyze SPSS Menu Arrow Regression SPSS Menu Arrow Linear and open the “Save” subdialog as shown below.

SPSS Mahalanobis Distances Regression Dialog

Keep in mind here that Mahalanobis distances are computed only over the independent variables. The dependent variable does not affect them unless it has any missing values. In this case, the situation becomes rather complicated as I'll cover near the end of this article.

Mahalanobis Distances - Basic Reasoning

Before analyzing any data, we first need to know if they're even plausible in the first place. One aspect of doing so is checking for outliers: observations that are substantially different from the other observations. One approach here is to inspect each variable separately and the main options for doing so are

SPSS Outliers In Histogram 0285

Now, when analyzing multiple variables simultaneously, a better alternative is to check for multivariate outliers: combinations of scores on 2(+) variables that are extreme or unusual. Precisely how extreme or unusual a combination of scores is, is usually quantified by their Mahalanobis distance.

The basic idea here is to add up how much each score differs from the mean while taking into account the (Pearson) correlations among the variables. So why is that a good idea? Well, let's first take a look at the scatterplot below, showing 2 positively correlated variables.

Mahalanobis Distance Scatterplot A

The highlighted observation has rather high z-scores on both variables. However, this makes sense: a positive correlation means that cases scoring high on one variable tend to score high on the other variable too. The (squared) Mahalanobis distance D2 = 7.67 and this is well within a normal range.

So let's now compare this to the second scatterplot shown below.

Mahalanobis Distance Scatterplot B

The highlighted observation has a rather high z-score on variable A but a rather low one on variable B. This is highly unusual for variables that are positively correlated. Therefore, this observation is a clear multivariate outlier because its (squared) Mahalanobis distance D2 = 18.03, p < .0005. Two final points on these scatterplots are the following:

Mahalanobis Distances - Formula and Properties

Software for applied data analysis (including SPSS) usually computes squared Mahalanobis distances as

\(D^2_i = (\mathbf{x_i} - \mathbf{\overline{x}})'\;\mathbf{S}^{-1}\;(\mathbf{x_i} - \overline{\mathbf{x}})\)

where

Some basic properties are that

Finding Mahalanobis Distances in SPSS

In SPSS, you can use the linear regression dialogs to compute squared Mahalanobis distances as a new variable in your data file. For doing so, navigate to Analyze SPSS Menu Arrow Regression SPSS Menu Arrow Linear and open the “Save” subdialog as shown below.

SPSS Mahalanobis Distances Regression Dialog

Again, Mahalanobis distances are computed only over the independent variables. Although this is in line with most text books, it makes more sense to me to include the dependent variable as well. You could do so by

Finally, if you've any missing values on either the dependent or any of the independent variables, things get rather complicated. I'll discuss the details at the end of this article.

Critical Values Table for Mahalanobis Distances

After computing and inspecting (squared) Mahalanobis distances, you may wonder: how large is too large? Sadly, there's no simple rule of thumb here but most text books suggest that (squared) Mahalanobis distances for which p < .001 are suspicious for reasonable sample sizes. Since p also depends on the number of variables involved, we created a handy overview table in this Googlesheet, partly shown below.

Critical Values Mahalanobis Distances

Mahalanobis Distances & Missing Values

Missing values on either the dependent or any of the independent variables may affect Mahalanobis distances. Precisely when and how depends on which option you choose for handling missing values in the linear regression dialogs as shown below.

SPSS Linear Regression Missing Values Dialog

If you select listwise exclusion,

If you select pairwise exclusion,

If you select replace with mean,

References

SPSS – Reorder Variables from Syntax

While working in SPSS, it's pretty common to reorder your variables. This tutorial shows how to do so the right way. We encourage you try the examples for yourself by downloading and opening hotel_evaluation.sav, a screenshot of which is shown below.

SPSS Reorder Variables by Syntax Data View

SPSS Reorder Variables Example 1

Right, the most common way for reordering variables in SPSS is by running ADD FILES. Before explaining how it works, let's first show that it works in the first place. So suppose we'd like to swap the variables fname and sex. Running the syntax below does just that.

*Reorder variables example 1.

add files file *
/keep id sex fname all.

execute.

Result

SPSS Reorder Variables by Syntax Data View

How Does It Work?

Remember that ADD FILES merges data sources holding different cases but similar variables. Note that in SPSS syntax, an * addresses the active dataset (often the only data that's open in SPSS).
Now, if we specify this as the only data source, it gets merged with nothing. This means that running add files file *. does absolutely nothing whatsoever. However, this command becomes useful if we add a /keep subcommand which tells SPSS which variables to keep (not delete) and in which order.
In our first example, we decided to keep id sex fname. Note that the original order here was id fname sex.
Finally, all is a special keyword that addresses all other variables in their original order. Omitting all deletes these variables from our data. The final result of all this is that we basically do nothing except keep all variables in a different order than previously, which is exactly what we were intending.

SPSS Reorder Variables Example 2

Right, we just swapped two variables in our data. But what about moving entire blocks of variables? For instance, suppose we want to have variables q1 through q5 right after fname in our data. Well, this doesn't pose any challenge whatsoever if we use SPSS’ TO keyword. The syntax below shows just how easily it's done.

*Move entire block of variables forwards within file.

add files file */keep id to fname q1 to q5 all.
execute.

Result

SPSS Reorder Variables by Syntax Data View

SPSS Reorder Variables Example 3

Recap: if we're having some data open in SPSS, we can easily reorder our variables by using an ADD FILES command with a keep subcommand.
Is that the only way to get it done? Well, no. If we're done with a dataset, we may use the exact same keep subcommand in a save command as shown below (first example). This is never really necessary but it may save a tiny amount of syntax and processing time.
Just for the sake of completeness, if we know in advance which variables are present in our data, we may also reorder them while opening the data file with get.

SPSS Reorder Variables Syntax Examples

*1. Reorder variables when saving data file.

save outfile '10_all_data_prepared.sav'
/keep q1 to q5 all.


*2. Reorder variables when opening data file.

get file 'hotel_evaluation.sav'
/keep q1 to q5 all.

SPSS SORT VARIABLES

There's just one more thing that we need to mention: SPSS (starting from version 16) also has a SORT VARIABLES command. Like so you can sort variables according to variable name, variable type, variable label, format or a couple of other properties.
We obviously needed to mention this command here. However, we very rarely use it in practice. The reason is that the aforementioned sorting options virtually never correspond to the order we desire for our variables. Those who like to give it a quick try anyway, may run something like sort variables by name.

SPSS Reorder Variables - Final Note

In some cases, you may need to sort your variables in a structured manner that's more complex than covered by this tutorial. An example is sorting (many) variables according to subscripts in their variable names. The way to get such cases done efficiently is by using Python in SPSS.

SPSS Datasets Tutorial 1 – Basics

Introduction

SPSS dataset logic is not always logical. However, for working proficiently with datasets, just a handful of basics is sufficient. These are explained in this tutorial.

This tutorial focuses on working with SPSS datasets. For a definition and some background on datasets, see SPSS Datasets.

Working with SPSS Datasets

*Set working directory and open data file.

cd 'd:/downloads'.
get file 'idols.sav'.

Untitled Datasets

SPSS Untitled Dataset An Untitled Dataset in SPSS

Named Datasets

*Open idols.sav and apply name to dataset.

get file 'idols.sav'.
dataset name idols_data.

*Open service_provider.sav and apply name to dataset.

get file 'service_provider.sav'.
dataset name service_data.

*Compute test variable.

compute test_0 = 0.
exe.

Now you have two open datasets. The first didn't close upon opening the second because a name ("idols_data") was applied to it.

The Active Dataset

SPSS Active Dataset The Active Dataset in SPSS
*Compute test variable in idols_data.

dataset activate idols_data.
compute test_1 = 1.
exe.

Activating idols_data before the COMPUTE command ensures that the new variable will be created in this dataset.

Closing SPSS Datasets

*Close datasets. Alternatively, use "dataset close all." instead of the two lines below.

dataset close idols_data.
dataset close service_data.

*Get rid of the last open dataset.

new file.

SPSS Missing Values Functions

Most real world data contain some (or many) missing values. It's always a good idea to inspect the amount of missingness for avoiding unpleasant surprises later on. In order to do so, SPSS has some missing values functions that are mostly used with COMPUTE, IF AND DO IF. This tutorial demonstrates how to use them effectively. We'll do so by using the last 5 variables in hospital.sav.

SPSS Hospital Data

Setting User Missing Values

Before discussing SPSS missing values functions, we'll first set 6 as a user missing value for the last 5 variables by running the line of syntax below. missing values doctor_rating to facilities_rating (6).

SPSS Missing Values Functions

ExpressionMeaningReturns
MISSINGEvaluate whether value is system missing or user missingTrue or false
SYSMISEvaluate whether value is system missingTrue or false
NMISSReturn number of missing values over variablesNumeric value
NVALIDReturn number of valid values over variablesNumeric value

SPSS MISSING Function

SPSS MISSING function evaluates whether a value is missing (either a user missing value or a system missing value). For example, we'll flag cases that have a missing value on doctor_rating with the syntax below.If the COMPUTE command puzzles you, see Compute A = B = C for an explanation.

*1. Flag cases having a missing value on doctor_rating.

compute mis_1 = missing(doctor_rating).

*2. Move flagged cases to top of file.

sort cases mis_1 (d).
SPSS Missing Function Result

SPSS SYSMIS Function

SPSS SYSMIS function evaluates whether a value is system missing. For example, the syntax below uses IF to replace all system missing values by 99. We'll then label it, specify it as user missing and run a quick check with FREQUENCIES.

*1. Change system missing values to 99.

if sysmis(doctor_rating) doctor_rating = 99.

*2. Add value label 99.

add value labels doctor_rating 99 'Recoded system missing value'.

*3. Specify 6 and 99 as user missings.

missing values doctor_rating (6,99).

*4. Quick check.

frequencies doctor_rating.
SPSS SYSMIS function result

SPSS NMISS Function

SPSS NMISS function counts missing values within cases over variables. Cases with many missing values may be suspicious and you may want to exclude them from analysis with FILTER or SELECT IF. The syntax runs a quick scan for such cases.

*1. Compute variable indicating missings per case.

compute mis_2 = nmiss(doctor_rating to facilities_rating).

*2. Apply variable label. Tip: indicate number of variables involved here.

variable labels mis_2 'Number of missing values over doctor_rating to facilities_rating (5 variables)'.

*3. Quick check.

frequencies mis_2.
SPSS NMISS Function Result

SPSS NVALID Function

SPSS NVALID function counts the number of valid values over variables. It is equivalent to the number of variables minus NMISS over those variables. Note that the dot operator is a faster alternative for excluding cases from statistical functions (such as MEAN and SUM).

*Count valid values over doctor_rating to facilities_rating (5 variables).

compute valid_1 = nvalid(doctor_rating to facilities_rating).
exe.