- Summary
- Mahalanobis Distances - Basic Reasoning
- Mahalanobis Distances - Formula and Properties
- Finding Mahalanobis Distances in SPSS
- Critical Values Table for Mahalanobis Distances
- Mahalanobis Distances & Missing Values

## Summary

In SPSS, you can compute (squared) Mahalanobis distances as a new variable in your data file. For doing so, navigate to and open the “Save” subdialog as shown below.

Keep in mind here that
Mahalanobis distances are computed *only* over the independent variables.
The dependent variable does not affect them unless it has any missing values. In this case, the situation becomes rather complicated as I'll cover near the end of this article.

## Mahalanobis Distances - Basic Reasoning

Before analyzing any data, we first need to know if they're even plausible in the first place. One aspect of doing so is checking for **outliers**:
observations that are *substantially different* from the other observations.
One approach here is to inspect each variable separately and the main options for doing so are

- inspecting histograms;
- inspecting boxplots or
- inspecting z-scores.

Now, when analyzing multiple variables simultaneously, a better alternative is to check for **multivariate outliers**:
*combinations* of scores on 2(+) variables that are extreme or unusual.
Precisely *how* extreme or unusual a combination of scores is, is usually quantified by their Mahalanobis distance.

The basic idea here is to add up how much each score differs from the mean while taking into account the (Pearson) correlations among the variables. So why is that a good idea? Well, let's first take a look at the scatterplot below, showing 2 positively correlated variables.

The highlighted observation has rather high z-scores on both variables. However, this makes sense: a positive correlation means that cases scoring high on one variable tend to score high on the other variable too. The (squared) Mahalanobis distance *D ^{2}* = 7.67 and this is well within a normal range.

So let's now compare this to the second scatterplot shown below.

The highlighted observation has a rather high z-score on variable A but a rather low one on variable B. This is highly unusual for variables that are positively correlated. Therefore,
this observation is a *clear* multivariate outlier
because its (squared) Mahalanobis distance *D ^{2}* = 18.03,

*p*< .0005. Two final points on these scatterplots are the following:

- the (univariate) z-scores
**fail**to detect that the highlighted observation in the second scatterplot is highly unusual; - this observation has a huge impact on the correlation between the variables and is thus an
**influential data point**. Again, this is detected by the (squared) Mahalanobis distance but not by z-scores, histograms or even boxplots.

## Mahalanobis Distances - Formula and Properties

Software for applied data analysis (including SPSS) usually computes *squared* Mahalanobis distances as

\(D^2_i = (\mathbf{x_i} - \mathbf{\overline{x}})'\;\mathbf{S}^{-1}\;(\mathbf{x_i} - \overline{\mathbf{x}})\)

where

- \(D^2\) denotes the squared Mahalanobis distance for case \(i\);
- \(\mathbf{x_i}\) denotes the vector of scores for case \(i\);
- \(\mathbf{\overline{x}}\) denotes the vector of means (centroid) over all cases;
- \(S\) denotes the covariance matrix over all variables.

Some **basic properties** are that

- Mahalanobis distances can (theoretically)
**range**from zero to infinity; - Mahalanobis distances are
**standardized**: they are scale independent so they are unaffected by any linear transformations to the variables they're computed on; - Mahalanobis distances for a
**single variable**are equal to z-scores; *squared*Mahalanobis distances computed over*k*variables follow a**χ**with^{2}-distribution*df*=*k*under the assumption of multivariate normality.

## Finding Mahalanobis Distances in SPSS

In SPSS, you can use the linear regression dialogs to compute squared Mahalanobis distances as a new variable in your data file. For doing so, navigate to

and open the “Save” subdialog as shown below.Again,
Mahalanobis distances are computed only over the *independent* variables.
Although this is in line with most text books, it makes more sense to me to include the dependent variable as well. You could do so by

- adding the actual dependent variable to the independent variables and
- temporarily using an alternative dependent variable that is neither a constant, nor has any missing values.

Finally, if you've any **missing values** on either the dependent or any of the independent variables, things get rather complicated. I'll discuss the details at the end of this article.

## Critical Values Table for Mahalanobis Distances

After computing and inspecting (squared) Mahalanobis distances, you may wonder:
how large is *too* large?
Sadly, there's no simples rule of thumb here but most text books suggest that
(squared) Mahalanobis distances for which *p* < .001 are suspicious
for reasonable sample sizes. Since *p* also depends on the number of variables involved, we created a handy overview table in this Googlesheet, partly shown below.

## Mahalanobis Distances & Missing Values

Missing values on either the dependent or any of the independent variables may affect Mahalanobis distances. Precisely when and how depends on which option you choose for handling missing values in the linear regression dialogs as shown below.

If you select **listwise exclusion**,

- Mahalanobis distances are computed for all cases that have zero missing values on the
**independent**variables; - missing values on the
**dependent**variable may affect the Mahalanobis distances. This is because these are based on the listwise complete covariance matrix over the dependent as well as the independent variables.

If you select **pairwise exclusion**,

- Mahalanobis distances are computed for all cases that have zero missing values on the
**independent**variables; - missing values on the
**dependent**variable do not affect the Mahalanobis distances in any way.

If you select **replace with mean**,

- missing values on the
**dependent**and**independent**variables are replaced with the (variable) means before SPSS proceeds with any further computations; - Mahalanobis distances are computed for
**all**cases, regardless any missing values; - \(D^2\) = 0 for cases having missing values on all independent variables. This makes sense because \(\mathbf{x_i} - \mathbf{\overline{x}}\) results in a vector of zeroes after replacing all missing values by means.

## References

- Hair, J.F., Black, W.C., Babin, B.J. et al (2006).
*Multivariate Data Analysis.*New Jersey: Pearson Prentice Hall. - Warner, R.M. (2013).
*Applied Statistics (2nd. Edition)*. Thousand Oaks, CA: SAGE. - Pituch, K.A. & Stevens, J.P. (2016).
*Applied Multivariate Statistics for the Social Sciences (6th. Edition)*. New York: Routledge. - Field, A. (2013).
*Discovering Statistics with IBM SPSS Statistics*. Newbury Park, CA: Sage. - Howell, D.C. (2002).
*Statistical Methods for Psychology*(5th ed.). Pacific Grove CA: Duxbury. - Agresti, A. & Franklin, C. (2014).
*Statistics. The Art & Science of Learning from Data.*Essex: Pearson Education Limited.

## THIS TUTORIAL HAS 1 COMMENT:

## By Akdir on August 6th, 2024

Nice article!

Pretty short but super complete and detailed nevertheless!