SPSS tutorials


SPSS Missing Values Tutorial

SPSS has two types of missing values. In short,

The remainder of this tutorial explains what missing values are, shows how to track them down and shows the right way to deal with them. You can follow along by downloading and opening hospital.sav.

System Missing Values

System missing values are values that are truly absent from the data. In data view they are shown as empty cells holding just a tiny dot. In rare cases, this is also seen if SPSS is unable to display a value due to its format. In case of doubt you can find out by running FREQUENCIES over the variable. If we scroll to the last variables in our data, we first encounter a system missing value in “nurse_rating” as shown in the screenshot.

SPSS system missing values in data view

System Missing Values - Possible Causes

System missing values are common in real world data. Some reasons why they occur are the following:

User Missing Values

User missing values are values that are present in the data but must be excluded from calculations and analyses. In order to do so, the (SPSS) user needs to specify them as missing. We'll briefly point out the two scenarios that require this and then discuss them a little more in depth.

User Missing Values in Ordinal Variables

Say we want to know the average “doctor_rating”.Strictly, calculations are not allowed on ordinal variables but they are very common nevertheless. Also, see Assumption of Equal Intervals. Now, before doing anything whatsoever with a variable, we first want to know what's in there. In order to do so, we run FREQUENCIES with the syntax below. The table that appears in the output viewer window is shown in the following screenshot.

*1. Show both data values and value labels in output.

set tnumbers both.

*2. Run frequency table of doctor_rating.

frequencies doctor_rating.

First note that the frequency table has three main sections:

valid (non missing) values;
missing values, in this case only 4 system missing values;
all values, both valid and missing.

Second, note that higher values reflect more positive attitudes. However, this doesn't hold for 6 (“Not applicable or don't want to answer”) which does not reflect a more positive attitude than 5 (“Very satisfied”).
Now if two respondents score 2 (“Dissatisfied”) and 6 (“Not applicable or don't want to answer”) on this variable, their average will be (2 + 6) / 2 = 4 which means “Satisfied”. However, these two respondents are, on average, clearly not satisfied. The proper calculation for these two respondents is 2 / 1 = 2 which means “Dissatisfied”. This is accomplished by excluding the value 6 from the calculation altogether. This is done in SPSS by running the following line: missing values doctor_rating(6). After doing so, we rerun our frequency table. The result is shown in the screenshot below.

Note that the valid values section is now limited to the values that we do want to include in calculations.
The value 6 (“Not applicable or don't want to answer”) is shown in the missing values section. This confirms that it's now a user missing value and will be excluded from all calculations.
In short, we can detect user missing values in ordinal variables by inspecting their frequency tables containing their values and value labels. After specifying zero or more values as missing, we rerun these tables to confirm that we're good to go.

User Missing Values in Metric Variables

A second main class of user missing values are unlikely values. For example, we asked respondents about their monthly salary and we see somebody filled out “1000000”. There are a number of explanations for this:

In any case, such extreme values may have a huge impact on our final results. Excluding them from all calculations avoids this and is accomplished by specifying them as missing.
We can inspect variables for unlikely values by running their frequency tables. However, metric variables often (not necessarily) have many distinct values, resulting in huge tables. If so, it may be hard to see which values are to be considered extreme.
A more insightful approach is running and inspecting their histograms. The syntax below shows the easiest way to run a histogram for entry_date and entry_time.

*Run histograms for entry_date and entry_time.

frequencies entry_date entry_time
/format notable

*Note that "/format notable" suppresses the actual frequency tables.

The screenshot shows the histogram of entry_time. First, we don't see any extreme values; all values are between 00:00 and 23:59 hours. This doesn't have to be the case for SPSS time variables. A typo can result in a value outside this normal range.

Second, the distribution seems very plausible, with the bulk of time values being within office hours. Conclusion: we don't need to specify any user missing values for entry_time.

Missing Values - Conclusion

System and user missing values are virtually always treated the same by SPSS. SPSS simply proceeds with its calculations and uses all non missing values that are present in the variables involved.
However, exactly how SPSS proceeds differs between different commands and functions. This may sometimes lead to surprising outcomes. For an example, see SPSS Sum - Cautionary Note.

Previous tutorial: SPSS Variable Types and Formats

Next tutorial: Pearson Correlations – Quick Introduction

Let me know what you think!

*Required field. Your comment will show up after approval from a moderator.

This tutorial has 15 comments