SPSS Missing Values Tutorial
SPSS has two types of missing values. In short,
- system missing values are values that are completely absent from the data and
- user missing values are values that are present in the data but must be excluded from calculations.
The remainder of this tutorial explains what missing values are, shows how to track them down and shows the right way to deal with them. You can follow along by downloading and opening hospital.sav.
System Missing Values
System missing values are values that are truly absent from the data. In data view they are shown as empty cells holding just a tiny dot. In rare cases, this is also seen if SPSS is unable to display a value due to its format. In case of doubt you can find out by running FREQUENCIES over the variable. If we scroll to the last variables in our data, we first encounter a system missing value in “nurse_rating” as shown in the screenshot.
System Missing Values - Possible Causes
System missing values are common in real world data. Some reasons why they occur are the following:
- some questions weren't offered to all respondents;
- some respondents skipped some of the questions;
- a technical failure occurred.
User Missing Values
User missing values are values that are present in the data but must be excluded from calculations and analyses. In order to do so, the (SPSS) user needs to specify them as missing. We'll briefly point out the two scenarios that require this and then discuss them a little more in depth.
- Ordinal variables may contain values that reflect answers such as “don't know” and “no opinion”
- Metric variables may contain extremely high or low values that possibly don't correspond to reality.
User Missing Values in Ordinal Variables
Say we want to know the average “doctor_rating”.Strictly, calculations are not allowed on ordinal variables but they are very common nevertheless. Also, see Assumption of Equal Intervals. Now, before doing anything whatsoever with a variable, we first want to know what's in there. In order to do so, we run FREQUENCIES with the syntax below. The table that appears in the Output Viewer window is shown in the following screenshot.
set tnumbers both.
*2. Run frequency table of doctor_rating.
First note that the frequency table has three main sections:
valid (non missing) values;
missing values, in this case only 4 system missing values;
all values, both valid and missing.
Second, note that higher values reflect more positive attitudes. However, this doesn't hold for 6 (“Not applicable or don't want to answer”) which does not reflect a more positive attitude than 5 (“Very satisfied”).
Now if two respondents score 2 (“Dissatisfied”) and 6 (“Not applicable or don't want to answer”) on this variable, their average will be (2 + 6) / 2 = 4 which means “Satisfied”. However, these two respondents are, on average, clearly not satisfied. The proper calculation for these two respondents is 2 / 1 = 2 which means “Dissatisfied”. This is accomplished by excluding the value 6 from the calculation altogether. This is done in SPSS by running the following line: missing values doctor_rating(6). After doing so, we rerun our frequency table. The result is shown in the screenshot below.
Note that the valid values section is now limited to the values that we do want to include in calculations.
The value 6 (“Not applicable or don't want to answer”) is shown in the missing values section. This confirms that it's now a user missing value and will be excluded from all calculations.
In short, we can detect user missing values in ordinal variables by inspecting their frequency tables containing their values and value labels. After specifying zero or more values as missing, we rerun these tables to confirm that we're good to go.
User Missing Values in Metric Variables
A second main class of user missing values are unlikely values. For example, we asked respondents about their monthly salary and we see somebody filled out “1000000”. There are a number of explanations for this:
- the respondent happened to be Bill Gates;
- the respondent didn't give a serious answer;
- some weird typo was made;
- dollar cents, rather than dollars, were filled in.
In any case, such extreme values may have a huge impact on our final results. Excluding them from all calculations avoids this and is accomplished by specifying them as missing.
We can inspect variables for unlikely values by running their frequency tables. However, metric variables often (not necessarily) have many distinct values, resulting in huge tables. If so, it may be hard to see which values are to be considered extreme.
A more insightful approach is running and inspecting their histograms. The syntax below shows the easiest way to run a histogram for entry_date and entry_time.
frequencies entry_date entry_time
*Note that "/format notable" suppresses the actual frequency tables.
The screenshot shows the histogram of entry_time. First, we don't see any extreme values; all values are between 00:00 and 23:59 hours. This doesn't have to be the case for SPSS time variables. A typo can result in a value outside this normal range.
Second, the distribution seems very plausible, with the bulk of time values being within office hours. Conclusion: we don't need to specify any user missing values for entry_time.
Missing Values - Conclusion
System and user missing values are virtually always treated the same by SPSS. SPSS simply proceeds with its calculations and uses all non missing values that are present in the variables involved.
However, exactly how SPSS proceeds differs between different commands and functions. This may sometimes lead to surprising outcomes. For an example, see SPSS Sum - Cautionary Note.