You can't readily use categorical variables as predictors in linear regression: you need to break them up into dichotomous variables known as dummy variables.

The ideal way to create these is our dummy variables tool. If you don't want to use this tool, then this tutorial shows the right way to do it manually.

- Example I - Any Numeric Variable
- Example II - Numeric Variable with Adjacent Integers
- Example III - String Variable with Conversion
- Example IV - String Variable without Conversion

## Example Data File

This tutorial uses staff.sav throughout. Part of this data file is shown below.

## Example I - Any Numeric Variable

Let's first create dummy variables for marit, short for marital status. Our first step is to run a basic FREQUENCIES table with frequencies marit. The table below shows the resulting table.

So how to break up marital status into dummy variables? First off, we always omit one category, the **reference category**. You may choose any category as the reference category.

So for this example, we choose 5 (Widowed). This implies that we'll create 3 dummy variables representing categories 1, 2 and 4 (note that 3 does not occur in this variable).

The syntax below shows how to create and label our 3 dummy variables. Let's run it.

***Create dummy variables for categories 1, 2 and 4.**

compute marit_1 = (marit = 1).

compute marit_2 = (marit = 2).

compute marit_4 = (marit = 4).

***Apply variable labels to dummy variables.**

variable labels

marit_1 'Marital Status = Never Married'

marit_2 'Marital Status = Currently Married'

marit_4 'Marital Status = Divorced'.

***Quick check first dummy variable**

frequencies marit_1.

## Results

First off, note that we created 3 nicely labelled dummy variables in our active dataset.

The table below shows the frequency distribution for our first dummy variable.

Note that our dummy variable holds 3 distinct values:

- respondents whose marital status is
**not**“never married” score 0; - respondents whose marital status is “never married” score 1;
- respondents whose marital status is a missing value (and therefore unknown) have a system missing value.

We may now check the results more thoroughly by running crosstabs marit by marit_1 to marit_4. Doing so creates 3 contingency tables, the first of which is shown below.

On our dummy variable,

respondents having other marital statuses than “never married” all score 0;

respondents who “never married” all score 1;

we've a sample size of N = 170 (this table only includes respondents without missing values on either variable).

Optionally, a final -*very thorough*- check is to compare ANOVA results for the original variable to regression results using our dummy variables. The syntax below does just that, using monthly salary as the dependent variable.

***Minimal regression using dummy variables.**

regression

/dependent salary

/method enter marit_1 to marit_4.

***Minimal ANOVA using original variable.**

oneway salary by marit.

Note that both analyses result in **identical** ANOVA tables. We'll discuss ANOVA versus dummy variable regression more thoroughly in a future tutorial.

## Example II - Numeric Variable with Adjacent Integers

We'll now create dummy variables for region. Again, we start off by inspecting a minimal frequency table which we'll create by running frequencies region. This results in the table below.

We'll choose 1 (“North”) as our reference category. We'll therefore create dummy variables for categories 2 through 5. Since these are adjacent integers, we can speed things up by using DO REPEAT as shown below.

***Create dummy variables for region categories 2 through 5.**

do repeat #vals = 2 to 5 / #vars = region_2 to region_5.

recode region (#vals = 1)(lo thru hi = 0) into #vars.

end repeat print.

***Apply variable labels to new variables.**

variable labels

region_2 'Region = East'

region_3 'Region = South'

region_4 'Region = West'

region_5 'Region = Top 4 City'.

***Quick check.**

crosstabs region by region_2 to region_5.

A careful inspection of the resulting tables confirms that all results are correct.

## Example III - String Variable with Conversion

Sadly, our first 2 methods don't work for string variables such as jtype -short for “job type”). The easiest solution is to convert it into a numeric variable as discussed in SPSS Convert String to Numeric Variable. The syntax below uses AUTORECODE to get the job done.

***Convert jtype into numeric variable.**

autorecode jtype

/into njtype.

***Check result.**

frequencies njtype.

***Set missing values.**

missing values njtype (1,2).

***Recheck result.**

frequencies njtype.

## Result

Since njtype -short for “numeric job type”- is a numeric variable, we can now use method I or method II for breaking it up into dummy variables.

## Example IV - String Variable without Conversion

Converting string variables into numeric ones is the easy to create dummy variables for them. Without this conversion, the process is cumbersome because SPSS doesn't handle missing values for string variables properly. However, syntax below gets the job done correctly.

***Inspect frequencies.**

frequencies jtype.

***Chance '(Unknown)' into 'NA'.**

recode jtype ('(Unknown)' = 'NA').

***Set user missing values.**

missing values jtype ('','NA').

***Reinspect frequencies.**

frequencies jtype.

***Create dummy variables for string variable.**

if(not missing(jtype)) jtype_1 = (jtype = 'IT').

if(not missing(jtype)) jtype_2 = (jtype = 'Management').

if(not missing(jtype)) jtype_3 = (jtype = 'Sales').

if(not missing(jtype)) jtype_4 = (jtype = 'Staff').

***Apply variable labels to dummy variables.**

variable labels

jtype_1 'Job type = IT'

jtype_2 'Job type = Management'

jtype_3 'Job type = Sales'

jtype_4 'Job type = Staff'.

***Check results.**

crosstabs jtype by jtype_1 to jtype_4.

## Final Notes

Creating dummy variables for numeric variables can be done fast and easily. Setting proper variable labels, however, always takes a bit of work. String variables require some extra step(s) but are pretty doable as well.

Nevertheless, the easiest option is our SPSS Create Dummy Variables Tool as it takes perfect care of everything.

Hope you found this tutorial helpful! Let us know by throwing a comment below.

**Thanks for reading!**

## THIS TUTORIAL HAS 29 COMMENTS:

## By Steven Bortels on April 5th, 2021

How do you make a dummy variable for a Numeric variable? In this case, it's age. I have a median of 38 years old and I want 18-38 = 0 en 39-82 = 1. How do I do that?

Thank you in advance for your feedback!

## By Ruben Geert van den Berg on April 5th, 2021

Hi Steven!

It's usually not a good idea to dummify such a quantitative variable. You're basically throwing away most of the information it contains. Why don't you simply use age in years as your predictor? Usually a better idea.

But anyway, an easy option is

RECODE AGE (LO THRU 38 = 0)(LO THRU HI = 1) INTO AGEGROUP.

If there's any missing values in AGE, those will be missing values in AGEGROUP too (which is what you usually want).

Hope that helps!

SPSS tutorials

## By Steven Bortels on April 5th, 2021

Thank you for your quick reply!

How do I paste the command "RECODE AGE (LO THRU 38 = 0)(LO THRU HI = 1) INTO AGEGROUP" in SPSS? I am unfamiliar with its syntax.

Best regards,

Steven Bortels

## By Ruben Geert van den Berg on April 5th, 2021

Just copy-paste the syntax into your syntax window, replace "AGE" by the variable name of your age variable and run it.

If that doesn't ring a bell, you should

seriouslyread up on our SPSS Syntax Introduction.