SPSS - Extract Digits from String Variable
SPSS tutorials website header logo SPSS TUTORIALS VIDEO COURSE BASICS ANOVA REGRESSION FACTOR

Extract Digits from String Variable

Recently, one of our clients used a text field for asking his respondents’ ages. The resulting age variable is in age-in-string.sav, partly shown below.

SPSS Extract Digits From String

I hope you realize that this looks nasty:

For adding injury to insult, the data contain 3,895 cases so doing things manually is not feasible. However, we'll quickly fix things anyway.

Inspect Frequency Table

Let's first see which problematic values we're dealing with anyway. So let's run a basic frequency table with the syntax below.

*Check which (string) values are present in age.

frequencies age
/format dfreq.

Result

If we scroll down our table a bit, we'll see some problematic values as shown below.

SPSS Frequency Table Age

This table shows us 2 important things:

most values that can be corrected start off with 2 digits;
at least one value is preceded by a leading space.

Let's first remove any leading spaces. We'll simply do so by running compute age = ltrim(age).

Extract Leading Digits

We'll now extract any leading digits from our string variable with the syntax below.

*Create new string variable of length 3 -assume that nobody is older than 999 years....
string nage (a3).

*Loop over characters in age and pass into nage if they are digits.
loop #ind = 1 to char.length(age).
do if(char.index('0123456789',char.substr(age,#ind,1)) > 0).
compute nage = concat(rtrim(nage),char.substr(age,#ind,1)).
else.
break.
end if.
end loop.
execute.

So what we very basically do here is

This last condition is needed for values such as “55 and will become 56 on 3/9” We need to make sure that no digits after “55” are added to our new variable. Otherwise, we'll end up with “555639” -an age perhaps only plausible for Fred Flintstone.

Inspect Which Values Couldn't be Converted

Let's now inspect which original age values could not be converted. We'll rerun our frequency distribution but we'll restrict it to respondents whose new age value is still empty.

*Include only respondents without nage in next table.

temporary.
select if (nage = '').

*Check which age values weren't converted yet.

frequencies age
/format dfreq.

Result

Surprisingly, a quick scroll down our table shows that we can reasonably convert only a single unconverted age value: “Will become 56 on the 3rd of September:-)”

SPSS Adjust Single Data Value

It is probably safe to infer from this statement that this person was 55 years old at questionnaire completion. We'll set his age to 55 with a simple IF command. We'll then run a quick final check.

*Manually correct single age value.

if(char.index(age,'Will become 56') > 0) nage = '55'.

*Recheck which age values weren't converted yet.

temporary.
select if (nage = '').

frequencies age
/format dfreq.

Final Frequency Table

As shown below, our minimal corrections resulted in a mere 148 (out of 3,895) unconverted ages. A quick scroll down our table shows that no further conversions are possible.

SPSS Valid Values For Frequencies

We'll now convert our new age variable into numeric with ALTER TYPE and inspect the result.

*Convert nage to numeric.

alter type nage(f3).

*Check age distribution.

frequencies nage
/histogram.

*Exclude nage = 99 from all analyses and/or editing.

missing values nage (99).

Inspect Final Results

First off, note that our final age variable has N = 148 missing values -just as expected. It is important to check this because ALTER TYPE may result in missing values without throwing any error or warning.

Next, a histogram over our final age values is shown below.

SPSS Histogram Age Distribution

Although the age distribution looks plausible, the x-axis runs up to 120 years. SPSS often applies a 20% margin on both sides so this may indicate an age around 100 years.

Closer inspection shows that somebody reported an age of 99 years. As we think that's not plausible for the current study, we set it as a user missing value.

Done.

Thanks for reading!

SPSS String Variables – Quick Tutorial

SPSS CHAR.SUBSTR - Example

Working with string variables in SPSS is pretty straightforward if one masters some basic string functions. This tutorial will quickly walk you through the important ones.

SPSS Main String Functions

SPSS Syntax Example

We asked respondents to type in their first name, surname prefix and last name. We'd like to combine these into full names and correct some irregularities such as incorrect casing and double spaces. For creating some test data, close all open datasets and run the syntax below.

*Create mini test dataset.

set unicode off.
data list free/s1 s2 s3 (3a20).
begin data
'ANNEKE' ' VAN DEN ' 'BERG' 'daan' '' 'balvert' 'a' '' 'b'
end data.

1. Correcting First Names

*1. Declare new string variables.

string n1 to n4 (a20).

*2. Extract first letter of first name.

compute n1 = char.substr(s1,1,1).
exe.

*3. Convert to upper case.

compute n1 = upcase(n1).
exe.

*4. Substitution: use substring function within upcase function.

compute n1 = upcase(char.substr(s1,1,1)).
exe.

*5. Extract remaining letters and convert to lower case.

compute n1 = lower(char.substr(s1,2)).
exe.

*6. Substitution: concatenate results from previous attempts.

compute n1 = concat(upcase(char.substr(s1,1,1)),lower(char.substr(s1,2))).
exe.

2. Correcting Surname Prefixes

*1. Remove leading spaces.

compute n2 = ltrim(s2).
exe.

*2. Substitution: remove leading spaces and convert to lower case.

compute n2 = lower(ltrim(s2)).
exe.

*3. Replace double spaces by single spaces.

compute n2 = replace(n2,' ',' ').
exe.

3. Combining First and Last Names

*1. Reuse capitalization syntax used for first name on last name.

compute n3 = concat(upcase(char.substr(s3,1,1)),lower(char.substr(s3,2))).
exe.

*2. If rtrim is omitted, concat doesn't seem to work.

compute n4 = concat(n1,n2,n3).
exe.

*3. Correct concatenation but spaces should be inserted.

compute n4 = concat(rtrim(n1),rtrim(n2),rtrim(n3)).
exe.

*4. Final concatenation.

compute n4 = concat(rtrim(n1),' ',rtrim(n2),' ',rtrim(n3)).
exe.

*5. Replace double spaces by single spaces.

compute n4 = replace(n4,' ',' ').
exe.

4. Flag Single Letter Names

*1. Find short first/last names from separate name components.

compute flag_1a = char.length(s1).
compute flag_1b = char.length(s3).
exe.

*2. Find short first/last names from combined names.

compute flag_2a = char.index(n4,' ') -1.
compute flag_2b = char.length(n4) - char.rindex(rtrim(n4),' ').
exe.

SPSS String Variables Basics

For working proficiently with SPSS string variables , it greatly helps to understand some string basics. This tutorial explains what SPSS string variables are and demonstrates their main properties.
We encourage you along by downloading and opening string_basics.sav. The syntax we use can be copy-pasted or downloaded here.

SPSS String Variable Basics Data File

SPSS String Variables - What Are They?

String variables are one of SPSS' two variable types. What really defines a string variable is the way its values are stored internally.We won't go into this technical matter here but those who really want to know may consult our Unicode tutorial. A simpler definition is that string variables are variables that hold zero or more text characters.
String values are always treated as text, even if they contain only numbers. Some surprising consequences of this are shown towards the end of this tutorial.

SPSS String Format

String variables in SPSS usually have an “A” format, where “A” denotes “Alphanumeric”. This can be seen by running the following line of syntax display dictionary. after opening the data. The result, shown in the screenshot below, confirms that we have two string variables having A3 and A8 formats.

SPSS String Variable Formats

The numeric suffixes (3 and 8 here) are the numbers of bytes that the values can hold. Starting from SPSS version 16, some characters may consist of two bytes.This is explained in Unicode mode. If you don't want to go into details, just choose string lengths that are twice the number of characters they need to contain to stay on the safe side.

SPSS String Command

Commands that pass values into variables, most notably COMPUTE and IF, can be used for both existing and new numeric variables. However, they can't be used for new string variables; you must first create one or more new, empty string variables before you can pass values into them. This is done with the STRING command. Its most basic use is STRING variable_names (A10). As explained earlier, A10 means that the new variable can hold values of up to 10 bytes. The syntax below creates a new string variable in our test data.

*1. Create empty new string variable with string command.

string string_3(a10).

*2. Pass values into new string variable.

compute string_3 = 'Hello'.
exe.

SPSS String Function

SPSS' string function converts numeric values to string values. Its most basic use is compute s2 = string(s1,f1). where s2 is a string variable, s1 is a numeric variable or value and f1 is the numeric format to be used.
With regard to our test data, the syntax below shows how to convert numeric_1 into (previously created) string_3. In order to capture all three digits, we need to specify f3 as the format.

*Convert numeric_1 to (existing) string variable with string function.

compute string_3 = string(numeric_1,f3).
exe.

Quotes Around String Values

If you use string values in syntax, put quotes around them. For example, say we want to flag all cases whose name is “Stefan”. The screenshot shows the desired result. The syntax below demonstrates the wrong way and then the right way to do so.A faster way to do this is compute find_stefan = string_2 = 'Stefan'. Compute A = B = C explains how this works.

*1. Compute empty flag variable.

compute find_stefan = 0.
exe.

*2. Wrong way: without quotes Stefan is thought to be variable name.

if string_2 = Stefan find_stefan = 1.
exe.

*3. Right way: quotes around Stefan.

if string_2 = 'Stefan' find_stefan = 1.
exe.

Result

SPSS String Variable Flag Cases Flagging Cases Whose Name is Stefan

Note that the second step triggers SPSS error #4285: due to the omitted quotes, SPSS thinks that Stefan refers to a variable name and doesn't find it in the data.

String Values are Case Sensitive

Now let's create a similar flag variable for cases called “Chrissy”. After running step 2 in the syntax below, you can see in data view that no cases have been flagged; it uses the wrong casing. Step 3, using the correct casing, does flag “Chrissy” correctly.

*1. Compute empty flag variable.

compute find_chrissy = 0.
exe.

*2. Line below doesn't flag any cases because 'chrissy' is not the same as 'Chrissy'.

if string_2 = 'chrissy' find_chrissy = 1.
exe.

*3. Right way: 'Chrissy' instead of 'chrissy'.

if string_2 = 'Chrissy' find_chrissy = 1.
exe.

SPSS String Variables - System Missing Values

There's no such thing as a system missing value in a string variable; string values consisting of zero characters which are called empty strings are valid values in SPSS.Also note that you don't see a dot (indicating a system missing value) in an empty cell of a string variable. We can confirm this by running FREQUENCIES: frequencies string_2. Note that the empty string value is among the valid values.

Result

SPSS String Variable No System Missing Values

User Missing Values in String Variables

Over the years, we've seen many forum questions (and some heated debates) regarding user missing values in string variables. Well, running missing values string_2(''). specifies the empty string as a user missing value. This can be confirmed by rerunning its frequency table; the empty string is now in the missing values section as shown by the screenshot.

Result

SPSS String Variable No System Missing Values

Sorting on String Variables

String values are seen as text, even if they consist of only numbers. A consequence is that string values are sorted alphabetically. To see what this means, run sort cases by string_1.

SPSS String Variable Sorted Alphabetically Alphabetical Sorting of string_1

If this result puzzles you, represent the numbers 0 through 9 by letters a through j. Clearly, “bb” (= 11) comes before “c” (= 2) if sorted alphabetically.

No Calculations on String Variables

Because string values are seen as text, you can't do any calculations on them. For instance a COMPUTE command with some numeric function like compute string_1 = string_1 * 2. will trigger SPSS error #4307. It basically tries to tell us that our command crashed because a string variable was used in a calculation.

SPSS Error #4307

In a similar vein, most procedures involve calculations and thus won't run on string variables either. For example, descriptives string_1. won't produce any other results than a warning that the command crashed because only string variables were involved.