SPSS TUTORIALS BASICS ANOVA REGRESSION FACTOR CORRELATION

# Extract Digits from String Variable

Recently, one of our clients used a text field for asking his respondents’ ages. The resulting age variable is in age-in-string.sav, partly shown below.

I hope you realize that this looks nasty:

• age is a string variable so we can't compute its mean, standard deviation or any other statistic;
• we can't readily convert age into a numeric variable because it contains more than just numbers;
• a simple text replacement won't remove all such undesired characters.

For adding injury to insult, the data contain 3,895 cases so doing things manually is not feasible. However, we'll quickly fix things anyway.

## Inspect Frequency Table

Let's first see which problematic values we're dealing with anyway. So let's run a basic frequency table with the syntax below.

*Check which (string) values are present in age.

frequencies age
/format dfreq.

## Result

If we scroll down our table a bit, we'll see some problematic values as shown below.

This table shows us 2 important things:

most values that can be corrected start off with 2 digits;
at least one value is preceded by a leading space.

Let's first remove any leading spaces. We'll simply do so by running compute age = ltrim(age).

## Extract Leading Digits

We'll now extract any leading digits from our string variable with the syntax below.

*Create new string variable of length 3 -assume that nobody is older than 999 years....
string nage (a3).

*Loop over characters in age and pass into nage if they are digits.
loop #ind = 1 to char.length(age).
do if(char.index('0123456789',char.substr(age,#ind,1)) > 0).
compute nage = concat(rtrim(nage),char.substr(age,#ind,1)).
else.
break.
end if.
end loop.
execute.

So what we very basically do here is

• we create a new string variable;
• we LOOP through all characters in age;
• we evaluate if each character is a digit: char.index returns 0 if the character can't be found in '0123456789'.
• if the character is a digit (DO IF), we'll add it to the end of our new string variable;
• if the character is not a digit (ELSE), BREAK ends the loop for that particular respondent.

This last condition is needed for values such as “55 and will become 56 on 3/9” We need to make sure that no digits after “55” are added to our new variable. Otherwise, we'll end up with “555639” -an age perhaps only plausible for Fred Flintstone.

## Inspect Which Values Couldn't be Converted

Let's now inspect which original age values could not be converted. We'll rerun our frequency distribution but we'll restrict it to respondents whose new age value is still empty.

*Include only respondents without nage in next table.

temporary.
select if (nage = '').

*Check which age values weren't converted yet.

frequencies age
/format dfreq.

## Result

Surprisingly, a quick scroll down our table shows that we can reasonably convert only a single unconverted age value: “Will become 56 on the 3rd of September:-)”

It is probably safe to infer from this statement that this person was 55 years old at questionnaire completion. We'll set his age to 55 with a simple IF command. We'll then run a quick final check.

*Manually correct single age value.

if(char.index(age,'Will become 56') > 0) nage = '55'.

*Recheck which age values weren't converted yet.

temporary.
select if (nage = '').

frequencies age
/format dfreq.

## Final Frequency Table

As shown below, our minimal corrections resulted in a mere 148 (out of 3,895) unconverted ages. A quick scroll down our table shows that no further conversions are possible.

We'll now convert our new age variable into numeric with ALTER TYPE and inspect the result.

*Convert nage to numeric.

alter type nage(f3).

*Check age distribution.

frequencies nage
/histogram.

*Exclude nage = 99 from all analyses and/or editing.

missing values nage (99).

## Inspect Final Results

First off, note that our final age variable has N = 148 missing values -just as expected. It is important to check this because ALTER TYPE may result in missing values without throwing any error or warning.

Next, a histogram over our final age values is shown below.

Although the age distribution looks plausible, the x-axis runs up to 120 years. SPSS often applies a 20% margin on both sides so this may indicate an age around 100 years.

Closer inspection shows that somebody reported an age of 99 years. As we think that's not plausible for the current study, we set it as a user missing value.

Done.

# Tell us what you think!

*Required field. Your comment will show up after approval from a moderator.

# THIS TUTORIAL HAS 6 COMMENTS:

• ### By Jon K Peck on May 16th, 2021

The easiest way to find the numerical dates would be to use the SPSSINC TRANS extension command, which can be installed from the Extensions > Extension Hub menu. Then just this one command will do it.

spssinc trans result=agedigits1 to agedigits5
/formula "re.findall(r'\d\d', age)".

It is set up to find up to five two-digit numbers (case 3840 has five, because someone wrote a birth date). Unused result variables for a case are set to sysmis.
The expression \d\d is a regular expression that finds two successive digits. The re.findall funcrtion finds all such patterns. Regular expressions, although they take some learning, are a really effective way for finding patterns in strings.

• ### By Ruben Geert van den Berg on May 17th, 2021

Hi Jon, thanks for your comment!

I was expecting SPSSINC TRANS to be able to handle it but your syntax only triggered an error:

Only numeric variables are allowed.

I then tried

spssinc trans result=nage type=3
/formula "re.findall(r'\d\d\d', age)".

for extracting the first 3 digits. That didn't work either but I didn't receive any warning.

P.s. a value that particularly bothers me is '1985-09-15'. If I extract only 2 digits, the age will be 19 -which is plausible. I therefore want to extract 3 digits because age = 198 is an easily detected misconversion.

• ### By Jon K Peck on May 17th, 2021

I don't know why you would get an error. That code works on your data for me. As written, the code expects a string variable input and produces numeric output variables.

If you want to get more than two digits, you can write it like this.

spssinc trans result=agedigits1 to agedigits5
/formula "re.findall(r'\d\d\d*', age)".

That extracts fields with at least two digits, but it would capture a string of digits of any length.

Writing \d\d\d is valid - hence no error message - but it would only capture strings of three digits.

• ### By Ruben Geert van den Berg on May 17th, 2021

Sorry Jon, still no luck. SPSS 27 just throws a warning:

Warnings
local variable '_customfunction' referenced before assignment

The Help button doesn't seem to work either.

• ### By Jon Peck on May 19th, 2021

- Did you install this extension fresh with V27? I checked the Extension Hub and verified that the posted version is identical to what I am running. I seem to recall that when Development posted the initial Python 3 conversion with V27, there was a bug in it. I fixed that for them, so maybe reinstalling it from the Hub would fix this.

I also tested both the syntax and dialog box help, and they worked fine. I am running 27.0.1 (and 28, which is due shortly).

Sometimes a browser objects to file references in code on security grounds. The syntax help can be opened, however, as file://.../spssinc_trans/markdown.html where ... is replaced by the location where the extension is installed.