SPSS Unicode mode is a setting which implies that all text is encoded as UTF-8 (Unicode Tranformation Format - 8 bit). Note that this tutorial leans substantively on Unicode.
SPSS Unicode Mode - What and Why
- Up to version 15, all character encoding in SPSS was based on code pages. SPSS using code pages is now referred to as SPSS code page mode.
- Starting from version 16, however, Unicode (as UTF-8) has been supported as well. SPSS using UTF-8 is referred to as Unicode mode. Note that this encoding doesn't only apply to string variables but to syntax files as well.
- For SPSS versions 21 and onwards, whether to use Unicode mode or not is explicitly asked when the program is first started.
SPSS Unicode and Variable Widths
- The single most important thing to understand about UTF-8 in SPSS is that any character may consist of 1, 2 or 3 bytes. Code page mode is restricted to single byte characters so that characters and bytes correspond.
- In SPSS, variable width is defined as the number of bytes (not characters) that may be used for string values. To stay on the safe side, one could use three times the number of characters as variable widths.
- When a data file that was saved in code page mode is opened in Unicode mode, SPSS automatically triples all string variable widths to ensure that they are long enough.
SPSS Unicode Mode and String Functions
- Basic string functions in SPSS (such as INDEX) apply to bytes (not characters). Since code page mode uses only single byte characters, bytes and characters correspond and basic string functions can safely be used. In Unicode mode they can only be used on string values that don't hold any multibyte characters.
- SPSS' character string functions (such as
CHAR.INDEX
) apply to characters (not bytes). They can always be used (both in Unicode as wel as in code page mode.) - Lastly, RTRIM is automatically applied in Unicode mode. Including it in your syntax anyway is the safest option since this renders your syntax valid in both Unicode mode as well as code page mode.
Switching Between Unicode and Code Page Mode
- If Unicode is to be used,
SET UNICODE ON.
will switch on Unicode mode. Note that this command can only be run if there are no open datasets. - Alternatively,
SET UNICODE OFF.
switches SPSS into code page mode. - Finally,
SHOW UNICODE.
will show whether SPSS is in Unicode mode or code page mode.
THIS TUTORIAL HAS 14 COMMENTS:
By nita on July 24th, 2017
I did no find menu unicode (universal character set) in general tab IBM SPSS Statistics 20?
By Ruben Geert van den Berg on July 24th, 2017
Hi Nita!
Why don't you do it by syntax? Close all data, output and syntax and just run
set unicode on.
or
set unicode off.
Via the menu, try Edit -> Options -> Language -> Character Encoding...
Hope that helps!
By Thomas Buhl on January 30th, 2020
Thank you for the clear explanation. It help me to solve a miracle ;-)
Just a small remark:
"SPSS' character string functions (such as CHAR.INDEX) apply to characters (not bytes). They can always be used (both in Unicode as well as in code page mode.)" seems not to be true for CHAR.SUBSTR.
Here is a MWE with special characters (German Umlaute):
new file.
set unicode = ON.
show unicode.
data list free
/text (a8).
begin data
ABCD
ÄBCD
ÄÄCD
ÄÄÄD
ÄÄÄÄ
end data.
string newText (a3).
comp newText = CHAR.SUBSTR (LTRIM(RTRIM(text)), 1,3).
exe.
list var = all.
exe.
new file.
set unicode = off.
show unicode.
data list free
/text (a8).
begin data
ABCD
ÄBCD
ÄÄCD
ÄÄÄD
ÄÄÄÄ
end data.
string newText (a3).
comp newText = CHAR.SUBSTR (LTRIM(RTRIM(text)), 1,3).
exe.
list var = all.
exe.
By Ruben Geert van den Berg on February 2nd, 2020
Hi Thomas!
Sorry for my late reaction, just back from holidays.
In any case, your examples are very interesting indeed. I've no clue why things don't work as they're supposed to in Unicode mode. Any suggestions?