SPSS tutorials website header logo SPSS TUTORIALS VIDEO COURSE BASICS ANOVA REGRESSION FACTOR

SPSS – Open CSV Data File

Summary

This tutorial walks you through opening a .csv data file in SPSS. We chose a .csv file as obtained from Google Analytics (GA). Since many readers may be unfamiliar with GA, the screenshot below gives a basic idea of what the data look like before we export them as a .csv file.

SPSS open .csv data - Google Analytics 1

The original data file -as generated by GA- can be downloaded here. We chose it because opening it in SPSS causes some complications that you'll typically encounter with real life data. We'll take our time to explain and resolve these issues.

Notepad++

Right, so where do we start? Well, .csv files are typically associated with spreadsheet editors such as MS Excel or OpenOffice Calc. However, a better idea is to inspect the file with notepad++ or a similar text editor. As we'll see in a minute, this allows us to gather some essential pieces of information on the file contents. The screenshot below shows what the data look like in Notepad++.

Contents of the CSV file

SPSS open .csv data - Notepad++ 1

Note that the first 6 lines of the file are a basic description of the data. We'll remove them after inspecting the rest of the data.
Line 7 contains variable names. Note that most of them are not suitable for SPSS because they contain spaces or start with a percent sign. We'll clean them up manually.
The actual data start on line 8, which will be the second line of data after removing the first 6 lines.
Note how data values are separated by commas (hence “comma separated values” or .csv file). These commas are known as the delimiter.
If a value contains a comma, there's double quotes around it. This is the text qualifier. These quotes indicate that commas between them are not meant as delimiters.

More Contents of the CSV file

SPSS open .csv data - Notepad++ 2

If we quickly scroll down the file (tip: use the pgdn key on your keyboard), we see that line 318 is the last line containing raw data values.
Lines 319 and below contain summary statistics which are not part of the data we want. We therefore remove them.

SPSS open .csv data - Notepad++ 3

Last but not least, note “UTF-8” in the status bar. This is the file encoding. It basically tells how characters are mapped onto the bits (ones and zeroes) that the file really consists of.

Cleaning Up the CSV File

We now clean up the .csv file like we just discussed. After doing so, we'll save it with a different file name. This enables us to compare our manually edited file with the original file as GA generated it. The final .csv file that we'll open in SPSS can be downloaded from browsers.csv.

Opening the CSV File in SPSS

We're now ready to open our .csv file in SPSS. For doing so, we'll first launch SPSS and navigate to File SPSS Menu Arrow Open SPSS Menu Arrow Data.

SPSS open .csv data 1

We saw in Notepad++ that the file encoding is UTF-8. When we select the file, SPSS may switch to “Local Encoding” here. If that happens, set it back to UTF-8 before proceeding.
Click Paste instead of Open.

Step 1

SPSS open .csv data 2

Step 2

SPSS open .csv data 3

Remember that variable names make up the first line of our .csv file.

Step 3

SPSS open .csv data 4

Step 4

SPSS open .csv data 5

Here we specify the delimiter we identified when we inspected the file in Notepad++.
Like we saw earlier, the only text qualifier in our raw data is a double quote.

Step 5

SPSS open .csv data 6

We could specify some variable formats here. However, taking care of that later yields better results faster.

Step 6

SPSS open .csv data 7

Initial Syntax

Following the previous steps results in the syntax below. Readers who'd like to copy-paste the syntax from this tutorial may either correct the /FILE specification (second line of syntax) or create a folder d:/project_00/edit and move browsers.csv into this folder.
Before running this syntax, we remove the last line; we didn't ask for any DATASET NAME ... command and running it just gets in our way.

*Initial syntax pasted from menu - needs some corrections.

GET DATA /TYPE=TXT
/FILE="D:\project_00\edit\browsers.csv"
/ENCODING='UTF8'
/DELCASE=LINE
/DELIMITERS=","
/QUALIFIER='"'
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
screen_resolution A9
sessions F5.0
perc_new_sessions A7
new_users F5.0
bounce_rate A7
pages_session F4.2
avg_session_duration A8
goal_conversion_rate A5
goal_completions F1.0
goal_value A7.
CACHE.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.

System Missing Values

SPSS open .csv data - System Missing Values

After running the syntax, let's take a look at data view. Note that there's some system missing values in sessions. The reason for this is that some values contain commas. If we take a close look at the syntax, we see that it contains sessions F5.0 This means that the values of the variable sessions are interpreted according to a f format (standard numeric) which doesn't understand commas. The solution is replacing this line by sessions comma6 We do so for all standard numeric variables. Don't worry if you're not sure which variables that refers to; we'll present the entire corrected syntax in a minute.

String Variables

The other variables may look fine at first, but inspecting variable view shows many of them to be string variables. If we think about the data for a minute, we conclude that screen_resolution should be the only string variable here.

SPSS open .csv data - String Variables

This is the same problem -and the same solution- as the system missing values we saw earlier. For instance, we find perc_new_sessions A7 in our syntax. The A format indicates a string variable and SPSS has chosen this format here because of the percentage signs in the data. The solution is replacing the line by perc_new_sessions pct7.2 In a similar vein, we choose this format for all percentage variables. Again, we'll present the entire corrected syntax in a minute.

Time Variables

Our data contain one time variable: avg_session_duration. Our syntax addresses it as avg_session_duration A8 Our time values are in hh:mm:ss format. In SPSS, this is the time8 format (where 8 denotes the number of characters including the colons). We therefore replace the previous line by avg_session_duration time8

Final Corrected Syntax

After making all modifications as previously explained, we end up with the syntax below. This is the final syntax we'll use for opening the .csv file in SPSS.

*Corrected syntax. Works for all variables except goal_value.

GET DATA /TYPE=TXT
/FILE="D:\project_00\edit\browsers.csv"
/ENCODING='UTF8'
/DELCASE=LINE
/DELIMITERS=","
/QUALIFIER='"'
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
screen_resolution A9
sessions comma6
perc_new_sessions pct7.2
new_users comma6
bounce_rate pct7.2
pages_session F4.2
avg_session_duration time8
goal_conversion_rate pct7.2
goal_completions F1.0
goal_value A7.
CACHE.
EXECUTE.

Euro Signs

Attentive readers may notice that we didn't prevent goal_value from ending up as a string variable. This is because the previous steps couldn't prevent this. For the sake of completeness, we'll fix it with the syntax below.
As explained elsewhere, we always make sure that ALTER TYPE doesn't create any system missing values. We do so in step 3.

Syntax for Fixing Euro Signs

*1. Chop off leading Euro sign.

compute goal_value = char.substr(goal_value,2).
execute.

*2. Convert to numeric variable.

alter type goal_value(f4.2).

*3. Confirm that all 311 values are non missing.

descriptives goal_value.

*4. Set cca (Custom Currency A) to Euro.

set cca '-,€ ,,'.

*5. Set goal_value to cca.

formats goal_value(cca6.2).

*Done.

Result

SPSS open .csv data - ALTER TYPE check

Extracting Screen Width

That's it. We now have a perfect SPSS file holding our .csv data. If that's all you wanted to know, you can stop reading right here, right now.
Right, note that the values of screen_resolution consist of two numbers separated by an x. The first number is the screen width in pixels and we'll extract it by running the syntax below.

*1. Compute position of "x".

compute pos_x = char.index(screen_resolution,'x').
execute.

*2. Create new, empty string variable.

string width(a5).

*3. Compute width = characters of screen_resolution up to position of "x".

compute width = char.substr(screen_resolution,1,pos_x - 1).
execute.

*4. Change width to numeric variable.

alter type width(f5).

*5. Make sure ALTER TYPE doesn't create system missing values.

descriptives width.

*Done.

Cumulative Percentages of Sessions per Screen Width

Finally. We're ready to create the table we're after: we'd like to know the cumulative percentages of sessions per screen width, from largest to smallest. Note that most cases represent many sessions. In order to have SPSS interpret each case as the number of sessions it represents, we'll use WEIGHT as shown below.
Finally, we'll run a FREQUENCIES table which we'll sort according to descending screen width.

SPSS Syntax for Cumulative Percentages

*1. Have each record count as the number of sessions it represents.

weight by sessions.

*2. Create frequencies table, sort by descending widths.

frequencies width/format dvalue.

Checksum

SPSS open .csv data - Checksum

Due to weighting our cases, SPSS reports 11,607 cases. This is the exact same number of sessions that we saw in the Google Analytics (GA) interface (very first screenshot). We can see this as a checksum: these numbers being identical strongly suggests that our SPSS data perfectly correspond to the original GA data.

Purpose

The purpose of this study was to choose a minimal screen width that our website design should support. Due to the nature of our website, very few visitors use a small screen width (smartphone or tablet). Should we invest in supporting mobile devices instead of writing more tutorials? Our final table suggests we shouldn't.

Conclusion

SPSS open .csv data - Cumulative PercentagesMost Interesting Sections from Final Table

The second table is rather large but this doesn't bother us. If we inspect the cumulative percentages and scroll down, we conclude that 69.9% of all sessions used a screen width of 1,366 pixels or more. A screen width of at least 1,280 pixels was used in 91.2% of all sessions.
Finally, if we make sure our website works well at a screen width of 1,024 pixels or more, we support 95.6% of all visits. Our final decision was to support screen widths of 800 pixels or more because it's the smallest our design could reasonably handle.

Tell us what you think!

*Required field. Your comment will show up after approval from a moderator.

THIS TUTORIAL HAS 3 COMMENTS: