This tutorial walks through running nice tables and charts for investigating the association between categorical or dichotomous variables. If statistical assumptions are met, these may be followed up by a chi-square test.
As an example, we'll see whether sector_2010 and sector_2011 in freelancers.sav are associated in any way.
SPSS Quick Data Check
Before doing anything else, let's first just take a quick look at both variables separately. In the syntax below, we first ensure we'll see both values and value labels in our output tables (step 1). Next, we run a basic FREQUENCIES command.
set tnumbers both.
*2. Run frequencies.
frequencies sector_2010 sector_2011.
RECODE System Missing Values
Both variables contain values from 1 through 5 plus system missing values. Since both variables are nominal, we may include these system missings as just another category. This keeps the N nice and constant over analyses and results in cleaner tables.For nicer tables, you may remove “Valid” with a Python script and apply styling with an SPSS table template (.stt file). The syntax below shows how to do so with RECODE.
recode sector_2010 sector_2011 (sysmis = 6).
*2. Explain what formerly missing value means.
add value labels sector_2010 sector_2011 6 '(Unknown)'.
*3. Show only value labels in output.
set tnumbers labels.
*4. Run clean frequency tables.
frequencies sector_2010 sector_2011.
SPSS CROSSTABS for Both Variables
Thus far, we only had a look at both variables separately. In order to see how they're associated, we'll inspect their contingency table obtained from CROSSTABS. Displaying column percentages without frequencies is our preferred option here.
crosstabs sector_2011 by sector_2010
/cells column.
Conclusion: the variables are strongly related.Again, note that we're only describing the data at hand. We're not making any attempt to generalize these results to any larger population. Roughly, most people who worked in a sector in 2010 stayed in the same sector in 2011. For example, 60% of respondents who worked in industry in 2010 stayed in industry. Another 20% moved to finance and the final 20% moved to “other”.
SPSS Clustered Bar Chart Creation
We'll now visualize the contents of the previous table. An option here is a split bar chart but we'll go for a clustered bar chart instead. The screenshots below walk you through the process.
SPSS Clustered Bar Chart Syntax
GRAPH
/BAR(GROUPED)=COUNT BY sector_2011 BY sector_2010
/TITLE='Sector in 2010 by sector in 2011 (N = 40)'.
SPSS Clustered Bar Chart Styling
Although our chart is technically correct, it looks appalling. Its default color scheme basically just looks like a bad joke from the software developers. A fast way to prettify this and similar charts is building and applying an SPSS chart template (.sgt file). Our final result after doing so is shown in the last screenshot.
SPSS Clustered Bar Chart Example
Conclusion: as with the contingency table, we don't see much of a clear pattern here except for people tending to stay in the same sector as the previous year.
THIS TUTORIAL HAS 14 COMMENTS:
By Abass Alhassan on July 4th, 2015
The tutorial has been helpful to me. thanks.
By Garabasa on December 10th, 2015
This is very useful, Many thanks to the Authors.
By Alishan on January 5th, 2016
Very nice tutorial we wnat to learn more many thanks
By Wendy on March 7th, 2017
Hi Ruben
Thanks for a great tutorial! My question is about post hoc testing. I've run a Chi-squared on 3 age groups and 3 body composition groups. It's come up as significant (0.005), but how do I work out in which specific group combinations the differences are? Apologies if you've answered this question previously. Thanks!
By Ruben Geert van den Berg on March 8th, 2017
Hi Wendy, thanks for the compliment!
I'm actually working on new tutorials on popular tests (including chi-square independence) and the answer to your question will be included: add adjusted standardized residuals (ARESID in your syntax) to your CROSSTAB. For reasonable sample sizes, they roughly follow a standard normal distribution. This means that any cell with an ARESID outside the range -2 through 2 is statistically significantly different from what should be expected from mere sampling fluctuation and thus signifies a "real" difference.
Does that make any sense?
Best,
Ruben