When reporting on data, we usually summarize our variables. The most basic way for doing so is presenting their frequency distributions. Visualizing these as histograms tends to give immediate insight into what a variable ‘looks like’, especially when many distinct values are present.
However, frequency distributions often provide more detail than we actually need. We sometimes just want to know a variable's average.
Such summary statistics also facilitate comparing variables; it's much easier to compare just two numbers than two entire frequency tables. The same holds for groups of cases (rows of data), for example male versus female respondents.
Which summary statistics are appropriate for a given variable depends on its measurement level. We therefore proceed our discussion from nominal through metric.
A summary statistic appropriate for any variable (including nominal variables) is the mode. The mode is the value that has the highest frequency. Note that in some cases the highest frequency occurs for two (or even more) values, in which case a variable has two (or more) modes. If we return to the employee data presented in a previous tutorial, we thus see that the mode for “Education Type” is “Law”. It has a frequency of 4 and all of the other education types have lower frequencies.
For variables with many distinct values, the mode often refers to a range of values (“the modal income is between € 2000,- and € 2250,-”). How to group values into such ranges is subjective; a different mode may be found if values are grouped differently.
Percentiles are appropriate for ordinal and metric variables. The nth percentile is the value that separates the nth% cases having the lowest values from the remaining cases.
Let's consider the frequency distribution for “Experience”. Note that the values are sorted from low to high and we added cumulative percentages to the table.
|Experience (Years)||Frequency||Percentage||Cumulative Percentage|
From the cumulative percentages we see that 50% of our respondents have 1 or 2 years of experience. From this we can conclude that another 50% have 3 or more years of experience. Hence, the 50th percentile lies between 2 and 3 (years of experience). In a similar vein, the 90th percentile lies between 5 and 9.
The exact “value between 5 and 9” may be obtained by applying an interpolation formula. We'll skip that since we won't calculate percentile values manually anyway; for now it's sufficient that you understand what a percentile is.
The median is the 50th percentile. It is all too often confused with the average (which we'll discuss later). For example, the statement that “half of all people make less money than the average income” is not (necessarily) true. Half of all people make less money than the median income is the correct version here.
We see exactly this in the “experience” table we just presented; the average number of years is 3.1. No less than 70% of all employees have less experience than the average number of years.
We got a bit ahead of ourselves by already mentioning the average. We'll discuss it along with some other summary statistics in the next tutorial.