Histogram – What Is It?
A histogram is a chart that shows frequencies for equal width intervals of values of a metric variable.
Histogram - Example
A company wants to know how monthly salaries are distributed over 1,110 employees having operational, middle or higher management level jobs. The screenshot below shows what their raw data look like.
Since these salaries are partly based on commissions, basically every employee has a slightly different salary. Now how can we gain some insight into the salary distribution?
Histogram Versus Bar Chart
We first try and run a bar chart of monthly salaries. The result is shown below.
Our bar chart is pretty worthless. The only thing we learn from it is that most salaries occur just once and some twice. The main problem here is that a bar chart shows the frequency with which each distinct value occurs in the data.
Importantly, note that the first interval is ($832 - $802 =) $30 wide. The last interval represents ($1206 - $1119 =) $87. But both are equally wide in millimeters on your screen. This tells us that the x-axis doesn't have a linear scale, which renders this chart unsuitable for a metric variable such as monthly salary.
Histogram - Basic Example
Since our bar chart wasn't any good, we now try and run a histogram on our data. The result is shown below.
This chart looks much more useful but how was it generated? Well, we assigned each employee’s salary to a $25 interval ($800 - $825, $825 - $850 and so on). Next, we looked up the number of employees that fall within each such interval. We visualize these frequencies by bars in a chart.
Importantly, the x-axis of our chart has a linear scale: each $25 interval corresponds to the same width in millimeters even if it contains zero employees. The chart we end up with is known as a histogram and -as we'll see in a minute- it's a very useful one.
Histogram - Bin Width
The bin width is the width of the intervals whose frequencies we visualize in a histogram. Our first example used a bin width of $25; the first bar represents the number of salaries between $800 and $825 and so on. This bin width of $25 is a rather arbitrary choice. The figure below shows histograms over the exact same data, using different bin widths.
Although different bin widths seem reasonable, we feel $10 is rather narrow and $100 is rather wide for the data at hand. Either $25 or $50 seems more suitable.
Histograms - Why Are They So Useful?
Why are histograms so useful? Well, first of all, charts are much more visual than tables; after looking at a chart for 10 seconds, you can tell much more about your data than after inspecting the corresponding table for 10 seconds. Generally, charts convey information about our data faster than tables -albeit less accurately.
On top of that, histograms also give us a much more complete information about our data. Keep in mind that you can reasonably estimate a variable’s mean, standard deviation, skewness and kurtosis from a histogram. However, you can't estimate a variable’s histogram from the aforementioned statistics. We'll illustrate this with an example.
Histogram Versus Descriptive Statistics
Let's say we find two age variables in our data and we're not sure which one we should use. We compare some basic descriptive statistics for both variables and they look almost identical.
So can we conclude that both age variables have roughly similar distributions? If you think so, take a look at their histograms shown below.
Split Histogram - Frequencies
Each of the 1,110 employees in our data has a job level: operational, middle management or higher management. If we want to compare the salary distributions between these three groups, we may inspect a split histogram: we create a separate histogram for each job level and these three histograms have identical axes. The result is shown below.
Our split histogram totally sucks. The problem is that the group sizes are very unequal and these relate linearly to the surface areas of our histograms. The result is that the surface area for higher management (n = 100) is only 1% of the surface area for “operational” (n = 1,000). The histogram for higher management is so small that it's no longer visible.
Split Histogram - Percentages
We just saw how a split histogram with frequencies is useless for the data at hand. Does this mean that we can't compare salary distributions over job levels? Nope. If we choose percentages within job level groups, then each histogram will have the same surface area of 100%. The result is shown below.
Histogram - Final Notes
This tutorial aimed at explaining what histograms are and how they differ from bar charts. In our opinion, histograms are among the most useful charts for metric variables. With the right software (such as SPSS), you can create and inspect histograms very fast and doing so is an excellent way for getting to know your data.