Simple Introduction to Confidence Intervals

A confidence interval is a range of values
that encloses a parameter with a given likelihood.
So let's say we've a sample of 200 people from a population of 100,000. Our sample data come up with a correlation of 0.41 and indicate that the 95% confidence interval for this correlation
runs from 0.29 to 0.52.
This means that

So basically, a confidence interval tells us how much our sample correlation is likely to differ from the population correlation we're after.

Confidence Intervals - Example

El Hierro is the smallest Canary island and has 8,077 inhabitants of 18 years or over. A scientist wants to know their average yearly income. He asks a sample of N = 100. The table below presents his findings.

Confidence Interval Descriptive Statistics

Based on these 100 people, he concludes that the average yearly income for all 8,077 inhabitants is probably between $25,630 and $32,052. So how does that work?

Confidence Intervals - How Does it Work?

Let's say the tax authorities have access to the yearly incomes of all 8,077 inhabitants. The table below shows some descriptive statistics.

Confidence Interval Population Parameters

Now, a scientist who samples 100 of these people can compute a sample mean income. This sample mean probably differs somewhat from the $32,383 population mean. Another scientist could also sample 100 people and come up with another different mean. And so on: if we'd draw 100 different samples, we'd probably find 100 different means. In short, sample means fluctuate over samples. So how much do they fluctuate? This is expressed by the standard deviation of sample means over samples, known as the standard error -SE- of the mean. SE is calculated as
$$SE = \frac{\sigma}{\sqrt{N}}$$
so for our data that'll be $$SE = \frac{$22,874}{\sqrt{100}} = $2,287.$$
Right. Now, statisticians also figured out the exact frequency distribution of sample means: the sampling distribution of the mean. For our data, it's shown below.

Confidence Interval Population Parameters

Our graph tells us that 95% of all samples will come up with a mean between roughly $27,808 and $36,958. This is basically the mean ± 2SE:

In practice, however, we usually don't know the population mean. So we estimate it from sample data. But how much is a sample mean likely to differ from its population counterpart? Well, we just saw that a sample mean has a 95% probability of falling within ± 2SE of the population mean.
Now, we don't know SE because it depends on the (unknown) population standard deviation. However, we can estimate SE from the sample standard deviation. By doing so, most samples will come up with roughly the correct SE. As a result, the 95% of samples whose means fall within ± 2SE
typically have confidence intervals enclosing the population mean
as illustrated below.

Confidence Intervals - Illustration

Confidence Interval Mean Sampling distribution and confidence intervals. Note that the interval for sample 3 does not contain the population mean μ. This holds for 5% of all CI’s.

Now, a sample having a mean within ±2SE may have a confidence interval not containing the population mean. This may happen if it underestimates the population standard deviation. The reverse may occur too.
However, the sample standard deviation is an unbiased estimator: on average it is exactly correct. So for all samples, exactly 95% of all 95% confidence intervals
contain the parameter they estimate.
Just as promised.

Confidence Intervals - Basic Properties

Right, so a confidence interval is basically a likely range of values for a parameter such as a population correlation, mean or proportion. Therefore, wider confidence intervals indicate less precise estimates for such parameters.

Three factors determine the width of a confidence interval. Everything else equal,

Confidence Intervals or Statistical Significance?

If both are available, confidence intervals. Why? Well, confidence intervals give the same -and more- information than statistical significance. Some examples:

Confidence Interval Versus Statistical Significance One Sample T Test

So should we stop reporting statistical significance altogether in favor of confidence intervals? Probably not. Confidence intervals are not available for nonparametric tests such as ANOVA or the chi-square independence test. If we compare 2 means, a single confidence interval for the difference tells it all. But that's not going to work for comparing 3 or more means...

Formulas and Example Calculations

Statistical software such as SPSS, Stata or SAS computes confidence intervals for us so there's no need to bother about any formulas or calculations. Do you want to know anyway? Then let's go: we computed the confidence interval for our example in this Googlesheet (downloadable as Excel) as shown below.

Confidence Interval Calculation Googlesheets

So how does it work? Well, first off, our sample data came up with the descriptive statistics shown below.

Confidence Interval Descriptive Statistics

We estimate the standard error of the mean as $$SE_{mean} = \frac{S}{\sqrt{N}}$$
so that'll be $$SE_{mean} = \frac{$16,185}{\sqrt{100}} = $1,6185.$$
Next, $$T = \frac{M - \mu}{SE_{mean}}$$
This formula tries to tell you that the difference between the sample mean \(M\) and the population mean \(\mu\) divided by \(SE_{mean}\) follows a t distribution. We're really just standardizing the mean difference here into a z-score (T).
Finally, we need the degrees of freedom given by $$Df = N - 1$$
so that'll be $$Df = 100 - 1 = 99.$$
So between which t-values do we find 95% of all (standardized) mean differences? We can look this up in Google sheets as shown below.

Inverse T Distribution Google Sheets

This tells us that a proportion of 0.025 (or 2.5%) of all t-values < -1.984. Because the t-distribution is symmetrical, a proportion of 0.975 of t-values > 1.984. These critical t-values are visualized below.

Critical T Value Df 99

The illustration tells us that our previous rule of thumb of roughly ±2SE is ±1.984SE for this example: 95% of all standardized mean differences are between -1.984 and 1.984. Finally, the 95% confidence interval is $$M - T_{0.975} \cdot SE_{mean} \lt \mu \lt M + T_{0.975} \cdot SE_{mean} $$
so that'll be $$$28,841 - 1.984 \cdot $1,619 \lt \mu \lt $28,841 + 1.984 \cdot $1,619$$
which results in $$$25,630 \lt \mu \lt $32,052.$$
Thanks for reading.

Tell us what you think!

*Required field. Your comment will show up after approval from a moderator.