Confidence Intervals | Alper Kokcu

Confidence interval is a range of values that describes the uncertainty surrounding an estimate.

Frequentist vs Bayesian

Confidence interval (Frequentist)
Credible interval (Bayesian)

Confidence intervals describe the uncertainty of an estimate for

The average return on investment for a stock portfolio.
The average maintenance costs for factory machinery.
The percentage of customers who will register for rewards programs.
The percentage of website visitors who will click on an ad.

Estimate types

Point estimate – uses a single value to estimate a population parameter.

Point estimates might be:
Population mean of penguin weight: 31 lbs.
Population proportion of voters: 55%

Interval estimate – uses a range of values to estimate a population parameter.

Confidence intervals might be:
Penguin weight: 95% CI [28, 32]
Voters: 99% CI [51, 57]

Typically, data professionals use confidence intervals rather than point estimates to share the results. A point estimate can be useful, but a single value like 30 pounds does not express the uncertainty built into any estimate. This uncertainty is due to the method of random sampling.

Confidence interval

Confidence interval includes

sample statistic
margin of error
confidence level

Interval – sample statistic +/- margin of error.
Margin of error – the maximum expected difference between a population parameter and a sample estimate.

Margin of error = z-score * SE

Confidence level – describes the likelihood that a particular sampling method will produce a confidence interval that includes the population parameter.

Common confidence levels are 90, 95, and 99%. 95% is a popular choice. This is a choice based on tradition and statistical research and education. We can adjust the confidence level to meet the requirements of our analysis.

Imagine we are a data professional working for a fashion company. Our manager asks us to estimate sales revenue for the new line of spring clothing. When we meet with stakeholders, we might say, “I think we’ll do $1 million in sales.” or we might say, “Based on a 95% confidence level, I estimate that our sales revenue will be between $950,000 and $1,050,000.”

The first statement offers a point estimate. The second statement provides a confidence level and an interval estimate and communicates the uncertainty in the estimate. It gives our stakeholders more information and helps them make more informed decisions about issues related to future sales revenue.

Interpret confidence intervals

Confidence level expresses the uncertainty of the estimation process and confidence intervals are one of the most misunderstood concepts in statistics. That’s why let’s start with common misinterpretations first.

Four common misinterpretations of this concept.

1- A 95% confidence interval means that 95% of all the data values in our data set fall within the interval. This is not necessarily true.

Like any population parameter, the population mean is a constant, not a random variable. While the value of the sample mean varies from sample to sample, the value of the population mean does not change. The probability that a constant falls within any given range of values is always 0% or 100%. It either falls within the range of values, or it doesn’t.

2- A 95% confidence interval implies that 95% of all possible sample means fall within the range of the interval. This is not necessarily true.

For example, our 95% confidence interval for mean penguin weight is between 28 pounds and 32 pounds. Imagine we take repeated samples of 100 penguins and calculate the mean weight for each sample. It’s possible that over 5% of our sample means will be less than 28 pounds or greater than 32 pounds.

3- A 95% confidence interval refers to the percentage of data values that fall within the interval. This is not necessarily true.

A 95% confidence interval shows a range of values that likely includes the actual population mean. This is not the same as a range that contains 95% of the data values in the population.

For example, our 95% confidence interval for the mean penguin weight is between 28 pounds and 32 pounds. It may not be accurate to say that 95% of all weight values fall within this interval. It’s possible that over 5% of the penguin weights in the population are outside this interval, either less than 28 pounds or greater than 32 pounds.

4- To assume that a confidence interval refers to the only possible source of error in our results. While every confidence interval includes a margin of error, many other kinds of errors can enter into statistical analysis. For example, the questions in a survey may be poorly designed or sampling bias may affect the sample data.

The margin of error is a useful measure of uncertainty and makes our estimate more reliable, but it’s not the only possible source of error in our analysis. So when we’re interpreting a confidence interval, we need to remember that the uncertainty lies in an estimation process based on random sampling.

A 95% confidence level refers to the success rate of that process. In other words, we can expect 95% of the random intervals we generate to capture the population Parameter.

The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling.

Construct a confidence interval for a proportion

Steps for constructing a confidence interval.

Identify a sample statistic. > This is a sample proportion.
Choose a confidence level. > Mostly 95%
Find the margin of error. > z-score * SE
Calculate the interval.

The table below shows the z-scores that correspond to popular confidence levels.

Confidence Level	Z-score
90%	1.645
95%	1.96
99%	2.58

The formula for the standard error of the proportion is SE(p̂) = sqrt(p̂ * (1-p̂) / n)
(the square root of the sample proportion times 1 minus the sample proportion divided by the sample size)

Our sample proportion = 0.55
Our sample size = 100
SE will be about = sqrt(0.55 * (1-0.55) / 100) = 0.05
So margin of error will be = 1.96 * 0.05 = 0.098

Upper & lower limits of our interval
Upper limit = Sample proportion + margin of error = 0.55 + 0.098 = 0.648 = 64.8%
Lower limit = Sample proportion – margin of error = 0.55 – 0.098 = 0.452 = 45.2%

So our Confidence interval is
95% CI [45.2, 64.8]

While our confidence interval mostly lies above 50 percent, this isn’t necessarily a reason to be optimistic about the upcoming election, since the lower limit of 45.2 percent falls below 50 percent. Based on the confidence interval, losing the election is still a possibility.

The campaign team may want to invest more in TV or social media advertising to ensure victory. Or if the campaign team wants a more accurate estimate of the election results, they may request another poll with a larger sample size.

Changing sample size

As the sample size gets larger, the confidence interval gets narrower. With a sample of 100, the interval covers 19.6 percentage points. With a sample size of 1,000, the interval covers 6.2 percentage points.

Sample size = 100 [45.2, 64.8] = 19.6
Sample size = 1000 [50.9, 57.1] = 6.2

This is because as our sample size increases, our margin of error decreases. If we could sample every member of the population, the margin of error would be zero.

Construct a confidence interval for a mean

Steps for constructing a confidence interval.

Identify a sample statistic. > Here we’ll work with the sample mean
Choose a confidence level. > Mostly 95%
Find the margin of error. > z-score * SE
Calculate the interval.

The formula for the standard error of the mean is SE = σ / sqrt(n)
(the population standard deviation divided by the square root of the sample size)
(“σ” when standard deviation of population is known, otherwise “s”)

Battery life

Sample mean: 20.5 hrs
Sample standard deviation: 1.7 hrs
Population standard deviation: 1.5 hrs

Our sample mean = 20.5
Our sample size = 100
SE will be about = 1.5 / sqrt(100) = 0.15
So margin of error will be = 1.96 * 0.15 = 0.294

Upper & lower limits of our interval
Upper limit = Sample mean + margin of error = 20.5 + 0.294 = 20.794 hrs =~ 20:48
Lower limit = Sample proportion – margin of error = 20.5 – 0.294 = 20.206 hrs =~ 20:12

So our Confidence interval is
95% CI [20:12, 20:48]

The lower limit of our interval 20 hours and 12 minutes is above the company’s goal of 20 hours. This helps the marketing team feel confident about advertising the battery life of the cell phone to be at least 20 hours.

Changing confidence level

As the confidence level gets higher, the confidence interval gets wider. With the confidence level of 95%, the interval covers 36 minutes, with the confidence level of 99%, the interval covers 46 minutes.

Confidence level = 95%
[20:12, 20:48] = 36 min

Confidence level = 99%
[20:07, 20:53] = 46 min

This is because a wider confidence interval is more likely to include the actual population parameter.

Note that in this example we know that the population standard deviation is 1.5 hours. However, in practice, the population standard deviation is often unknown and has to be estimated based on the sample standard deviation. This is because it’s difficult to get complete data on a large population. If we don’t know the population standard deviation, this changes the calculations for the confidence interval.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.