When looking at polls in the news, you may notice a margin of error attached to the numbers in the poll. The margin of error quantifies the uncertainty we attach to a statistic estimated from data, and the confidence interval, found by adding and subtracting the margin of error from the statistic, represents the range of values in which the true value of the parameter being estimated could plausibly lie. These are computed for many statistics we estimate.
We cover in the lecture how confidence intervals are computed using probabilistic methods. Here, we will use a computational technique called bootstrapping for computing these intervals. Even though the statistics we discuss in this course could have confidence intervals computed exactly, this is not always the case. We may not always have a formula for the margin of error in a closed form, either due to it simply not being derived yet or because it’s intractable. Additionally, bootstrapping may be preferred even if a formula for a margin error exists because bootstrapping may be a more robust means of computing this margin of error when compared to a procedure very sensitive to the assumptions under which it was derived.
Earlier in this lecture, we examined techniques for obtaining, computationally, a confidence interval for the location of the true mean of a Normal distribution. Unfortunately, doing so required knowing what the true mean was, which clearly is never the case (otherwise there would be no reason for the investigation). Bootstrapping does what we did earlier, but instead of using the Normal distribution to estimate the standard error, it estimates the standard error by drawing from the distribution of the data set under investigation (the empirical distribution). We re-estimate the statistic in the simulated data sets to obtain a distribution of the statistic under question, and use this distribution to estimate the margin of error we should attach to the statistic.
Suppose x
is a vector containing the data set under investigation. We can sample from the empirical distribution of x
via sample(x, size = length(x), replace = TRUE)
(I specify size = length(x)
to draw simulated data sets that are the same size as x
, which is what should be done when bootstrapping to obtain confidence intervals, but in principle one can sample from the empirical distribution simulated data sets of any size). We can then compute the statistic of interest from the simulated data sets, and use quantile()
to obtain a confidence interval.
Let’s demonstrate by estimating the mean determinations of copper in wholemeal flour, in parts per million, contained in the chem
data set (MASS).
library(MASS)
chem
## [1] 2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70
## [12] 2.20 5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60
## [23] 3.70 3.70
# First, the sample mean of chem
mean(chem)
## [1] 4.280417
# To demonstrate how we sample from the empirical distribution of chem, I
# simulate once, and also compute the mean of the simulated data set
chem_sim <- sample(chem, size = length(chem), replace = TRUE)
chem_sim
## [1] 3.77 2.40 3.70 2.20 2.20 3.03 2.20 2.40 3.10 2.20 2.40
## [12] 3.40 28.95 3.50 3.40 3.77 2.50 3.77 2.40 3.70 3.70 28.95
## [23] 28.95 2.90
mean(chem_sim)
## [1] 6.22875
# Now let's obtain a standard error by simulating 2000 means from the
# empirical distribution
mean_boot <- replicate(2000, {
mean(sample(chem, size = length(chem), replace = TRUE))
})
# The 95% confidence interval
quantile(mean_boot, c(0.025, 0.975))
## 2.5% 97.5%
## 3.0025 6.5750