R Functions for Confidence Intervals

Confidence intervals and hypothesis testing are closely related concepts, so the R functions that find confidence intervals are also the functions that perform hypothesis testing (all of these functions being in the stats package, which is loaded by default). If desired, it is possible to extract just the desired confidence interval from the output of these functions, though when used interactively, the confidence interval and the result of some hypothesis test are often shown together.

When constructing confidence intervals, there are two parameters to be aware of: conf.level and alternative. conf.level specifies the desired confidence level associated with the interval, \(C\). Thus, conf.level = .99 tells the function computing the confidence interval that a 99% confidence interval is desired. (The default behavior is to construct a 95% confidence interval.) alternative determines whether a two-sided interval, an upper-bound, or a lower-bound are computed. alternative = "two.sided" creates a two-sided confidence interval (this is usually the default behavior, though, so specifying this may not be necessary). alternative = "greater" computes a one-sided \(`00C\)% confidence lower bound, and alternative = "less" returns a one-sided \(100C\)% confidence upper bound. (The parameter alternative gets its name and behavior from the alternative hypothesis when performing hypothesis testing, which we will discuss next lecture.)

I now discuss different confidence intervals intended to capture different population parameters.

Population Mean, Single Sample

We saw a procedure for constructing a confidence interval for the population mean earlier, and that procedure assumed the population standard deviation \(\sigma\) is known. That assumption is unrealistic, and instead we base our confidence intervals on the random variable \(T = \frac{\overline{X} - \mu}{\frac{S}{\sqrt{n}}}\). If the data \(X_i\) is drawn from some Normal distribution, then \(T\) will follow Student’s \(t\) distribution with \(\nu = n - 1\) degrees of freedom: \(T \sim t(n - 1)\). Thus the confidence interval is constructed using critical values from the \(t\) distribution, and the resulting confidence interval is:

\[\bar{x} \pm t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}} \equiv \left(\bar{x} - t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}}; \bar{x} + t_{\frac{\alpha}{2},n - 1}\frac{s}{\sqrt{n}}\right)\]

where if \(T \sim t(\nu)\), \(P\left(T \geq t_{\alpha,\nu}\right) = \alpha\).

One should check the Normality assumption made by this confidence interval before using it. A Q-Q plot is a good way to do so, and a Q-Q plot checking for Normality can be created using the qnorm() function. qnorm(x) creates a Q-Q plot checking normality for a data vector x, and qqline(x) will add a line by which to judge how well the distribution fits the data (being clost to the line suggests a good fit, while strange patterns like an S-shape curve suggest a poor fit).

I check how well the Normal distribution describes the versicolor iris flower sepal length data below. It is clear that the assumption of Normality holds quite well for this data.

qqnorm(vers)
qqline(vers)

That said, the Normality assumption turns out not to be crucial for these confidence intervals, and the methods turn out to be robust enough to work well even when that assumption does not hold. If a distribution is roughly symmetric with no outliers, it’s probably safe to build a confidence interval using methods based on the \(t\) distribution, and even if these two do not quite hold, there are circumstances when even then the \(t\) distribution will still work well.

The function \(t.test()\) will construct a confidence interval for the true mean \(\mu\), and I demonstrate its use below.

# Construct a 95% two-sided confidence interval
t.test(vers)
## 
##  One Sample t-test
## 
## data:  vers
## t = 81.318, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  5.789306 6.082694
## sample estimates:
## mean of x 
##     5.936
# Construct a 99% upper confidence bound
t.test(vers, conf.level = 0.99, alternative = "less")
## 
##  One Sample t-test
## 
## data:  vers
## t = 81.318, df = 49, p-value = 1
## alternative hypothesis: true mean is less than 0
## 99 percent confidence interval:
##      -Inf 6.111551
## sample estimates:
## mean of x 
##     5.936
# t.test creates a list containing all computed stats, including the
# confidence interval. I can extract the CI by accessing the conf.int
# variable in the list
tt1 <- t.test(vers, conf.level = 0.9)
tt2 <- t.test(vers, conf.level = 0.95)
tt3 <- t.test(vers, conf.level = 0.99)
str(tt1)
## List of 9
##  $ statistic  : Named num 81.3
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 49
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 6.14e-54
##  $ conf.int   : atomic [1:2] 5.81 6.06
##   ..- attr(*, "conf.level")= num 0.9
##  $ estimate   : Named num 5.94
##   ..- attr(*, "names")= chr "mean of x"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "mean"
##  $ alternative: chr "two.sided"
##  $ method     : chr "One Sample t-test"
##  $ data.name  : chr "vers"
##  - attr(*, "class")= chr "htest"
# Create a graphic comparing these CIs
library(ggplot2)
# The end goal is to create an object easily handled by ggplot. I start by
# making a list with my data
conf_int_dat <- list(`90% CI` = as.list(tt1$conf.int), `95% CI` = as.list(tt2$conf.int), 
    `99% CI` = as.list(tt3$conf.int))
# Use melt and cast from reshape package to make a data frame I can work
# with
library(reshape)
melted_cid <- melt(conf_int_dat)
melted_cid
##      value L2     L1
## 1 5.813616  1 90% CI
## 2 6.058384  2 90% CI
## 3 5.789306  1 95% CI
## 4 6.082694  2 95% CI
## 5 5.740370  1 99% CI
## 6 6.131630  2 99% CI
plot_conf_int_dat <- cast(melted_cid, L1 ~ L2)
names(plot_conf_int_dat) <- c("Type", "Lower", "Upper")
plot_conf_int_dat
##     Type    Lower    Upper
## 1 90% CI 5.813616 6.058384
## 2 95% CI 5.789306 6.082694
## 3 99% CI 5.740370 6.131630
# Now, create desired graphic
ggplot(plot_conf_int_dat) + geom_segment(aes(x = Lower, xend = Upper, y = Type, 
    yend = Type), size = 2)

# It's clear that larger confidence levels lead to larger intervals, as can
# be seen in this graphic

Population Proportion, Single Sample

Suppose we collect categorical data, and encode observations that have some trait being tracked (“successes”) with 1, and the rest 0. This is a Bernoulli random variable, and our objective is to estimate the population proportion of successes, \(p\). The number of successes in the sample, \(\sum_{i = 1}^{n} X_i\), is a binomial random variable, and the natural estimator for the population proportion is the sample proportion, \(\hat{p} = \frac{1}{n}\sum_{i = 1}^{n} X_i\).

Computing a confidence interval for \(p\) has been revisited many times, each method trying to rectify faults of others (there is even an R package, binom, devoted to providing all the different methods for computing confidence intervals for the population proportion). A method that used to be popular, based on the Normal approximation for the distribution of the sample proportion \(\hat{p}\), is the confidence interval:

\[\hat{p} \pm z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \equiv \left(\hat{p} - z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}; \hat{p} + z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\right)\]

The function prop.test() will compute confidence intervals using this method, having the format prop.test(x, n), where x is the number of successes and n the sample size.

A June CNN-ORC poll found 420 individuals surveyed out of 1001 would vote for Donald Trump if the election were held that day. Let’s use this data to get a confidence interval for the proportion of voters who would vote for Trump.

prop.test(420, 1001)
## 
##  1-sample proportions test with continuity correction
## 
## data:  420 out of 1001, null probability 0.5
## X-squared = 25.574, df = 1, p-value = 4.256e-07
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3888812 0.4509045
## sample estimates:
##         p 
## 0.4195804

prop.test() found a confidence interval with a margin of error of roughly 3% (the target margin of error of the survey).

This confidence interval was popular in textbooks for years, but it is flawed to the point few advocate its use. The primary reason is that this interval is not guaranteed to capture the true \(p\) with the probability we specify, behaving erratically for different combinations of sample sizes and with more-or-less extreme \(p\). It’s thus considered unreliable, and alternatives are preferred.

The Hmisc package contains a function for computing confidence intervals for the population proportion, binconf(), that allows for a few other confidence intervals for the population proportion to be computed. The score-based method is advocated by the textbook author used in the lecture course, and is supported by binconf(). The interface of the function is different from the interface of similar functions in the stats (base R) package (it was not written by the same authors); instead of specifying conf.level, one specifies alpha in binconf(), with \(\alpha = 1 - C\) (the default value for alpha is .05, corresponding to a 95% confidence interval). Additionally, binconf() will not directly compute one-sided confidence bounds like prop.test() would. But otherwise usage is similar, with binconf(x, n, method = "wilson") computing a 95% confidence interval for a data set in which, out of n trials, there were x successes, using the score confidence interval.

Let’s use binconf() to compute the score confidence interval for the proportion of voters supporting Donald Trump, obtaining both a 95% and 99% confidence interval.

library(Hmisc)
# 95% CI
binconf(420, 1001, method = "wilson")
##   PointEst     Lower     Upper
##  0.4195804 0.3893738 0.4504019
# 99% CI
binconf(420, 1001, alpha = 0.01, method = "wilson")
##   PointEst     Lower     Upper
##  0.4195804 0.3800618 0.4601581

The 95% confidence interval using the score method is not the same as the result from prop.test(), but not drastically different either.

binom.test() supports the exact confidence interval, which uses the CDF of the binomial distribution rather than a Normal approximation. The exact confidence interval can be computed via binom.test(x, n) (with all other common parameters such as conf.level and alternative being supported). binconf() also supports the exact confidence interval, where one merely sets method = "exact" to compute it.

binom.test(420, 1001)
## 
##  Exact binomial test
## 
## data:  420 and 1001
## number of successes = 420, number of trials = 1001, p-value =
## 4.029e-07
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3887849 0.4508502
## sample estimates:
## probability of success 
##              0.4195804
# 99% confidence lower bound for Trump support, using binom.test
binom.test(420, 1001, alternative = "greater", conf.level = 0.99)
## 
##  Exact binomial test
## 
## data:  420 and 1001
## number of successes = 420, number of trials = 1001, p-value = 1
## alternative hypothesis: true probability of success is greater than 0.5
## 99 percent confidence interval:
##  0.3831763 1.0000000
## sample estimates:
## probability of success 
##              0.4195804

Paired Sample Confidence Interval for Difference in Means

Suppose you have paired data, \(X_i\) and a corresponding \(Y_i\). Examples of paired data include weight before (\(X_i\)) and after (\(Y_i\)) a weight-loss program, or the political views of a husband (\(X_i\)) and wife (\(Y_i\)). There are two different means for \(X_i\) and \(Y_i\), denoted \(\mu_X\) and \(\mu_Y\), and we want to know how these means compare.

Since the data is paired, we create a confidence interval using the differences of each \(X_i\) and \(Y_i\), \(D_i = X_i - Y_i\), and the confidence interval will describe the location of the mean difference \(\mu_D\). Thus, for paired data, after computing the differences, the procedure for computing a confidence interval is the same as the univariate case.

One approach for computing the confidence interval for the mean difference of paired data is with t.test(x - y), where x and y are vectors containing the paired data. Another approach would be to use t.test(x, y, paired = TRUE).

I illustrate by examining whether stricter enforcement of speed limit laws helps reduce the number of accidents. The Traffic data set (MASS) shows the effects of Swedish speed limits on accidents after an experiment where specific days in 1961 and 1962 saw different speed limits applied, in a paired study (days each year were matched). Thus a matched pairs design could be applied to see if the speed limit laws helped reduce accidents.

library(MASS)
library(reshape)
head(Traffic)
##   year day limit  y
## 1 1961   1    no  9
## 2 1961   2    no 11
## 3 1961   3    no  9
## 4 1961   4    no 20
## 5 1961   5    no 31
## 6 1961   6    no 26
# The Traffic data set needs to be formed into a format that is more easily
# worked with. The following code reshapes the data into a form that allows
# for easier comparison of paired days
new_Traffic <- with(Traffic, data.frame(year = year, day = day, limit = as.character(limit), 
    y = as.character(y)))
melt_Traffic <- melt(new_Traffic, id.vars = c("year", "day"), measure.vars = c("limit", 
    "y"))
form_Traffic <- cast(melt_Traffic, day ~ year + variable)
form_Traffic <- with(form_Traffic, data.frame(day = day, limit_61 = `1961_limit`, 
    accidents_61 = as.numeric(as.character(`1961_y`)), limit_62 = `1962_limit`, 
    accidents_62 = as.numeric(as.character(`1962_y`))))
head(form_Traffic)
##   day limit_61 accidents_61 limit_62 accidents_62
## 1   1       no            9       no            9
## 2   2       no           11       no           20
## 3   3       no            9       no           15
## 4   4       no           20       no           14
## 5   5       no           31       no           30
## 6   6       no           26       no           23
# Now we subset this data so that only days where speed limits were enforced
# differently between the two years
diff_Traffic <- subset(form_Traffic, select = c("accidents_61", "accidents_62", 
    "limit_61", "limit_62"), subset = limit_61 != limit_62)
head(diff_Traffic)
##    accidents_61 accidents_62 limit_61 limit_62
## 11           29           17       no      yes
## 12           40           23       no      yes
## 13           28           16       no      yes
## 14           17           20       no      yes
## 15           15           13       no      yes
## 16           21           13       no      yes
# Get a vector for accidents when the speed limit was not enforced, and a
# vector for accidents for when it was
accident_no_limit <- with(diff_Traffic, c(accidents_61[limit_61 == "no"], accidents_62[limit_62 == 
    "no"]))
accident_limit <- with(diff_Traffic, c(accidents_61[limit_61 == "yes"], accidents_62[limit_62 == 
    "yes"]))
accident_limit
##  [1] 19 19  9 21 22 23 14 19 15 13 22 42 29 21 12 16 17 23 16 20 13 13  9
## [24] 10 27 12  7 11 15 29 17 17 15 25  9 16 25 25 16 22 21 17 26 41 25 12
## [47] 17 24 26 16 15 12 22 24 16 25 14 15  9
accident_no_limit
##  [1] 29 40 28 17 15 21 24 15 32 22 24 11 27 37 32 25 20 40 21 18 35 21 25
## [24] 34 42 27 34 47 36 15 26 21 39 39 21 15 17 20 24 30 25  8 21 10 14 18
## [47] 26 38 31 12  8 22 17 31 49 23 14 25 24
# It took a lot of work, but finally we have the data in a format we want.
# Now, the first thing I check is whether the differences are Normally
# distributed
qqnorm(accident_limit - accident_no_limit)
qqline(accident_limit - accident_no_limit)

# Aside from a couple observations, the Normal distribution seems to fit
# very well. t procedures are safe to use.
t.test(accident_limit, accident_no_limit, paired = TRUE)
## 
##  Paired t-test
## 
## data:  accident_limit and accident_no_limit
## t = -3.7744, df = 58, p-value = 0.0003794
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.856470 -3.024886
## sample estimates:
## mean of the differences 
##               -6.440678

The above analysis suggests that enforcing speed limits reduces the number of accidents, since zero is not in the interval.

Test for Difference in Population Means Between Two Samples

Analyzing paired data is simple since the analysis reduces to the univariate case. Two independent samples, though, are not as simple.

Suppose we have two independent data sets, with \(X_1, ..., X_n\) the first and \(Y_1, ..., Y_m\) the second, each drawn from two distinct populations. We typically wish to know whether the means of the two populations, \(\mu_X\) and \(\mu_Y\), differ. Thus, we wish to obtain a confidence interval for the difference in means of the populations, \(\mu_X - \mu_Y\).

The natural estimator for \(\mu_X - \mu_Y\) is \(\overline{X} - \overline{Y}\), so the point estimate of the difference is \(\bar{x} - \bar{y}\). The margin of error, though, depends on whether we believe the populations are homoskedastic or heteroskedastic. Homoskedastic populations have equal variances, so \(\sigma_X^2 = \sigma_Y^2\). Heteroskedastic populations, though, have different variances, so \(\sigma_X^2 \neq \sigma_Y^2\). Between the two, homoskedasticity is a much stronger assumption, so unless you have a reason for believing otherwise, you should assume heteroskedasticity. (It is the correct assumption the populations are, in fact, heteroskedastic, and if they are actually homoskedastic, the confidence interval should still be reasonably precise.)

For homoskedastic populations, the confidence interval is:

\[\bar{x} - \bar{y} \pm t_{\frac{\alpha}{2},n + m - 2}s\sqrt{\frac{1}{n} + \frac{1}{m}} \equiv \left(\bar{x} - \bar{y} - t_{\frac{\alpha}{2},n + m - 2}s\sqrt{\frac{1}{n} + \frac{1}{m}} ; \bar{x} - \bar{y} + t_{\frac{\alpha}{2},n + m - 2}s\sqrt{\frac{1}{n} + \frac{1}{m}}\right)\]

where \(s\) is the sample standard deviation of the pooled sample (that is, when \(X_1,...,X_n\) and \(Y_1,...,Y_m\) are treated as one sample).

Confidence intervals for heteroskedastic data are more complicated, since the \(t\) distribution is no longer the correct distribution to use to describe the behavior of the random variable from which the confidence interval is derived. That said, the \(t\) distribution can approximate the true distribution when the degrees of freedom used is:

\[\nu = \frac{\left(\frac{s_X^2}{n} + \frac{s_Y^2}{m}\right)^2}{\frac{\left(s_X^2/n\right)^2}{n - 1} + \frac{\left(s_Y^2/m\right)^2}{m - 1}}\]

with \(s_X^2\) and \(s_Y^2\) being the variances of the data sets \(x_1,...,x_n\) and \(y_1,...,y_m\), respectively. Then the confidence interval is approximately:

\[\bar{x} - \bar{y} \pm t_{\frac{\alpha}{2}, \nu}\sqrt{\frac{s_X^2}{n} + \frac{s_Y^2}{m}}\]

t.test() can compute confidence intervals in both these cases. The function call is similar to the case of paired data, though without setting paired = TRUE. Instead, set var.equal = TRUE for homoskedastic populations, and var.equal = FALSE (the default) for heteroskedastic populations.

Let’s illustrate this by determining whether orange juice or a vitamin C supplement increase tooth growth in guinea pigs (in the ToothGrowth data set).

# First, let's get the data into separate vectors
split_len <- split(ToothGrowth$len, ToothGrowth$supp)
OJ <- split_len$OJ
VC <- split_len$VC
# Gets CI for difference in means
t.test(OJ, VC)
## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
# Is there homoskedasticity? Let's check a boxplot
boxplot(len ~ supp, data = ToothGrowth)

# These populations look homoskedastic; spread does not differ drastically
# between the two. A CI that assumes homoskedasticity is thus computed.
t.test(OJ, VC, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1670064  7.5670064
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
# Not much changed (only slightly narrower).

Difference in Population Proportions

If we have two populations, each with a respective population proportion representing the proportion of that population possessing some trait, we can construct a confidence interval for the difference in the proportions, \(p_X - p_Y\). The form of the confidence interval depends on which type is being used, and I will not discuss the details of how these confidence intervals are computed, but simply say that some of the functions we have seen for single sample confidence intervals, such as prop.test(), support comparing proportions (binconf() and binom.test() do not). The only difference is that rather than passing a single count and a single sample size to these functions, you pass a vector containing the count of successes in the two population, and a vector containing the two sample sizes (naturally they should be in the same order).

In the next example, I compare the proportions of men and women with melanoma who die from the disease.

mel_split <- split(Melanoma$status, Melanoma$sex)
# Get logical vectors for whether patient died from melanoma. Group 0 is
# women and group 1 is men, and the code 1 in the data indicates death from
# melanoma
fem_death <- mel_split[["0"]] == 1
man_death <- mel_split[["1"]] == 1
# Vector containing the number of deaths for both men and women
deaths <- c(Female = sum(fem_death), Male = sum(man_death))
# Vector containing sample sizes
size <- c(Female = length(fem_death), Male = length(man_death))
deaths/size
##    Female      Male 
## 0.2222222 0.3670886
# prop.test, with the 'classic' CLT-based CI
prop.test(deaths, size)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  deaths out of size
## X-squared = 4.3803, df = 1, p-value = 0.03636
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.283876633 -0.005856138
## sample estimates:
##    prop 1    prop 2 
## 0.2222222 0.3670886