Lecture 6

Inferential Statistics via Computational Methods

The objective of inferential statistics is to describe population parameters using samples. In the lecture, we will study extensively inferential statistics when constructed from a probabilistic model, and in the lab we will see how to use R to perform analysis based on these methods. Prior to that, though, we explore techniques based on computational methods such as simulation and bootstrapping. Not only do computational methods provide a good first step for thinking about later methods, they are also very useful when one does not have a probability model available for inference, perhaps because such a model would be very difficult to construct.

Simulation

Let \(X_i\) be a Normal random variable with mean 100 and standard deviation 16, so \(X_i \sim N(100, 16^2)\). We know that if \(X_i\) are random variables, the sample mean \(\overline{X} = \frac{1}{n}\sum_{i = 1}^{n} X_i\) is also a random variable. What distribution does \(\overline{X}\) follow? While we can compute this distribution exactly, here we show how it could be simulated.

We could simulate a single sample mean for a sample of size 10 with the following code:

set.seed(6222016)
rand_dat <- rnorm(10, mean = 100, sd = 16)
mean(rand_dat)
## [1] 101.0052

Obviously 101.00519 is not 100, which we know to be the true mean of the data, nor should we expect that to ever happen. But is it “close” to the true mean? It’s hard to say; we would need to know how much variability is possible.

We would like to generate a number of sample means for data coming from the distirbution in question. An initial approach may involve a for loop, which is not the best approach when using R. Another approach may involve sapply(), like so:

sim_means <- sapply(1:100, function(throwaway) mean(rnorm(10, mean = 100, sd = 16)))

While better than a for loop, this solution uses sapply() in a way it was not designed for. We have to create a variable with a parameter that is unused by the function, and also pass a vector with a length corresponding to the number of simulated values we want, when in truth it’s not the vector we want but the length of the vector. A better solution would be to use the replicate() function. The call replicate(n, expr) will repeat the expression expr n times, and return a vector containing the results. I show an example for simulated means below:

sim_means <- replicate(100, mean(rnorm(10, mean = 100, sd = 16)))
sim_means
##   [1] 108.35565 100.66750 108.19234  95.89265 101.20817 105.76444 104.89860
##   [8] 104.22216 102.06229  94.31038  99.55700  94.44461 102.06794  89.95609
##  [15] 101.17615 100.63431  95.18799 104.70427 107.69503 104.04101 107.86346
##  [22] 103.06723  97.43702 102.05654  96.08053 101.36263 103.91291  96.41028
##  [29]  96.81695 102.50987 104.38500  99.17687  97.74994  92.30596  99.26339
##  [36] 101.67710 102.39232 105.79338 106.72293  99.22185  95.99618  98.91788
##  [43] 101.01230  96.25258  91.20321 104.81500  96.05610 106.60535  93.97403
##  [50] 103.89452 107.67794 105.31947  99.81982  99.84518  98.65894  93.04366
##  [57]  99.36829  97.92127 105.95773  93.25400  97.91482 100.05437  93.78229
##  [64] 105.21112  97.93044 100.19164  95.70649  97.18679  93.44727 101.54060
##  [71]  98.67888 103.98927  95.20378 101.56096 108.71335 105.40760 105.08654
##  [78] 102.58549  95.86958  97.53956  92.42496 103.78224 102.66197  93.44900
##  [85]  98.92777  97.65882  96.09172 108.16021 100.97451  97.53069  98.50505
##  [92] 102.83923  91.39683  98.91011 102.07212  99.36859 100.66825 102.34920
##  [99]  99.55977 105.28871

Looking at the simulated means, we see a lot of variability, with 53% of the means being more than 3 away from the true mean. The only way to reduce this variability would be to increase the sample size.

sim_means2 <- replicate(100, mean(rnorm(20, mean = 100, sd = 16)))
sim_means2
##   [1]  99.69189  95.21360 105.52372 104.45792  93.52940  91.62906 105.26088
##   [8] 100.36841 105.73402 102.46795  99.09634  98.09586 103.41291 102.24939
##  [15]  95.19029  97.46231 101.84167 106.39167  99.76708  96.41046  94.57369
##  [22] 100.30044 107.70845 105.03082 103.03096  96.89259 101.63428 106.77924
##  [29]  93.78310  99.20373  99.21529  96.04547 100.02290 100.23826  98.78401
##  [36]  96.60201  97.90106  95.12226  98.53885 101.81787 107.67086 102.87544
##  [43] 102.54591 100.30953 104.98270  99.21484  96.38804 100.71951 100.54088
##  [50]  96.36608 102.28295 103.48824 104.53096 100.48312  97.77604 100.40682
##  [57]  99.59780  95.34798  97.79904  94.79671 103.44435 103.94647 101.33339
##  [64]  98.72290  98.94278 102.88865  97.95571  98.96574 101.67445  98.99390
##  [71] 101.10229 100.38276 100.99239 102.65524 106.64794  95.42338 103.30498
##  [78] 104.93425 108.15721  99.01630  96.63749  96.91546 101.56492 104.53852
##  [85]  99.35392  97.48784  98.20285  92.61517 106.96066 100.54894 100.45520
##  [92]  99.29380  99.20205 104.19171 103.59921  97.42151  95.17621 106.17301
##  [99]  98.95303 106.53663

Now, only 46% of the simulated means are more than 5 away from the true mean. If we repeat this for ever increasing sample sizes, we can see the distribution of the sample means concentrating around the true mean.

sim_means3 <- replicate(100, mean(rnorm(50, mean = 100, sd = 16)))
sim_means4 <- replicate(100, mean(rnorm(100, mean = 100, sd = 16)))
boxplot(list("n10" = sim_means, "n20" = sim_means2, "n50" = sim_means3, "n100" = sim_means4))
abline(h = 100, col = "blue")

As can be seen in the chart, larger sample sizes have smaller variability around the true mean. This is what we want; we want estimators for the mean to be both accurate (they center around the correct result, not elsewhere) and precise (they are consistently near the correct answer). The term for the first property is unbiasedness, where \(\mathbb{E}\left[X\right] = \mu\), and the term for the second property is minimum variance.

Recall that for the Normal distribution, the mean is also the median. This suggests the sample median as an alternative to the sample mean for estimating the same parameter. How do the two compare? Let’s simulate them and find out!

library(ggplot2)

# Simulate sample medians
sim_medians <- replicate(100, median(rnorm(10, mean = 100, sd = 16)))
sim_medians2 <- replicate(100, median(rnorm(20, mean = 100, sd = 16)))
sim_medians3 <- replicate(100, median(rnorm(50, mean = 100, sd = 16)))
sim_medians4 <- replicate(100, median(rnorm(100, mean = 100, sd = 16)))

# Make a data frame to contain the data for the sake of easier plotting
dat <- data.frame(stat = c(sim_means, sim_medians, sim_means2, sim_medians2, sim_means3, sim_medians3, sim_means4, sim_medians4),
                  type = rep(c("mean", "median"), times = 4, each = 20),
                  n = as.factor(rep(c(10, 20, 50, 100), each = 2 * 20)))
head(dat)
##        stat type  n
## 1 108.35565 mean 10
## 2 100.66750 mean 10
## 3 108.19234 mean 10
## 4  95.89265 mean 10
## 5 101.20817 mean 10
## 6 105.76444 mean 10
# Using ggplot2 to make the graphic comparing distributions
ggplot(dat, aes(y = stat, x = type, color = type, fill = type)) +
  geom_hline(yintercept = 100, color = "blue", alpha = .5) + # Horizontal line for true center
  geom_violin(width = .5, alpha = .3) + # Violin plot
  stat_summary(fun.data = "median_hilow") + # A statistical summary, showing median and 5th/95th percentiles
  facet_grid(. ~ n) + # Split based on sample size
  theme_bw() +  # Sometimes I don't like the grey theme
  ggtitle("Comparison of distribution of simulated sample means and medians")

From the above graphic, one can see that the sample median has more variance than the sample median, and is thus not minimum variance. The sample mean is a more reliable way to estimate unknown \(\mu\) than the sample median.

These analyses suggest that when trying to estimate the value of a parameter, we should follow these principals:

  1. We should use unbiased estimators. If this is not possible, we should use estimators that are at least consistent (that is, the bias approaches 0 as the sample size grows large).
  2. We should use estimators that vary as little as possible. This in turn implies that we should use as large a sample size as possible when estimating unknown parameters.

Clearly, \(\overline{X}\) is a random variable. With that said, what is its distribution? We could explore the possibility that \(\overline{X}\) is Normally distributed by looking at a Q-Q plot, which compares sample quantiles to theoretical quantiles if a random variable were to follow some particular distribution. If we wanted to see if \(\overline{X}\) were Normally distributed, we could use the qnorm() function to create a Q-Q plot comparing the distribution of \(\overline{X}\) to the Normal distribution. The call qqnorm(x) create a Q-Q plot comparing the distribution of x to the Normal distribution. I next create such a plot.

qqnorm(sim_means4)
# A line, for comparison
qqline(sim_means4)

If the points in the plot fit closely to a straight line, then the distributions are similar. In this case, it seems that the Normal distribution fits the data well (as it should).

Simulation is frequently used for estimating probabilities that are otherwise difficult to compute by hand. The idea is to simulate the phenomena under question and how often the event in question happened in the simulated data. If done correctly, the sample proportion should approach the true probability as more simulations are done. Here, I demonstrate by estimating for each of the sample sizes investigated, the probability of being “close” to the true mean, both for the sample mean and the sample median (these probabilities are no necessarily difficult to compute by hand, and you can investigate for yourself using the results of the Central Limit Theorem whether the estimated probabilities are close to the true probabilities).

library(reshape)
dat_list <- list("mean" = list(
  # Compute probability of being "close" for sample means
  "10" = mean(abs(sim_means - 100) < 1), "20" = mean(abs(sim_means2 - 100) < 1), "50" = mean(abs(sim_means3 - 100) < 1), "100" = mean(abs(sim_means4 - 100) < 1)),
  # Probabilities of being "close" for sample medians
  "median" = list("10" = mean(abs(sim_medians - 100) < 1), "20" = mean(abs(sim_medians2 - 100) < 1), "50" = mean(abs(sim_medians3 - 100) < 1), "100" = mean(abs(sim_medians4 - 100) < 1)))
# Convert this list into a workable data frame. Here I use the melt function in reshape, which will create a data frame where one column is the values stored in the list, and the others represent the values of the tiers.
nice_df <- melt(dat_list)
# Data cleanup
names(nice_df) <- c("Probability", "Size", "Statistic")
nice_df$Size <- factor(nice_df$Size, levels = c("10", "20", "50", "100"))
# The actual probabilities
nice_df
##   Probability Size Statistic
## 1        0.15   10      mean
## 2        0.24   20      mean
## 3        0.40   50      mean
## 4        0.49  100      mean
## 5        0.11   10    median
## 6        0.22   20    median
## 7        0.21   50    median
## 8        0.37  100    median
# A plot of the probabilities
ggplot(nice_df, aes(y = Probability, x = Statistic)) +
  # Instead of plotting points, I simply plot what the probability is, as a point.
  geom_text(aes(label = paste0(Probability * 100, "%"))) +
  facet_grid(. ~ Size) +
  theme_bw()

The above information suggests that the sample mean tends to be closer than the sample median to the true mean. This can be proven mathematically, but the simulation approach is much easier, although not nearly as precise.

We could also use sample quantiles to estimate true quantiles for a statistic, and thus get some sense as to where the true parameter may lie. I demonstrate in the code block below.

quant_list = list("mean" = list("10" = as.list(quantile(sim_means, c(.05, .95))), "20" = as.list(quantile(sim_means2, c(.05, .95))), "50" = as.list(quantile(sim_means3, c(.05, .95))), "100" = as.list(quantile(sim_means4, c(.05, .95)))), "median" = list("10" = as.list(quantile(sim_medians, c(.05, .95))), "20" = as.list(quantile(sim_medians2, c(.05, .95))), "50" = as.list(quantile(sim_medians3, c(.05, .95))), "100" = as.list(quantile(sim_medians4, c(.05, .95)))))
quant_df <- cast(melt(quant_list), L1 + L2 ~ L3)
names(quant_df) <- c("Statistic", "Size", "Lower", "Upper")
quant_df$Size <- factor(quant_df$Size, levels = c("10", "20", "50", "100"))

quant_df
##   Statistic Size    Lower    Upper
## 1      mean   10 93.01272 107.7034
## 2      mean  100 97.37297 102.6705
## 3      mean   20 94.78556 106.6545
## 4      mean   50 97.43711 103.6452
## 5    median   10 89.42588 108.9288
## 6    median  100 96.60708 102.9936
## 7    median   20 94.47622 107.2607
## 8    median   50 95.24452 104.3953
ggplot(quant_df) +
  geom_segment(aes(x = Lower, xend = Upper, y = Statistic, yend = Statistic), size = 1.2) +
  facet_grid(Size ~ .) +
  xlab(NULL)

We can see that in all cases the true mean lies between the \(5^{\text{th}}\) and \(95^{\text{th}}\) quantiles, and that the sample mean has a narrower range than the sample median, thus giving a more precise description as to where the true mean lies.

Significance Testing

Statistical inference’s raison d’etre is detecting signal from noise. More precisely, statistical inference is used to determine if there is enough evidence to detect an effect or determine the location of a parameter in the presence of “noise”, or random effects.

Suppose a researcher is trying to determine if a new drug under study lowers blood pressure. The researcher randomly assigns study participants to control and treatment groups. He stores his results in R vectors like so:

control <- c(124, 155, 120, 116)
treatment <- c(120, 108, 132, 112)

Is there a difference beteen control and treatment? Let’s check.

mean(treatment) - mean(control)
## [1] -10.75

A difference of -10.75 looks promising, but there is lots of variation in the data. Is a difference of -10.75 large enough to conclude there is a difference?

Suppose not. Let’s assume that the treatment had no effect, and the observed difference is due only to the random assignment to control and treatment groups. If that were the case, other random assignment schemes would have differences just as or even more extreme than the one observed.

Let’s investigate this possibility. We will use the combn() function to find all possible combinations of assigning individuals to the treatment group, with the rest going to the control group. combn(vec, n) will find all ways to choose n elements from vec, storing the result in a matrix with each column representing one possible combination. We will assume that we will be pooling the control and treatment groups into a single vector with 8 elements.

# Find all ways to choose the four indices that will represent a new random
# assignment to the treatment group. This is all we need to know to
# determine who is assigned to both control and treatment.
assn <- combn(1:(length(control) + length(treatment)), 4)
# Look at the result
assn[, 1:4]
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
## [4,]    4    5    6    7
# How many combinations are possible?
ncol(assn)
## [1] 70

Now, let’s actually investigate the difference and determine if it is “large”.

# Lump all data into a single vector
blood_pressure <- c(control, treatment)
blood_pressure
## [1] 124 155 120 116 120 108 132 112
# To demonstrate the idea of this analysis, let's consider just one new
# assignment of control and treatment, using the 4th column of assn. The new
# assignment is in ind.
ind <- assn[, 4]
# If blood_pressure[ind] is the treatment group, blood_pressure[-ind] is the
# control group.
blood_pressure[ind]
## [1] 124 155 120 132
blood_pressure[-ind]
## [1] 116 120 108 112
# What is the new difference?
mean(blood_pressure[ind]) - mean(blood_pressure[-ind])
## [1] 18.75
# Now, let's do this for all possible combinations, using apply
diffs <- apply(assn, 2, function(ind) {
    mean(blood_pressure[ind]) - mean(blood_pressure[-ind])
})

Now we can decide how “unusual” our initial difference between treatment and control of -10.75 is. Since we are trying to determine if the treatment reduces blood pressure, we decide that if this particular assignment is unusually negative, there would be evidence that the treatment works. So we see how many assignments have means more negative than the one seen.

sum(diffs < mean(treatment) - mean(control))
## [1] 11
mean(diffs < mean(treatment) - mean(control))
## [1] 0.1571429

It seems that 0.16% of assigments to control and treatment result in differences more “extreme” than the one observed. This is not convincing evidence that the difference we saw reflects any significant effect from the new drug; there is too much noise in the data to reach such a conclusion.

Let’s consider another problem. In the ToothGrowth data set, we can see the results of an experiment where different supplements were used to examine the tooth growth of guinea pigs. Is there a difference?

Let’s first see what the difference is.

# Find the means
supp_means <- aggregate(len ~ supp, data = ToothGrowth, mean)
supp_means
##   supp      len
## 1   OJ 20.66333
## 2   VC 16.96333
# The difference in means (VC - OJ)
diff(supp_means$len)
## [1] -3.7

There is a difference; it appears that the supplement VC (vitamin C in absorbic acid) has smaller tooth growth than the supplement OJ (orange juice). But is this evidence significant?

In this case, our earlier trick will not work. There are 60 observations in ToothGrowth, 30 of which are for the group OJ. This means that there are \({60 \choose 30} =\) choose(60, 30) \(= 118,264,581,564,861,152\) possible random assignments to the two groups. R will surely choke on that much computation, so we must use a new approach.

Rather than examine all possible permutations, let’s randomly select permutations and see how often in our random sample we get results more extreme than what was observed. The principle is the same as what was done before, but it’s probabilistic rather than exhaustive.

# Let's randomly assign to VC supplement just once for proof of concept;
# each sample has 30 observations, so we will randomly select 30 to be the
# new VC group
ind <- sample(1:60, size = 30)
with(ToothGrowth, mean(len[ind]) - mean(len[-ind]))
## [1] 0.7066667
# We will now do this many times
sim_diffs <- replicate(2000, {
    ind <- sample(1:60, size = 30)
    with(ToothGrowth, mean(len[ind]) - mean(len[-ind]))
})
# Proportion with bigger difference
mean(sim_diffs < diff(supp_means$len))
## [1] 0.0325

3.25% of means are “more extreme” than what was observed, which seems convincing evidence that VC is, on average, less than OJ.

Bootstrapping

When looking at polls in the news, you may notice a margin of error attached to the numbers in the poll. The margin of error quantifies the uncertainty we attach to a statistic estimated from data, and the confidence interval, found by adding and subtracting the margin of error from the statistic, represents the range of values in which the true value of the parameter being estimated could plausibly lie. These are computed for many statistics we estimate.

We cover in the lecture how confidence intervals are computed using probabilistic methods. Here, we will use a computational technique called bootstrapping for computing these intervals. Even though the statistics we discuss in this course could have confidence intervals computed exactly, this is not always the case. We may not always have a formula for the margin of error in a closed form, either due to it simply not being derived yet or because it’s intractable. Additionally, bootstrapping may be preferred even if a formula for a margin error exists because bootstrapping may be a more robust means of computing this margin of error when compared to a procedure very sensitive to the assumptions under which it was derived.

Earlier in this lecture, we examined techniques for obtaining, computationally, a confidence interval for the location of the true mean of a Normal distribution. Unfortunately, doing so required knowing what the true mean was, which clearly is never the case (otherwise there would be no reason for the investigation). Bootstrapping does what we did earlier, but instead of using the Normal distribution to estimate the standard error, it estimates the standard error by drawing from the distribution of the data set under investigation (the empirical distribution). We re-estimate the statistic in the simulated data sets to obtain a distribution of the statistic under question, and use this distribution to estimate the margin of error we should attach to the statistic.

Suppose x is a vector containing the data set under investigation. We can sample from the empirical distribution of x via sample(x, size = length(x), replace = TRUE) (I specify size = length(x) to draw simulated data sets that are the same size as x, which is what should be done when bootstrapping to obtain confidence intervals, but in principle one can sample from the empirical distribution simulated data sets of any size). We can then compute the statistic of interest from the simulated data sets, and use quantile() to obtain a confidence interval.

Let’s demonstrate by estimating the mean determinations of copper in wholemeal flour, in parts per million, contained in the chem data set (MASS).

library(MASS)
chem
##  [1]  2.90  3.10  3.40  3.40  3.70  3.70  2.80  2.50  2.40  2.40  2.70
## [12]  2.20  5.28  3.37  3.03  3.03 28.95  3.77  3.40  2.20  3.50  3.60
## [23]  3.70  3.70
# First, the sample mean of chem
mean(chem)
## [1] 4.280417
# To demonstrate how we sample from the empirical distribution of chem, I
# simulate once, and also compute the mean of the simulated data set
chem_sim <- sample(chem, size = length(chem), replace = TRUE)
chem_sim
##  [1]  3.77  2.40  3.70  2.20  2.20  3.03  2.20  2.40  3.10  2.20  2.40
## [12]  3.40 28.95  3.50  3.40  3.77  2.50  3.77  2.40  3.70  3.70 28.95
## [23] 28.95  2.90
mean(chem_sim)
## [1] 6.22875
# Now let's obtain a standard error by simulating 2000 means from the
# empirical distribution
mean_boot <- replicate(2000, {
    mean(sample(chem, size = length(chem), replace = TRUE))
})
# The 95% confidence interval
quantile(mean_boot, c(0.025, 0.975))
##   2.5%  97.5% 
## 3.0025 6.5750