One of the most important theorems in all of statistics is the central limit theorem. This theorem describes the distribution of the sample mean as the sample size grows large. Let \(X_i\) be a random variable with mean \(\mu\) and standard deviation \(\sigma\). For every \(i\) such that \(1 \leq i \leq n\) (\(n\) representing the sample size), \(X_i\) is independent and identically distributed, or iid (meaning that every data point is drawn from the exact same distribution and do not depend on the value of other \(X_i\) in the data set). Let \(\overline{X} = \frac{1}{n}\sum_{i = 1}^{n} X_i\) be the sample mean of this data set. Then:
\[\overline{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right) \text{ as } n \to \infty\]
This is saying that if the above conditions hold, the distribution of the sample mean \(\overline{X}\) begins to resemble the Normal distribution with larger sample sizes, regardless of the original distribution of \(X_i\). This is the basis of much of statistical inference.
We can show the central limit theorem at work in R. Let \(X_i \sim \operatorname{EXP}(5)\). Notice that \(\mu = 5\) and \(\sigma = \mu = 5\). If the sample size is \(n\), the central limit theorem says that the distribution of \(\overline{X}\) should begin to resemble the Normal distribution with mean \(\mu = 5\) and standard deviation:
\[\frac{\sigma}{\sqrt{n}} = \frac{5}{\sqrt{n}}\]
as \(n\) grows large.
Let’s consider samples of size 10, 50, 200, and 1000. I first generate 200 simulated means from samples for each of these sizes.
# We will loop over the vector c(10, 50, 200, 500), each of these
# representing a sample size. lapply() applies a function (the second
# argument) to each element of this vector, and stores the result as a list,
# which we will call sim_means. The (anonymous) function will take only one
# argument, n, representing the sample size. It will return a vector of
# simulated means when the sample size is n.
sim_means <- lapply(c(5, 10, 20, 50), function(n) {
# I now simulate 200 data sets of size n from the exponential distribution,
# setting rate = 1/5 for exponentially distributed random variables with
# mean 5, and store the resulting data sets in a list called sim_data_sets.
# (I add a throwaway argument because something will be passed to the
# function, but I don't want to use it)
sim_data_sets <- lapply(1:200, function(throwaway) rexp(n, rate = 1/5))
# Return a vector containing means for each of these data sets.
sapply(sim_data_sets, mean)
})
# Let's look at what we just made
str(sim_means)
## List of 4
## $ : num [1:200] 3.51 4.13 2.59 2.91 4.67 ...
## $ : num [1:200] 5.38 3.29 5.43 2.68 4.21 ...
## $ : num [1:200] 4.01 2.83 6.64 6.62 7.09 ...
## $ : num [1:200] 7.06 5.27 5.61 4.31 4.78 ...
# First, I plot an estimated density function for simulated exponential
# random variables with mean 5 (rate 1/5)
plot(density(rexp(200, rate = 1/5)), xlab = "quantile", ylab = "density", main = "Estimated density curves",
col = "gray", type = "l", xlim = c(-5, 15), ylim = c(0, dnorm(5, mean = 5,
sd = 5/sqrt(50)) + 0.1))
# I add density curves for each of the simulated means
lines(density(sim_means[[1]]), col = "green")
lines(density(sim_means[[2]]), col = "blue")
lines(density(sim_means[[3]]), col = "purple")
lines(density(sim_means[[4]]), col = "red")
# Finally, add a Normal curve showing what the distribution of the final
# sample mean should be near according to the central limit theorem
clt_norm_q <- seq(0, 10, length = 1000)
lines(clt_norm_q, dnorm(clt_norm_q, mean = 5, sd = 5/sqrt(50)), col = "black")