R Functions for Statistical Testing

R has functions for performing many common statistical tests, including many that come with any R installation in the stats package. We already saw all the functions for performing statistical tests I consider in this lecture when we discussed confidence intervals. There may be additional parameters to specify depending on the statistical test, such as what the population mean under the null hypothesis is, but otherwise little has changed. Many of these functions have a common parameter alternative that determines what the alternative hypothesis is. For a two-sided test (that is, when the alternative hypothesis is that \(\theta \neq \theta_0\)), you may set alternative = "two.sided" (this is the default, though, so setting this may not be necessary). If the alternative hypothesis says \(\theta > \theta_0\), set alternative = "greater", and if the alternative hypothesis says \(\theta < \theta_0\), set alternative = "less".

The statistical testing functions in stats will report the results of a statistical test and many other important quantities, such as the estimate of the parameters investigated, the sample size, degrees of freedom, the value of a test statistic, the p-value, and even the corresponding confidence interval. They do not state whether to reject or not reject the null hypothesis; you are expected to tell from the p-value reported whether to reject or not reject based on the significance level you have chosen.

stats includes some functions for Type II error analysis (often in the form of power analysis, which amounts to the same sort of analysis), but this is a difficult analysis in general. Functions for study planning may be included in other packages.

Test for Location of Population Mean

Suppose you wish to test the hypotheses:

\[H_0: \mu = \mu_0\] \[H_A: \left\{\begin{array}{l} \mu < \mu_0 \\ \mu \neq \mu_0 \\ \mu > \mu_0 \end{array}\right.\]

Furthermore, you assume (after perhaps using qqnorm() or some other procedure to check) that your data is Normally distributed, and thus it is safe to use the \(t\)-test (or your sample size is large enough for this to be safe to use anyway). The function t.test() allows you to test these hypotheses. The call t.test(x, mu = mu0) will test whether the data stored in vector x has a mean of mu0 or whether the true mean differs from mu0 (remember: change the form of the alternative hypothesis with the parameter alternative), using the test statistic:

\[t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}\]

I test the hypotheses regarding tree girth in the trees data set mentioned earlier in this lecture, but this time using t.test(), using a level of significance of \(\alpha = .05\).

# Check the normality assumption
qqnorm(trees$Girth)
qqline(trees$Girth)

# Data appears Normally distributed, so it's safe to use t.test
t.test(trees$Girth, mu = 12)
## 
##  One Sample t-test
## 
## data:  trees$Girth
## t = 2.2149, df = 30, p-value = 0.0345
## alternative hypothesis: true mean is not equal to 12
## 95 percent confidence interval:
##  12.09731 14.39947
## sample estimates:
## mean of x 
##  13.24839

With a p-value of 0.0345, I reject the null hypothesis in favor of the alternative.

Test for Value of a Proportion

Suppose you wish to test the hypotheses:

\[H_0: p = p_0\] \[H_A: \left\{\begin{array}{l} p < p_0 \\ p \neq p_0 \\ p > p_0 \end{array}\right.\]

Suppose your sample size is large enough to invoke the Central Limit Theorem to describe the sampling distribution of the statistic \(\hat{p} = \frac{1}{n}\sum_{i = 1}^{n} X_i\), and thus use the test statistic:

\[z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}\]

You can use prop.test() to test the hypotheses if these conditions hold. The call prop.test(x, n, p = p0) will test whether the true proportion is p0 when there were x successes out of n trials (both x and n are single non-negative integers).

Suppose that out of 1118 survey participants, 562 favored Hillary Clinton over Donald Trump (this data is fictitious). I test whether the candidates are tied or whether Hillary Clinton is winning below:

prop.test(562, 1118, alternative = "greater", p = 0.5)
## 
##  1-sample proportions test with continuity correction
## 
## data:  562 out of 1118, null probability 0.5
## X-squared = 0.022361, df = 1, p-value = 0.4406
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.477664 1.000000
## sample estimates:
##         p 
## 0.5026834

With a p-value of 0.4406, I soundly fail to reject the null hypothesis.

Suppose that your sample size is not large enough to use the above procedure (or you would simply rather not risk it), and you would rather use the binomial distribution to perform an exact test of the null and alternative hypotheses. binom.test() will allow you to perform such a test with a function call similar to prop.test().

Suppose you wish to know what the political alignment of your Facebook friends are. You conduct a survey, randomly selecting 15 friends from your friends list and determining their political affiliation, with a “success” being a friend being the same political party as yours. You find that 10 of your friends have the same political views as you, and use this to test whether most of your friends agree with you. You use a significance level of \(\alpha = .1\) to decide whether to reject the null hypothesis.

binom.test(10, 15, p = .5, alternative = "greater")
## 
##  Exact binomial test
## 
## data:  10 and 15
## number of successes = 10, number of trials = 15, p-value = 0.1509
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4225563 1.0000000
## sample estimates:
## probability of success 
##              0.6666667

Since the p-value of 0.1509 is greater than the significance level, you fail to reject the null hypothesis.

Testing for Difference Between Population Means Using Paired Data

Suppose you have data sets from two populations \(X_i\) and \(Y_i\), each possibly having their own mean, and you wish to test:

\[H_0: \mu_X = \mu_Y \equiv \mu_X - \mu_Y = 0\] \[H_A: \left\{\begin{array}{l} \mu_X < \mu_Y \equiv \mu_X - \mu_Y < 0 \\ \mu_X \neq \mu_Y \equiv \mu_X - \mu_Y \neq 0 \\ \mu_X > \mu_Y \equiv \mu_X - \mu_Y > 0 \end{array}\right.\]

Since the data is paired, testing these hypotheses is similar to inference with univariate data after you find \(D_i = X_i - Y_i\) and replace \(\mu_X - \mu_Y\) with \(\mu_D\).

If the differences \(D_i\) are Normally distributed, you can use the \(t\)-test to decide whether to reject \(H_0\) (you would also use the \(t\)-test if your data was not Normally distributed but the sample size was large enough that little would change if a more exact test were used). In R, use t.test() setting the parameter paired = TRUE. By default the parameter mu is 0, so you do not need to change it unless you want to test for a difference between the two populations other than zero. (In other words, you would be testing whether the samples differ by a certain amount as opposed to whether they differ at all.)

Below I test whether stronger speed limits reduce accidents, using the Swedish study we saw in the lecture on confidence intervals (in fact, you may see that most of the following code is identical to that lecture’s code). I will reject \(H_0\) if the p-value is less than 0.1.

library(MASS)
library(reshape)
head(Traffic)
##   year day limit  y
## 1 1961   1    no  9
## 2 1961   2    no 11
## 3 1961   3    no  9
## 4 1961   4    no 20
## 5 1961   5    no 31
## 6 1961   6    no 26
# The Traffic data set needs to be formed into a format that is more easily worked with. The following code reshapes the data into a form that allows for easier comparison of paired days
new_Traffic <- with(Traffic, data.frame("year" = year, "day" = day, "limit" = as.character(limit), "y" = as.character(y)))
melt_Traffic <- melt(new_Traffic, id.vars = c("year", "day"), measure.vars = c("limit", "y"))
form_Traffic <- cast(melt_Traffic, day ~ year + variable)
form_Traffic <- with(form_Traffic, data.frame("day" = day, "limit_61" = `1961_limit`, "accidents_61" = as.numeric(as.character(`1961_y`)), "limit_62" = `1962_limit`, "accidents_62" = as.numeric(as.character(`1962_y`))))
head(form_Traffic)
##   day limit_61 accidents_61 limit_62 accidents_62
## 1   1       no            9       no            9
## 2   2       no           11       no           20
## 3   3       no            9       no           15
## 4   4       no           20       no           14
## 5   5       no           31       no           30
## 6   6       no           26       no           23
# Now we subset this data so that only days where speed limits were enforced differently between the two years
diff_Traffic <- subset(form_Traffic, select = c("accidents_61", "accidents_62", "limit_61", "limit_62"), subset = limit_61 != limit_62)
head(diff_Traffic)
##    accidents_61 accidents_62 limit_61 limit_62
## 11           29           17       no      yes
## 12           40           23       no      yes
## 13           28           16       no      yes
## 14           17           20       no      yes
## 15           15           13       no      yes
## 16           21           13       no      yes
# Get a vector for accidents when the speed limit was not enforced, and a vector for accidents for when it was
accident_no_limit <- with(diff_Traffic, c(accidents_61[limit_61 == "no"], accidents_62[limit_62 == "no"]))
accident_limit <- with(diff_Traffic, c(accidents_61[limit_61 == "yes"], accidents_62[limit_62 == "yes"]))
accident_limit
##  [1] 19 19  9 21 22 23 14 19 15 13 22 42 29 21 12 16 17 23 16 20 13 13  9
## [24] 10 27 12  7 11 15 29 17 17 15 25  9 16 25 25 16 22 21 17 26 41 25 12
## [47] 17 24 26 16 15 12 22 24 16 25 14 15  9
accident_no_limit
##  [1] 29 40 28 17 15 21 24 15 32 22 24 11 27 37 32 25 20 40 21 18 35 21 25
## [24] 34 42 27 34 47 36 15 26 21 39 39 21 15 17 20 24 30 25  8 21 10 14 18
## [47] 26 38 31 12  8 22 17 31 49 23 14 25 24
# It took a lot of work, but finally we have the data in a format we want. Now, the first thing I check is whether the differences are Normally distributed
qqnorm(accident_limit - accident_no_limit)
qqline(accident_limit - accident_no_limit)

# Aside from a couple observations, the Normal distribution seems to fit very well. t procedures are safe to use.
t.test(accident_limit, accident_no_limit, paired = TRUE, alternative = "less")
## 
##  Paired t-test
## 
## data:  accident_limit and accident_no_limit
## t = -3.7744, df = 58, p-value = 0.0001897
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -3.588289
## sample estimates:
## mean of the differences 
##               -6.440678

With a p-value of 0.0002, we soundly reject \(H_0\).

Test for Difference in Mean Between Populations With Independent Samples

Again, consider the set of hypotheses considered in the above section, only the data is not paired; we have two samples, \(X_1, ..., X_n\) and \(Y_1, ..., Y_m\) (I assume both are drawn from Normal distributions, so we may use \(t\) procedures). Our test statistic depends on whether we believe our populations are homoskedastic (common standard deviation \(\sigma\)) or heteroskedastic (possibly differing standard deviations \(\sigma_X\) and \(\sigma_Y\)) under the null hypothesis (whether they are homoskedastic or heteroskedastic if the alternative hypothesis is true doesn’t matter). If we believe that, under \(H_0\), the populations are homoskedastic, our test statistic will be:

\[t = \frac{\bar{x} - \bar{y}}{s\sqrt{\frac{1}{n} + \frac{1}{m}}}\]

where \(s\) is the pooled sample standard deviation. The degrees of freedom of the \(t\) distribution used to compute the p-value are \(\nu = n + m - 2\).

If we do not assume that the populations are homoskedastic under \(H_0\) (so we either believe they are heteroskedastic or that we have no reason to believe they are homoskedastic and thus assume heteroskedasticity by default), our test statistic will be:

\[t = \frac{\bar{x} - \bar{y}}{\sqrt{\frac{s_x^2}{n} + \frac{s_y^2}{m}}}\]

The \(t\) distribution is used as an approximation of the true distribution this test statistic follows under \(H_0\), with degrees of freedom:

\[\nu = \frac{\left(\frac{s_X^2}{n} + \frac{s_Y^2}{m}\right)^2}{\frac{\left(s_X^2/n\right)^2}{n - 1} + \frac{\left(s_Y^2/m\right)^2}{m - 1}}\]

t.test() allows for testing these hypotheses. There is no need to set paired = FALSE, and you only set var.equal = TRUE if you want a test assuming that the populations are homoskedastic under \(H_0\). (If this assumption holds, the homoskedastic test is more powerful than the heteroskedastic test, but the difference is only minor.)

Let’s test whether orange juice increases the tooth growth of guinea pigs, using the ToothGrowth data set. I use a significance level of \(\alpha = .05\) to decide whether to reject \(H_0\).

# First, let's get the data into separate vectors
split_len <- split(ToothGrowth$len, ToothGrowth$supp)
OJ <- split_len$OJ
VC <- split_len$VC
# Perform statistical test
t.test(OJ, VC, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
# Is there homoskedasticity? Let's check a boxplot
boxplot(len ~ supp, data = ToothGrowth)

# These populations look homoskedastic; spread does not differ drastically
# between the two. A test that assumes homoskedasticity is thus performed.
t.test(OJ, VC, var.equal = TRUE, alternative = "greater")
## 
##  Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 58, p-value = 0.0302
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4708204       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
# Not much changed.

In both versions of the test, I reject the null hypothesis.

Testing for Difference in Population Proportion

Suppose you have two data sets of Bernoulli random variables, \(X_1, ..., X_n\) and \(Y_1, ..., Y_m\), and you wish to know if the population proportion of “successes” for the first population, \(p_X\), differs from the corresponding proportion in the second population, \(p_Y\). Your set of hypotheses are:

\[H_0: p_X = p_Y \equiv p_X - p_Y = 0\] \[H_A: \left\{\begin{array}{l} p_X < p_Y \equiv p_X - p_Y < 0 \\ p_X \neq p_Y \equiv p_X - p_Y \neq 0 \\ p_X > p_Y \equiv p_X - p_Y > 0 \end{array}\right.\]

If \(m\) and \(n\) are reasonably large, the Central Limit Theorem can be used to obtain a test statistic:

\[z = \frac{\hat{p}_X - \hat{p}_Y}{\sqrt{\hat{p}(1 - \hat{p})\left(\frac{1}{n} + \frac{1}{m} \right)}}\]

where \(\hat{p}\) is the pooled sample proportion (that is, the proportion of “successes” from the combined sample). Under \(H_0\), this test statistic follows a standard Normal distribution (when applying the Central Limit Theorem).

In R, prop.test() can be used to conduct this test, using a call similar to prop.test(x, n) where x is a vector containing the number of successes in the samples, and n is the size of both samples.

Below, I use prop.test() and the Melanoma data set to test whether males and females are equally likely to die from melanoma after being diagnosed, or whether they differ. I use a significance level of \(\alpha = .05\).

# First, I obtain sample sizes, using the fact that sex == 1 for males and 0
# for females
n <- c(Female = nrow(Melanoma) - sum(Melanoma$sex), Male = sum(Melanoma$sex))
# Now, find how many in each group died from melanoma
x <- aggregate(Melanoma$status == 1, list(Melanoma$sex), sum)
x <- c(Female = x[1, "x"], Male = x[2, "x"])
prop.test(x, n)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  x out of n
## X-squared = 4.3803, df = 1, p-value = 0.03636
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.283876633 -0.005856138
## sample estimates:
##    prop 1    prop 2 
## 0.2222222 0.3670886

With a p-value of 0.0364, I reject the null hypothesis; men and women are not equally likely to die from melanoma.