Families of Distributions

R facilitates both description and simulation of random variables from various families of probability distributions, using similar formats. Each family will have four R functions associated with it, preceded by a character and followed by the identifier of the family.

  • d functions are associated with the probability density function (pdf) or probability mass function (pmf) of a distribution. An input vector of quantiles will get the value of the pdf/pmf at those quantiles.
  • p functions are associated with the cumulative distribution function (cdf) of a probability distribution. Thus they describe the probability \(P(X \leq x)\), or the probability of being less than the input. This is often referred to as the lower tail of the distribution. If instead you want the upper tail, or \(P(X > x)\), you can set lower.tail = FALSE in a p function to obtain the upper tail.
  • q functions are associated with the quantiles of the probability distribution, and thus are the inverse of the cdf. There is likewise a lower.tail parameter for the q function.
  • r functions generate random values drawn from the distribution.

I give examples below for some distributions, starting with some discrete distributions.

Bernoulli

A Bernoulli random variable is a random variable that is either 1 or 0, and is described by the probability of obtaining 1, often denoted \(p\). Thus we denote a Bernoulli random variable with:

\[X \sim \operatorname{Ber}(p)\]

Note that if \(p = \frac{1}{2}\), this is the model of a single fair coin flip.

Bellow, I plot the pdf’s and cdf’s of a random variable \(X \sim \operatorname{Ber}(\frac{1}{2})\), and also simulate some values.

# I will be plotting a lot of pmf's in this document, so I create a function
# to help save effort.  The first argument, q, represents the quantiles of
# the random variable (the values that are possible). The second argument
# represents the value of the pmf at each q (and thus should be of the same
# length); in other words, for each q, p is the probability of seeing q
plot_pmf <- function(q, p) {
    # This will plot a series of horizontal lines at q with height p, setting
    # the y limits to a reasonable heights
    plot(q, p, type = "h", xlab = "x", ylab = "probability", main = "pmf", ylim = c(0, 
        max(p) + 0.1))
    # Usually these plots have a dot at the end of the line; the point function
    # will add these dots to the plot created above
    points(q, p, pch = 16, cex = 2)
}

# Plot the pmf of a Bernoulli rv
ber_q <- c(0, 1)
ber_pmf <- dbinom(ber_q, 1, 0.5)
plot_pmf(ber_q, ber_pmf)

# Plot the cdf of a Bernoulli rv
ber_q2 <- seq(-1, 2, length = 1000)
ber_cdf <- pbinom(ber_q2, 1, 0.5)
# Plot the cdf with a line plot
plot(ber_q2, ber_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# The quartiles of a Bernoulli distribution
qbinom(seq(0, 1, by = 0.25), 1, 0.5)
## [1] 0 0 0 1 1
# Generate 10 random values from a Bernoulli distribution
rbinom(10, 1, 0.5)
##  [1] 1 0 0 1 1 0 0 0 1 1

Binomial

A binomial random variable is specified by two parameters, \(n\) and \(p\), and is denoted by:

\[X \sim \operatorname{BINOM}(n, p)\]

Binomial random variables represent the number of times \(n\) Bernoulli random variables are 1 when the probability of seeing 1 is \(p\) (another way to think of this is that if a coin flip is a Bernoulli random variable, then the number of times the coin lands heads-up when flipped \(n\) times is a binomial random variable).

The functions for the binomial distribution are dbinom(), pbinom(), qbinom(), and rbinom(). All of these have arguments size and prob for specifying the parameters n and p of the binomial distribution respectively.

# Let's consider a model where a fair coin is flipped five times.  The pmf
# of the binomial rv:
binom_q <- 0:5
binom_pmf <- dbinom(binom_q, size = 5, prob = 0.5)
plot_pmf(binom_q, binom_pmf)

# The cdf of a binomial rv
binom_q2 <- seq(-1, 6, length = 1000)
binom_cdf <- pbinom(binom_q2, size = 5, prob = 0.5)
plot(binom_q2, binom_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# The quartiles of this binomial rv
qbinom(seq(0, 1, by = 0.25), size = 5, prob = 0.5)
## [1] 0 2 2 3 5
# 10 simulated binomial rv
rbinom(10, size = 5, prob = 0.5)
##  [1] 3 3 3 5 4 3 2 2 1 2

Geometric

A geometric random variable is specified by one parameter, \(p\), and is denoted by:

\[X \sim \operatorname{GEOM}(p)\]

This random variable represents how many times a Bernoulli random variable with parameter \(p\) is created until a 1 is seen. (In other words, the number of times a coin is flipped until it lands heads-up is a geometric random variable). The name comes from the fact that the pmf of a geometric random variable is a geometric series, or \(p(x) = p(1 - p)^{x - 1}\).

The functions for working with geometric random variables are dgeom(), pgeom(), qgeom(), and rgeom(). All of these have an argument prob used to represent the parameter p of the geometric distribution.

# In these examples we will work with a geometric rv where p = 1. I now show
# part of its pmf
geom_q <- 0:20
geom_pmf <- dgeom(geom_q, prob = 0.1)
plot_pmf(geom_q, geom_pmf)

# The cdf of this variable
geom_q2 <- seq(-1, 21, length = 1000)
geom_cdf <- pgeom(geom_q2, prob = 0.1)
plot(geom_q2, geom_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# Quartiles of a geometric rv
qgeom(seq(0, 1, by = 0.25), prob = 0.1)
## [1]   0   2   6  13 Inf
# 10 simulated geometric rv's
rgeom(10, prob = 0.1)
##  [1] 19  3 11  8 10  4 14 13 11  0

Poisson

A Poisson random variable is specified by one parameter, \(\lambda\), and is denoted by:

\[X \sim \operatorname{POI}(\lambda)\]

Poisson random variables are used to model counts of an event that occur in a finite frame of time. They are, in some sense, the inverse of a geometric random variable; while a geometric random variable represents the length of time to see one “success”, a Poisson random variable represents how many “successes” given a certain length of time. The parameter \(\lambda\) represents the average “success” rate. Some examples of Poisson random variables include:

  • The number of calls a call center receives in a work day.
  • How many goals a soccer team scores in a game.
  • The number of deaths from horse kicks in the Prussian army annually.

The R functions for working with Poisson random variables are dpois(), ppois(), qpois(), and rpois(). The argument lambda corresponds to the parameter \(\lambda\).

# For these examples, the average success rate is 4.  Creating a pmf:
pois_q <- 0:20
pois_pmf <- dpois(pois_q, lambda = 4)
plot_pmf(pois_q, pois_pmf)

# Creating a cdf
pois_q2 <- seq(-1, 21, length = 1000)
pois_cdf <- ppois(pois_q2, lambda = 4)
plot(pois_q2, pois_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# Quartiles of the Poisson rv
qpois(seq(0, 1, by = 0.25), lambda = 4)
## [1]   0   3   4   5 Inf
# Simulating Poisson rv's
rpois(10, lambda = 4)
##  [1] 4 5 2 4 4 5 3 3 2 0

Uniform

The first continuous random variable I consider here is a uniformly distributed random variable, which is described by two parameters, \(a\) and \(b\), and represented with:

\[X \sim \operatorname{UNIF}(a, b)\]

\(a\) is the smallest value possible a uniformly distributed rv could take, and \(b\) the largest. Every number between \(a\) and \(b\) is equally weighted. It is a very important random variable, since it describes the probability for any random variable the probability that the random variable is a certain quantile of its distribution. Additionally, computers simulate uniform random variables, and using them you can simulate every other random variable.

The dunif(), punif(), qunif(), and runif() functions are used for working with uniform rv’s in R. They all have arguments min and max that correspond to parameters \(a\) and \(b\) mentioned above.

Being a continuous random variable, rather than having a pmf, a probability density function (pdf) is used for describing the random variable.

# Throughout I will set a = 0 and b = 1. This could be considered the
# standard Uniform random variable.  The pdf of a uniform rv:
unif_q <- seq(-1, 2, length = 1000)
unif_pdf <- dunif(unif_q, min = 0, max = 1)
plot(unif_q, unif_pdf, type = "l", xlab = "x", ylab = "density", main = "pdf", 
    ylim = c(0, max(unif_pdf) + 0.1))

# The cdf of a uniform rv
unif_cdf <- punif(unif_q, min = 0, max = 1)
plot(unif_q, unif_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# Quartiles of a uniform rv
qunif(seq(0, 1, by = 0.25), min = 0, max = 1)
## [1] 0.00 0.25 0.50 0.75 1.00
# Simulated uniform rv's
runif(10, min = 0, max = 1)
##  [1] 0.7562467 0.9696724 0.8479128 0.2943741 0.9111995 0.2399134 0.7144876
##  [8] 0.3392606 0.8428848 0.8038884

Exponential

The exponential distribution is specified by a parameter \(\mu\) and is described by:

\[X \sim \operatorname{EXP}(\mu)\]

\(\mu\) describes both the mean and the standard deviation of an exponential random variable. These are used to model waiting times, such as how long it takes a server to service a a customer and move on to the next one in the queue. Thus, \(\mu\) is the average length of time to service one customer and move on to the next. One could also consider instead \(\frac{1}{\mu}\), which is the rate at which customers are serviced. So if a call center takes on average half a minute to service a customer, the average rate at which it services its customers is two customers per minute. (You may notice some similarity between an exponential random variable and a geometric random variabe; the exponential rv is the continuous version of the geometric rv.)

The function dexp(), pexp(), qexp(), and rexp() are used for working with exponential random variables in R. You do not directly specify the mean wait time, \(\mu\), in R; instead you specify the rate, \(\frac{1}{\mu}\), with the rate argument. Thus if you wanted a mean of \(\mu = 6\), you would set rate = 1/6.

# Here we will model an exponential rv where the mean is 6. This means that
# the rate is 1/6.  The pdf of an exponential rv:
exp_q <- seq(-1, 12, length = 1000)
exp_pdf <- dexp(exp_q, rate = 1/6)
plot(exp_q, exp_pdf, type = "l", xlab = "x", ylab = "density", main = "pdf", 
    ylim = c(0, max(exp_pdf) + 0.1))

# The cdf of an exponential rv
exp_cdf <- pexp(exp_q, rate = 1/6)
plot(exp_q, exp_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# Quartiles of an exponential rv
qexp(seq(0, 1, by = 0.25), rate = 1/6)
## [1] 0.000000 1.726092 4.158883 8.317766      Inf
# Simulated exponential rv's
rexp(10, rate = 1/6)
##  [1]  4.181027  1.381246  2.254482  2.814817  1.777893 15.998047 13.546735
##  [8] 13.562386  4.017361  4.070043

Normal

The Normal distribution is specified by two parameter, \(\mu\) and \(\sigma\), and is described by:

\[X \sim N(\mu, \sigma)\]

\(\mu\) is the mean of a Normal rv, and \(\sigma\) is the rv’s standard deviation. The Normal distribution describes the distribution of “errors”, or now far away some random variable is from its mean. It is perhaps the most important distribution in probability and statistics.

The functions dnorm(), pnorm(), qnorm(), and rnorm() are used for working with Normal rv’s. They have two arguments, mean and sd, used for specifying the parameters \(\mu\) and \(\sigma\), respectively.

# We will set the mean to 10 and the standard deviation to 2 when working
# with Normal rv's in this example.  The pdf of a Normal rv:
norm_q <- seq(4, 16, length = 1000)
norm_pdf <- dnorm(norm_q, mean = 10, sd = 2)
plot(norm_q, norm_pdf, type = "l", xlab = "x", ylab = "density", main = "pdf", 
    ylim = c(0, max(norm_pdf) + 0.1))

# The cdf of a Normal rv:
norm_cdf <- pnorm(norm_q, mean = 10, sd = 2)
plot(norm_q, norm_cdf, type = "l", xlab = "x", ylab = "probability", main = "cdf", 
    ylim = c(0, 1.1))

# The quartiles of a Normal rv
qnorm(seq(0, 1, by = 0.25), mean = 10, sd = 2)
## [1]     -Inf  8.65102 10.00000 11.34898      Inf
# Simulated Normal rv's
rnorm(10, mean = 10, sd = 2)
##  [1] 12.541297 11.221620  7.411588 13.483226 12.046666  8.065685 10.076769
##  [8] 10.338442  7.755875  6.666931