Random Number Generation

Computers cannot create random numbers. They are Turing-complete, and a Turing machine is deterministic, not random. Instead, computers generate pseudo-random numbers. They start with a seed, an initial number in a sequence, and from there generate further numbers in the sequence that, without knowing the initial seed, appear to be random. That said, if the seed were known along with the generating process, the random sequence could be recreated exactly. Usually the system time is used to initialize a random number generator, since this may look random to a user. In R, you can use the set.seed() function to set a random seed. With no parameters, R will reset the random seed. If you wish to set the seed to a specific (numeric) value, you can do so with set.seed(seed). (You do not have to set a seed; R will automatically pick one at the start of a session, though if you want reproducible results, you should set the seed and communicate what it is.) For this document, I set the seed below.

set.seed(52716)

Random number generators generate uniformly distributed random numbers. Once they can generate random numbers from the uniform distribution, random numbers following other distributions can be generated as well, so only one random number generator is needed.

The sample() function allows for some basic random value generation. Two arguments are required. The first is a vector containing the values that could be generated. This is your sample space, the possible values that could be generated by the random process. The second is the number of random values to generate. The call sample(x, size) will generate a random vector consisting of values from x of length size.

# Here I simulate the classic ball-in-bag probability model where random
# balls are pulled from a bag, without replacement. There are 32 red balls
# and 16 blue balls. I pull five balls from the bag.
sample(rep(c("red", "blue"), times = c(32, 16)), 5)
## [1] "blue" "red"  "red"  "red"  "red"

By default, sample() will use a model where random values are generated without replacement. This means that once a value is observed, it cannot be observed again (in the ball-and-bag model, this equates to pulling a ball from the bag and not putting the ball back in; once drawn, it cannot be drawn again). This naturally means that you must have size < length(x), or an error will be thrown; you can’t draw random values when you run out! To switch to a model with replacement (i.e. balls are put back in the bag after being drawn), set the parameter replace = TRUE.

# A model without replacement
ball_pull1 <- sample(rep(c("red", "blue"), times = c(32, 16)), 32 + 16)
ball_pull1
##  [1] "red"  "red"  "red"  "red"  "blue" "blue" "blue" "blue" "red"  "red" 
## [11] "blue" "red"  "red"  "red"  "red"  "red"  "blue" "blue" "red"  "red" 
## [21] "red"  "red"  "red"  "blue" "red"  "red"  "red"  "blue" "red"  "red" 
## [31] "blue" "red"  "blue" "blue" "red"  "red"  "red"  "blue" "red"  "red" 
## [41] "blue" "red"  "red"  "red"  "red"  "blue" "red"  "blue"
# Since I pulled all balls out of the bag, I get exactly 32 reds and 16
# blues
table(ball_pull1)
## ball_pull1
## blue  red 
##   16   32
# Now with replacement
ball_pull2 <- sample(rep(c("red", "blue"), times = c(32, 16)), 32 + 16, replace = TRUE)
ball_pull2
##  [1] "red"  "red"  "red"  "blue" "red"  "red"  "red"  "red"  "blue" "red" 
## [11] "red"  "red"  "red"  "red"  "red"  "blue" "red"  "red"  "red"  "red" 
## [21] "red"  "red"  "red"  "red"  "red"  "blue" "red"  "red"  "blue" "red" 
## [31] "red"  "red"  "red"  "red"  "red"  "red"  "blue" "red"  "red"  "blue"
## [41] "red"  "red"  "blue" "red"  "blue" "red"  "red"  "blue"
# While close to the true frequencies, I don't see red balls exactly 32
# times, or blue balls exactly 16 times.
table(ball_pull2)
## ball_pull2
## blue  red 
##   10   38

By default, sample() will assume that each element in x is equally likely. This can be changed by setting prob to a vector assigning probabilities to each outcome.

# We can change the sampling with replacement model by making the sample
# space only c('red', 'blue'), and setting the probability of seeing these
# respective values to those consistent with our model.
sample(c("red", "blue"), 5, replace = TRUE, prob = c(32/48, 16/48))
## [1] "red"  "red"  "red"  "red"  "blue"