Applying a Function Over a Collection

Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:

2^1:10
## [1]  2  3  4  5  6  7  8  9 10

A similar idea is that we could take the square root of numbers between 0 and 1 with:

sqrt(seq(0, 1, by = 0.1))
##  [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
##  [8] 0.8366600 0.8944272 0.9486833 1.0000000

It may not be this simple though. For example, suppose we have a data frame, which I construct below:

library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway, 
    EngineSize, Horsepower, RPM))
head(cdat)
##   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1      12.9  15.9      18.8       25          31        1.8        140
## 2      29.2  33.9      38.7       18          25        3.2        200
## 3      25.9  29.1      32.3       20          26        2.8        172
## 4      30.8  37.7      44.6       19          26        2.8        172
## 5      23.7  30.0      36.2       22          30        3.5        208
## 6      14.2  15.7      17.3       22          31        2.2        110
##    RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200

I want the mean of all the variables in cdat. mean(cdat) will not work; the mean() function does not know how to handle the different variables in a data frame.

We may instead try a for loop, like so:

# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
    # For ever vector in cdat (called vec in the body of the loop), the code in
    # the loop will be executed Compute the mean of vec, and add it to
    # cdat_means
    cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

A good R programmer will try to avoid for loops as much as possible. One reason is that for loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for loops is that there is often an alternative not using a loop that easier to both write and understand.

How could we rewrite the above code without using for? We could use the function sapply() and the call sapply(v, f), where v is either a vector or list with the items you wish to iterate over, and f is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.

# A function to check if a number is even
even <- function(x) {
    # If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
    # even and the result is TRUE. Otherwise, the result is FALSE.
    x%%2 == 0
}

# Which numbers between 1 and 10 are even?
sapply(1:10, even)
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161
# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
    diff(range(vec))
})
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##        38.7        54.5        72.1        31.0        30.0         4.7 
##  Horsepower         RPM 
##       245.0      2700.0

The lapply() function works exactly like the sapply() function, except lapply() returns a list rather than a vector.

Alternatively, if we have a function f(x) that knows how to work with an object x, we could vectorize f so it can work on a vector or list of objects like x. We can use the Vectorize() function for this task with a call like vf <- Vectorize(f), where f is the function to vectorize, and vf is the new, vectorized version of f. The example below does what we did for cdat with both a for loop and sapply(), but now does so with a vectorized version of mean().

vmean <- Vectorize(mean)
vmean(cdat)
##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

Now suppose you have a data frame d, which contains information from different samples representing different populations. You wish to apply a function f() to data stored in d$x, and d$y determines which sample each row of the data frame (and thus, each entry of d$x) came from. You want f() to be applied to the data in each sample, separately. You can do so with the aggregate() function in a call of the form aggregate(x ~ y, data = d, f). I illustrate with the iris dataset below.

# The struture of iris
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588
# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)
##      Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1     setosa           4.300            4.800            5.000
## 2 versicolor           4.900            5.600            5.900
## 3  virginica           4.900            6.225            6.500
##   Sepal.Length.75% Sepal.Length.100%
## 1            5.200             5.800
## 2            6.300             7.000
## 3            6.900             7.900

Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply() function allows us to do just that in a call of the form apply(mat, m, f), where mat is the matrix with data, f the function to apply, and m the margin to apply f() over. For matrices, a value of 1 for m will lead to the function being applied across rows, and a value of 2 across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).

## Loading required package: methods
school_race_dat
##                  Entheos Academy Kearns Entheos Academy Magna
## Native American                       0                     0
## Asian                                 4                     5
## Black                                 1                     5
## Hispanic                            145                   201
## Pacific Islander                     15                     3
## White                               334                   273
## Multiple Race                        23                    15
##                  Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American                   4                   5                 9
## Asian                             6                  25                50
## Black                            12                  19                42
## Hispanic                        216                 322               551
## Pacific Islander                 12                  28                28
## White                           314                1124              1924
## Multiple Race                     7                  50               102
##                  Thomas Jefferson Jr High Kearns High
## Native American                        11          39
## Asian                                  13          49
## Black                                  17          53
## Hispanic                              260         937
## Pacific Islander                       42          99
## White                                 394        1138
## Multiple Race                           2          10
# Get row sums
apply(school_race_dat, 1, sum)
##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209
# Column sums
apply(school_race_dat, 2, sum)
##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325
# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)
##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209
colSums(school_race_dat)
##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325