Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:
2^1:10
## [1] 2 3 4 5 6 7 8 9 10
A similar idea is that we could take the square root of numbers between 0 and 1 with:
sqrt(seq(0, 1, by = 0.1))
## [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
## [8] 0.8366600 0.8944272 0.9486833 1.0000000
It may not be this simple though. For example, suppose we have a data frame, which I construct below:
library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway,
EngineSize, Horsepower, RPM))
head(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1 12.9 15.9 18.8 25 31 1.8 140
## 2 29.2 33.9 38.7 18 25 3.2 200
## 3 25.9 29.1 32.3 20 26 2.8 172
## 4 30.8 37.7 44.6 19 26 2.8 172
## 5 23.7 30.0 36.2 22 30 3.5 208
## 6 14.2 15.7 17.3 22 31 2.2 110
## RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200
I want the mean of all the variables in cdat
. mean(cdat)
will not work; the mean()
function does not know how to handle the different variables in a data frame.
We may instead try a for
loop, like so:
# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
# For ever vector in cdat (called vec in the body of the loop), the code in
# the loop will be executed Compute the mean of vec, and add it to
# cdat_means
cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
A good R programmer will try to avoid for
loops as much as possible. One reason is that for
loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for
loops is that there is often an alternative not using a loop that easier to both write and understand.
How could we rewrite the above code without using for
? We could use the function sapply()
and the call sapply(v, f)
, where v
is either a vector or list with the items you wish to iterate over, and f
is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.
# A function to check if a number is even
even <- function(x) {
# If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
# even and the result is TRUE. Otherwise, the result is FALSE.
x%%2 == 0
}
# Which numbers between 1 and 10 are even?
sapply(1:10, even)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
diff(range(vec))
})
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 38.7 54.5 72.1 31.0 30.0 4.7
## Horsepower RPM
## 245.0 2700.0
The lapply()
function works exactly like the sapply()
function, except lapply()
returns a list rather than a vector.
Alternatively, if we have a function f(x)
that knows how to work with an object x
, we could vectorize f
so it can work on a vector or list of objects like x
. We can use the Vectorize()
function for this task with a call like vf <- Vectorize(f)
, where f
is the function to vectorize, and vf
is the new, vectorized version of f
. The example below does what we did for cdat
with both a for
loop and sapply()
, but now does so with a vectorized version of mean()
.
vmean <- Vectorize(mean)
vmean(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
Now suppose you have a data frame d
, which contains information from different samples representing different populations. You wish to apply a function f()
to data stored in d$x
, and d$y
determines which sample each row of the data frame (and thus, each entry of d$x
) came from. You want f()
to be applied to the data in each sample, separately. You can do so with the aggregate()
function in a call of the form aggregate(x ~ y, data = d, f)
. I illustrate with the iris
dataset below.
# The struture of iris
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)
## Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1 setosa 4.300 4.800 5.000
## 2 versicolor 4.900 5.600 5.900
## 3 virginica 4.900 6.225 6.500
## Sepal.Length.75% Sepal.Length.100%
## 1 5.200 5.800
## 2 6.300 7.000
## 3 6.900 7.900
Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply()
function allows us to do just that in a call of the form apply(mat, m, f)
, where mat
is the matrix with data, f
the function to apply, and m
the margin to apply f()
over. For matrices, a value of 1
for m
will lead to the function being applied across rows, and a value of 2
across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).
## Loading required package: methods
school_race_dat
## Entheos Academy Kearns Entheos Academy Magna
## Native American 0 0
## Asian 4 5
## Black 1 5
## Hispanic 145 201
## Pacific Islander 15 3
## White 334 273
## Multiple Race 23 15
## Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American 4 5 9
## Asian 6 25 50
## Black 12 19 42
## Hispanic 216 322 551
## Pacific Islander 12 28 28
## White 314 1124 1924
## Multiple Race 7 50 102
## Thomas Jefferson Jr High Kearns High
## Native American 11 39
## Asian 13 49
## Black 17 53
## Hispanic 260 937
## Pacific Islander 42 99
## White 394 1138
## Multiple Race 2 10
# Get row sums
apply(school_race_dat, 1, sum)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
# Column sums
apply(school_race_dat, 2, sum)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325
# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
colSums(school_race_dat)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325