Lecture 4
Working with Data Frames
Data frames are such a key tool for R users that packages are written solely for the accessing and manipulation of data in data frames. Thus they deserve more discussion.
Often we wish to work with multiple variables stored in a data frame, but while the $
notation is convenient, even it can grow tiresome with complicated computations. The function with()
can help simplify code. The first argument of with()
is a data frame, and the second argument is a command to evaluate.
d <- mtcars[1:10, ]
# We wish to know which cars have mpg within the first and third quartile.
# Here's a first approach that is slightly cumbersome
d[d$mpg > quantile(d$mpg, 0.25) & d$mpg < quantile(d$mpg), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# We can use the with function to clean things up
d[with(d, mpg > quantile(mpg, 0.25) & mpg < quantile(mpg)), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Often users don’t want all the data in a data frame, but only a subset of it. The which()
could be used to get the desired rows and a vector the desired columns, but this can quickly become cumbersome. Alternatively, use the subset()
function for this task. The data frame is the first argument passed to subset()
. Next, pass information to the subset
parameter to decide on what rows to include, or the select
parameter to choose the columns. Names of variables in the data frame can be used in subset()
like in with()
; you don’t need to use $
notation to choose the variable from within the data frame. Additionally, unlike when selecting with vectors, you can use :
to choose all columns between two names, not just numbers, and you can use -
in front of a vector of names to declare columns you don’t want.
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Notice that I do not list the names as strings
subset(mtcars, select = c(mpg, cyl), subset = mpg > quantile(mpg, 0.9))
## mpg cyl
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Lotus Europa 30.4 4
# Other ways to select columns Using : on column names selects columns
# between the names on either side
subset(mtcars, select = hp:qsec, subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## hp drat wt qsec
## Hornet Sportabout 175 3.15 3.440 17.02
## Merc 450SE 180 3.07 4.070 17.40
## Merc 450SL 180 3.07 3.730 17.60
## Dodge Challenger 150 2.76 3.520 16.87
## Pontiac Firebird 175 3.08 3.845 17.05
## Ford Pantera L 264 4.22 3.170 14.50
# Using - on a vector of names selects all columns except those in a vector
subset(mtcars, select = -c(drat, wt, qsec), subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
# Here is the above without using subset; notice how complicated the command
# is
mtcars[!is.na(mtcars$mpg) & mtcars$mpg > quantile(mtcars$mpg, 0.25) & mtcars$mpg <
quantile(mtcars$mpg, 0.75) & mtcars$cyl == 8, !(names(mtcars) %in% c("drat",
"wt", "qsec"))]
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
There are many other details about working with data frames that are common parts of an analysts workflow, such as reshaping a data frame (keeping the same information stored in a data frame but changing the data frame’s structure) and merging (combining information in two data frames). Read the textbook for more information and examples of these very important ideas. The entire process of bringing data into a workable format is called data cleaning, a significant and often underappreciated part of an analyst’s job.
Applying a Function Over a Collection
Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:
2^1:10
## [1] 2 3 4 5 6 7 8 9 10
A similar idea is that we could take the square root of numbers between 0 and 1 with:
sqrt(seq(0, 1, by = 0.1))
## [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
## [8] 0.8366600 0.8944272 0.9486833 1.0000000
It may not be this simple though. For example, suppose we have a data frame, which I construct below:
library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway,
EngineSize, Horsepower, RPM))
head(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1 12.9 15.9 18.8 25 31 1.8 140
## 2 29.2 33.9 38.7 18 25 3.2 200
## 3 25.9 29.1 32.3 20 26 2.8 172
## 4 30.8 37.7 44.6 19 26 2.8 172
## 5 23.7 30.0 36.2 22 30 3.5 208
## 6 14.2 15.7 17.3 22 31 2.2 110
## RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200
I want the mean of all the variables in cdat
. mean(cdat)
will not work; the mean()
function does not know how to handle the different variables in a data frame.
We may instead try a for
loop, like so:
# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
# For ever vector in cdat (called vec in the body of the loop), the code in
# the loop will be executed Compute the mean of vec, and add it to
# cdat_means
cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
A good R programmer will try to avoid for
loops as much as possible. One reason is that for
loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for
loops is that there is often an alternative not using a loop that easier to both write and understand.
How could we rewrite the above code without using for
? We could use the function sapply()
and the call sapply(v, f)
, where v
is either a vector or list with the items you wish to iterate over, and f
is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.
# A function to check if a number is even
even <- function(x) {
# If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
# even and the result is TRUE. Otherwise, the result is FALSE.
x%%2 == 0
}
# Which numbers between 1 and 10 are even?
sapply(1:10, even)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
diff(range(vec))
})
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 38.7 54.5 72.1 31.0 30.0 4.7
## Horsepower RPM
## 245.0 2700.0
The lapply()
function works exactly like the sapply()
function, except lapply()
returns a list rather than a vector.
Alternatively, if we have a function f(x)
that knows how to work with an object x
, we could vectorize f
so it can work on a vector or list of objects like x
. We can use the Vectorize()
function for this task with a call like vf <- Vectorize(f)
, where f
is the function to vectorize, and vf
is the new, vectorized version of f
. The example below does what we did for cdat
with both a for
loop and sapply()
, but now does so with a vectorized version of mean()
.
vmean <- Vectorize(mean)
vmean(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
Now suppose you have a data frame d
, which contains information from different samples representing different populations. You wish to apply a function f()
to data stored in d$x
, and d$y
determines which sample each row of the data frame (and thus, each entry of d$x
) came from. You want f()
to be applied to the data in each sample, separately. You can do so with the aggregate()
function in a call of the form aggregate(x ~ y, data = d, f)
. I illustrate with the iris
dataset below.
# The struture of iris
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)
## Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1 setosa 4.300 4.800 5.000
## 2 versicolor 4.900 5.600 5.900
## 3 virginica 4.900 6.225 6.500
## Sepal.Length.75% Sepal.Length.100%
## 1 5.200 5.800
## 2 6.300 7.000
## 3 6.900 7.900
Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply()
function allows us to do just that in a call of the form apply(mat, m, f)
, where mat
is the matrix with data, f
the function to apply, and m
the margin to apply f()
over. For matrices, a value of 1
for m
will lead to the function being applied across rows, and a value of 2
across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).
## Loading required package: methods
school_race_dat
## Entheos Academy Kearns Entheos Academy Magna
## Native American 0 0
## Asian 4 5
## Black 1 5
## Hispanic 145 201
## Pacific Islander 15 3
## White 334 273
## Multiple Race 23 15
## Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American 4 5 9
## Asian 6 25 50
## Black 12 19 42
## Hispanic 216 322 551
## Pacific Islander 12 28 28
## White 314 1124 1924
## Multiple Race 7 50 102
## Thomas Jefferson Jr High Kearns High
## Native American 11 39
## Asian 13 49
## Black 17 53
## Hispanic 260 937
## Pacific Islander 42 99
## White 394 1138
## Multiple Race 2 10
# Get row sums
apply(school_race_dat, 1, sum)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
# Column sums
apply(school_race_dat, 2, sum)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325
# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
colSums(school_race_dat)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325
Using External Data
R would not be very useful if we had no way of loading in and saving data. R has means for reading data from spreadsheets such as .xls
or .xlsx
files made by Microsoft Excel. Functions for reading Excel files can be found in the xlsx or gdata packages.
Common plain-text formats for reading data include the comma-separated values format (.csv
), tab-separated values format (.tsv
), and the fixed-width format (.fwf
). These files can be read in using the read.csv()
, read.table()
, and the read.fwf()
functions (with read.csv()
being merely a front-end for read.table()
). All of these functions parse a plain-text data file and return a data frame with the contents. Keep in mind that R will guess what type of data is stored in the file. Usually it makes a good guess, but this is not guaranteed and you may need to do some more data cleaning or give R more instructions on how to interpret the file.
In order to load a file, you must specify the location of the file. If the file is on your hard drive, there are a few ways to do so:
You could use the
file.choose()
command to browse your system and locate the file. Once done, you will have a text string describing the location of the file on your system.Any R session has a working directory, which is where R looks first for files. You can see the current working directory with
getwd()
, and change the working directory withsetwd(path)
, wherepath
is a string for the location of the directory you wish to set as the new working directory.
Let’s assume we’re loading in a .csv
file (the approach is similar for other formats). The command df <- read.csv("myfile.csv")
instructs R to read myfile.csv
(which is presumably in the working directory, since we did not specify a full path; if it were not, we would either change the working directory or pass the full path to the function, which may look something like read.csv("C:/path/to/myfile.csv")
, or read.csv("/path/to/myfile.csv")
, depending on the system) and store the resulting data frame in df
. Once done, df
will now be ready for us to use.
Suppose that the data file is on the Internet. You can pass the url of the file to read.csv()
and R will read the file online and make it available to you in your session. I demonstrate below:
# Total Primary Energy Consumption by country and region, for years 1980
# through 2008; in Quadrillion Btu (CSV Version). Dataset from data.gov,
# from the Department of Energy's dataset on total primary energy
# consumption. Download and load in the dataset
energy <- read.csv("http://en.openei.org/doe-opendata/dataset/d9cd39c5-492e-4e82-8765-12e0657eeb4e/resource/3c42d852-567e-4dda-a39c-2bfadf309da5/download/totalprimaryenergyconsumption.csv",
stringsAsFactors = FALSE)
# R did not parse everything correctly; turn some variables numeric
energy[2:30] <- lapply(energy[2:30], as.numeric)
# We want energy data for North American countries, from 2000 to 2008
us_energy <- subset(energy, select = X2000:X2008, subset = Country %in% c("Canada",
"United States", "Mexico"))
us_energy
## X2000 X2001 X2002 X2003 X2004 X2005 X2006
## 2 13.07669 12.87847 13.10786 13.52061 13.83128 14.16374 13.81736
## 4 6.37958 6.32931 6.32936 6.50563 6.48998 6.80188 7.36271
## 6 99.25385 96.53415 98.03879 98.31384 100.49743 100.60722 99.90566
## X2007 X2008
## 2 14.07179 14.02923
## 4 7.27651 7.30898
## 6 101.67563 99.53011
Naturally you can export data frames into common formats as well. write.csv()
, write.table()
, and write.fwf()
will write data into comma-separated value, tab-separated value, and fixed width formats. Their syntax is similar. To save a .csv
file, issue the command write.csv(df, file = "myfile.csv")
, where df
is the data frame to save and file
where to save it, which could be just a file name (resulting in the file being saved in the working directory), or an absolute path.
my_data <- data.frame(var1 = 1:10, var2 = paste("word", 1:10))
write.csv(my_data, file="my_data.csv")
There are other formats R can read and write to. The foreign package allows R to read data files created for other statistical software packages such as SAS or Stata. The XML package allows R to read XML and HTML files. You can also read JSON files or data stored in Google Sheets. Refer to the textbook for more information.