Lecture 4

Working with Data Frames

Data frames are such a key tool for R users that packages are written solely for the accessing and manipulation of data in data frames. Thus they deserve more discussion.

Often we wish to work with multiple variables stored in a data frame, but while the $ notation is convenient, even it can grow tiresome with complicated computations. The function with() can help simplify code. The first argument of with() is a data frame, and the second argument is a command to evaluate.

d <- mtcars[1:10, ]
# We wish to know which cars have mpg within the first and third quartile.
# Here's a first approach that is slightly cumbersome
d[d$mpg > quantile(d$mpg, 0.25) & d$mpg < quantile(d$mpg), ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

# We can use the with function to clean things up
d[with(d, mpg > quantile(mpg, 0.25) & mpg < quantile(mpg)), ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Often users don’t want all the data in a data frame, but only a subset of it. The which() could be used to get the desired rows and a vector the desired columns, but this can quickly become cumbersome. Alternatively, use the subset() function for this task. The data frame is the first argument passed to subset(). Next, pass information to the subset parameter to decide on what rows to include, or the select parameter to choose the columns. Names of variables in the data frame can be used in subset() like in with(); you don’t need to use $ notation to choose the variable from within the data frame. Additionally, unlike when selecting with vectors, you can use : to choose all columns between two names, not just numbers, and you can use - in front of a vector of names to declare columns you don’t want.

names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

# Notice that I do not list the names as strings
subset(mtcars, select = c(mpg, cyl), subset = mpg > quantile(mpg, 0.9))

##                 mpg cyl
## Fiat 128       32.4   4
## Honda Civic    30.4   4
## Toyota Corolla 33.9   4
## Lotus Europa   30.4   4

# Other ways to select columns Using : on column names selects columns
# between the names on either side
subset(mtcars, select = hp:qsec, subset = !is.na(mpg) & mpg > quantile(mpg, 
    0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)

##                    hp drat    wt  qsec
## Hornet Sportabout 175 3.15 3.440 17.02
## Merc 450SE        180 3.07 4.070 17.40
## Merc 450SL        180 3.07 3.730 17.60
## Dodge Challenger  150 2.76 3.520 16.87
## Pontiac Firebird  175 3.08 3.845 17.05
## Ford Pantera L    264 4.22 3.170 14.50

# Using - on a vector of names selects all columns except those in a vector
subset(mtcars, select = -c(drat, wt, qsec), subset = !is.na(mpg) & mpg > quantile(mpg, 
    0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)

##                    mpg cyl  disp  hp vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175  0  0    3    2
## Merc 450SE        16.4   8 275.8 180  0  0    3    3
## Merc 450SL        17.3   8 275.8 180  0  0    3    3
## Dodge Challenger  15.5   8 318.0 150  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175  0  0    3    2
## Ford Pantera L    15.8   8 351.0 264  0  1    5    4

# Here is the above without using subset; notice how complicated the command
# is
mtcars[!is.na(mtcars$mpg) & mtcars$mpg > quantile(mtcars$mpg, 0.25) & mtcars$mpg < 
    quantile(mtcars$mpg, 0.75) & mtcars$cyl == 8, !(names(mtcars) %in% c("drat", 
    "wt", "qsec"))]

##                    mpg cyl  disp  hp vs am gear carb
## Hornet Sportabout 18.7   8 360.0 175  0  0    3    2
## Merc 450SE        16.4   8 275.8 180  0  0    3    3
## Merc 450SL        17.3   8 275.8 180  0  0    3    3
## Dodge Challenger  15.5   8 318.0 150  0  0    3    2
## Pontiac Firebird  19.2   8 400.0 175  0  0    3    2
## Ford Pantera L    15.8   8 351.0 264  0  1    5    4

There are many other details about working with data frames that are common parts of an analysts workflow, such as reshaping a data frame (keeping the same information stored in a data frame but changing the data frame’s structure) and merging (combining information in two data frames). Read the textbook for more information and examples of these very important ideas. The entire process of bringing data into a workable format is called data cleaning, a significant and often underappreciated part of an analyst’s job.

Applying a Function Over a Collection

Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:

2^1:10

## [1]  2  3  4  5  6  7  8  9 10

A similar idea is that we could take the square root of numbers between 0 and 1 with:

sqrt(seq(0, 1, by = 0.1))

##  [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
##  [8] 0.8366600 0.8944272 0.9486833 1.0000000

It may not be this simple though. For example, suppose we have a data frame, which I construct below:

library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway, 
    EngineSize, Horsepower, RPM))
head(cdat)

##   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1      12.9  15.9      18.8       25          31        1.8        140
## 2      29.2  33.9      38.7       18          25        3.2        200
## 3      25.9  29.1      32.3       20          26        2.8        172
## 4      30.8  37.7      44.6       19          26        2.8        172
## 5      23.7  30.0      36.2       22          30        3.5        208
## 6      14.2  15.7      17.3       22          31        2.2        110
##    RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200

I want the mean of all the variables in cdat. mean(cdat) will not work; the mean() function does not know how to handle the different variables in a data frame.

We may instead try a for loop, like so:

# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
    # For ever vector in cdat (called vec in the body of the loop), the code in
    # the loop will be executed Compute the mean of vec, and add it to
    # cdat_means
    cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means

##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

A good R programmer will try to avoid for loops as much as possible. One reason is that for loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for loops is that there is often an alternative not using a loop that easier to both write and understand.

How could we rewrite the above code without using for? We could use the function sapply() and the call sapply(v, f), where v is either a vector or list with the items you wish to iterate over, and f is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.

# A function to check if a number is even
even <- function(x) {
    # If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
    # even and the result is TRUE. Otherwise, the result is FALSE.
    x%%2 == 0
}

# Which numbers between 1 and 10 are even?
sapply(1:10, even)

##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)

##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
    diff(range(vec))
})

##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##        38.7        54.5        72.1        31.0        30.0         4.7 
##  Horsepower         RPM 
##       245.0      2700.0

The lapply() function works exactly like the sapply() function, except lapply() returns a list rather than a vector.

Alternatively, if we have a function f(x) that knows how to work with an object x, we could vectorize f so it can work on a vector or list of objects like x. We can use the Vectorize() function for this task with a call like vf <- Vectorize(f), where f is the function to vectorize, and vf is the new, vectorized version of f. The example below does what we did for cdat with both a for loop and sapply(), but now does so with a vectorized version of mean().

vmean <- Vectorize(mean)
vmean(cdat)

##   Min.Price       Price   Max.Price    MPG.city MPG.highway  EngineSize 
##   17.125806   19.509677   21.898925   22.365591   29.086022    2.667742 
##  Horsepower         RPM 
##  143.827957 5280.645161

Now suppose you have a data frame d, which contains information from different samples representing different populations. You wish to apply a function f() to data stored in d$x, and d$y determines which sample each row of the data frame (and thus, each entry of d$x) came from. You want f() to be applied to the data in each sample, separately. You can do so with the aggregate() function in a call of the form aggregate(x ~ y, data = d, f). I illustrate with the iris dataset below.

# The struture of iris
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)

##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588

# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)

##      Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1     setosa           4.300            4.800            5.000
## 2 versicolor           4.900            5.600            5.900
## 3  virginica           4.900            6.225            6.500
##   Sepal.Length.75% Sepal.Length.100%
## 1            5.200             5.800
## 2            6.300             7.000
## 3            6.900             7.900

Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply() function allows us to do just that in a call of the form apply(mat, m, f), where mat is the matrix with data, f the function to apply, and m the margin to apply f() over. For matrices, a value of 1 for m will lead to the function being applied across rows, and a value of 2 across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).

## Loading required package: methods

school_race_dat

##                  Entheos Academy Kearns Entheos Academy Magna
## Native American                       0                     0
## Asian                                 4                     5
## Black                                 1                     5
## Hispanic                            145                   201
## Pacific Islander                     15                     3
## White                               334                   273
## Multiple Race                        23                    15
##                  Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American                   4                   5                 9
## Asian                             6                  25                50
## Black                            12                  19                42
## Hispanic                        216                 322               551
## Pacific Islander                 12                  28                28
## White                           314                1124              1924
## Multiple Race                     7                  50               102
##                  Thomas Jefferson Jr High Kearns High
## Native American                        11          39
## Asian                                  13          49
## Black                                  17          53
## Hispanic                              260         937
## Pacific Islander                       42          99
## White                                 394        1138
## Multiple Race                           2          10

# Get row sums
apply(school_race_dat, 1, sum)

##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209

# Column sums
apply(school_race_dat, 2, sum)

##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325

# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)

##  Native American            Asian            Black         Hispanic 
##               68              152              149             2632 
## Pacific Islander            White    Multiple Race 
##              227             5501              209

colSums(school_race_dat)

##   Entheos Academy Kearns    Entheos Academy Magna       Jim Bridger School 
##                      522                      502                      571 
##      Sunset Ridge Middle        Copper Hills High Thomas Jefferson Jr High 
##                     1573                     2706                      739 
##              Kearns High 
##                     2325

Using External Data

R would not be very useful if we had no way of loading in and saving data. R has means for reading data from spreadsheets such as .xls or .xlsx files made by Microsoft Excel. Functions for reading Excel files can be found in the xlsx or gdata packages.

Common plain-text formats for reading data include the comma-separated values format (.csv), tab-separated values format (.tsv), and the fixed-width format (.fwf). These files can be read in using the read.csv(), read.table(), and the read.fwf() functions (with read.csv() being merely a front-end for read.table()). All of these functions parse a plain-text data file and return a data frame with the contents. Keep in mind that R will guess what type of data is stored in the file. Usually it makes a good guess, but this is not guaranteed and you may need to do some more data cleaning or give R more instructions on how to interpret the file.

In order to load a file, you must specify the location of the file. If the file is on your hard drive, there are a few ways to do so:

You could use the file.choose() command to browse your system and locate the file. Once done, you will have a text string describing the location of the file on your system.
Any R session has a working directory, which is where R looks first for files. You can see the current working directory with getwd(), and change the working directory with setwd(path), where path is a string for the location of the directory you wish to set as the new working directory.

Let’s assume we’re loading in a .csv file (the approach is similar for other formats). The command df <- read.csv("myfile.csv") instructs R to read myfile.csv (which is presumably in the working directory, since we did not specify a full path; if it were not, we would either change the working directory or pass the full path to the function, which may look something like read.csv("C:/path/to/myfile.csv"), or read.csv("/path/to/myfile.csv"), depending on the system) and store the resulting data frame in df. Once done, df will now be ready for us to use.

Suppose that the data file is on the Internet. You can pass the url of the file to read.csv() and R will read the file online and make it available to you in your session. I demonstrate below:

# Total Primary Energy Consumption by country and region, for years 1980
# through 2008; in Quadrillion Btu (CSV Version). Dataset from data.gov,
# from the Department of Energy's dataset on total primary energy
# consumption.  Download and load in the dataset
energy <- read.csv("http://en.openei.org/doe-opendata/dataset/d9cd39c5-492e-4e82-8765-12e0657eeb4e/resource/3c42d852-567e-4dda-a39c-2bfadf309da5/download/totalprimaryenergyconsumption.csv", 
    stringsAsFactors = FALSE)
# R did not parse everything correctly; turn some variables numeric
energy[2:30] <- lapply(energy[2:30], as.numeric)
# We want energy data for North American countries, from 2000 to 2008
us_energy <- subset(energy, select = X2000:X2008, subset = Country %in% c("Canada", 
    "United States", "Mexico"))
us_energy

##      X2000    X2001    X2002    X2003     X2004     X2005    X2006
## 2 13.07669 12.87847 13.10786 13.52061  13.83128  14.16374 13.81736
## 4  6.37958  6.32931  6.32936  6.50563   6.48998   6.80188  7.36271
## 6 99.25385 96.53415 98.03879 98.31384 100.49743 100.60722 99.90566
##       X2007    X2008
## 2  14.07179 14.02923
## 4   7.27651  7.30898
## 6 101.67563 99.53011

Naturally you can export data frames into common formats as well. write.csv(), write.table(), and write.fwf() will write data into comma-separated value, tab-separated value, and fixed width formats. Their syntax is similar. To save a .csv file, issue the command write.csv(df, file = "myfile.csv"), where df is the data frame to save and file where to save it, which could be just a file name (resulting in the file being saved in the working directory), or an absolute path.

my_data <- data.frame(var1 = 1:10, var2 = paste("word", 1:10))
write.csv(my_data, file="my_data.csv")

There are other formats R can read and write to. The foreign package allows R to read data files created for other statistical software packages such as SAS or Stata. The XML package allows R to read XML and HTML files. You can also read JSON files or data stored in Google Sheets. Refer to the textbook for more information.