Lecture 2
R Data Structures
As mentioned before, vectors are hardly the only data structure in R. There are other very important data structures R uses.
Lists
A list is a generalized vector in R. An R vector requres that all data saved stored in the vector be of the same type. A list has no such requirement. You can easily create lists with numbers, strings, vectors, functions, and other lists all in one object. Lists in R are created with the list()
function, where each element of the list is separated with a ,
(note that lists don’t flatten vectors like c()
does; every item separated by a comma gets its own index in the list).
# Let's make a list of mixed type!
l1 <- list(1, "fraggle rock", c("henry", "margaret", "donna"), list(1:2, paste("Test",
1:10)))
l1
## [[1]]
## [1] 1
##
## [[2]]
## [1] "fraggle rock"
##
## [[3]]
## [1] "henry" "margaret" "donna"
##
## [[4]]
## [[4]][[1]]
## [1] 1 2
##
## [[4]][[2]]
## [1] "Test 1" "Test 2" "Test 3" "Test 4" "Test 5" "Test 6" "Test 7"
## [8] "Test 8" "Test 9" "Test 10"
# This list has no names for its elements; we could specify some using
# names()
names(l1) <- c("num", "char", "vec", "inner_list")
# We can also assign names when we create the list
l2 <- list(char = "monday", vec = c("and", "but", "or"))
l2
## $char
## [1] "monday"
##
## $vec
## [1] "and" "but" "or"
How do we reference the objects stored in a list? We have a few options:
If we wish that the object returned by the reference also be a list, we can use single-bracket notation like we did with vectors, like
li[x]
wherex
is any means for selecting elements of the list (number, string, vector, boolean vector, etc.).If we the object stored at x, we can use double bracket notation, like
l1[[x]]
wherex
is either a number or a string (x
cannot be a vector in this case). The difference betweenl1[x]
andli[[x]]
may be subtle, but it’s very important.li[x]
is a list, andl1[[x]]
is an object stored in a list. (This difference is also true for vectors;vec[x]
is a vector, andvec[[x]]
is an object stored in a vector. Rarely does this make a difference, but sometimes it does, like when the vector is a vector of functions.)If the elements of the list are named, instead of referencing them with
l1[["x"]]
(x
is the name of the element), we can use$
notation, likel1$x
. This is usually how named elements are referenced.
# This is a list
l1[1:3]
## $num
## [1] 1
##
## $char
## [1] "fraggle rock"
##
## $vec
## [1] "henry" "margaret" "donna"
is.list(l1[1:3])
## [1] TRUE
# This is item stored in the third position of the list
l1[[3]]
## [1] "henry" "margaret" "donna"
# This is not a list
is.list(l1[[3]])
## [1] FALSE
# Notice the difference
l1[3]
## $vec
## [1] "henry" "margaret" "donna"
# We can also reference by name
l2["vec"]
## $vec
## [1] "and" "but" "or"
l2[["vec"]]
## [1] "and" "but" "or"
# An alternative way to reference the contents of an element by name
l2$vec
## [1] "and" "but" "or"
More complex objects in R are often simply lists with a specific structure, thus making lists very important.
Matrices
An R matrix is much like an R vector (in fact, internally they are the same, with matrices having additionaly attributes for dimension). A matrix is two-dimensional, with a row and column dimension. Like a vector, matrices only allow data of a single type. There are a few ways to make matrices in R:
- The
rbind()
function takes an arbitrary number of vectors as inputs (all of equal length), and creates a matrix where each input vector is a row of the matrix.cbind()
is exactly likerbind()
except that the vectors become columns rather than rows. - The
matrix()
function takes a single vector input and turns that vector into a matrix. You can set either thenrow
parameter or thencol
parameter to the number of rows or columns respectively that you desire your matrix to have (it is not necessary to specify both, though not illegal either so long as the product of the dimensions equals the length of the input vector). By default, R will fill the matrix by column; this means that it will fill the first column with the first contents of your input vector in sequence, then the next column with remaining elements, and so on until the matrix is filled and the contents of the input vector “exhausted.”" Changing thebyrow
parameter tobyrow = TRUE
changes this behavior, and R will fill the matrix by rows rather than columns.
Both the rows and the columns of a matrix can be named, though you don’t use the names()
function for seeing or changing these names. Instead, use the rownames()
or colnames()
function for accessing or modifying the row names and column names, respectively.
You can get the dimensions of a matrix with the dim()
function. nrow()
returns the number of rows of a matrix, and ncol()
the number of columns. length()
returns the number of elements in the matrix (so the product of the dimensions).
# Using rbind to make a matrix
mat1 <- rbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns",
"west jordan"), c("university of utah", "byu", "westminster"), c("slcc",
"snow", "suu"))
# Likewise with cbind
mat2 <- cbind(c("jim bridger", "meadowbrook", "elwood"), c("copper hills", "kearns",
"west jordan"), c("university of utah", "byu", "westminster"), c("slcc",
"snow", "suu"))
mat1
## [,1] [,2] [,3]
## [1,] "jim bridger" "meadowbrook" "elwood"
## [2,] "copper hills" "kearns" "west jordan"
## [3,] "university of utah" "byu" "westminster"
## [4,] "slcc" "snow" "suu"
mat2
## [,1] [,2] [,3] [,4]
## [1,] "jim bridger" "copper hills" "university of utah" "slcc"
## [2,] "meadowbrook" "kearns" "byu" "snow"
## [3,] "elwood" "west jordan" "westminster" "suu"
dim(mat1) # The dimensions of mat1
## [1] 4 3
nrow(mat1) # The number of rows of mat1
## [1] 4
ncol(mat1) # The number of columns of mat1
## [1] 3
length(mat1) # The number of elements stored in mat1
## [1] 12
# Using matrix()
mat3 <- matrix(1:10, nrow = 2)
mat4 <- matrix(1:10, nrow = 2, byrow = FALSE)
mat3
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
mat4
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
# Naming matrix dimensions
rownames(mat3) <- c("odds", "evens")
colnames(mat3) <- c("first", "second", "third", "fourth", "fifth")
mat3
## first second third fourth fifth
## odds 1 3 5 7 9
## evens 2 4 6 8 10
# Internally, matrices are glorified vectors
as.vector(mat1)
## [1] "jim bridger" "copper hills" "university of utah"
## [4] "slcc" "meadowbrook" "kearns"
## [7] "byu" "snow" "elwood"
## [10] "west jordan" "westminster" "suu"
To access the elements of the matrix, you could do so with mat[x]
, where x
is a vector. This will treat the matrix mat
like a vector. Sometimes this is the behavior you want, but most of the time you probably wish to access the data using the matrix’s rows and columns (otherwise you would have made a vector).
R uses the notation [,]
for referencing elements in a matrix. Thus you can reference objects in a matrix with mat[x,y]
, where x
is a vector specifying the desired rows, and y
a vector specifying the desired columns. All the rules for referencing elements of a vector apply to x
and y
, with the additional rule that leaving a dimension blank will lead to everything in that dimension being included. Thus, mat[,y]
results in a matrix with all the rows of mat
and columns determined by y
, and mat[x,]
a matrix with all the columns of mat
and rows determined by x
.
# Get the (1,2) entry of mat1
mat1[1, 2]
## [1] "meadowbrook"
# The first row of mat1; notice that this is a vector
mat1[1, ]
## [1] "jim bridger" "meadowbrook" "elwood"
# The second column of mat1; notice that this is also a vector
mat1[, 2]
## [1] "meadowbrook" "kearns" "byu" "snow"
# We can preserve the matrix structure (in other words, not turn the result
# into a vector) by adding an additional comma and specifying the option
# drop=FALSE
mat1[1, , drop = FALSE]
## [,1] [,2] [,3]
## [1,] "jim bridger" "meadowbrook" "elwood"
mat1[, 2, drop = FALSE]
## [,1]
## [1,] "meadowbrook"
## [2,] "kearns"
## [3,] "byu"
## [4,] "snow"
# A small 2x3 submatrix of mat1
mat1[1:2, 1:3]
## [,1] [,2] [,3]
## [1,] "jim bridger" "meadowbrook" "elwood"
## [2,] "copper hills" "kearns" "west jordan"
# The third odd number in 1 to 10
mat3["odds", "third"]
## [1] 5
# The first and third even numbers in 1 to 10
mat3["evens", c("first", "third")]
## first third
## 2 6
Matrices generalize to arrays, and can have more than two dimensions. For example, if arr
is a three-dimensional array, we may access an element in it with arr[1, 4, 3]
. We will not discuss arrays any further than this.
Data Frames
An R data frame stores data in a tabular format. Technically, a data frame is a list of vectors of equal length, so a data frame is a list. But since each “column” of the data frame has equal length, it also looks like a matrix where each column can differ in type (so one column could be numeric data, another character data, yet another factor data, etc.). Thus we can reference the data in a data frame like it is a list or like it is a matrix.
The matrix style of referencing data frame data is like
df[x,y]
, wherex
is the rows of the data frame andy
the columns. All the rules for using this notation with matrices apply to data frames. The result is another data frame.The list style for referencing a data frame references only the columns, not the rows. So
df[x]
will select the columns ofdf
specified byx
, and the result is another data frame.df[[x]]
refers to the vector stored indf[[x]]
; this is a vector, not a data frame. More commonly, though, we refer to a column of a data frame we want with the dollar notation; rather than usedf[["x"]]
, we usedf$x
to get the column vectorx
indf
.
To create a data frame, we have options:
We could use the
data.frame()
function, where each vector passed will become a column in the data frame.We could use the
as.data.frame()
function on an object easily coerced into a data frame, like a matrix or a list.
Some examples are shown below.
# Making a data frame with data.frame
df1 <- data.frame(numbers = 1:5, letters = c("a", "b", "c", "d", "e"))
df1
## numbers letters
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
# Notice that the character vector was automatically made a factor vector!
str(df1)
## 'data.frame': 5 obs. of 2 variables:
## $ numbers: int 1 2 3 4 5
## $ letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
colnames(mat2) <- c("elementary", "high school", "university", "local")
# Make a data frame out of a matrix If we don't want to turn character
# strings into factors, set stringsAsFactors to FALSE (this also works in
# data.frame)
df2 <- as.data.frame(mat2, stringsAsFactors = FALSE)
df2
## elementary high school university local
## 1 jim bridger copper hills university of utah slcc
## 2 meadowbrook kearns byu snow
## 3 elwood west jordan westminster suu
str(df2)
## 'data.frame': 3 obs. of 4 variables:
## $ elementary : chr "jim bridger" "meadowbrook" "elwood"
## $ high school: chr "copper hills" "kearns" "west jordan"
## $ university : chr "university of utah" "byu" "westminster"
## $ local : chr "slcc" "snow" "suu"
newlist <- list(first = c("Tamara", "Danielle", "John", "Kent"), last = c("Garvey",
"Wu", "Godfrey", "Morgan"))
# Making a data frame from a list
df3 <- as.data.frame(newlist, stringsAsFactors = FALSE)
Working with Data Frames
Data frames are such a key tool for R users that packages are written solely for the accessing and manipulation of data in data frames. Thus they deserve more discussion.
Often we wish to work with multiple variables stored in a data frame, but while the $
notation is convenient, even it can grow tiresome with complicated computations. The function with()
can help simplify code. The first argument of with()
is a data frame, and the second argument is a command to evaluate.
d <- mtcars[1:10, ]
# We wish to know which cars have mpg within the first and third quartile.
# Here's a first approach that is slightly cumbersome
d[d$mpg > quantile(d$mpg, 0.25) & d$mpg < quantile(d$mpg), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# We can use the with function to clean things up
d[with(d, mpg > quantile(mpg, 0.25) & mpg < quantile(mpg)), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Often users don’t want all the data in a data frame, but only a subset of it. The which()
could be used to get the desired rows and a vector the desired columns, but this can quickly become cumbersome. Alternatively, use the subset()
function for this task. The data frame is the first argument passed to subset()
. Next, pass information to the subset
parameter to decide on what rows to include, or the select
parameter to choose the columns. Names of variables in the data frame can be used in subset()
like in with()
; you don’t need to use $
notation to choose the variable from within the data frame. Additionally, unlike when selecting with vectors, you can use :
to choose all columns between two names, not just numbers, and you can use -
in front of a vector of names to declare columns you don’t want.
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Notice that I do not list the names as strings
subset(mtcars, select = c(mpg, cyl), subset = mpg > quantile(mpg, 0.9))
## mpg cyl
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Lotus Europa 30.4 4
# Other ways to select columns Using : on column names selects columns
# between the names on either side
subset(mtcars, select = hp:qsec, subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## hp drat wt qsec
## Hornet Sportabout 175 3.15 3.440 17.02
## Merc 450SE 180 3.07 4.070 17.40
## Merc 450SL 180 3.07 3.730 17.60
## Dodge Challenger 150 2.76 3.520 16.87
## Pontiac Firebird 175 3.08 3.845 17.05
## Ford Pantera L 264 4.22 3.170 14.50
# Using - on a vector of names selects all columns except those in a vector
subset(mtcars, select = -c(drat, wt, qsec), subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
# Here is the above without using subset; notice how complicated the command
# is
mtcars[!is.na(mtcars$mpg) & mtcars$mpg > quantile(mtcars$mpg, 0.25) & mtcars$mpg <
quantile(mtcars$mpg, 0.75) & mtcars$cyl == 8, !(names(mtcars) %in% c("drat",
"wt", "qsec"))]
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
There are many other details about working with data frames that are common parts of an analysts workflow, such as reshaping a data frame (keeping the same information stored in a data frame but changing the data frame’s structure) and merging (combining information in two data frames). Read the textbook for more information and examples of these very important ideas. The entire process of bringing data into a workable format is called data cleaning, a significant and often underappreciated part of an analyst’s job.
Applying a Function Over a Collection
Often we wish to apply a function not to a single object or variable but instead a collection so we can get multiple values. For example, if we want all powers of two from one to ten, we could do so with the following:
2^1:10
## [1] 2 3 4 5 6 7 8 9 10
A similar idea is that we could take the square root of numbers between 0 and 1 with:
sqrt(seq(0, 1, by = 0.1))
## [1] 0.0000000 0.3162278 0.4472136 0.5477226 0.6324555 0.7071068 0.7745967
## [8] 0.8366600 0.8944272 0.9486833 1.0000000
It may not be this simple though. For example, suppose we have a data frame, which I construct below:
library(MASS)
cdat <- subset(Cars93, select = c(Min.Price, Price, Max.Price, MPG.city, MPG.highway,
EngineSize, Horsepower, RPM))
head(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
## 1 12.9 15.9 18.8 25 31 1.8 140
## 2 29.2 33.9 38.7 18 25 3.2 200
## 3 25.9 29.1 32.3 20 26 2.8 172
## 4 30.8 37.7 44.6 19 26 2.8 172
## 5 23.7 30.0 36.2 22 30 3.5 208
## 6 14.2 15.7 17.3 22 31 2.2 110
## RPM
## 1 6300
## 2 5500
## 3 5500
## 4 5500
## 5 5700
## 6 5200
I want the mean of all the variables in cdat
. mean(cdat)
will not work; the mean()
function does not know how to handle the different variables in a data frame.
We may instead try a for
loop, like so:
# Make an empty vector
cdat_means <- c()
# This starts a for loop
for (vec in cdat) {
# For ever vector in cdat (called vec in the body of the loop), the code in
# the loop will be executed Compute the mean of vec, and add it to
# cdat_means
cdat_means <- c(cdat_means, mean(vec))
}
names(cdat_means) <- names(cdat)
cdat_means
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
A good R programmer will try to avoid for
loops as much as possible. One reason is that for
loops in R are slow, unlike in other languages. Since R is an interpreted language and also includes many features for interacting with R and writing code easier, R programs are going to be slower than in other languages. This is the price R pays for being interactive and much easier to write code for than compiled languages like C, C++, or Java. (A lot of R functions run fast because the function is actually an interface for a function written in C, C++, or FORTRAN.) Another reason R programmers avoid for
loops is that there is often an alternative not using a loop that easier to both write and understand.
How could we rewrite the above code without using for
? We could use the function sapply()
and the call sapply(v, f)
, where v
is either a vector or list with the items you wish to iterate over, and f
is a function to apply to each item. (Remember that a data frame is a list of vectors of equal length.) A vector is returned containing the result.
# A function to check if a number is even
even <- function(x) {
# If x is divisible by 2 (the remainder is 0 when x is divided by 2), x is
# even and the result is TRUE. Otherwise, the result is FALSE.
x%%2 == 0
}
# Which numbers between 1 and 10 are even?
sapply(1:10, even)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
# The means of the vectors in cdat (remember that a data frame is a list of
# equal length vectors)
sapply(cdat, mean)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
# We can pass sapply an anonymous function, which is an unnamed function
# passed as an argument to some other function, used for some evaluation. I
# illustrate below by passing to sapply a function that computes the range
# of each of the variables in cdat.
sapply(cdat, function(vec) {
diff(range(vec))
})
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 38.7 54.5 72.1 31.0 30.0 4.7
## Horsepower RPM
## 245.0 2700.0
The lapply()
function works exactly like the sapply()
function, except lapply()
returns a list rather than a vector.
Alternatively, if we have a function f(x)
that knows how to work with an object x
, we could vectorize f
so it can work on a vector or list of objects like x
. We can use the Vectorize()
function for this task with a call like vf <- Vectorize(f)
, where f
is the function to vectorize, and vf
is the new, vectorized version of f
. The example below does what we did for cdat
with both a for
loop and sapply()
, but now does so with a vectorized version of mean()
.
vmean <- Vectorize(mean)
vmean(cdat)
## Min.Price Price Max.Price MPG.city MPG.highway EngineSize
## 17.125806 19.509677 21.898925 22.365591 29.086022 2.667742
## Horsepower RPM
## 143.827957 5280.645161
Now suppose you have a data frame d
, which contains information from different samples representing different populations. You wish to apply a function f()
to data stored in d$x
, and d$y
determines which sample each row of the data frame (and thus, each entry of d$x
) came from. You want f()
to be applied to the data in each sample, separately. You can do so with the aggregate()
function in a call of the form aggregate(x ~ y, data = d, f)
. I illustrate with the iris
dataset below.
# The struture of iris
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# The mean sepal length by species of iris
aggregate(Sepal.Length ~ Species, data = iris, mean)
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
# The five-number summary of sepal length for each species of iris
aggregate(Sepal.Length ~ Species, data = iris, quantile)
## Species Sepal.Length.0% Sepal.Length.25% Sepal.Length.50%
## 1 setosa 4.300 4.800 5.000
## 2 versicolor 4.900 5.600 5.900
## 3 virginica 4.900 6.225 6.500
## Sepal.Length.75% Sepal.Length.100%
## 1 5.200 5.800
## 2 6.300 7.000
## 3 6.900 7.900
Let’s now consider matrices. Perhaps we have a matrix and we wish to apply a function across the rows of the matrix or the columns of the matrix. The apply()
function allows us to do just that in a call of the form apply(mat, m, f)
, where mat
is the matrix with data, f
the function to apply, and m
the margin to apply f()
over. For matrices, a value of 1
for m
will lead to the function being applied across rows, and a value of 2
across columns. I illustrate with a data set recording the ethnicity of selected Utah publich schools (to see how this data set was created, view the source code of this document).
## Loading required package: methods
school_race_dat
## Entheos Academy Kearns Entheos Academy Magna
## Native American 0 0
## Asian 4 5
## Black 1 5
## Hispanic 145 201
## Pacific Islander 15 3
## White 334 273
## Multiple Race 23 15
## Jim Bridger School Sunset Ridge Middle Copper Hills High
## Native American 4 5 9
## Asian 6 25 50
## Black 12 19 42
## Hispanic 216 322 551
## Pacific Islander 12 28 28
## White 314 1124 1924
## Multiple Race 7 50 102
## Thomas Jefferson Jr High Kearns High
## Native American 11 39
## Asian 13 49
## Black 17 53
## Hispanic 260 937
## Pacific Islander 42 99
## White 394 1138
## Multiple Race 2 10
# Get row sums
apply(school_race_dat, 1, sum)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
# Column sums
apply(school_race_dat, 2, sum)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325
# Row sums and column sums are actually used frequently, so there are
# specialized functions for these
rowSums(school_race_dat)
## Native American Asian Black Hispanic
## 68 152 149 2632
## Pacific Islander White Multiple Race
## 227 5501 209
colSums(school_race_dat)
## Entheos Academy Kearns Entheos Academy Magna Jim Bridger School
## 522 502 571
## Sunset Ridge Middle Copper Hills High Thomas Jefferson Jr High
## 1573 2706 739
## Kearns High
## 2325
Using External Data
R would not be very useful if we had no way of loading in and saving data. R has means for reading data from spreadsheets such as .xls
or .xlsx
files made by Microsoft Excel. Functions for reading Excel files can be found in the xlsx or gdata packages.
Common plain-text formats for reading data include the comma-separated values format (.csv
), tab-separated values format (.tsv
), and the fixed-width format (.fwf
). These files can be read in using the read.csv()
, read.table()
, and the read.fwf()
functions (with read.csv()
being merely a front-end for read.table()
). All of these functions parse a plain-text data file and return a data frame with the contents. Keep in mind that R will guess what type of data is stored in the file. Usually it makes a good guess, but this is not guaranteed and you may need to do some more data cleaning or give R more instructions on how to interpret the file.
In order to load a file, you must specify the location of the file. If the file is on your hard drive, there are a few ways to do so:
You could use the
file.choose()
command to browse your system and locate the file. Once done, you will have a text string describing the location of the file on your system.Any R session has a working directory, which is where R looks first for files. You can see the current working directory with
getwd()
, and change the working directory withsetwd(path)
, wherepath
is a string for the location of the directory you wish to set as the new working directory.
Let’s assume we’re loading in a .csv
file (the approach is similar for other formats). The command df <- read.csv("myfile.csv")
instructs R to read myfile.csv
(which is presumably in the working directory, since we did not specify a full path; if it were not, we would either change the working directory or pass the full path to the function, which may look something like read.csv("C:/path/to/myfile.csv")
, or read.csv("/path/to/myfile.csv")
, depending on the system) and store the resulting data frame in df
. Once done, df
will now be ready for us to use.
Suppose that the data file is on the Internet. You can pass the url of the file to read.csv()
and R will read the file online and make it available to you in your session. I demonstrate below:
# Total Primary Energy Consumption by country and region, for years 1980
# through 2008; in Quadrillion Btu (CSV Version). Dataset from data.gov,
# from the Department of Energy's dataset on total primary energy
# consumption. Download and load in the dataset
energy <- read.csv("http://en.openei.org/doe-opendata/dataset/d9cd39c5-492e-4e82-8765-12e0657eeb4e/resource/3c42d852-567e-4dda-a39c-2bfadf309da5/download/totalprimaryenergyconsumption.csv",
stringsAsFactors = FALSE)
# R did not parse everything correctly; turn some variables numeric
energy[2:30] <- lapply(energy[2:30], as.numeric)
# We want energy data for North American countries, from 2000 to 2008
us_energy <- subset(energy, select = X2000:X2008, subset = Country %in% c("Canada",
"United States", "Mexico"))
us_energy
## X2000 X2001 X2002 X2003 X2004 X2005 X2006
## 2 13.07669 12.87847 13.10786 13.52061 13.83128 14.16374 13.81736
## 4 6.37958 6.32931 6.32936 6.50563 6.48998 6.80188 7.36271
## 6 99.25385 96.53415 98.03879 98.31384 100.49743 100.60722 99.90566
## X2007 X2008
## 2 14.07179 14.02923
## 4 7.27651 7.30898
## 6 101.67563 99.53011
Naturally you can export data frames into common formats as well. write.csv()
, write.table()
, and write.fwf()
will write data into comma-separated value, tab-separated value, and fixed width formats. Their syntax is similar. To save a .csv
file, issue the command write.csv(df, file = "myfile.csv")
, where df
is the data frame to save and file
where to save it, which could be just a file name (resulting in the file being saved in the working directory), or an absolute path.
my_data <- data.frame(var1 = 1:10, var2 = paste("word", 1:10))
write.csv(my_data, file="my_data.csv")
There are other formats R can read and write to. The foreign package allows R to read data files created for other statistical software packages such as SAS or Stata. The XML package allows R to read XML and HTML files. You can also read JSON files or data stored in Google Sheets. Refer to the textbook for more information.