Data frames are such a key tool for R users that packages are written solely for the accessing and manipulation of data in data frames. Thus they deserve more discussion.
Often we wish to work with multiple variables stored in a data frame, but while the $
notation is convenient, even it can grow tiresome with complicated computations. The function with()
can help simplify code. The first argument of with()
is a data frame, and the second argument is a command to evaluate.
d <- mtcars[1:10, ]
# We wish to know which cars have mpg within the first and third quartile.
# Here's a first approach that is slightly cumbersome
d[d$mpg > quantile(d$mpg, 0.25) & d$mpg < quantile(d$mpg), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# We can use the with function to clean things up
d[with(d, mpg > quantile(mpg, 0.25) & mpg < quantile(mpg)), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Often users don’t want all the data in a data frame, but only a subset of it. The which()
could be used to get the desired rows and a vector the desired columns, but this can quickly become cumbersome. Alternatively, use the subset()
function for this task. The data frame is the first argument passed to subset()
. Next, pass information to the subset
parameter to decide on what rows to include, or the select
parameter to choose the columns. Names of variables in the data frame can be used in subset()
like in with()
; you don’t need to use $
notation to choose the variable from within the data frame. Additionally, unlike when selecting with vectors, you can use :
to choose all columns between two names, not just numbers, and you can use -
in front of a vector of names to declare columns you don’t want.
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Notice that I do not list the names as strings
subset(mtcars, select = c(mpg, cyl), subset = mpg > quantile(mpg, 0.9))
## mpg cyl
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Lotus Europa 30.4 4
# Other ways to select columns Using : on column names selects columns
# between the names on either side
subset(mtcars, select = hp:qsec, subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## hp drat wt qsec
## Hornet Sportabout 175 3.15 3.440 17.02
## Merc 450SE 180 3.07 4.070 17.40
## Merc 450SL 180 3.07 3.730 17.60
## Dodge Challenger 150 2.76 3.520 16.87
## Pontiac Firebird 175 3.08 3.845 17.05
## Ford Pantera L 264 4.22 3.170 14.50
# Using - on a vector of names selects all columns except those in a vector
subset(mtcars, select = -c(drat, wt, qsec), subset = !is.na(mpg) & mpg > quantile(mpg,
0.25) & mpg < quantile(mpg, 0.75) & cyl == 8)
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
# Here is the above without using subset; notice how complicated the command
# is
mtcars[!is.na(mtcars$mpg) & mtcars$mpg > quantile(mtcars$mpg, 0.25) & mtcars$mpg <
quantile(mtcars$mpg, 0.75) & mtcars$cyl == 8, !(names(mtcars) %in% c("drat",
"wt", "qsec"))]
## mpg cyl disp hp vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 0 0 3 2
## Merc 450SE 16.4 8 275.8 180 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 0 0 3 3
## Dodge Challenger 15.5 8 318.0 150 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 0 1 5 4
There are many other details about working with data frames that are common parts of an analysts workflow, such as reshaping a data frame (keeping the same information stored in a data frame but changing the data frame’s structure) and merging (combining information in two data frames). Read the textbook for more information and examples of these very important ideas. The entire process of bringing data into a workable format is called data cleaning, a significant and often underappreciated part of an analyst’s job.