Lecture 2

Univariate Data Analysis

Vectors

Univariate data is usually stored as vectors. The most basic function for creating a vector is the c() function, where the arguments of the function (separated by commas) become the contents of the vector. It generally doesn’t matter what the type of data the contents of the vector are so long as they are all the same.

# A numeric vector of fictitious data
num_vec <- c(10, 13, -1, 0.02, 0, -3.31)
# A vector of character data
char_vec <- c("joe", "phil", "jan", "denise", "tom")
# A vector of boolean values
bool_vec <- c(TRUE, TRUE, FALSE)
# A vector of functions? WHAT IS THIS MADNESS?
func_vec <- c(mean, sum, sd)
# But this does not create a vector of vectors; all vectors are flattened into one vector (it is possible to make a vector of vectors, but it's tricky)
not_a_vec_of_vecs <- c(c(1,2,3),c(4,54),c(10,2,-6))

# If I want to see the contents of a vector, just type its name into the interpreter
not_a_vec_of_vecs
## [1]  1  2  3  4 54 10  2 -6

When numbers, strings, boolean values, and other similar objects are assigned to a variable, that variable is interpreted as being a vector.

im_a_vector <- 5
is.vector(im_a_vector)
## [1] TRUE

To access the contents of a vector, you can use [] notation, like vec[x]. x identifies the elements of the vector you want. This is called indexing.

Some notes about x:

  • Items in a vector can always be found via an integer. vec[1] will get the first element of the vector, and vec[5] the fifth. Also, instead of specifying what indices you do want, you can specify the indices you don’t want with negative integers. For example, vec[-1] is vec with all except for vec[1]. vec[0] is an empty vector that is of the same type as vec.
  • Some vectors have named elements. In that case, you can access elements of the vector by name, using a character string. For example, if cities is a vector of city populations which is indexed with city names, cities["Salt Lake City"] will find the population of Salt Lake City in the vector. If you want to see the names of the elements, use the names() function. (Not surprisingly, the object returned by names() is also a vector, specifically a character vector.)
  • x can be another vector, and thus you can index a vector with another vector so long as the values contained in x are a valid means of indexing. We could index a vector with vec[c(1,2,3)] or cities[c("Salt Lake City", "Provo")].
  • x could be a vector of booleans. Every TRUE in x will lead to the element in vec corresponding to where the TRUE is located in x will be included, and every element in vec where x is FALSE will not be included. (While this implies that x must be the same length as vec, this is not necessarily true; if x is shorter, the contents of x will be recycled until R has made a decition for each element of vec whether to include it or not. I discuss recycling more later.)
  • vec[] will return the entire vector vec. This is sometimes useful.
# You can assign names to elements using name = value notation in c()
friendly_vector <- c(jon = 1, tony = 4, janet = 5)
names(friendly_vector)
## [1] "jon"   "tony"  "janet"
friendly_vector[1]
## jon 
##   1
friendly_vector[c(2,3)]
##  tony janet 
##     4     5
friendly_vector[-2]
##   jon janet 
##     1     5
friendly_vector["tony"]
## tony 
##    4
friendly_vector[c("jon", "tony")]
##  jon tony 
##    1    4
friendly_vector[c(TRUE, TRUE, FALSE)]
##  jon tony 
##    1    4
# You can rename elements in a vector like so:
names(friendly_vector) <- c("jack", "jill", "dick")
friendly_vector["jack"]
## jack 
##    1

You can also use indexing to change the values of a vector, or even expand it. vec[x] <- val will replace the contents of vec at x with val if val is the same type as the rest of the data in vec and x is any valid means of indexing vec. x could consist of indices that do not already exist in vec, in which case those indices will be added to vec, thus expanding it. If x consists of integer indices outside the range of existing integer indices, these other indices will be created as well and filled with NA’s.

You can also delete values from the vector with vec[x] <- NULL

# friendly_vector was defined in an earlier code block
# Change existing values
friendly_vector["jack"] <- 16
friendly_vector[2] <- 19
friendly_vector
## jack jill dick 
##   16   19    5
# Adding new values
friendly_vector[4] <- 21
friendly_vector
## jack jill dick      
##   16   19    5   21
friendly_vector["danielle"] <- 0
friendly_vector
##     jack     jill     dick          danielle 
##       16       19        5       21        0
friendly_vector[c(6, 7, 8)] <- 12
friendly_vector
##     jack     jill     dick          danielle                            
##       16       19        5       21        0       12       12       12
friendly_vector[20] <- 18

While vectors must contain data of one type, there is a special type that can be included in any vector: NA. This value represents “missing information”. NA isn’t like other values and needs to be handled with care. The function is.na() identifies these values in vectors.

not_finished <- c(1, 4, 5, NA, 2, 2)
not_finished
## [1]  1  4  5 NA  2  2
# If I want to access the non-NA parts of the vector, I can do so like this
not_finished[!is.na(not_finished)]
## [1] 1 4 5 2 2

There are other special types in R resembling but dfferent from NA. NULL is a lot like NA but usually means that something in R is unavailable (whereas NA is more akin to missing data in a dataset). Inf and -Inf are special values denoting infinity and negative infinity respectively. These are, in some sense, numeric, and represent values that are very large (or very small, in the case of -Inf), and can occur when dividing non-zero numbers by zero. NaN effectively means “not a number”, and occurs when some numerical error occurs, like dividing zero by zero. Again, these cases must be handled with care, and there are special functions like is.na() for detecting them in vectors.

There are other ways to create vectors other than with the c() function. Some common ways are listed below:

  • The : constructor can be used to create sequences. For example, 1:10 will create a vector of numbers from 1 to 10, incrementing by 1. 10:0 creates a vector starting at 10 and ending at 0, decrementing by 1. You can also replace the endpoints with variables, or expressions in parentheses, like 1:n or 1:(2 + 2).
  • Sometimes you may want to create a sequence but want more control over incrementation, or how many elements in the vector you want. In this case, use the seq() function. You can make a sequence of numbers incrementing by 2 going from 1 to 100 with seq(1, 100, by = 2), or a sequence of numbers between 0 and 1 with length 100 with seq(0, 1, length = 100). (See help("seq") to see all the many ways to create sequences with seq().)
  • The rep() function can make vectors with repeating elements. Let’s say I want to repeat the character values “"a", "b", and "c" three times total. I can do so with rep(c("a", "b", "c"), times = 3). Alternatively, if I wanted to repeat these values each three times, I would do so with rep(c("a", "b", "c"), each = 3). (See help("rep") to see all the many ways to create sequences with rep().)
  • Sometimes I want to create character vectors where I have pasted together strings from separate vectors. For example, if I want a character vector containing the names of one hundred treatments, it may be tedious to type c("Treatment 1", "Treatment 2", ..., "Treatment 100"). The paste() function makes this much easier. I can create such a vector with paste("Treatment", 1:100); each of the elements from both vectors will be pasted together into a new vector. By default, these elements will be separated with a space character, but I can change this behavior by specifying the sep parameter in paste(). For example, if I want "Treatment_1" rather than "Treatment 1", I can do so with paste("Treatment", 1:100, sep = "_").

I demonstrate these techniques below.

# Create a vector of numbers from 1 to 10
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# In reverse
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1
# Getting creative
1:(2 + 2)
## [1] 1 2 3 4
# Another way to make a sequence
seq(-1, 1, by = 0.5)
## [1] -1.0 -0.5  0.0  0.5  1.0
# Yet another way to make a sequence
seq(0, 20, length = 3)
## [1]  0 10 20
# A repeating sequence of letters
rep(c("a", "b", "c"), times = 3)
## [1] "a" "b" "c" "a" "b" "c" "a" "b" "c"
# A sequence of repeating letters
rep(c("a", "b", "c"), each = 3)
## [1] "a" "a" "a" "b" "b" "b" "c" "c" "c"
# A quick way to make a character vector
paste("Treatment", 1:5)
## [1] "Treatment 1" "Treatment 2" "Treatment 3" "Treatment 4" "Treatment 5"

You can do mathematical operations with vectors, such as +, -, *, /, ^, and others. Operations are often applied component-wise, using R’s recycling behavior. Recycling occurs when two vectors are not of equal length. Let’s say you have a vector vec1 that is longer than vec2, and you do an operation such as vec1 + vec2. Let’s say length(vec1) == 12 and length(vec2) == 4. At first, the resulting vector will add, component-wise, each element from each vector; think vec1[1] + vec2[1], vec1[2] + vec2[2], vec1[3] + vec2[3], and vec1[4] + vec2[4]. After the fourth index, though, there are no more elements in vec2, so R will start from the beginning of vec2 and keep going, resulting in vec1[5] + vec2[1], vec1[6] + vec2[2], and so on. R will continue to do this until it has reached the end of vec1. You would get the same result for vec2 + vec1; it doesn’t matter which side of + has the longer vector. If the longer vector is not a multiple of the shorter vector, though, R will throw a warning message.

vec1 <- c(0, 0, 10, -1, 5, 6)
# R treats "1" as a vector of length 1, and thus recycles
vec1 + 1
## [1]  1  1 11  0  6  7
vec1 * 2
## [1]  0  0 20 -2 10 12
vec1 ^ 2
## [1]   0   0 100   1  25  36
vec2 <- 1:2
# Another case of recycling behavior
vec1 + vec2
## [1]  1  2 11  1  6  8
# Same result if I switch the order
vec2 + vec1
## [1]  1  2 11  1  6  8
# R will complain if the length of the longer vector is not a multiple of the length of the shorter object, though it will still produce a result
vec1 / 1:4
## Warning in vec1/1:4: longer object length is not a multiple of shorter
## object length
## [1]  0.000000  0.000000  3.333333 -0.250000  5.000000  3.000000
vec1 ^ vec2
## [1]  0  0 10  1  5 36
# Does NOT get same result because ^ is not commutative
vec2 ^ vec1
## [1]  1.0  1.0  1.0  0.5  1.0 64.0

Of course, when working with vectors of the same length, recycling isn’t a concern.

# These vectors are the same length, so no need to worry about recycling
vec1 <- 1:3
vec2 <- 10:12
vec1 + vec2
## [1] 11 13 15
vec1 * vec2
## [1] 10 22 36
vec1 ^ vec2
## [1]      1   2048 531441
# That said, it's not hard to guess what will happen when one of the objects is length one (like a vector)
(1:10) ^ 2 # First ten squares
##  [1]   1   4   9  16  25  36  49  64  81 100
2 ^ (1:10) # First ten powers of two
##  [1]    2    4    8   16   32   64  128  256  512 1024

Recycling occurs elsewhere in R. We saw recycling behavior when indexing a vector with a vector of booleans earlier. There are other instances in R as well.

Some functions are vector-valued. For example, when passed a vector, sqrt() will take the square root of every element in the vector.

vec <- 1:5
sqrt(vec)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
exp(vec)
## [1]   2.718282   7.389056  20.085537  54.598150 148.413159
vec[3] <- NA
# Creates a vector of booleans
is.na(vec)
## [1] FALSE FALSE  TRUE FALSE FALSE

Another important set of operators are logical operators, which return boolean data. Such operators include:

  • ==: This detects equality (notice that this is not =, which is an assignment operator). If or when the object on the left equals the object on the right, the result will be TRUE; otherwise, it will be FALSE. For vectors, this does not return whether the two vectors are identical, but when one vector is equal to the other component-wise.
  • < and >: These detects “less” or “greater than”, like in mathematics. This is intended for numeric data, but can be used for other types of data as well (although rarely, and probably not well).
  • <= and >=: These detect “less than or equal to” or “greater than or equal to”.
  • !=: This detects “not equal to”, and is the opposite of ==.
  • &: This is logical “and”, and is true when the boolean or statment on the left is true and the boolean or statement on the right is true. Thus, (1 == 1) & (2 >= 1) == TRUE and (1 == 2) & (2 >= 1) == FALSE.
  • |: This is logical “or”, and is true when the boolean or statement on the left is true or the boolean or statement on the right is true. Thus, (1 == 1) | (2 >= 1) == TRUE and (1 == 2) | (2 >= 1) == TRUE.
  • !: This is logical “not”, negating any truth statement. So !TRUE == FALSE and !(1 == 2) == TRUE.
  • %in%: This logical operator is unique compared to the others considered here, since this is actually a function. The argument on the right-hand side of this operator must be a vector, and this operator whether elements on the left-hand side are in the vector on the right-hand side (in logic, this is \(x \in S\) where \(S\) is a set). Thus, 3 %in% c(1,2,3) == TRUE and 3 %in% c(1, 2) == FALSE.

Like other operators, these utilize recycling and may return vectors. Examples are shown below.

vec1 <- c(1, 4, 21, 22, -5)
vec2 <- c(1, 2, 3, 4, -5)
# True only in the first and last positions
vec1 == vec2
## [1]  TRUE FALSE FALSE FALSE  TRUE
vec1 < 4  # True for first and last elements
## [1]  TRUE FALSE FALSE FALSE  TRUE
vec2 <= 4  # True in first, second, and last elements
## [1] TRUE TRUE TRUE TRUE TRUE
!(vec1 <= 4)  # Inverse of above
## [1] FALSE FALSE  TRUE  TRUE FALSE
# %% is the modulus operator, returning the remainder when the number on the
# left is divided by the number on the right. vec1 %% 2 == 0 is a way to
# detect even numbers (since the remainder when dividing an even number by 2
# must be zero).
(vec1 > 20) & !(vec1%%2 == 0)  # Only true in third position
## [1] FALSE FALSE  TRUE FALSE FALSE
(vec1 > 20) | !(vec1%%2 == 0)  # Not true in second or fourth
## [1]  TRUE FALSE  TRUE  TRUE  TRUE
1:4 %in% vec1  # One and four are in vec1
## [1]  TRUE FALSE FALSE  TRUE
# Some useful functions are the any() and all() functions, which take
# boolean vectors as arguments and return whether anywhere the vector is
# true or whether the vector is true everywhere, respectively. In other
# words, any() will 'or' all elements of the vector, and all() will 'and'
# all elements in the vector.
any(1:4 %in% vec1)  # Is there a number between 1 and 4 in vec1?
## [1] TRUE
all(1:4 %in% vec1)  # Are all numbers between 1 and 4 in vec1?
## [1] FALSE

Many other functions return logical data, notably the is.object() family of functions like is.vector() or is.na().

While a vector of booleans for when a condition is true is nice, sometimes you may not want whether a statement is true for each element of a vector, but for which elements of a vector a statement is true. In this case, the which() function will tell you the indices of where an input vector is TRUE.

which(c(TRUE, FALSE, FALSE, TRUE, FALSE))
## [1] 1 4
which(15:25 > 20)
## [1]  7  8  9 10 11
# I'm going to create a dataset with NA's, and use which() to find the NA's
# and also the 'good' data
data_vec <- 1:100
data_vec[c(21, 33, 49, 61, 62)] <- NA
which(is.na(data_vec))  # Where are the NA's
## [1] 21 33 49 61 62
which(!is.na(data_vec))  # Where is the non-NA data?
##  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
## [18]  18  19  20  22  23  24  25  26  27  28  29  30  31  32  34  35  36
## [35]  37  38  39  40  41  42  43  44  45  46  47  48  50  51  52  53  54
## [52]  55  56  57  58  59  60  63  64  65  66  67  68  69  70  71  72  73
## [69]  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
## [86]  91  92  93  94  95  96  97  98  99 100
# How much data is NA?
length(which(is.na(data_vec)))
## [1] 5
# Alternatively, I can sum a boolean vector to find the same answer (FALSE
# is treated as 0 and TRUE as 1)
sum(is.na(data_vec))
## [1] 5

A special type of vector not discussed earlier is a factor vector, which is similar to a character vector but requires that each value of the vector be in a list of levels, which describe the values a factor vector can take. (R also views vactors differently internally than character vectors.) Factor vectors are thus used for storing categorical data. You can create a factor with the factor() function.

When manipulating factor data, there is an additional restriction to the ones discussed already: you cannot add a value that is not in the levels of the factor (aside from NA). You would have to add the value to the levels of the factor first, then add the value. Also, beware that changing the levels will also change the values stored in the factor. You can set the levels of a factor with levels(x) <- y, where x is the factor vector and y a vector representing the new levels of the factor.

# Create a factor of color data
color_data <- c("blue", "red", "blue", "blue", "blue", "red", "red", "blue")
# Create the factor, declaring the levels; notice that I included a category
# that is not in the data vector
colcat1 <- factor(color_data, levels = c("red", "green", "blue"))
colcat1
## [1] blue red  blue blue blue red  red  blue
## Levels: red green blue
# If I do not declare the levels, R will use the values stored in the vector
# to guess what they are
colcat2 <- factor(color_data)
# Now I change data; for the first, no complaint if I add 'green'
colcat1[1] <- "green"
colcat1
## [1] green red   blue  blue  blue  red   red   blue 
## Levels: red green blue
# But 'green' is not in the levels of colcat2, so an warning is issued and
# NA is added instead
colcat2[1] <- "green"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "green"): invalid factor level,
## NA generated
colcat2
## [1] <NA> red  blue blue blue red  red  blue
## Levels: blue red
# I can see the levels of these with levels()
levels(colcat1)
## [1] "red"   "green" "blue"
levels(colcat2)
## [1] "blue" "red"
# I can rearrange all the categories by changing the levels
levels(colcat1) <- c("blue", "red", "green")
colcat1
## [1] red   blue  green green green blue  blue  green
## Levels: blue red green
# Here I add a new level to those specified for colcat2
levels(colcat2)[3] <- "green"
# Now I can add 'green' to colcat2 without complaint
colcat2[1] <- "green"
colcat2
## [1] green red   blue  blue  blue  red   red   blue 
## Levels: blue red green

Functions

Functions are one of the most important objects in R. In fact, R follows a programming paradigm called functional programming, where most operations are seen as the evaluation of functions. You can think of functions as miniature programs for performing certain tasks.

In R, functions have two parts:

  • Arguments are the values passed to the function for evaluation. In the statement f(x, y), the variables between the parentheses, x and y, are the arguments of the function. A function will typically expect an argument to be of a particular type and may throw an error when that type is not received, but the expected type could be anything from a vector to a data frame to even another function. Some functions have many arguments but have default values assigned to most of them so that the user specifies those arguments only if they wish to change the function’s default behavior.

  • The body of a function is the part of the function that performs computations (involving the parameters) and possibly returning an output. The output of the function is either the last value computed (and not assigned to a variable) or a value returned by the return() function.

To create a named function, use the syntax my_func <- function(arguments) { function body }. Below I create a function.

# Let's create a function that returns the length of the longest vector
# passed to it. It will take two vectors x and y as arguments, and return
# the length of the longer vector
longest_length <- function(x, y) {
    l1 <- length(x)  # This is a local variable that will be visible only in the function, not the rest of R
    l2 <- length(y)
    max(l1, l2)  # This is the last unevaluated computation and thus is returned by the function
}
# Note that longest_length is a variable storing a function, and
# longest_length() is a function call

# Testing it out
vec1 <- c("bob", "jim", "margaret", "danny")
vec2 <- c(22, -9)
longest_length(vec1, vec2)
## [1] 4
# Functions will check the position of objects passed and assign them to the
# arguments in the same position in the function definition. Alternatively
# (and very useful when there are many arguments not all of which are
# specified), you can use = to set specific arguments by name.
longest_length(y = vec2, x = vec1)
## [1] 4

Functions are extremely important objects in R and the key to R programming. Appendix A of the Verzani textbook discusses function programming in more detail.

Visually Exploring a Dataset

R has many techniques built-in for visually analyzing datasets, and many packages that do so even better, such as the very popular ggplot2 package (I used ggplot2 for all the graphics for my first report on Utah’s gender gap in wages written for Voices for Utah Children, which you can read here). Here I will discuss how to make a few basic graphics for visually analyzing a dataset.

Stem-and-leaf plot

Use the stem() function to create a stem-and-leaf plot.

stem(rivers)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##    0 | 4
##    2 | 011223334555566667778888899900001111223333344455555666688888999
##    4 | 111222333445566779001233344567
##    6 | 000112233578012234468
##    8 | 045790018
##   10 | 04507
##   12 | 1471
##   14 | 56
##   16 | 7
##   18 | 9
##   20 | 
##   22 | 25
##   24 | 3
##   26 | 
##   28 | 
##   30 | 
##   32 | 
##   34 | 
##   36 | 1

The display suggests that the rivers dataset is very right-skewed. Let’s see if this agrees with other plots.

Dotplot

You can use stripchart() to make dotplots. The default behavior of this function doesn’t produce a very useful plot (truth be told, plotting functions in other packages make better dotplots in general than stripchart()), so set method = "stack" to enable stacking. I also set pch = 20 to change the plotting character used from the default square to a filled-in circle, which is easier to read.

Many R plotting functions have parameters main, xlab, and ylab. These are so you set the title of the plot, and the names of the labels. I do so in the plot I create as well.

# Make a dotplot of the rivers data
stripchart(rivers, method="stack", pch = 20,
           # Adding axis labels
           main = "Length of North American Rivers",
           xlab = "Length")

The rivers dataset is slightly large, with 141 observations. The plots used so far may not result in very good graphics for datasets like this. I next consider graphics that may handle larger datasets better.

Histogram

The hist() function creates histograms in R. Calling hist(x) for some dataset x will create a histogram R thinks is appropriate. R will automatically choose classes and the number of classes to used based on built-in algorithms. I show an example below:

hist(rivers)

We can change the parameters of the hist() function to have more control over the result. For example:

  • R automatically chooses axis names and the main title, which usually are not very good names. We can change these defaults to reasonable labels by setting the main, xlab, and ylab parameters.

  • By default, R will plot the frequency rather than the relative frequency of each class. The result is indistinguishable until you wish to overlay a histogram with a smooth curve or otherwise be more creative with the chart. Set freq = FALSE to show relative frequencies, or probabilities.

  • R creates left-inclusive histograms by default. If we want right-inclusive histograms, set right = TRUE.

  • The breaks parameters is used for setting where the break points are located. If we set breaks to an integer, this will tell R how many classes to use. For example, if we want \(\sqrt{n}\) classes for the rivers dataset, we can do so by setting breaks = round(sqrt(length(x))). (That said, the method R uses for determining how many classes to use is usually better than the \(\sqrt{n}\) rule.)

  • Do we really like white bars in our histogram? The col parameter can change their color. For example, set col = "gray" for gray bars.

There are many more parameters for the hist() function than this; I invite you to read the function’s documentation for more details. Here is another histogram for the rivers dataset, this one changing some parameters.

hist(rivers, main = "Distribution of the Lengths of Major North American Rivers",
     xlab = "Length of River",
     freq = FALSE,
     right = TRUE,
     breaks = round(sqrt(length(rivers))), # Using the sqrt(n) rule
     col = "gray")

Density Curve

A density curve is another way to view a distribution where the end result is a smooth curve, It can be interpreted like a histogram, but it avoids disadvantages that come with choosing discrete classes. You can create a density curve with plot(density(x)).

plot(density(rivers), main = "Distribution of Lengths of Rivers")

# It is possible to superimpose a density plot on top of a histogram to see the relationship. Just be sure that the histogram is showing relative frequencies rather than frequencies. Here is an example.
hist(rivers, main = "Distribution of the Lengths of Major North American Rivers",
     xlab = "Length of River",
     freq = FALSE,
     right = TRUE,
     breaks = round(sqrt(length(rivers))), # Using the sqrt(n) rule
     col = "gray",
     ylim = c(0, 0.0025)) # ylim sets the limits of the vertical axis
lines(density(rivers))

Boxplot

You can create a boxplot in R with the boxplot() function.

boxplot(rivers)

One of the major advantages of boxplots is being able to compare distributions of different samples. To do so, issue a call to boxplot() of the form boxplot(x ~ y), where x is a data vector with all samples combined, and y is a character or factor vector identifying the sample each element of x belongs to.

# I will be working with the iris dataset for this example
# Looking at the data
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
iris$Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
boxplot(iris$Sepal.Length ~ iris$Species)

Bar Chart

For a categorical dataset, we can create a bar chart using the barplot() function, along with the summary() function. This is a natural way to visualize categorical data.

library(UsingR)
central.park.cloud  # Looking at the data
##  [1] partly.cloudy partly.cloudy partly.cloudy clear         partly.cloudy
##  [6] partly.cloudy clear         cloudy        partly.cloudy clear        
## [11] cloudy        partly.cloudy cloudy        cloudy        clear        
## [16] partly.cloudy partly.cloudy clear         clear         clear        
## [21] clear         cloudy        cloudy        cloudy        cloudy       
## [26] cloudy        clear         partly.cloudy clear         clear        
## [31] partly.cloudy
## Levels: clear partly.cloudy cloudy
# For this object (a factor vector), summary will give the counts of each
# category.
summary(central.park.cloud)
##         clear partly.cloudy        cloudy 
##            11            11             9
# Shows the cloudiness of different days in Central Park
barplot(summary(central.park.cloud))

Numerical Summaries

A numerical summary tries to describe some aspect of a dataset using numbers. Two classes of numerical summaries include measures of location and measures of spread. There are many other numerical summaries (the textbook used in this lab describes measures of skewness and kurtosis in addition to measures of location and spread), but we will focus on these two.

First, a great way to obtain appropriate numerical summaries for a dataset is with the summary() function. This is a generic function that changes its behavior depending on the object passed to it. If x is a numeric vector, summary(x) will compute the five-number summary of x (minimum, first quartile, median, third quartile, maximum) in addition to the mean of x. On the other hand, if x is a factor or character vector, summary(x) will compute the frequency of each category in x. Thus, if you have a dataset you are unfamiliar with, summary() is a good way to see some basic information about it.

# Some basic information about rivers, a numeric dataset
summary(rivers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   310.0   425.0   591.2   680.0  3710.0
# Frequencies for cloud conditions in central.park.cloud, a factor vector
summary(central.park.cloud)
##         clear partly.cloudy        cloudy 
##            11            11             9

Of course it is possible to compute specific numeric summaries.

The mean of a dataset is defined as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

where \(x_i\) is the \(i^{\text{th}}\) observation and \(n\) the size of the dataset. The mean can be computed in R using the mean() function. To obtain the trimmed mean, you can set the trim parameter to a value between 0 and 0.5; this sets how much of the data to “trim” from either end of the ordered dataset. If trim = 0 (the default), no data is trimmed from either end, but if trim = 0.5, as much data is possible is trimmed from either end and the median is computed.

# The mean executive pay (in $10,000's)
mean(exec.pay)
## [1] 59.88945
# There are very large outliers, though; what happens to the mean if we trim
# 10% of the data from either side of the (ordered) dataset?
mean(exec.pay, trim = 0.1)
## [1] 29.96894

The median is the number that splits the dataset in half after being ordered. You can compute the median in R using the median() function.

# The median executive pay; compare to the mean or trimmed mean
median(exec.pay)
## [1] 27

The \((100\alpha)^{\text{th}}\) percentile is the number such that \(100\alpha\%\) of the dataset lies below and \(100(1-\alpha)\%\) above the number. The quantile() function in R computes percentiles (also referred to as quantiles). quantile(x) will effectively find the five-number summary of the dataset x, including the first and third quartiles. Alternatively, one can call quantile(x, p), where p is a vector (or perhaps just a number that will be interpreted as a vector) specifying with percentiles are wanted (these are numbers between 0 and 1).

The fivenum() function will find five-number summaries outright, but one may as well call quantile() with the default parameters (the presentation is better anyway).

# The five-number summary using fivenum
fivenum(exec.pay)
## [1]    0.0   14.0   27.0   41.5 2510.0
# The same information using quantile
quantile(exec.pay)
##     0%    25%    50%    75%   100% 
##    0.0   14.0   27.0   41.5 2510.0
# What is the 99th percentile of executive pay?
quantile(exec.pay, .99)
##    99% 
## 906.62
# The deciles of exec.pay, breaking the dataset into 10 parts
quantile(exec.pay, seq(0, 1, by = .1))
##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90% 
##    0.0    9.0   12.6   16.0   22.0   27.0   31.0   38.0   48.0   91.4 
##   100% 
## 2510.0

Now let’s discuss measures of spread. The range of a dataset is the difference between the largest and smallest values:

\[\text{range} = \max_i x_i - \min_i x_i\]

R’s range() function will find the maximum and minimum of the dataset, but won’t difference them. This is not a problem, though; simply call diff() on the result of range to subtract the minimum from the maximum, like diff(range(x)), where x is the dataset.

# The largest and smallest executive pay
range(exec.pay)
## [1]    0 2510
# The range
diff(range(exec.pay))
## [1] 2510

Another (more common) means for numerically describing the spread of a dataset is with the standard deviation or its square, the variance. The variance is defined as follows:

\[s^2 = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_i - \bar{x})^2\]

The standard deviation is merely the square root of the variance.

\[s = \sqrt{s^2}\]

In R, the function var() finds a dataset’s variance, and sd() finds the standard deviation.

# The variance of exec.pay
var(exec.pay)
## [1] 42867.03
# We could take the square root of the variance to get the standard
# deviation
sqrt(var(exec.pay))
## [1] 207.0435
# Or we could just use sd
sd(exec.pay)
## [1] 207.0435

For categorical data, the simplest way to numerically summarize the data is with a table. We can create one with the table() function

table(central.park.cloud)
## central.park.cloud
##         clear partly.cloudy        cloudy 
##            11            11             9
# Create a frequency table by dividing a table by the sample size (i.e. the
# length of the data vector)
table(central.park.cloud)/length(central.park.cloud)
## central.park.cloud
##         clear partly.cloudy        cloudy 
##     0.3548387     0.3548387     0.2903226