So far we have examined only relationships in quantitative data. We can also examine relationships between categorical data.
Numerically we analyze categorical data with tables of counts, with each cell of the table containing a count of the observations having a particular combination of the categorical variables in question. We usually want to consider the joint distribution of the variables in question as well as the margins of the tables. The xtabs()
function can allow us to quickly construct join distribution tables using formula notation. xtabs(~ x + y, data = d)
will construct a table depending on variables x
and y
, with data stored in d
. One could extend this table to as many variables as desired; for example, xtabs(~ x + y + z, data = d)
constructs a three-dimensional array examining the relationship between variables x
, y
, and z
. When creating such an array, you may want to use the ftable()
function for viewing the information in the table in a more legible format. For example, if we saved the results of the earlier xtabs()
output in a variable tab
, ftable(tab, row.vars = 2, col.vars = c(1, 3))
will create a table where the variable associated with dimension 2 (y
) will be shown in rows, and the variables associated with dimensions 1 and 3 (x
and z
) are shown in the columns.
I demonstrate constructing tables this way below, exploring the Cars93
(MASS) data set.
# Two-way table exploring origin and type
tab1 <- xtabs(~Origin + Type, data = Cars93)
tab1
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 7 11 10 7 8 5
## non-USA 9 0 12 14 6 4
# A three-way table
tab2 <- xtabs(~Origin + Type + Cylinders, data = Cars93)
# The following output is hard to parse
tab2
## , , Cylinders = 3
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 0 0
## non-USA 0 0 0 3 0 0
##
## , , Cylinders = 4
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 7 0 4 7 4 0
## non-USA 8 0 3 11 4 1
##
## , , Cylinders = 5
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 0 0
## non-USA 0 0 1 0 0 1
##
## , , Cylinders = 6
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 7 5 0 3 5
## non-USA 1 0 7 0 1 2
##
## , , Cylinders = 8
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 4 1 0 1 0
## non-USA 0 0 1 0 0 0
##
## , , Cylinders = rotary
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 0 0
## non-USA 0 0 0 0 1 0
# This is easier to read
ftable(tab2, row.vars = 2, col.vars = c(1, 3))
## Origin USA non-USA
## Cylinders 3 4 5 6 8 rotary 3 4 5 6 8 rotary
## Type
## Compact 0 7 0 0 0 0 0 8 0 1 0 0
## Large 0 0 0 7 4 0 0 0 0 0 0 0
## Midsize 0 4 0 5 1 0 0 3 1 7 1 0
## Small 0 7 0 0 0 0 3 11 0 0 0 0
## Sporty 0 4 0 3 1 0 0 4 0 1 0 1
## Van 0 0 0 5 0 0 0 1 1 2 0 0
# A four-way table
tab3 <- xtabs(~Origin + Type + Cylinders + Man.trans.avail, data = Cars93)
ftable(tab3, row.vars = c(2, 4), col.vars = c(1, 3))
## Origin USA non-USA
## Cylinders 3 4 5 6 8 rotary 3 4 5 6 8 rotary
## Type Man.trans.avail
## Compact No 0 2 0 0 0 0 0 0 0 0 0 0
## Yes 0 5 0 0 0 0 0 8 0 1 0 0
## Large No 0 0 0 7 4 0 0 0 0 0 0 0
## Yes 0 0 0 0 0 0 0 0 0 0 0 0
## Midsize No 0 4 0 4 1 0 0 0 0 3 1 0
## Yes 0 0 0 1 0 0 0 3 1 4 0 0
## Small No 0 0 0 0 0 0 0 0 0 0 0 0
## Yes 0 7 0 0 0 0 3 11 0 0 0 0
## Sporty No 0 0 0 0 0 0 0 0 0 0 0 0
## Yes 0 4 0 3 1 0 0 4 0 1 0 1
## Van No 0 0 0 4 0 0 0 0 0 2 0 0
## Yes 0 0 0 1 0 0 0 1 1 0 0 0
When faced with a table, one often wishesw to know the marginal distributions, which is the distribution of just one of the variables without any knowledge of any other variables. We can obtain the margins of the tables produced by xtabs()
with margin.table()
. The call margin.table(tbl, margin = i)
will find the marginal distribution of tbl
for margin i
, which may be 1 or 2 for a two-way table (corresponding to rows and columns, respectively), but could be higher for more complex tables. I demonstrate margin.table()
below:
margin.table(tab3, margin = 1)
## Origin
## USA non-USA
## 48 45
margin.table(tab3, margin = 2)
## Type
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
margin.table(tab3, margin = 3)
## Cylinders
## 3 4 5 6 8 rotary
## 3 49 2 31 7 1
We have a few options for visualizing data in a two-way table (tables with more dimensions would be more complex and more demanding from visualization techniques). One way would be with a stacked bar plot, where the height of the bar corresponds to the marginal distribution of one variable and sub-bars denote the breakdown for the second category. Better though would be side-by-side bar plots, which don’t stack the breakdown categories one atop the other but instead place them side by side yet in close proxomity. barplot()
can create such plots.
barplot(tab1)
barplot(tab1, beside = TRUE)
Another option is a mosaic plot, which shows the frequency of each combination of variables as the size of rectangles. This can be created in R using the function mosaicplot()
, as below.
mosaicplot(tab1)
mosaicplot(tab2)
mosaicplot(tab3)