Telling stories with data using the grammar of graphics

Different types of graphs may, at first glance, appear completely distinct. But in fact, graphs share many common elements, such as coordinate systems and using geometric shapes to represent data. By making different visual choices (Cartesian or polar coordinates, points or lines or bars to represent data), you can use graphs to highlight different aspects of the same data. For example, here are three ways of displaying the same data:

plot of chunk three plots

The pie chart focuses the reader on large percentages, and encourages the reader to think of the total (here, the cut of different diamonds) as a finite quantity that is being apportioned to different groups. The stacked bar plot provides the same information, but makes it easier to accurately gauge how large each category is. The histogram splits the categories horizontally, and draws attention to how the categories are ordered. It encourages the reader to think about the distribution rather than disconnected categories, and provides a sense of scale.

We often talk about types of graphs – bar plots, pie charts, scatterplots – as though they are unrelated, but most graphs share many aspects of their structure. We can think of graphs as visual representations of (possibly transformed) data, along with labels (like axes and legends) that make the meaning clear. Much like the grammar of a language allows you to combine words into meaningful sentences, a grammar of graphics provides a structure to combine graphical elements into figures that display data in a meaningful way. The grammar of graphics was originally introduced by Leland Wilkinson in the late 1990s, and was popularized by Hadley Wickham with ggplot, an R graphical library based on the grammar of graphics. I started using ggplot a few years ago. The syntax felt foreign for a long time, so I decided to learn about the theory behind the grammar, to get an intuition for the concepts underlying the code. Understanding this theory helped me understand ggplot and think more deeply about graphics, and I hope this introduction does the same for others. ggplot is the best-known implementation of the grammar, but thanks to its success, it is being implemented in a variety of languages, including Python, Julia, and D3, but I’ll be using R code for purposes of illustration here.

In the grammar of a language, words have different parts of speech, which perform different roles in the sentence. Analagously, the grammar of graphics separates a graphic into different layers. These are layers in a literal sense – you can think of them as transparency sheets for an overhead projector, each containing a piece of the graphic, which can be arranged and combined in a variety of ways. But what is in a layer? Let’s think about an example. Say we have a dataset with an independent variable, x, and a dependent variable, y. If we perform a simple linear regression, we can also calculate the predicted values for y at specific values of x (we can call these predictions y’). Using these data, we want to make a scatterplot with a line of best fit. What are the elements of this plot?

We have:

The data itself (x, y, and the best fit prediction, y’)
Dots on the scatterplot representing the relationship between x and y
The line representing the relationship between x and y’ (the line of best fit)
The scaling of the data (linear)
The coordinate system (Cartesian)

What if we want to make a histogram of the distribution of x? Then we have:

The data itself (x)
Bars representing the frequency of x at different values of x
The scaling of the data (linear)
The coordinate system (Cartesian)

Clearly there are many similar components between these graphs, and for most graphs, these elements do a pretty good job describing what a plot will look like. These are our “parts of speech”, the pieces of a graphic that we can use to tell a story. So let’s look at each one of them and try to understand what they do.

Data

Before it’s possible to talk about a graphical grammar, it’s important to know the format of the data you’re working with. After all, it contains all of the information you’re trying to convey. The grammar speaks in terms of data as “tidy” rows of individual observations. Here’s a sample of data in this format, taken from ggplot’s sample dataset diamonds.

##   carat       cut color clarity depth table price
## 1  0.23     Ideal     E     SI2  61.5    55   326
## 2  0.21   Premium     E     SI1  59.8    61   326
## 3  0.23      Good     E     VS1  56.9    65   327
## 4  0.29   Premium     I     VS2  62.4    58   334
## 5  0.31      Good     J     SI2  63.3    58   335
## 6  0.24 Very Good     J    VVS2  62.8    57   336

Here, each row represents observations of a single diamond. This seems like an obvious format, but not all datasets have this structure by default. Count data can be stored as a matrix. For example, you might imagine a matrix of locations and the number of birds spotted there:

##        cardinal blue jay chickadee
## site_1        5        0         1
## site_2        4        0         2

Here, each matrix element, rather than each row, represents a single observation. Using the package reshape2, we can transform the matrix into a format that is compatible with the grammar. In this case we transform the matrix into a list of observations and store the value in the new column count.

require(ggplot2)
require(reshape2)
reshape2::melt(birdmat, value.name = "count")

##     Var1      Var2 count
## 1 site_1  cardinal     5
## 2 site_2  cardinal     4
## 3 site_1  blue jay     0
## 4 site_2  blue jay     0
## 5 site_1 chickadee     1
## 6 site_2 chickadee     2

Run this code in r-fiddle

Other representations of data include summary tables, or the storage of different columns in different variables. ggplot is fairly picky about data formatting, but in return it gives us the power to make significant changes to how we are plotting the data without changing the data object itself. To understand the grammar of graphics, it helps to think of the one-observation-one-row dataset as a fixed entity that we can view in different ways.

Geoms

The most obvious part of a graph is the visual display of the data itself. This is often a basic geometric object like a point, line, or bar, so in ggplot, each of these elements is called a “geom”. You can display multiple pieces of information by layering geoms (scatterplot layer + line of best fit), or you can explore the same data by visualizing it with different types of geoms.

## fit a linear regression for the relationship between carat and price
## the "fitted" column here is calculating the predicted price for
## each value of carat.
diamonds$fitted <- lm(price ~ carat, data = diamonds)$fitted
## ggplot can actually fit simple models like a linear regression
## using geom_smooth(). Since this only works for a limited set of
## models, I prefer to do the model fitting outside of the plotting.
g <- ggplot(data = diamonds, aes(x = carat, y = price)) +
    geom_point(alpha=.3) + geom_line(aes(y = fitted)) +
    scale_y_continuous(limits = c(0, 20000))
plot(g)

Run this code in r-fiddle

plot of chunk geom_layers

## three ways of looking at the distribution of diamond price,
## conditional on the quality of the cut
g <- ggplot(diamonds, aes(x = cut, y = price, fill = cut))

## jittering the points helps prevent overplotting
gjitter <- g + geom_jitter()

## box and whiskers plot
gbox <- g + geom_boxplot()

## a fancier version of the boxplot, which shows the whole distribution,
## not just quantiles
gviol <- g + geom_violin()

Run this code in r-fiddle

plot of chunk plot_three

Scaling

Sometimes it’s useful to transform or rescale data. Our eyes are good at seeing linear relationships, so if a relationship is log-linear, it makes sense to simply change the scale. Similarly, it is common to fit a regression line to log-transformed data, and it makes sense to plot this as a linear relationship, rather than plotting a curved fit on a linear scale. The same logic applies for other transformations. In the example below, the distribution looks close to unimodal (that is, it has a single peak) until we log-tranform it. I like to think of this as a transformation of our view of the data, rather than a transformation of the dataset itself. Thinking about the dataset as a fixed entity, it makes sense to apply transformations while plotting rather than altering the dataset itself.

# define linear histogram
g <- ggplot(data = diamonds, aes(x = price)) +
    geom_histogram()

# apply lograrithmic scale
g2 <- g + scale_x_log10()

Run this code in r-fiddle

plot of chunk plot_logunlog

Coords

We’re very used to thinking using the Cartesian coordinate system, but sometimes polar coordinates make sense. Pie charts are a common (albeit controversial) use of polar coordinates, and there are other flashy graphics that use them. You may also want to use a map projection, or flip the coordinates for a horizontal bar graph rather than a vertical one.

flip <- ggplot(data = diamonds, aes(x= cut, fill = cut)) +
    geom_histogram() +
    coord_flip()
plot(flip)

Run this code in r-fiddle

plot of chunk flip_coords

Polar coordinates can sometimes be misleading or confusing to the eye, but if your data are fundamentally cyclic rather than linear in nature, it’s a useful option to have.

Facets are a way to split data into subplots based on another factor in the data. In my opinion, facets are one of the most compelling reasons for using the grammar of graphics. Let’s say we’re looking at diamond carat versus diamond price. This is pretty easy to plot in most programming languages.

## base R
plot(diamonds$carat, diamonds$price, type = 'p')

plot of chunk carat_price

## ggplot
g <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
plot(g)

plot of chunk carat_price

But we also have information about the cut. Bigger, heavier diamonds probably cost more, but not if they’re cut poorly. So we can create different plots for each cut, to tease out this relationship. In base R, you would have to write out separate commands for each cut. Basically, each cut category gets treated as an entirely separate dataset, and each plot as a separate unit. But using the grammar of graphics, the facet is simply another layer to apply to one dataset. Look at how much nicer the ggplot code (and output!) looks:

## base R
par(mfrow = c(2,3)) ## sets up multi-plot window
## plot each facet separately
## also notice that we have to specify the axes limits for x using
## xlim, or the plots would be differently scaled
plot(diamonds$carat[diamonds$cut == 'Fair'],
     diamonds$price[diamonds$cut == 'Fair'], type = 'p',
     xlim = c(0,5))
plot(diamonds$carat[diamonds$cut == 'Good'],
     diamonds$price[diamonds$cut == 'Good'], type = 'p',
     xlim = c(0,5))
plot(diamonds$carat[diamonds$cut == 'Very Good'],
     diamonds$price[diamonds$cut == 'Very Good'], type = 'p',
     xlim = c(0,5))
plot(diamonds$carat[diamonds$cut == 'Premium'],
     diamonds$price[diamonds$cut == 'Premium'], type = 'p',
     xlim = c(0,5))
plot(diamonds$carat[diamonds$cut == 'Ideal'],
     diamonds$price[diamonds$cut == 'Ideal'], type = 'p',
     xlim = c(0,5))

plot of chunk cut_facet

## ggplot
## we can just reuse 'g' from the previous example, and add facetting
## also note that ggplot keeps identical axis limits for all facets,
## because it's treating the facets as parts of a single dataset
g <- g + facet_grid(. ~ cut)
plot(g)

plot of chunk cut_facet_2

Run this code in r-fiddle

Now let’s say we also want to split the data by clarity. Having each cut-clarity combination get its own separate plot would be a lot of plots, so let’s color the points by the level of clarity. Base R treats each point color as a different set of points, and each subplot in the window as a separate unit. But color is like an adjective in the grammar: an aesthetic element which modifies the other layers.

## base R
## set up subplot window
par(mfrow = c(2,3))

## we can make things more concise by looping
for(i in levels(diamonds$cut)){
    ## color number for points
    colnum <- 1
    for(j in levels(diamonds$clarity)){
        ## for each new plot, the first call need to create the plot
        ## Otherwise we layer on points
        if(colnum == 1){
            plot(diamonds$carat[diamonds$cut == i & diamonds$clarity == j],
                 diamonds$price[diamonds$cut == i & diamonds$clarity == j],
                 type = 'p', col = colnum, xlim = c(0,5))
        } else {
            points(diamonds$carat[diamonds$cut == i & diamonds$clarity == j],
                   diamonds$price[diamonds$cut == i & diamonds$clarity == j],
                   col = colnum)
        }
        colnum = colnum + 1
    }
}

plot of chunk clarity_color

## ggplot
## The only change we need to make is to add an aesthetic in ggplot()
g <- ggplot(diamonds, aes(x = carat, y = price, colour = clarity)) +
    geom_point() + facet_grid(. ~ cut)
plot(g)

plot of chunk clarity_color_2

Run ggplot code in r-fiddle (the base R code is clunky enough that it doesn’t run well on r-fiddle)

Clearly, the base code is clunky and verbose next to the ggplot version. With a grammar to work with, we can communicate our intentions to ggplot in a clearer, more concise way. To anthropomorphize heavily, Base R graphics happily plots data, but it doesn’t “understand” anything about how data are structured. If you want to split your data into groups by a another variable, you have to tell it specifically what to do with each data chunk; it doesn’t understand what “split by another variable” means. But the grammar of graphics provides a common language between the computer and the user. ggplot understands the concept of a dataset, and how its rows and columns are related. It understands that a pie chart and a stacked bar chart are the same plot with different coordinate systems. It understands the idea of faceted plots as a single visual unit. And most importantly, it makes those ideas and relationships visible to the user, to make it simple to switch between different visual elements to represent the data.

Communication is the core idea at work here. When you have a structured language for graphics, it’s a lot easier to think and talk about them. It’s a great mental framework for when you’re trying to decide how to display data for yourself or for a presentation. If you aren’t sure how you want to display your data, it provides a concise, consistent way to move between different possible representations.

Making graphical choices

Choices of geometry matter. Points suggest that each data point is its own unit, independent of or distinct from the other points. Lines highlight the relationship between data points, rather than the points themselves. Similarly, bars, rectangles, and distributions emphasize that data points are parts of some larger category, and we are using the data to estimate the size or spread of that category.

This is why I find it useful to plot my data multiple ways before I present it to others. What do I learn from each plot? What information is excluded or difficult to see? How much detail am I providing, relative to what my audience needs to understand? What I find so powerful about grammar of graphics-style plotting is that, once my data are properly formatted, it is easy to slice, group, and facet my data in a variety of ways, swapping between geoms and aesthetics to explore my data. It’s not just a syntax but a different way to think about data, and a powerful tool for exploration and understanding.

Telling stories with data using the grammar of graphics

Data

Geoms

Scaling

Coords

Groups and Facets

Making graphical choices

Further Reading