Where are we?

Prerequisite: Loading packages

library(mdsr) 
# library(tidyverse)

The textbook offers the mdsr package for R, which contains all of the data sets referenced in this book. In particular, loading mdsr also loads the mosaic package, which in turn loads dplyr and ggplot2. The mosaic package includes data sets and utilities from Project MOSAIC (http://mosaic-web.org) that are used to teach mathematics, statistics, computation and modeling. Packages dplyr and ggplot2 are part of tidyverse.

The tidyverse (https://www.tidyverse.org/) is an opinionated collection of R packages designed for data science, managed by a group of people including Hadley Wickham, statistician and chief scientist at RStudio, Inc. See the excerpt of the Tidyverse slides at Tidyverse Slide.

ggplot2

The ggplot2 package is the primary tool of data visualization, and implements the grammar of graphics in the book “The Grammar of Graphics” by Leland Wilkinson, now chief scientist at h2o, Inc. The four elements of graphics identified by Yau (Visual Cues, Coordinate System, Scale and Context) are also found in the grammar of graphics, albeit by different terms. Thus, it is essential to understand the taxonomy of graphics in order to use ggplot2.

ggplot2::mpg data example

We will follow the examples in R for data science. Let’s first look at the data set. mpg contains observations collected by the US Environment Protection Agency on 38 models of car. mpg is a tibble, which is a simplified data.frame, modified for better handling large data. For now, it is okay to think a tibble as a data.frame.

class(mpg)
## [1] "tbl_df"     "tbl"        "data.frame"
mpg
## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
##  1         audi         a4   1.8  1999     4   auto(l5)     f    18    29
##  2         audi         a4   1.8  1999     4 manual(m5)     f    21    29
##  3         audi         a4   2.0  2008     4 manual(m6)     f    20    31
##  4         audi         a4   2.0  2008     4   auto(av)     f    21    30
##  5         audi         a4   2.8  1999     6   auto(l5)     f    16    26
##  6         audi         a4   2.8  1999     6 manual(m5)     f    18    26
##  7         audi         a4   3.1  2008     6   auto(av)     f    18    27
##  8         audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
##  9         audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Among the variables in mpg are:

  1. displ, a car’s engine size, in litres.

  2. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

To learn more about mpg, open its help page by running ?mpg.

Creating a ggplot

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

In connection to the four elements of data graphics,

  1. ggplot() (by default) sets the coordinate system as the Cartesian coordinate system;

  2. Visual cue used is the position, set by mapping = aes(x = ..., y = ...), paired with the use of geom_point();

  3. scale is automatically chosen as appropriate as possible;

  4. context is (minimally) given by the axis labels.

A graphing template

To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Adding more visual cues

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Try mapping the class variable using the visual cues size, shape, or alpha (transparency), fill (with set shape = 22) .

g <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
g + geom_point(mapping = aes(size = class))
g + geom_point(mapping = aes(shape = class))
g + geom_point(mapping = aes(alpha = class))
g + geom_point(mapping = aes(fill = class), shape = 22)

Your visual cue is the aesthetic, and must be mapped to graphics by aes(). You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue with square shape:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue", shape = 15)

Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes().

Listing all available colors, shapes, linetypes, etc, is out of scope of this course. Web references include Cookbook for R and Tian Zheng’s “Colors in R”. The ColorBrewer scales are also useful and documented online at http://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth. It is very possible things may change rapidly, I generally recommend googling “R shape codes”, “R color codes”, etc, for reference.

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

g <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
g + facet_wrap(~ class, nrow = 2)

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

g + facet_grid(drv ~ cyl)

See that I am not retyping ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)). The result was already stored in the R object g and you can simply reuse it.

Finally, use facet_grid() to facet into columns (or rows) based on drv

g + facet_grid(. ~ drv)
g + facet_grid(drv ~ . )

ggplot2 continues in Lec 6.