What is Data Science?
How do we learn Data Science?
Data visualization: What is a good graphic?
Composing/Dissecting Data Graphics
Implementing the grammar of graphics using ggplot2
library(mdsr)
# library(tidyverse)
The textbook offers the mdsr
package for R
, which contains all of the data sets referenced in this book. In particular, loading mdsr
also loads the mosaic
package, which in turn loads dplyr
and ggplot2
. The mosaic
package includes data sets and utilities from Project MOSAIC (http://mosaic-web.org) that are used to teach mathematics, statistics, computation and modeling. Packages dplyr
and ggplot2
are part of tidyverse
.
The tidyverse
(https://www.tidyverse.org/) is an opinionated collection of R packages designed for data science, managed by a group of people including Hadley Wickham, statistician and chief scientist at RStudio, Inc. See the excerpt of the Tidyverse slides at Tidyverse Slide.
The ggplot2
package is the primary tool of data visualization, and implements the grammar of graphics in the book “The Grammar of Graphics” by Leland Wilkinson, now chief scientist at h2o, Inc. The four elements of graphics identified by Yau (Visual Cues, Coordinate System, Scale and Context) are also found in the grammar of graphics, albeit by different terms. Thus, it is essential to understand the taxonomy of graphics in order to use ggplot2
.
ggplot2::mpg
data exampleWe will follow the examples in R for data science. Let’s first look at the data set. mpg
contains observations collected by the US Environment Protection Agency on 38 models of car. mpg
is a tibble
, which is a simplified data.frame
, modified for better handling large data. For now, it is okay to think a tibble
as a data.frame
.
class(mpg)
## [1] "tbl_df" "tbl" "data.frame"
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Among the variables in mpg
are:
displ
, a car’s engine size, in litres.
hwy
, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
To learn more about mpg
, open its help page by running ?mpg
.
To plot mpg
, run this code to put displ
on the x-axis and hwy
on the y-axis:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
With ggplot2, you begin a plot with the function ggplot()
. ggplot()
creates a coordinate system that you can add layers to. The first argument of ggplot()
is the dataset to use in the graph. So ggplot(data = mpg)
creates an empty graph.
You complete your graph by adding one or more layers to ggplot()
. The function geom_point()
adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.
Each geom function in ggplot2 takes a mapping
argument. This defines how variables in your dataset are mapped to visual properties. The mapping
argument is always paired with aes()
, and the x
and y
arguments of aes()
specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg
.
In connection to the four elements of data graphics,
ggplot()
(by default) sets the coordinate system as the Cartesian coordinate system;
Visual cue used is the position, set by mapping = aes(x = ..., y = ...)
, paired with the use of geom_point()
;
scale is automatically chosen as appropriate as possible;
context is (minimally) given by the axis labels.
To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
You can add a third variable, like class
, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class
of each car.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Try mapping the class
variable using the visual cues size
, shape
, or alpha
(transparency), fill
(with set shape = 22
) .
g <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
g + geom_point(mapping = aes(size = class))
g + geom_point(mapping = aes(shape = class))
g + geom_point(mapping = aes(alpha = class))
g + geom_point(mapping = aes(fill = class), shape = 22)
Your visual cue is the aesthetic, and must be mapped to graphics by aes()
. You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue with square shape:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue", shape = 15)
Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes()
.
Listing all available colors, shapes, linetypes, etc, is out of scope of this course. Web references include Cookbook for R and Tian Zheng’s “Colors in R”. The ColorBrewer scales are also useful and documented online at http://colorbrewer2.org/ and made available in R via the RColorBrewer
package, by Erich Neuwirth. It is very possible things may change rapidly, I generally recommend googling “R shape codes”, “R color codes”, etc, for reference.
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap()
. The first argument of facet_wrap()
should be a formula, which you create with ~
followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap()
should be discrete.
g <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
g + facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid()
to your plot call. The first argument of facet_grid()
is also a formula. This time the formula should contain two variable names separated by a ~
.
g + facet_grid(drv ~ cyl)
See that I am not retyping ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
. The result was already stored in the R object g
and you can simply reuse it.
Finally, use facet_grid() to facet into columns (or rows) based on drv
g + facet_grid(. ~ drv)
g + facet_grid(drv ~ . )