the grammar of graphics
a grammar of data manipulation
dplyr
This section explores the main functions in dplyr
which Hadley Wickham describes as a grammar of data manipulation—the counterpoint to his grammar of graphics in ggplot2
.
The github repo for dplyr
not only houses the R code, but also vignettes for various use cases. The introductory vignette is a good place to start and can by viewed by typing the following on the command line: vignette("dplyr", package = "dplyr")
or by opening the dplyr
file in the vignettes directory of the dplyr
repo. The material for this section is extracted from Hadley Wickham’s Introduction to dplyr Vignette, R for data science, and MDSR.
dplyr
was designed to:
dplyr
operates on data frames, but it also operates on tibbles, a trimmed-down version of a data frame (tbl_df
) that provides better checking and printing. Tibbles are particularly good for large data sets since they only print the first 10 rows and the first 7 columns by default although additional information is provided about the rows and columns.
We will use ggplot2::presidential
data frame.
library(dplyr)
library(ggplot2)
presidential
## # A tibble: 11 x 4
## name start end party
## <chr> <date> <date> <chr>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy 1961-01-20 1963-11-22 Democratic
## 3 Johnson 1963-11-22 1969-01-20 Democratic
## 4 Nixon 1969-01-20 1974-08-09 Republican
## 5 Ford 1974-08-09 1977-01-20 Republican
## 6 Carter 1977-01-20 1981-01-20 Democratic
## 7 Reagan 1981-01-20 1989-01-20 Republican
## 8 Bush 1989-01-20 1993-01-20 Republican
## 9 Clinton 1993-01-20 2001-01-20 Democratic
## 10 Bush 2001-01-20 2009-01-20 Republican
## 11 Obama 2009-01-20 2017-01-20 Democratic
The variable names in presidential
are self explanatory, but note that presidential
does not print like a regular data frame. This is because it is a tibble, which is designed for data with a lot of rows and/or columns, i.e., big data. The print
function combines features of head
and str
. str
gives the inheritance path along with a summary of the data frame. For brevity we will use class()
to give the inheritance path:
class(presidential)
## [1] "tbl_df" "tbl" "data.frame"
See how traditional data frame prints out.
MASS::Boston
as_tibble(MASS::Boston)
dplyr
provides a suite of verbs for data manipulation:
filter()
: select rows (observations) in a data frame;arrange()
: reorder rows in a data frame;select()
: select columns (variables) in a data frame;mutate()
: add new columns to a data frame;summarise()
: collapses a data frame to a single row;See texbook figures 4.1–4.5 for a graphical illustration of these operations.
These can all be used in conjunction with group_by()
which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
Wickham’s approach is inspired by his desire to blur the boundaries between R and the ubiquitous relational database querying syntax SQL. In fact, the five verbs, when combined, exhibit a close similarity to SQL query statements (at least for data analysis purpose). Thus, mastering dplyr
data wrangling verbs have become a gateway to analyzing big data, through relational database management system and beyond. The real power of dplyr
is that it abstracts the data source, i.e., whether it is a data frame, a database, or Spark.
In fact, the statistical package “SAS” have always had a powerful “data step” that does about the same thing, since 1970s.
dplyr
also includes the powerful workflow operator called “pipes”, found in e.g. Unix shell script. We will see the pipe in the first example below.
presidential
To retrieve only the names and party affiliations of these presidents, we would use select()
. The first argument to the select()
function is the data frame, followed by an arbitrarily long list of column names, separated by commas.
select(presidential, name, party)
To retrive only the Repulbican presidents, we use filter()
. The first argument to filter()
is a data frame, and subsequent arguments are logical conditions that are evaluated on any involved columns.
filter(presidential, party == "Republican")
Naturally, combining the filter()
and select()
commands enables one to drill down to very specific pieces of information. For example, we can find which Democratic presidents served since Watergate.
select(filter(presidential, start > 1973 & party == "Democratic"), name)
In the syntax demonstrated above, the filter()
operation is nested inside the select()
operation. Each of the five verbs takes and returns a data frame, which makes this type of nesting possible. These long expressions become very difficult to read. Instead, we recommend the use of the %>%
(pipe) operator.
presidential %>%
filter(start > 1973 & party == "Democratic") %>%
select(name)
Notice how the expression dataframe %>% filter(condition)
is equivalent to filter(dataframe, condition)
.
The above pipeline reads
Take
presidential
data frame, then filter the Democrate presidents whose start year is greater than 1973. Then select the variablename
.
Frequently, in the process of conducting our analysis, we will create, re-define, and rename some of our variables. The functions mutate()
and rename()
provide these capabilities.
While we have the raw data on when each of these presidents took and relinquished office, we don’t actually have a numeric variable giving the length of each president’s term.
mypresidents <- presidential %>%
mutate(term_length = end - start)
head(mypresidents,2)
## # A tibble: 2 x 5
## name start end party term_length
## <chr> <date> <date> <chr> <time>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days
# textbook should have used mutate(term.length = interval(start, end) / dyears(1))
In this situation, it is generally considered good style to create a new object rather than clobbering the one that comes from an external source. To preserve the existing presidential data frame, we save the result of mutate()
as a new object called mypresidents.
The function sort() will sort a vector, but not a data frame. The function that will sort a data frame is called arrange().
To sort our presidential data frame by the length of each president’s term, we specify that we want the column term_length
in descending order.
mypresidents %>% arrange(desc(term_length))
A number of presidents completed either one or two full terms, and thus have the exact same term length (4 or 8 years, respectively). To break these ties, we can further sort by start
.
mypresidents %>% arrange(desc(term_length), start)
Our last of the five verbs for single-table analysis is summarize()
, which is nearly always used in conjunction with group_by()
. The previous four verbs provided us with means to manipulate a data frame in powerful and flexible ways. But the extent of the analysis we can perform with these four verbs alone is limited. On the other hand, summarize()
with group_by()
enables us to make comparisons.
When used alone, summarize()
collapses a data frame into a single row. We have to specify how we want to reduce an entire column of data into a single value.
mypresidents %>%
summarize(
N = n(),
first_year = min(year(start)),
last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
days = sum(term_length) / 365.25,
avg_term_length = mean(term_length)
)
In this example, the function n()
simply counts the number of rows. This is almost always useful information. The next variable determines the first year that one of these presidents assumed office. This is the smallest year in the start
column. The variable num_dems
simply counts the number of rows in which the value of the party
variable was "Democratic"
.
This begs the question of whether Democratic or Republican presidents served a longer average term during this time period. To figure this out, we can just execute summarize()
again, but this time, instead of the first argument being the data frame mypresidents, we will specify that the rows of the mypresidents
data frame should be grouped by the values of the party
variable. In this manner, the same computations as above will be carried out for each party separately.
mypresidents %>%
group_by(party) %>%
summarize(
N = n(),
avg_term_length = mean(term_length)
)
# Compare the intermediate data.frame group_by(mypresidents,party) with mypresidents
The pipe, %>%, comes from the magrittr
package by Stefan Milton Bache. Packages in the tidyverse
load %>% for you automatically.
Keyboard shortcut to type %>%
is
filter()
The first argument of the function filter()
is the data set (usually supplied through pipes).
The second argument of filter()
is a logical vector: i.e. a vector consisting of TRUE
and FALSE
. Only rows where the conditon evalutes to TRUE
are kept.
The logical vector is created by comparing one or more variables.
Basic logical operators are >
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
For set comparison, use x %in% Y
, which is true is x
is an element of the set Y
.
When you combine two or more comparions, use Boolean operators: &
(and), |
(or), !
(not),
Suppose that x
is a variable with four observations. What is the resulting logical vector?
x <- c(2,1,3,0)
x == 0
!(x == 0)
x == 0 | x == 1
x %in% c(0,1)
One important feature of R that can make comparison tricky are missing values, or NA
s (“not availables”).
NA
represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
All of the following operations return NA
x <- NA
x > 5
x + 10
x == NA
x == x
To check whether elements of x
is NA
, use is.na(x)
.
For example, to filter out all observations with missing values:
<DATA_FRAME> %>% filter(!is.na(<VARIABLE>)
The presidential
data set has only four variables, so selecting variables makes a little sense. To select variables for data sets with a large number of variables, there are a few handy options.
To demonstrate, load nycflights13::flights
data set. There are 19 variables (not terribly large).
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Selecting a few varibles is easy: select(flights, year, month, day)
To select all variables from variable year
to variable arr_time
: select(flights, year:arr_time)
(so you don’t have to type all variable names)
To select all variables except year, month, day
: select(flights, -(year:day))
There are a number of helper functions you can use within select()
:
starts_with("abc")
: matches names that begin with “abc”.
ends_with("xyz")
: matches names that end with “xyz”.
contains("ijk")
: matches names that contain “ijk”.
num_range("x", 1:3)
matches x1, x2 and x3.