STAT 1291: Data Science

Where are we?

the grammar of graphics
a grammar of data manipulation

Data manipulation with `dplyr`

This section explores the main functions in dplyr which Hadley Wickham describes as a grammar of data manipulation—the counterpoint to his grammar of graphics in ggplot2.

The github repo for dplyr not only houses the R code, but also vignettes for various use cases. The introductory vignette is a good place to start and can by viewed by typing the following on the command line: vignette("dplyr", package = "dplyr") or by opening the dplyr file in the vignettes directory of the dplyr repo. The material for this section is extracted from Hadley Wickham’s Introduction to dplyr Vignette, R for data science, and MDSR.

dplyr was designed to:

provide commonly used data manipulation tools;
have fast performance for in-memory operations;
abstract the interface between the data manipulation operations and the data source.

dplyr operates on data frames, but it also operates on tibbles, a trimmed-down version of a data frame (tbl_df) that provides better checking and printing. Tibbles are particularly good for large data sets since they only print the first 10 rows and the first 7 columns by default although additional information is provided about the rows and columns.

We will use ggplot2::presidential data frame.

library(dplyr)
library(ggplot2) 
presidential

## # A tibble: 11 x 4
##          name      start        end      party
##         <chr>     <date>     <date>      <chr>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican
##  2    Kennedy 1961-01-20 1963-11-22 Democratic
##  3    Johnson 1963-11-22 1969-01-20 Democratic
##  4      Nixon 1969-01-20 1974-08-09 Republican
##  5       Ford 1974-08-09 1977-01-20 Republican
##  6     Carter 1977-01-20 1981-01-20 Democratic
##  7     Reagan 1981-01-20 1989-01-20 Republican
##  8       Bush 1989-01-20 1993-01-20 Republican
##  9    Clinton 1993-01-20 2001-01-20 Democratic
## 10       Bush 2001-01-20 2009-01-20 Republican
## 11      Obama 2009-01-20 2017-01-20 Democratic

The variable names in presidential are self explanatory, but note that presidential does not print like a regular data frame. This is because it is a tibble, which is designed for data with a lot of rows and/or columns, i.e., big data. The print function combines features of head and str. str gives the inheritance path along with a summary of the data frame. For brevity we will use class() to give the inheritance path:

class(presidential)

## [1] "tbl_df"     "tbl"        "data.frame"

See how traditional data frame prints out.

MASS::Boston
as_tibble(MASS::Boston)

Single Table Verbs

dplyr provides a suite of verbs for data manipulation:

filter(): select rows (observations) in a data frame;
arrange(): reorder rows in a data frame;
select(): select columns (variables) in a data frame;
mutate(): add new columns to a data frame;
summarise(): collapses a data frame to a single row;

See texbook figures 4.1–4.5 for a graphical illustration of these operations.

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result.

A bit of side story

Wickham’s approach is inspired by his desire to blur the boundaries between R and the ubiquitous relational database querying syntax SQL. In fact, the five verbs, when combined, exhibit a close similarity to SQL query statements (at least for data analysis purpose). Thus, mastering dplyr data wrangling verbs have become a gateway to analyzing big data, through relational database management system and beyond. The real power of dplyr is that it abstracts the data source, i.e., whether it is a data frame, a database, or Spark.

In fact, the statistical package “SAS” have always had a powerful “data step” that does about the same thing, since 1970s.

dplyr also includes the powerful workflow operator called “pipes”, found in e.g. Unix shell script. We will see the pipe in the first example below.

First example

select variables and filter rows

presidential

To retrieve only the names and party affiliations of these presidents, we would use select(). The first argument to the select() function is the data frame, followed by an arbitrarily long list of column names, separated by commas.

select(presidential, name, party)

To retrive only the Repulbican presidents, we use filter(). The first argument to filter() is a data frame, and subsequent arguments are logical conditions that are evaluated on any involved columns.

filter(presidential, party == "Republican")

Naturally, combining the filter() and select() commands enables one to drill down to very specific pieces of information. For example, we can find which Democratic presidents served since Watergate.

select(filter(presidential, start > 1973 & party == "Democratic"), name)

In the syntax demonstrated above, the filter() operation is nested inside the select() operation. Each of the five verbs takes and returns a data frame, which makes this type of nesting possible. These long expressions become very difficult to read. Instead, we recommend the use of the %>% (pipe) operator.

presidential %>%
  filter(start > 1973 & party == "Democratic") %>%
  select(name)

Notice how the expression dataframe %>% filter(condition) is equivalent to filter(dataframe, condition).

The above pipeline reads

Take presidential data frame, then filter the Democrate presidents whose start year is greater than 1973. Then select the variable name.

mutate variables to create new ones

Frequently, in the process of conducting our analysis, we will create, re-define, and rename some of our variables. The functions mutate() and rename() provide these capabilities.

While we have the raw data on when each of these presidents took and relinquished office, we don’t actually have a numeric variable giving the length of each president’s term.

mypresidents <- presidential %>%
  mutate(term_length = end - start) 
head(mypresidents,2)

## # A tibble: 2 x 5
##         name      start        end      party term_length
##        <chr>     <date>     <date>      <chr>      <time>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican   2922 days
## 2    Kennedy 1961-01-20 1963-11-22 Democratic   1036 days

# textbook should have used mutate(term.length = interval(start, end) / dyears(1))

In this situation, it is generally considered good style to create a new object rather than clobbering the one that comes from an external source. To preserve the existing presidential data frame, we save the result of mutate() as a new object called mypresidents.

arrange rows

The function sort() will sort a vector, but not a data frame. The function that will sort a data frame is called arrange().

To sort our presidential data frame by the length of each president’s term, we specify that we want the column term_length in descending order.

mypresidents %>% arrange(desc(term_length))

A number of presidents completed either one or two full terms, and thus have the exact same term length (4 or 8 years, respectively). To break these ties, we can further sort by start.

mypresidents %>% arrange(desc(term_length), start)

summarize entire data set or for each group

Our last of the five verbs for single-table analysis is summarize(), which is nearly always used in conjunction with group_by(). The previous four verbs provided us with means to manipulate a data frame in powerful and flexible ways. But the extent of the analysis we can perform with these four verbs alone is limited. On the other hand, summarize() with group_by() enables us to make comparisons.

When used alone, summarize() collapses a data frame into a single row. We have to specify how we want to reduce an entire column of data into a single value.

mypresidents %>%
  summarize(
    N = n(), 
    first_year = min(year(start)), 
    last_year = max(year(end)),
    num_dems = sum(party == "Democratic"),
    days = sum(term_length) / 365.25,
    avg_term_length = mean(term_length)
    )

In this example, the function n() simply counts the number of rows. This is almost always useful information. The next variable determines the first year that one of these presidents assumed office. This is the smallest year in the start column. The variable num_dems simply counts the number of rows in which the value of the party variable was "Democratic".

This begs the question of whether Democratic or Republican presidents served a longer average term during this time period. To figure this out, we can just execute summarize() again, but this time, instead of the first argument being the data frame mypresidents, we will specify that the rows of the mypresidents data frame should be grouped by the values of the party variable. In this manner, the same computations as above will be carried out for each party separately.

mypresidents %>%
  group_by(party) %>% 
  summarize(
    N = n(), 
    avg_term_length = mean(term_length)
    )
# Compare the intermediate data.frame group_by(mypresidents,party) with mypresidents

The pipe

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>% for you automatically.

Keyboard shortcut to type %>% is

Cmd + Shift + M (Mac)
Ctrl + Shift + M (Windows)

Supplement

Comparisons for `filter()`

The first argument of the function filter() is the data set (usually supplied through pipes).

The second argument of filter() is a logical vector: i.e. a vector consisting of TRUE and FALSE. Only rows where the conditon evalutes to TRUE are kept.

The logical vector is created by comparing one or more variables.

Basic logical operators are >, >=, <, <=, != (not equal), and == (equal).
For set comparison, use x %in% Y, which is true is x is an element of the set Y.
When you combine two or more comparions, use Boolean operators: & (and), | (or), ! (not),

Suppose that x is a variable with four observations. What is the resulting logical vector?

x <- c(2,1,3,0)
x == 0 
!(x == 0)
x == 0 | x == 1 
x %in% c(0,1)

Handling missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”).

NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

All of the following operations return NA

x <- NA
x > 5
x + 10 
x == NA  
x == x

To check whether elements of x is NA , use is.na(x).

For example, to filter out all observations with missing values:

<DATA_FRAME> %>% filter(!is.na(<VARIABLE>)

Select many variables

The presidential data set has only four variables, so selecting variables makes a little sense. To select variables for data sets with a large number of variables, there are a few handy options.

To demonstrate, load nycflights13::flights data set. There are 19 variables (not terribly large).

library(nycflights13)
flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Selecting a few varibles is easy: select(flights, year, month, day)
To select all variables from variable year to variable arr_time: select(flights, year:arr_time) (so you don’t have to type all variable names)
To select all variables except year, month, day: select(flights, -(year:day))
There are a number of helper functions you can use within select():
- starts_with("abc"): matches names that begin with “abc”.
- ends_with("xyz"): matches names that end with “xyz”.
- contains("ijk"): matches names that contain “ijk”.
- num_range("x", 1:3) matches x1, x2 and x3.

STAT 1291: Data Science

Lecture 9 - Data Wrangling: one table (1/2)

Sungkyu Jung

Where are we?

Data manipulation with `dplyr`

Single Table Verbs

A bit of side story

First example

select variables and filter rows

mutate variables to create new ones

arrange rows

summarize entire data set or for each group

The pipe

Supplement

Comparisons for `filter()`

Handling missing values

Select many variables

STAT 1291: Data Science

Lecture 9 - Data Wrangling: one table (1/2)

Sungkyu Jung

Where are we?

Data manipulation with dplyr

Single Table Verbs

A bit of side story

First example

select variables and filter rows

mutate variables to create new ones

arrange rows

summarize entire data set or for each group

The pipe

Supplement

Comparisons for filter()

Handling missing values

Select many variables

Data manipulation with `dplyr`

Comparisons for `filter()`