%>%
: “then”dplyr
In the previous lectures, we illustrated how the five verbs can be chained to perform operations on a single table. This single table is reminiscent of a single well-organized spreadsheet. But in the same way that a workbook can contain multiple spreadsheets, we will often work with multiple tables.
Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important. Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases.
“Big Data” often involves storing really big pieces of information, fast processing of data and computation-intensive statistical learning. It requires large storage, large memory and parallel computing. In almost all instances, it involves a database, because:
The real power of dplyr
is that it abstracts the data source, i.e., whether it is a data frame, a database, or a Spark database (a “Lightning-fast cluster computing” platform) or multidimensional arrays.
Databases: Currently dplyr
supports the three most popular open source databases (sqlite
, mysql
and postgresql
), and Google’s bigquery
.
Spark: The sparklyr
package is the basis for data manipulation and machine learning based on a data frame workflow. This approach has limitations, e.g., with graph algorithms, but it covers most use cases. The rsparkling
package with its support for h2o
delves even deeper into machine learning, e.g., deep learning. An alternative approach, officially supported by Spark, is the SparkR
package.
Data cubes: tbl_cube()
provides an experimental interface to multidimensional arrays or data cubes. Potentially this could be used for deep learning algorithms, e.g., see TensorFlow.
In dplyr
, there are three families of verbs that work with two tables at a time:
This discussion assumes that you have tidy data, where the rows are observations and the columns are variables. We will primarily discuss mutating joins, which are used most often.
nycflights13
datalibrary(tidyverse)
library(nycflights13)
The package contains Airline on-time data for all flights departing NYC in 2013 (in flights
). Also includes useful ‘metadata’ on airlines
, airports
, weather
, and planes
.
Take a look at (a part of) flights
data.
flights %>%
filter(month == 1 & day == 1, abs(dep_delay) > 30) %>%
select(dep_time,arr_time,carrier:dest)
## # A tibble: 106 x 7
## dep_time arr_time carrier flight tailnum origin dest
## <int> <int> <chr> <int> <chr> <chr> <chr>
## 1 732 1011 UA 1111 N37456 EWR MCO
## 2 749 939 MQ 3737 N508MQ EWR ORD
## 3 811 1047 MQ 4576 N531MQ LGA CLT
## 4 826 1136 AA 443 N3GVAA JFK MIA
## 5 848 1001 MQ 3944 N942MQ JFK BWI
## 6 903 1045 MQ 4655 N532MQ LGA BNA
## 7 909 1331 AA 655 N5EXAA JFK STT
## 8 953 1320 UA 222 N586UA EWR LAX
## 9 957 1056 UA 856 N534UA EWR BOS
## 10 1025 1258 UA 501 N437UA EWR MCO
## # ... with 96 more rows
It is difficult to read the tabel because it includes lots of “codes”. To decipher, we need a codebook, or metadata:
airlines
lets you look up the full carrier name from its abbreviated codeairports
gives information about each airport, identified by the faa
airport codeplanes
gives information about each plane, identified by its tailnum
weather
gives the weather at each NYC airport for each hourhead(airlines,3)
## # A tibble: 3 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
Looking up the airlines
codebook, we find that carrier AA
stands for American Airlines Inc. This is possible because a unique identifier, the variable carrier
appearsin both data tables. Unique identifiers are called keys.
One way to show the relationships between the different tables is with a drawing:
Different pairs of tables have different keys. For nycflights13
:
flights
connects to airlines
through the carrier
variable.
flights
connects to planes
via a single variable, tailnum
.
flights
connects to airports
in two ways: via the origin
and dest
variables.
flights
connects to weather
via origin
(the location), and year
, month
, day
and hour
(the time).
Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13
data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data:
library("nycflights13")
# Drop unimportant variables so it's easier to understand the join results.
flights2 <- flights %>% select(year:day, hour, origin, dest,
tailnum, carrier)
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
flights2 %>%
left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows, and 1 more variables: name <chr>
In addition to x
and y
, each mutating join takes an argument by
that controls which variables are used to match observations in the two tables. There are several ways to specify it.
NULL
, the default. dplyr
will will use all variables that appear in both tables, a natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.weather
## # A tibble: 26,130 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## 5 EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014
## 6 EWR 2013 1 1 6 39.02 26.06 59.37 270 10.35702
## 7 EWR 2013 1 1 7 39.02 26.96 61.63 250 8.05546
## 8 EWR 2013 1 1 8 39.02 28.04 64.43 240 11.50780
## 9 EWR 2013 1 1 9 39.92 28.04 62.21 250 12.65858
## 10 EWR 2013 1 1 10 39.02 28.04 64.43 260 12.65858
## # ... with 26,120 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
flights2 %>% left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA NA NA NA
## 2 2013 1 1 5 LGA IAH N24211 UA NA NA NA
## 3 2013 1 1 5 JFK MIA N619AA AA NA NA NA
## 4 2013 1 1 5 JFK BQN N804JB B6 NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL 39.92 26.06 57.33
## 6 2013 1 1 5 EWR ORD N39463 UA NA NA NA
## 7 2013 1 1 6 EWR FLL N516JB B6 39.02 26.06 59.37
## 8 2013 1 1 6 LGA IAD N829AS EV 39.92 26.06 57.33
## 9 2013 1 1 6 JFK MCO N593JB B6 39.02 26.06 59.37
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.92 26.06 57.33
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
by = "x"
. Like a natural join, but uses only some of the common variables. For example, flights and planes have year columns, but they mean different things so we only want to join by tailnum
.flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999
## 2 2013 1 1 5 LGA IAH N24211 UA 1998
## 3 2013 1 1 5 JFK MIA N619AA AA 1990
## 4 2013 1 1 5 JFK BQN N804JB B6 2012
## 5 2013 1 1 6 LGA ATL N668DN DL 1991
## 6 2013 1 1 5 EWR ORD N39463 UA 2012
## 7 2013 1 1 6 EWR FLL N516JB B6 2000
## 8 2013 1 1 6 LGA IAD N829AS EV 1998
## 9 2013 1 1 6 JFK MCO N593JB B6 2004
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA
## # ... with 336,766 more rows, and 7 more variables: type <chr>,
## # manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## # speed <int>, engine <chr>
Note that the year columns in the output are disambiguated with a suffix.
by = c("x" = "a")
. This will match variable x
in table x
to variable a
in table b
. The variables from use will be used in the output.Each flight has an origin and destination airport, so we need to specify which one we want to join to:
flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows, and 7 more variables: name <chr>, lat <dbl>,
## # lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>
flights2 %>% left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows, and 7 more variables: name <chr>, lat <dbl>,
## # lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>
There are four types of mutating join, which differ in their behavior when a match is not found. We’ll illustrate each with a simple example:
(df1 <- data_frame(x = c(1, 2), y = 2:1))
## # A tibble: 2 x 2
## x y
## <dbl> <int>
## 1 1 2
## 2 2 1
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
## # A tibble: 2 x 3
## x a b
## <dbl> <dbl> <chr>
## 1 1 10 a
## 2 3 10 a
inner_join(x, y)
only includes observations that match in both x
and y
.
df1 %>% inner_join(df2) %>% knitr::kable()
## Joining, by = "x"
x | y | a | b |
---|---|---|---|
1 | 2 | 10 | a |
left_join(x, y)
includes all observations in x
, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table.
df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
right_join(x, y)
includes all observations in y
. It’s equivalent to left_join(y, x)
, but the columns will be ordered differently.
df1 %>% right_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 3 NA 10 a
df2 %>% left_join(df1)
## Joining, by = "x"
## # A tibble: 2 x 4
## x a b y
## <dbl> <dbl> <chr> <int>
## 1 1 10 a 2
## 2 3 10 a NA
full_join()
includes all observations from x
and y
.
df1 %>% full_join(df2)
## Joining, by = "x"
## # A tibble: 3 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
## 3 3 NA 10 a
The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, the new variables are filled in with missing values.
While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 5 x 3
## x y z
## <dbl> <int> <chr>
## 1 1 1 a
## 2 1 1 b
## 3 1 2 a
## 4 1 2 b
## 5 2 3 a
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13
dataset that don’t have a matching tail number in the planes table:
library("nycflights13")
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
If you’re worried about what observations your joins will match, start with a semi_join()
or anti_join()
. semi_join()
and anti_join()
never duplicate; they only remove observations.
df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
## [1] 4
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
## [1] 4
df1 %>% inner_join(df2, by = "x")
## # A tibble: 4 x 3
## x y z
## <dbl> <int> <chr>
## 1 1 1 a
## 2 1 1 b
## 3 1 2 a
## 4 1 2 b
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
## [1] 2
df1 %>% semi_join(df2, by = "x")
## # A tibble: 2 x 2
## x y
## <dbl> <int>
## 1 1 1
## 2 1 2
The final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:
intersect(x, y)
: return only observations in both x
and y
union(x, y)
: return unique observations in x
and y
setdiff(x, y)
: return observations in x
, but not in y
.Given this simple data:
df1 <- data_frame(x = 1:2, y = c(1L, 1L))
df2 <- data_frame(x = 1:2, y = 1:2)
The four possibilities are:
intersect(df1, df2)
## # A tibble: 1 x 2
## x y
## <int> <int>
## 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
## # A tibble: 3 x 2
## x y
## <int> <int>
## 1 1 1
## 2 2 1
## 3 2 2
setdiff(df1, df2)
## # A tibble: 1 x 2
## x y
## <int> <int>
## 1 2 1
setdiff(df2, df1)
## # A tibble: 1 x 2
## x y
## <int> <int>
## 1 2 2
Each two-table verb has a straightforward SQL equivalent. The correspondences between R and SQL are:
inner_join()
: SELECT * FROM x JOIN y ON x.a = y.a
left_join()
: SELECT * FROM x LEFT JOIN y ON x.a = y.a
right_join()
: SELECT * FROM x RIGHT JOIN y ON x.a = y.a
full_join()
: SELECT * FROM x FULL JOIN y ON x.a = y.a
semi_join()
: SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
anti_join()
: SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
intersect(x, y)
: SELECT * FROM x INTERSECT SELECT * FROM y
union(x, y)
: SELECT * FROM x UNION SELECT * FROM y
setdiff(x, y)
: SELECT * FROM x EXCEPT SELECT * FROM y
x
and y
don’t have to be tables in the same database. If you specify copy = TRUE
, dplyr
will copy the y
table into the same location as the x
variable. This is useful if you’ve downloaded a summarized dataset and determined a subset for which you now want the full data.
You should review the coercion rules, e.g., factors are preserved only if the levels match exactly and if their levels are different the factors are coerced to character.
At this time, dplyr
does not provide any functions for working with three or more tables.