Where are we?

What are tidy data?

Data can be as simple as a column of numbers in a spreadsheet file or as complex as the electronic medical records collected by a hospital. You need to operate with a small set of standard tools, to prepare the data sets to faciliate analysis.

Examples

library(tidyverse)

Are the following examples tidy?

1. Table 5.3, MDSR—results from the Minneapolis mayoral election.

Table 5.3

Table 5.3

It is not in tidy form, though the display is attractive and neatly laid out. Table 5.3 violates the first rule for tidy data.

  1. Rule 1: The rows, called cases, each must represent the same underlying attribute, that is, the same kind of thing. That’s not true in Table 5.3. For most of the table, the rows represent a single precinct. But other rows give ward or city-wide totals. The first two rows are captions describing the data, not cases.
  2. Rule 2: Each column is a variable containing the same type of value for each case.

To make the data tidy, certain rows of the spreadsheet need to be removed.

2. Table 5.6, MDSR—runners’ performance over time in a 10-mile race.

Table 5.6

Table 5.6

Is the dataset tidy? The answer depends on which analysis you have in mind. If your answer is no, then you would need to separate two pieces information in the first column.

What is the meaning of a case here? It is tempting to think that a case is a person. After all, it is people who run road races. But notice that individuals appear more than once: Jane Poole ran each year from 2003 to 2007. This suggests that a case is a runner in one year’s race, that is identified by a combination of name.yob and year.

3. UCBAdmissions data

UCBAdmissions contains aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. The data are well organized, but , is a three-dimensional table (array), not a data.frame.

data("UCBAdmissions")
UCBAdmissions[,,1:2]
## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8

Is the following better?

(narrow_tbl <- as_tibble(UCBAdmissions))
## # A tibble: 24 x 4
##       Admit Gender  Dept     n
##       <chr>  <chr> <chr> <dbl>
##  1 Admitted   Male     A   512
##  2 Rejected   Male     A   313
##  3 Admitted Female     A    89
##  4 Rejected Female     A    19
##  5 Admitted   Male     B   353
##  6 Rejected   Male     B   207
##  7 Admitted Female     B    17
##  8 Rejected Female     B     8
##  9 Admitted   Male     C   120
## 10 Rejected   Male     C   205
## # ... with 14 more rows

Compare the first four lines of narrow_tbl (above), with the following piece of table in the original data set.

(wide_tbl <- UCBAdmissions[,,1])
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19

wide_tbl is better for viewing. But is it tidy? Is narrow_tbl tidy?

We will use an R package dplyr to transform a wide table to a narrow table (and vice versa), and to separate a column into to (and its inverse), using the following verbs.

  • gather(), spread(), separate() and unite().

dplyr verbs to to make data suitable to use with software

We will use the slides Data Wrangling with R by Garrett Grolemund, pp 9–77, to see how the four verbs work for data sets.