Data can be as simple as a column of numbers in a spreadsheet file or as complex as the electronic medical records collected by a hospital. You need to operate with a small set of standard tools, to prepare the data sets to faciliate analysis.
library(tidyverse)
Are the following examples tidy?
It is not in tidy form, though the display is attractive and neatly laid out. Table 5.3 violates the first rule for tidy data.
To make the data tidy, certain rows of the spreadsheet need to be removed.
Is the dataset tidy? The answer depends on which analysis you have in mind. If your answer is no, then you would need to separate two pieces information in the first column.
What is the meaning of a case here? It is tempting to think that a case is a person. After all, it is people who run road races. But notice that individuals appear more than once: Jane Poole ran each year from 2003 to 2007. This suggests that a case is a runner in one year’s race, that is identified by a combination of name.yob
and year
.
UCBAdmissions
dataUCBAdmissions
contains aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. The data are well organized, but , is a three-dimensional table
(array), not a data.frame
.
data("UCBAdmissions")
UCBAdmissions[,,1:2]
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
Is the following better?
(narrow_tbl <- as_tibble(UCBAdmissions))
## # A tibble: 24 x 4
## Admit Gender Dept n
## <chr> <chr> <chr> <dbl>
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## # ... with 14 more rows
Compare the first four lines of narrow_tbl
(above), with the following piece of table in the original data set.
(wide_tbl <- UCBAdmissions[,,1])
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
wide_tbl
is better for viewing. But is it tidy? Is narrow_tbl
tidy?
We will use an R package dplyr
to transform a wide table to a narrow table (and vice versa), and to separate a column into to (and its inverse), using the following verbs.
gather()
, spread()
, separate()
and unite()
.dplyr
verbs to to make data suitable to use with softwareWe will use the slides Data Wrangling with R by Garrett Grolemund, pp 9–77, to see how the four verbs work for data sets.