The ultimate objective in data science is to extract meaning from data.
A typical data science project looks something like this:
In general, data used in EDA are not used in confirmation.
Statistical methodologies are used both in exploration and confirmation.
Statistical methodologies refer to the collective knowledge about
A statistical methodology called t-test is in fact more than the formula! For example,
A key in better understanding this, thus to correctly use statistical methods, is to think of data as a sample in a population.
We have considered data as being fixed.
Statistical methodology is governed by a broader point of view.
Yes, the data we have in hand are fixed, but the methodology assumes that the cases are drawn from a much larger set of potential cases. The given data are a sample of a larger population.
Suppose you were asked to help develop a travel policy for business travelers based in New York City. Imagine that the traveler has a meeting in San Francisco (airport code SFO) at a specified time \(t\). The policy to be formulated will say how much earlier than t an acceptable flight should arrive in order to avoid being late to the meeting due to a flight delay.
Pretend that we already have the complete population of flights. For this purpose, we’re going to use the set of 336,776 flights in 2013 in the nycflights13
package, which gives airline delays from New York City airports in 2013.
More realistically, the problem would be to develop a policy for this year based on the sample of data that have already been collected.
We’re going to simulate this situation by drawing a sample from the population of flights into SFO. First we set up the population.
library(mdsr)
library(nycflights13)
SF <- flights %>%
filter(dest == "SFO", !is.na(arr_delay))
We work with just a sample from this population. For now, we’ll set the sample size to be \(n = 25\) cases.
set.seed(101)
Sample25 <- SF %>%
sample_n(size = 25)
Here, we used
sample_n()
to select random rows from a table,set.seed()
to specify how to mimic random selection by pseudo-random generator.We are interested in the arrival delay. Focus on the variable arr_delay
, and compute some useful statistics, and visualize the distribution.
favstats( ~ arr_delay, data = Sample25)
## min Q1 median Q3 max mean sd n missing
## -50 -23 -7 4 124 -2.96 35.33822 25 0
Sample25 %>% ggplot(aes(x=arr_delay)) + geom_density()
Notice that most of delays are less than 50 minutes, but the maximum delay is 124 minutes. A simple way to set the policy is to look for the longest flight delay (in our data), and insist that travel be arranged to deal with this delay.
Naive policy: book a flight that is scheduled to arrive at least 124 minutes before.
Let us see how well this policy works. Since in our imaginary situation we have the entire population, we can get an answer of this. In particular, we can derive how often the travelers from NYC will be late for a meeting in San Francisco, when the naive policy is used.
SF %>%
mutate(is.late = arr_delay > 124) %>%
summarize(prop.late = mean(is.late))
## # A tibble: 1 x 1
## prop.late
## <dbl>
## 1 0.02937827
This seems reasonable, if being late or missing a meeting with 2.9% of chance is acceptable.
Now consider another policy, utilizing a 2-standard deviation rule. If the arrival delay is normally distributed, then only 2.27% of flights will have delay greater than \(\mu + 2\sigma\) (mean + 2 standard deviation ). However, in a bit more realistic situation, we do not have the population; we only have the sample in Sample25
. We do not know \(\mu\) and \(\sigma\), which must be estimated from the sample.
Sample25 %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))
## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 -2.96 35.33822
Policy #2: book a flight that is scheduled to arrive at least 67.7164459 minutes before.
How does this policy work? Again, since we have the population, we can see how often the travelers will miss their meetings:
SF %>%
mutate(is.late = arr_delay > 68) %>%
summarize(prop.late = mean(is.late))
## # A tibble: 1 x 1
## prop.late
## <dbl>
## 1 0.06930843
It’s much worse than 2.27%. About 7% of time, the traveler will be late or miss the meeting. What has gone wrong? To answer this, we need to consider the following:
How reliable a sample statistic? For example, the sample standard deviation, computed from our sample, is 35 minutes. This is much smaller than the population standard deviation, about 47 minutes. The sample mean is about -3 minutes, which is close to the population mean, 2.7 minutes. (We will still be in the playground world where we have the population in hand.)
If we were to collect a new sample from the population, how similar would the sample statistic on that new sample be to the same statistic calculated on the original sample?
SF %>% sample_n(size = 25) %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))
## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 -3.84 38.32719
If we draw many different samples from the population, each of size \(n\), and calculated the sample statistic on each of those samples, how similar would the sample statistic be across all the samples? With the population in hand, it’s easy to figure this out; use sample_n()
many times and calculate the sample statistic on each trial.
Trials <- list()
set.seed(101)
for(i in 1:100){
Trials[[i]] <- SF %>%
sample_n(size = 25) %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))
}
Trials_df <-
bind_rows(Trials) %>%
mutate(policy.cut = mean + 2 * sd)
(There are several new functions we used above: for(){...}
and dplyr::bind_rows()
.)
head(Trials_df,10)
## # A tibble: 10 x 3
## mean sd policy.cut
## <dbl> <dbl> <dbl>
## 1 -2.96 35.33822 67.71645
## 2 -3.84 38.32719 72.81438
## 3 15.76 61.00019 137.76038
## 4 -10.08 21.66587 33.25174
## 5 0.12 41.76454 83.64908
## 6 24.16 80.86134 185.88268
## 7 -2.64 27.98940 53.33881
## 8 13.08 53.84075 120.76151
## 9 7.80 66.44170 140.68341
## 10 16.56 60.34214 137.24427
These numbers vary a lot! The large variability of the sample statistic policy.cut
show how unreliable the statistic is.
favstats( ~ policy.cut, data = Trials_df)
## min Q1 median Q3 max mean sd n missing
## 14.45086 55.8069 85.53907 107.2871 449.4002 90.02024 54.42503 100 0
Trials_df %>% ggplot(aes(x=policy.cut)) + geom_density()
We were lucky in the first sample Sample25
that provided only 7% chance of being late. This could be more than 20% (with a 14-minute policy), or could be only 0.0003% with a ridiculous 449-minute (7.5 hr) policy.
This discussion is only possible because we play god by knowing the population. In a more realistic situation, the sample is only available to us; is quantifying the uncertainty ever possible?
Is normal distribution assumed for the population appropriate?
Are there (unsual or extreme) outliers?
Should the policy depend on the airlines, the time of the year, day of the week, and time of the day?
The data may contain some missing values. Is discarding the missing values a good practice?
Iteration is a powerful tool for reducing duplication, when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Consider a toy data frame consisting of 10 observations of 4 variables.
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
We want to compute the median of each column. You could do with copy-and-paste:
median(df$a)
median(df$b)
median(df$c)
median(df$d)
Instead, we could use a for loop.
output <- vector("double", 4) # 1. output
for (i in 1:4) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
output
## [1] -0.00468744 -0.20602777 -0.43395171 0.46391441
Every for loop has three components:
The output: output <- vector("double", length(x))
. Before you start the loop, you must always allocate sufficient space for the output. A general way of creating an empty vector of given length is the vector()
function. In the earlier example, we created an empty list, using the list()
function.
The sequence: i in 1:4
. This determines what to loop over: each run of the for loop will assign i to a different value from the sequence 1:4
.
The body: output[[i]] <- median(df[[i]])
. This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]])
, the second will run output[[2]] <- median(df[[2]])
, and so on.