Where are we?

Statistical foundation

Data Science workflow

A typical data science project looks something like this:

(from r4ds)

(from r4ds)

Exploratory data analysis (EDA) vs Confirmatory data analysis


Exploration

  • To investigate the data integrity
  • To generate hypotheses (e.g. “smoking is related to cancer”, “shape and size of diamonds predict price”)
  • Both visualization and modeling are important tools in EDA
  • EDA typically involves many iterations of “transform-visualize-model”

Confirmation

  • To confirm that the fitted model is appropriate
  • To confirm the hypothesis is true
  • Prediction accuracy vs. statistical inference
  • Use visualization to graphically convey the conclusion

In general, data used in EDA are not used in confirmation.


Statistical methodologies are used both in exploration and confirmation.

Statistical methods

Statistical methodologies refer to the collective knowledge about


Example: “t-test”

  • Observations: \(X_1,\ldots,X_n\).
  • Hypothesis: The mean of \(X_i\) is \(\mu = 0\)
  • The t-test: Reject the hypothesis if \(t > 1.96\) or \(t < -1.96\), where \[ t = \sqrt{n}\frac{\mbox{mean}(X) }{\mbox{std}(X)} \]

A statistical methodology called t-test is in fact more than the formula! For example,

  • model: observations are centered at \(\mu\) (unknown), with certain amount of error.
  • fit: the center and amount of error are fitted by computing \(\mbox{mean}(X)\) and \(\mbox{std}(X)\)
  • assumption: observations are numeric and follow a normal distribution
  • uncertainty quantification : We are 95% confident that true \(\mu\) is in the interval \(\mbox{mean}(X) \pm 1.96 \cdot \mbox{std}(X)/\sqrt{n}\)
  • strength of evidence: the smaller p-value, the stronger evidence against the hypothesis

A key in better understanding this, thus to correctly use statistical methods, is to think of data as a sample in a population.

Samples and Populations

We have considered data as being fixed.

Statistical methodology is governed by a broader point of view.

Yes, the data we have in hand are fixed, but the methodology assumes that the cases are drawn from a much larger set of potential cases. The given data are a sample of a larger population.

(from my lecture notes for mathematical statistics)

(from my lecture notes for mathematical statistics)

Sampling from the population

Suppose you were asked to help develop a travel policy for business travelers based in New York City. Imagine that the traveler has a meeting in San Francisco (airport code SFO) at a specified time \(t\). The policy to be formulated will say how much earlier than t an acceptable flight should arrive in order to avoid being late to the meeting due to a flight delay.

Pretend that we already have the complete population of flights. For this purpose, we’re going to use the set of 336,776 flights in 2013 in the nycflights13 package, which gives airline delays from New York City airports in 2013.

More realistically, the problem would be to develop a policy for this year based on the sample of data that have already been collected.


We’re going to simulate this situation by drawing a sample from the population of flights into SFO. First we set up the population.

library(mdsr)
library(nycflights13)
SF <- flights %>%
  filter(dest == "SFO", !is.na(arr_delay))

We work with just a sample from this population. For now, we’ll set the sample size to be \(n = 25\) cases.

set.seed(101)
Sample25 <- SF %>%
  sample_n(size = 25)

Here, we used


We are interested in the arrival delay. Focus on the variable arr_delay, and compute some useful statistics, and visualize the distribution.

favstats( ~ arr_delay, data = Sample25)
##  min  Q1 median Q3 max  mean       sd  n missing
##  -50 -23     -7  4 124 -2.96 35.33822 25       0
Sample25 %>% ggplot(aes(x=arr_delay)) + geom_density()


Notice that most of delays are less than 50 minutes, but the maximum delay is 124 minutes. A simple way to set the policy is to look for the longest flight delay (in our data), and insist that travel be arranged to deal with this delay.

Naive policy: book a flight that is scheduled to arrive at least 124 minutes before.


Let us see how well this policy works. Since in our imaginary situation we have the entire population, we can get an answer of this. In particular, we can derive how often the travelers from NYC will be late for a meeting in San Francisco, when the naive policy is used.

SF %>% 
  mutate(is.late = arr_delay > 124) %>%
  summarize(prop.late = mean(is.late))
## # A tibble: 1 x 1
##    prop.late
##        <dbl>
## 1 0.02937827

This seems reasonable, if being late or missing a meeting with 2.9% of chance is acceptable.


Now consider another policy, utilizing a 2-standard deviation rule. If the arrival delay is normally distributed, then only 2.27% of flights will have delay greater than \(\mu + 2\sigma\) (mean + 2 standard deviation ). However, in a bit more realistic situation, we do not have the population; we only have the sample in Sample25. We do not know \(\mu\) and \(\sigma\), which must be estimated from the sample.

Sample25 %>%
  summarize(mean = mean(arr_delay), sd = sd(arr_delay))
## # A tibble: 1 x 2
##    mean       sd
##   <dbl>    <dbl>
## 1 -2.96 35.33822

Policy #2: book a flight that is scheduled to arrive at least 67.7164459 minutes before.


How does this policy work? Again, since we have the population, we can see how often the travelers will miss their meetings:

SF %>% 
  mutate(is.late = arr_delay > 68) %>%
  summarize(prop.late = mean(is.late))
## # A tibble: 1 x 1
##    prop.late
##        <dbl>
## 1 0.06930843

It’s much worse than 2.27%. About 7% of time, the traveler will be late or miss the meeting. What has gone wrong? To answer this, we need to consider the following:

  1. Was the assumption appropriate?
  2. What do we know about the uncertainty in the estimation of \(\mu\) and \(\sigma\)?

Uncertainty quantification by sampling distribution

How reliable a sample statistic? For example, the sample standard deviation, computed from our sample, is 35 minutes. This is much smaller than the population standard deviation, about 47 minutes. The sample mean is about -3 minutes, which is close to the population mean, 2.7 minutes. (We will still be in the playground world where we have the population in hand.)

If we were to collect a new sample from the population, how similar would the sample statistic on that new sample be to the same statistic calculated on the original sample?

SF %>% sample_n(size = 25) %>%
  summarize(mean = mean(arr_delay), sd = sd(arr_delay))
## # A tibble: 1 x 2
##    mean       sd
##   <dbl>    <dbl>
## 1 -3.84 38.32719

If we draw many different samples from the population, each of size \(n\), and calculated the sample statistic on each of those samples, how similar would the sample statistic be across all the samples? With the population in hand, it’s easy to figure this out; use sample_n() many times and calculate the sample statistic on each trial.


Trials <- list()
set.seed(101)
for(i in 1:100){
  Trials[[i]] <- SF %>% 
    sample_n(size = 25) %>%
    summarize(mean = mean(arr_delay), sd = sd(arr_delay))
}

Trials_df <- 
  bind_rows(Trials) %>% 
  mutate(policy.cut = mean + 2 * sd)

(There are several new functions we used above: for(){...} and dplyr::bind_rows().)


head(Trials_df,10)
## # A tibble: 10 x 3
##      mean       sd policy.cut
##     <dbl>    <dbl>      <dbl>
##  1  -2.96 35.33822   67.71645
##  2  -3.84 38.32719   72.81438
##  3  15.76 61.00019  137.76038
##  4 -10.08 21.66587   33.25174
##  5   0.12 41.76454   83.64908
##  6  24.16 80.86134  185.88268
##  7  -2.64 27.98940   53.33881
##  8  13.08 53.84075  120.76151
##  9   7.80 66.44170  140.68341
## 10  16.56 60.34214  137.24427

These numbers vary a lot! The large variability of the sample statistic policy.cut show how unreliable the statistic is.

favstats( ~ policy.cut, data = Trials_df)
##       min      Q1   median       Q3      max     mean       sd   n missing
##  14.45086 55.8069 85.53907 107.2871 449.4002 90.02024 54.42503 100       0
Trials_df %>% ggplot(aes(x=policy.cut)) + geom_density()


We were lucky in the first sample Sample25 that provided only 7% chance of being late. This could be more than 20% (with a 14-minute policy), or could be only 0.0003% with a ridiculous 449-minute (7.5 hr) policy.

This discussion is only possible because we play god by knowing the population. In a more realistic situation, the sample is only available to us; is quantifying the uncertainty ever possible?

Other issues

Assumption made on the population

Is normal distribution assumed for the population appropriate?

Are there (unsual or extreme) outliers?

Conditioning

Should the policy depend on the airlines, the time of the year, day of the week, and time of the day?

Missing values

The data may contain some missing values. Is discarding the missing values a good practice?

A short excursion for iterations by “for” loops

Iteration is a powerful tool for reducing duplication, when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Consider a toy data frame consisting of 10 observations of 4 variables.

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

We want to compute the median of each column. You could do with copy-and-paste:

median(df$a)
median(df$b)
median(df$c)
median(df$d)

Instead, we could use a for loop.

output <- vector("double", 4)         # 1. output
for (i in 1:4) {                      # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
## [1] -0.00468744 -0.20602777 -0.43395171  0.46391441

Every for loop has three components:

  1. The output: output <- vector("double", length(x)). Before you start the loop, you must always allocate sufficient space for the output. A general way of creating an empty vector of given length is the vector() function. In the earlier example, we created an empty list, using the list() function.

  2. The sequence: i in 1:4. This determines what to loop over: each run of the for loop will assign i to a different value from the sequence 1:4.

  3. The body: output[[i]] <- median(df[[i]]). This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.