Where are we?

Sampling distributions

Recall our little playground from last lecture. We treat SF as the population, and Sample25 as the sample (or the data at hand).

library(mdsr)
library(nycflights13)
SF <- flights %>%
  filter(dest == "SFO", !is.na(arr_delay))

set.seed(101)
Sample25 <- SF %>%
  sample_n(size = 25)

This time, we are simply interested in estimating the mean arrival delay, using the average of all available values of arr_delay (i.e. the values in the sample).

Sample25 %>% with(mean(arr_delay))
## [1] -2.96

How reliable this statistic is? (Here, a statistic is a value computed from a sample.) To answer this, we can see how much the statistic vary from a sample to another. Since we have the population, we collect many of the sample statistics.

Trials <- list() 
for(i in 1:500){
  Trials[[i]] <- SF %>% 
    sample_n(size = 25) %>%
    summarize(mean = mean(arr_delay))
}
Trials_df <- bind_rows(Trials)
favstats( ~ mean, data = Trials_df)
##     min    Q1 median   Q3   max    mean       sd   n missing
##  -17.64 -3.85   1.78 9.14 45.48 3.05112 10.18587 500       0
Trials_df %>% ggplot(aes(x=mean)) + geom_density()

To discuss reliability, it helps to have some standardized vocabulary.

mean(Trials_df$mean) + sd(Trials_df$mean) * 1.96 * c(-1,1)  
## [1] -16.91319  23.01543

The reliability of a sample statistic is typically measured by

  1. the mean of the statistic (mean of the sampling distribution). Better if it is closer to the truth.
  2. the standard error of the statistic (standard of the sampling distribution). Better if it is small.

In above example, the standard error 10.1858725 seems to be too large to be acceptable. Equivalently, the confidence interval -16.91319, 23.01543 seems to be too wide. (The mean of the statistic is close to the truth.)

If we increase the sample size \(n\), the standard error will decrease. To see this, repeat the experiment with \(n = 25,50,100\) and \(200\), and see if the standard error decreases.

Trials.all <-list()
nvec = c(25,50,100,200)
for(j in seq_along(nvec)){
  Trials_n <- list() 
  n <- nvec[j]
  for(i in 1:500){
    Trials_n[[i]] <- SF %>% 
    sample_n(size = n) %>%
    summarize(mean = mean(arr_delay), n = n)
  }
  Trials.all[[j]] <- bind_rows(Trials_n)
}
Trials.all <- bind_rows(Trials.all)
Trials.all %>% ggplot(aes(x=mean)) + geom_density() + facet_wrap(~n)

Trials.all %>% group_by(n) %>% summarize(error = sd(mean))
## # A tibble: 4 x 2
##       n    error
##   <dbl>    <dbl>
## 1    25 9.325801
## 2    50 6.466642
## 3   100 4.787764
## 4   200 3.365816

An important question that statistical methods allow you to address is what size of sample \(n\) is needed to get a result with an acceptable reliability (For example, the standard error less than 5 minutes is an acceptable reliability). In this case, using \(n \ge 100\) is enough.

The figure above also demonstrates two fundamental results in mathematical statistics: Law of large numbers and Central Limit Theorem.

In fact, under suitable assumptions, \[\mbox{Standard error}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}},\] where \(\sigma\) is the population standard deviation.

Bootstrap sampling distribution

In the previous examples, we had access to the population data and so we could find the sampling distribution by repeatedly sampling from the population. In practice, however, we have only one sample and not the entire population. The bootstrap is a statistical method that allows us to approximate the sampling distribution even without access to the population.

The logical leap involved in the bootstrap is to think of our sample itself as if it were the population. Just as in the previous examples we drew many samples from the population, now we will draw many new samples from our original sample. This process is called resampling: drawing a new sample from an existing sample.

I will use blackboard to demonstrate a bootstrap procedure.

The boostrap procedure of creating \(B = 500\) bootstrap statistics from Sample25 in R is shown below.

n <- nrow(Sample25)
boot <- list() 
for(i in 1:500){
    boot[[i]] <- Sample25 %>% 
      sample_n(size = n, replace = TRUE) %>%
      summarize(mean = mean(arr_delay))
  }
boot_df <- bind_rows(boot)
boot_df %>% ggplot(aes(x = mean)) + 
  geom_density() + 
  labs(title = "Bootstrap distribution of mean(arr_delay) from Sample25")

The distribution of values in the bootstrap trials is called the bootstrap sampling distribution. It’s not exactly the same as the sampling distribution, but for moderate to large sample sizes it has been proven to approximate those aspects of the sampling distribution that we care most about, such as the standard error.

The bootstrap estimate of the standard error is the standard deviation of the bootstrap distribution:

sd(boot_df$mean)
## [1] 7.092224

Bootstrap standard error estimates for large sample

The following code sample for population SF only once for each of the sample sizes 25–200, and use the sample (without reference to the population) to estimate the standard error.

Trials.bootstrap <-list()
nvec = c(25,50,100,200)
for(j in seq_along(nvec)){
  n <- nvec[j]
  sample_df <- SF %>% sample_n(size = n)
  Trials_n <- list() 
  for(i in 1:500){
    Trials_n[[i]] <- sample_df %>% 
      sample_n(size = n, replace = TRUE) %>%
      summarize(mean = mean(arr_delay), n = n)
  }
  Trials.bootstrap[[j]] <- bind_rows(Trials_n)
}
bind_rows(Trials.bootstrap) %>% 
  group_by(n) %>% 
  summarize(error = sd(mean))
## # A tibble: 4 x 2
##       n     error
##   <dbl>     <dbl>
## 1    25 15.989628
## 2    50  4.791018
## 3   100  6.224418
## 4   200  3.271068

These standard error estimates, computed by bootstrapping the sample, are indeed quite close to the standard error, computed from the population.

In this example, the sample statistic is the mean. In practice, the statistics of interest are more complex, e.g. the coefficient estimate in linear regression model. The bootsrapping procedure can be applied to almost all situations, in order to quantify the uncertainty in a statistic.