In previous lectures, we experimented on a travel policy using the SF
data (below) as the population. An example of policy was to book a flight that is scheduled to arrive at least 124 minutes before.
library(dplyr)
library(nycflights13)
SF <- flights %>%
filter(dest == "SFO", !is.na(arr_delay))
There, we assumed that the population is homogeneous; that is, a flight is just like any other flight. This is certainly not true, since, as we will see shortly, flight delays are more likely to occur if the airport is busy, which depends on the time of day, day of the week and month of the year. If we know flights in July are more frequently delayed than flights in October, should the policy depend on the month? This can be done by modeling. A model, explaining a variable (arrival delay) by other variables (time of the year), is necessarily a conditional statement: “arrival delay pattern depends on time of the year”.
Many of the “useful” models, in statistics, statistical learning, machine learning, or data mining, are categorized into one of the following:
We will discuss three classes of models from Categories 1, 2, and 4 above. Before we get into it, let us first further explore the data SF
to see some evidence on how modeling, or conditioning, helps better shape your conclusion.
We begin by considering time of day: first using standard box-and-whiskers to show the (conditional) distribution of arrival delays per each hour; second with a kind of statistical model called a linear model that lets us track the mean arrival delay over the course of the day.
library(ggplot2)
SF %>%
ggplot(aes(x = hour, y = arr_delay)) +
geom_boxplot(alpha = 0.1, aes(group = hour)) + geom_smooth(method = "lm") +
xlab("Scheduled hour of departure") + ylab("Arrival delay (minutes)") +
coord_cartesian(ylim = c(-30, 120))
Data transformation might help us better understand the pattern. For this, create a new variable long_delay
whose value is 1 TRUE
if the delay is over 60 minutes. Instead of looking at the whole distribution of arrival delay, focus on the binary variable long_delay
.
SF %>%
mutate(long_delay = arr_delay > 60) %>%
ggplot(aes(x = hour, fill= long_delay)) +
geom_bar(position = "fill") +
xlab("Scheduled hour of departure") + ylab("Proportion of long delay")
How about in terms of day of the week, month of the year, and different airlines?
SF_longdelay <- SF %>%
mutate(long_delay = arr_delay > 60)
SF_longdelay %>%
ggplot(aes(x = as.factor(month), fill= long_delay)) +
geom_bar(position = "fill") +
xlab("Month") + ylab("Proportion of long delay")
library(lubridate, quietly = TRUE, warn.conflicts = FALSE)
SF_longdelay %>%
mutate(day = ymd(paste0(year, "-", month, "-", day)),
dow = wday(day, label = TRUE)
) %>%
ggplot(aes(x = dow, fill= long_delay)) +
geom_bar(position = "fill") +
xlab("Day of the week") + ylab("Proportion of long delay")
SF_longdelay %>%
ggplot(aes(x = origin, fill= long_delay)) +
geom_bar(position = "fill") +
xlab("Origin") + ylab("Proportion of long delay")
SF_longdelay %>%
ggplot(aes(x = carrier, fill= long_delay)) +
geom_bar(position = "fill") +
xlab("Airlines") + ylab("Proportion of long delay")
It makes a lot of sense to use these variables to model the patterns of arrival delay. Is the visual pattern really there? Can modeling help us identifying patterns that are not easily visible?
We use a different data set mosaicData::RailTrail
to make examples for regression. The Pioneer Valley Planning Commission (PVPC) collected data north of Chestnut Street in Florence, Massachusetts for a ninety day period. Data collectors set up a laser sensor that recorded when a rail-trail user passed the data collection station.
library(mosaicData)
glimpse(RailTrail)
## Observations: 90
## Variables: 10
## $ hightemp <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41,...
## $ lowtemp <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49,...
## $ avgtemp <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67....
## $ spring <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...
## $ summer <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, ...
## $ fall <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
## $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, ...
## $ precip <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.0...
## $ volume <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 4...
## $ weekday <fctr> 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,...
The PVPC wants to understand the relationship between daily ridership (i.e., the number of riders and walkers who use the bike path on any given day) and a collection of explanatory variables, including the temperature, rainfall, cloud cover, and day of the week.
In a simple linear regression model, there is a single quantitative explanatory variable. It seems reasonable that the high temperature for the day (hightemp
, measured in degrees Fahrenheit) might be related to ridership (volume
), so we will explore that first.
RailTrail %>% ggplot(aes(x = hightemp, y = volume)) + geom_point()
A simple linear regression model is of the form \[ \mbox{volume}_i = \beta_0 + \beta_1 \mbox{hightemp}_i + \epsilon_i,\] where \(\beta_0,\beta_1 \in (-\infty,\infty)\) are the (unknown) parameters of the model. This is not just one model, rather a family of models, where different values of \(\beta_0,\beta_1\), i.e. parameters, indicate different members in the family.
Here, \(\epsilon_i\) stands for the unexplained error associated with the \(i\)th case. Often, the errors are assume to have mean 0 and variance \(\sigma^2\). The model has a nice interpretation.
Fitting a regression model amounts to find estimates of \(\beta_0,\beta_1\) (or to find a suitable member in the family). Delve into this idea by sampling some members of the family.
RTsim <- RailTrail %>% transmute(x = hightemp, y = volume)
models <- tibble(
b0 = runif(500, -300, 900),
b1 = runif(500, -50, 50)
)
ggplot(RTsim, aes(x, y)) +
geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) +
geom_point()
Each line segment is given by the formula \(y = b_0 + b_1 x\), and is a member of the linear regression model family. A lot of these do not follow the apparent trend in the scatter plot. How do we choose a “good fit”?
There are million different ways to fit a model to data, but a standard technique is called least-squares.
Here is the idea behind the least-squares fit. We choose two of the models in the family:
Compare the errors of the fit on the left and the right:
## x y error.left error.right
## 1 41 193 -124.16667 -23.69772
## 2 41 287 -218.16667 70.30228
## 3 42 181 -112.16667 -41.39960
## 4 44 200 -131.16667 -33.80336
## 5 46 189 -120.16667 -56.20712
## 6 49 129 -60.16667 -133.31275
The sum of squares of errors of a model is
mod.compare %>%
summarize(
SSE.left = sum(error.left^2),
SSE.right = sum(error.right^2)
)
## SSE.left SSE.right
## 1 9904438 955214.1
SSE.right
is smaller than SSE.left
. Thus the model on the right is better than the modelon the left.
Of the five hundred models shown above, you can repeat this process of computing the SSE, and look for the model with the least sum of squared of errors. That is the best model among your collection of 500 models.
If you do this for all possible models (not just for the handful of 500 models), and find the “least-square” model, that is the best model fitted to your data. Mathematically, this is an optimization problem:
Find \(\hat\beta_0, \hat\beta_1\) such that \[\sum_{i=1}^n(y_i - (\hat\beta_0 + \hat\beta_1x_i))^2 \le \sum_{i=1}^n(y_i - (\beta_0 + \beta_1x_i))^2\] for all \(\beta_0,\beta_1\) in the family.
Mathematics sometimes gives you an analytic solution: The formula you have learned in introductory statistics for \(\hat\beta_0, \hat\beta_1\) is the solution of the above optimization problem.
In R, lm()
computes \(\hat\beta_0, \hat\beta_1\) (and a lot more):
mod <- lm(y ~ x, data = RTsim)
coef(mod)
## (Intercept) x
## -17.079281 5.701878
Many sophisticted statistical methods or “machine learning” algorithms are no unlike the simple linear regression. Essentially,
any statistic method optimizes a goodness-of-fit measure over a family of models.
Thus far, we have fit a model and interpreted its estimated coefficients.
Take one step back. We have fit a model using the data set we have. More often than not, a data set is merely a sample from a population. What can we say about the population from our fitted model?
Uncertainty quantification arises in two ways:
These are easy to answer if you know the population (well, except a few things). If you do not know the population, it is still possible, using bootstrap. Recall “sampling distribution” vs “bootstrap sampling distribution” from Lecture 15.
You never know about the population. Instead you make an assumption that you know certain things about the population. In our case, we make the assumption that the members of the population following the rule:
\[y = \beta_0 + \beta_1 x + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).\]
Even though you do not know the exact values of the parameters (\(\beta_0,\beta_1,\sigma^2\)) of this ruel/model, this is a big assumption!
(It is also typical to make assumptions on the sampling procedure, e.g. each member in the sample is picked independent from other members.)
In this case, we use our understanding of the normal distribution to make inferences about the true value of regression coefficients. In particular, we can test a hypothesis about \(\beta_1\) (most commonly that it is equal to zero) and find a confidence interval (range of plausible values) for it.
mod1 <- lm(volume ~ hightemp, data = RailTrail)
summary(mod1)
##
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -254.562 -57.800 8.737 57.352 314.035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.079 59.395 -0.288 0.774
## hightemp 5.702 0.848 6.724 1.71e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 104.2 on 88 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.3319
## F-statistic: 45.21 on 1 and 88 DF, p-value: 1.705e-09
Likewise, the margin or error in the prediction can be found, for each value of “\(x\)”, explaining that the true value lies in the (vertical) interval with 95% of confidence.
RailTrail %>%
ggplot(aes(x=hightemp, y=volume)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
The previous case (that is, you know assume that population is normally distributed) is in fact very common. In case you do not want to make such an assumption, bootstrap gives a computationally heavy, but distribution-free answer.
This can be easily done in R for a simple model like linear regression. We note that it is conceptually easy to implement bootstrap using R packages dplyr
and broom
. We compute the bootstrap standard error for the estimate of \(\beta_1\) (the coefficient) below.
library(broom)
boot <- RailTrail %>%
bootstrap(100) %>%
do(tidy(lm(volume ~ hightemp, .)))
head(boot)
## # A tibble: 6 x 6
## # Groups: replicate [3]
## replicate term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 (Intercept) -44.475469 72.4412522 -0.6139522 5.408308e-01
## 2 1 hightemp 6.331875 0.9834591 6.4383714 6.171510e-09
## 3 2 (Intercept) 17.953224 69.1283710 0.2597085 7.956953e-01
## 4 2 hightemp 4.988566 0.9610643 5.1906682 1.333282e-06
## 5 3 (Intercept) 4.475398 62.1638279 0.0719936 9.427704e-01
## 6 3 hightemp 5.591848 0.9020890 6.1987772 1.791636e-08
boot %>% ungroup() %>%
filter(term == "hightemp") %>%
summarize(se = sd(estimate, na.rm = TRUE))
## # A tibble: 1 x 1
## se
## <dbl>
## 1 0.8014019
The bootstrap standard error is indeed quite close to the standard error estimated assuming the normality.
Suppose that instead of using temperature as our explanatory variable for ridership on the rail trail, we only considered whether it was a weekday or not a weekday (e.g., weekend or holiday). The indicator variable weekday
is binary (or dichotomous) in that it only takes on the values 0 and 1. (Such variables are sometimes called indicator variables: weekday = 1
if weekday, 0
if weekend.) This new linear regression model has the form:
\[ \mbox{volume}_i = \beta_0 + \beta_1 \mbox{weekday}_i + \epsilon_i.\] There are now only two cases in the model: weekday = 1
or 0
.
RailTrail %>%
ggplot(aes(x = weekday, y= volume)) +
geom_boxplot()
RailTrail %>%
group_by(weekday) %>%
summarise(mean_volume = mean(volume))
## # A tibble: 2 x 2
## weekday mean_volume
## <fctr> <dbl>
## 1 0 430.7143
## 2 1 350.4194
The least-squares fit does reflect our finding:
coef(lm(volume ~ weekday, data = RailTrail))
## (Intercept) weekday1
## 430.71429 -80.29493
By default, the lm()
function will pick the alphabetically lowest value of the categorical predictor as the reference group and create indicators for the other levels (in this case “0”). As a result the intercept is now the predicted number of trail crossings on a weekend. The interpretation of the model is that on a weekday, 80 fewer riders are expected than on a weekend or holiday.
Multiple regression is a natural extension of simple linear regression that incorporates multiple explanatory (or predictor) variables. It has the general form:
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_p x_{pi} + \epsilon_i.\]
The estimated coefficients are now interpreted as “conditional on” the other variables???each \(\beta_j\) reflects the predicted change in \(y\) associated with a one-unit increase in \(x_{ji}\), conditional upon the rest of the explanatory variables. This type of model can help to disentangle more complex relationships between three or more variables.
We try some examples. Once you decide which variables to include in the model, R function lm()
gives the least-squares fit for \(\beta_j\):
mod1 <- lm(volume ~ hightemp + weekday, data = RailTrail)
mod2 <- lm(volume ~ hightemp + cloudcover + weekday, data = RailTrail)
mod3 <- lm(volume ~ . , data = RailTrail)
Here, the model formula volume ~ .
is the full model, i.e. a model with all available explanatory variables in the data set.
We now encounter another layer of complexity (First layer was the fitting itself; oftentimes fitting a model is merely an educated trial-and-error procedure.):
Among the three families of models
mod1, mod2, mod3
, which model family should we use?
In multiple linear regression, this question is translated to a variable selection problem: Among the \(p\) variables, which variables to use?
mod3
are not estimated. This is caused by a problem called colinearity. To cope with this, one may choose to alter the fitting procedure (e.g. regularized or sparse regression).The degrees of freedom in the choice of models are now almost infinite. Even sticking with the multiple linear regression, you will need to decide
Can’t decide? We can use the data (just like we did for bootstrap) to make informed decisions on these two questions. The way we seek for answers is again an educated trial-and-error procedure, which tends to be very computationally expensive. Statisticians are now required to take both inferential and computational aspects of model fitting. This is modern statistics. Some other people call it data science.
We will see some basics of data-driven model fitting later.
Nonparametric regression assumes a simple (not necessarily linear) model: \[y_i = f(x_i) + \epsilon_i.\] where \(f\) is a smooth function. The family of models is parametrized by the function \(f\), which inself is parameterized by a near infinite number of parameters. “Nonparametric” regression is named so because there are the infinite number of parameters to fit. As a result, the fit is very flexible.
scat_ex <- RTsim %>% ggplot(aes(x,y)) + geom_point()
scat_ex + geom_smooth(span = 10)
## `geom_smooth()` using method = 'loess'
scat_ex + geom_smooth(span = 1/5)
## `geom_smooth()` using method = 'loess'
In the above example, ggplot2
automatically use a local polynomial regression implemented in loess()
. There are other well-developed methods, such as kernel regression implemented in ksmooth()
. The choice of fitting methods comes from the choice of model families.
Notice that for two different values of span
, the fit looks quite different. The span
is an example of tuning parameter. Data-driven choice for the value of the tunining parameter is well-studied and we will glimpse over it later.
Our previous examples had quantitative (or continuous) outcomes. What happens when we are interested in modeling a dichotomous outcome? For example, we might model the probability of long delay as a function of airtime (We are back to the “SF” example).
It would be very awkward to use a linear model to fit the following relationship.
# SF_longdelay <- SF %>% mutate(long_delay = arr_delay > 60)
SF_longdelay %>% ggplot(aes(x = air_time, y = long_delay)) +
geom_point()
The main difficulty is caused by the fact that the response has only two possible values \(\{0,1\}\) as opposed to \((-\infty,\infty)\) in the simple linear regression. Logistic regression can be used in this case, which is an example of generalized linear regression. Generalized linear regression transforms the \(y\)-coordinate (and a lot more) so that a highly interpretable linear model can be used.