fun with R

## Why That Didn't Work

There are a few items related to the analysis of time series with R that will have you scratching your head.
The issues mentioned below are meant to help get you past the sticky points.
Most of the issues are related to the `stats`

package, which is essentially a base package in that it is included with R, and loaded automatically when you start R. If you use other time series packages that have scripts with the same or similar names, then these issues might not apply. Then again, they just may apply because mistakes tend to propagate.

## Issue 1: how will R end?

The issue below has become a real pain as the commercial enterprise that makes RStudio influences the R Foundation, which is a nonprofit organization. Older folks saw this happen with R's father, S-PLUS. Anybody using S-PLUS right now?

A recent issue with the package `dplyr`

and `lag`

from the `stats`

package came to my attention: you can read about it in how will R end... not with a bang but a package. The bottom line is, if you are working with time
series and you load `dplyr`

, then you should know what it breaks... just be careful.

```
# if I do this
library(dplyr)
# I will see this
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# it's a package fight!
```

To be safe, load packages consciously and watch for masked objects warnings. I would say avoid loading `dplyr`

if
you're analyzing time series interactively (the advantage of using R vs batch mode programs such as SAS).
One fix if you're analyzing time series (or teaching a class) is to (tell students to) do the following after loading all the packages needed:

filter = stats::filter lag = stats::lagNow back to our regularly scheduled list of screw ups.

⛔ Oh yeah, so you're probably wondering how? ... every package will nullify every other package until one day, you load R and it masks itself in an infinite do loop ... ⛔

## Issue 2: don't use auto.arima

This is another issue with packages, this time the`forecast`

package. I tend to see this kind of thing often now - don't look at the data, don't check residuals or any other kind of output ... just black box away:
# I don't know anything about fitting models to time series data, # so I used auto.arima from the forecast package. library(forecast) this.sux <- auto.arima(aTimeSeries) plot(forecast(this.sux))Done! Man, I am a bad ass data sci-un-tist!! Now I can leave work... I hope my boss appreciates my hard work.

Here's a little test of

`auto.arima()`

success rates (not very good to say the least):
Success rates of automated ARIMA fitting. The original (Windows) version of `astsa`

had
an automatic fitting procedure, which even for its time sucked.... it's why there is not one in the R package.
There's no excuse not to learn what you're doing before you do it.
A good idea when thinking about these kinds of ideas is to extend it to the extreme. Let's see, how about this:

```
# I don't know anything about flying a plane, so
# I'll use auto.pilot from the Trio_Avionics package.
library(Trio_Avionics)
crash_burn <- auto.pilot(this_plane)
forecast(my_life_expectancy)
AutoPilot Report
You Have Just Crashed Into A Mountain
Black Box Search Begins Now
Funeral In Four Days
```

That was fun... did I mention that the dude's boss didn't appreciate his hard work and fired him the next day? Not only because he used auto.arima, but because the dude is generally a dumb ass.
Let's analyze a real data set. I used to use

`AirPassengers`

because it gave a terrible result. That was changed
somehow in `auto.arima`

, but not generally - specifically for that data set!! Well, anything else I try is equally as bad. I'll use a seasonal series from `astsa`

called `UnempRate`

, monthly US unemployment %s from 1948 thru 2016.
```
library(astsa)
library(forecast)
auto.arima(UnempRate)
Series: UnempRate
ARIMA(5,1,5)(1,0,0)[12]
Coefficients:
ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3 ma4 ma5 sar1
0.8299 0.0002 -0.1877 0.5283 -0.5790 -0.7431 0.0733 0.2183 -0.6230 0.7694 0.8827
s.e. 0.0942 0.0930 0.1124 0.0932 0.0665 0.0768 0.0701 0.0851 0.0736 0.0562 0.0233
sigma^2 estimated as 0.07499: log likelihood=-104.71
AIC=233.42 AICc=233.81 BIC=290.02
```

If you look at the residual analysis, you'll see the model sucks (in a technical sense).
Using `approximate=FALSE`

doesn't change the result by much. Even without looking at the residuals, it's clear that model is a little crazy. And if you know what you're doing, you can beat it without much trouble by starting with Box and Jenkins catchall model ARIMA(0,1,1)(0,1,1)[12].
Let's do another one - we could go on forever:

set.seed(666) x = rnorm(1000) # WHITE NOISE forecast::auto.arima(x) # BLACK BOX # partial output Series: x ARIMA(2,0,1) with zero mean Coefficients: ar1 ar2 ma1 -0.9744 -0.0477 0.9509 s.e. 0.0429 0.0321 0.0294 sigma^2 estimated as 0.9657: log likelihood=-1400 AIC=2808.01 AICc=2808.05 BIC=2827.64ROTFLMAO: an ARMA(2,1)

The bottom line is, it's not that hard to fit ARIMA models if you know what you're doing. And if you don't know what you're doing, maybe you shouldn't... do it, that is... well, unless you want to be president of the US... but that's another story.

## Issue 3: when is the intercept the mean?

When fitting ARIMA models, R calls the estimate of the mean, the estimate of the intercept. This is ok if there's no AR term, but not if there is an AR term.For example, if x(t) = α + φ*x(t-1) + w(t) is stationary, then taking expectations, μ = α + φ*μ or α = μ*(1-φ). So, the intercept, α, is not the mean, μ, unless φ= 0. In general, the mean and the intercept are the same only when there is no AR term. Here's a numerical example:

# generate an AR(1) with mean 50 set.seed(66) # so you can reproduce these results x = arima.sim(list(order=c(1,0,0), ar=.9), n=100) + 50 mean(x)The result is telling you the estimated model is[1] 50.60668 # the sample mean is closearima(x, order = c(1, 0, 0))Coefficients: ar1 intercept # <--here is the problem0.8971 50.6304 # <--or here, one of these has to changes.e. 0.0409 0.8365

x(t) = 50.6304 + .8971*x(t-1) + w(t)

whereas, it should be telling you the estimated model is

x(t)−50.6304 = .8971*[x(t-1)−50.6304] + w(t)

or

x(t) = 5.21 + .8971*x(t-1) + w(t).

Note that 5.21 = 50.6304*(1-.8971) is the intercept... if it's not the intercept, then what is it??

So how did this happen? I don't know, but I'll guess... if you write an AR(1) model as

x(t) = μ + φ*[x(t-1) - μ] + w(t)

then μ looks like an intercept. Don't drink and code at the same time, kids.

The easy thing to do is simply change "intercept" to "mean":

Coefficients: ar1 mean 0.8971 50.6304 s.e. 0.0409 0.8365

## Issue 4: your arima is drifting

When fitting ARIMA models with R, a constant term is NOT included in the model if there is any differencing. The best R will do by default is fit a mean if there is no differencing [type`?arima`

for details].
What's wrong with this? Well (with a time series in x), for example:
arima(x, order = c(1, 1, 0)) # (1)will not produce the same result as

arima(diff(x), order = c(1, 0, 0)) # (2)because in (1), R will fit the model [with ∇x(s) = x(s)-x(s-1)]

∇x(t)= φ*∇x(t-1) + w(t) (no constant)

whereas in (2), R will fit the model

∇x(t) = α + φ*∇x(t-1) + w(t). (constant)

If there's drift (i.e., α is NOT zero), the two fits can be extremely different and using (1) will lead to an incorrect fit and consequently bad forecasts (see Issue 3 below). If α is NOT zero, then what you have to do to correct (1) is use xreg as follows:

arima(x, order = c(1, 1, 0), xreg=1:length(x)) # (1+)Why does this work? In symbols, xreg = t and consequently, R will replace x(t) with y(t) = x(t)-β*t; that is, it will fit the model

∇y(t)= φ*∇y(t-1) + w(t),

or

∇[x(t) - β*t] = φ*∇[x(t-1) - β*(t-1)] + w(t).

Simplifying,

∇x(t) = α + φ*∇x(t-1) + w(t) where α = β*(1-φ).

If you want to see the differences, generate a random walk with drift and try to fit an ARIMA(1,1,0) model to it. Here's how:

set.seed(1) # so you can reproduce the results v = rnorm(100,1,1) # v contains 100 iid N(1,1) variates x = cumsum(v) # x is a random walk with drift = 1 plot.ts(x) # pretty picture... arima(x, order = c(1, 1, 0)) #(1)Let me explain what's going on here. The model generating the data isCoefficients: ar1 0.6031 s.e. 0.0793arima(diff(x), order = c(1, 0, 0)) #(2)Coefficients: ar1 interceptarima(x, order = c(1, 1, 0), xreg=1:length(x)) #(1+)<-- remember, this is the mean of diff(x)-0.0031 1.1163and NOT the intercepts.e. 0.1002 0.0897Coefficients: ar1 1:length(x)<-- this is the intercept of the model-0.0031 1.1169for diff(x)... got a headache?s.e. 0.1002 0.0897

x(t) = 1 + x(t-1) + w(t)

where w(t) is N(0,1) noise. Another way to write this is

[x(t)-x(t-1)] = 1 + 0*[x(t-1)-x(t-2)] + w(t)

or

∇x(t) = 1 + 0*∇x(t-1) + w(t)

so, if you fit an AR(1) to ∇x(t), the estimates should be, approximately, ar1 = 0 and intercept = 1.

Note that (1) gives the WRONG answer because it's forcing the regression to go through the origin. But, (2) and (1+) give the correct answers expressed in two different ways.

## Issue 5: you call that a forecast?

If you want to get predictions from an ARIMA(p,d,q) fit when there is differencing (i.e., d > 0), then the previous issue continues to be a problem. Here's an example using the global temperature data from Edition 2 of the text. In what you'll see below, the*first method*gives the

*wrong*results and the

*second method*gives the

*correct*results. If you use

`sarima`

and `sarima.for`

, then you'll avoid these problems.
##-- WRONG --## fit1 = arima(gtemp, order=c(1,1,1)) fore1 = predict(fit1, 15) ##-- Right --## nobs = length(gtemp) fit2 = arima(gtemp, order=c(1,1,1), xreg=1:nobs) fore2 = predict(fit2, 15, newxreg=(nobs+1):(nobs+15)) ##-- Plot --## par(mfrow=c(2,1)) ts.plot(gtemp,fore1$pred, col=1:2, main="WRONG") ts.plot(gtemp,fore2$pred, col=1:2, main="RIGHT")

## Issue 6: the wrong p-values

If you use`tsdiag()`

for diagnostics after an ARIMA fit, you will
get a graphic that looks like this:
The p-values shown for the Ljung-Box statistic plot are incorrect because the degrees of freedom used to calculate the p-values are lag instead of lag - (p+q). That is, the procedure being used does NOT take into account the fact that the residuals are from a fitted model.

## Issue 7: lead from behind

You have to be careful when working with lagged components of a time series. Note that lag(x) is a FORWARD shift and lag(x,-1) is a BACKWARD shift (unless you happen to load`dplyr`

... ).
Try a small example:
x = ts(1:5) cbind(x, lag(x), lag(x,-1))In other words, if you have a series x(t), thenTime Series: Start = 0 End = 6 Frequency = 1 x lag(x) lag(x, -1) 0 NA 1 NA 1 1 2 NA 2 2 3 1 3 3 4 2 ## in this row, x is 3, lag(x) is 4, lag(x,-1) is 2 4 4 5 3 5 5 NA 4 6 NA NA 5

y(t) = lag{x(t)} = x(t+1), and NOT x(t-1).

In fact, this is reasonable in that y(t) actually does "lag" x(t) by one time period. But, it seems awkward, and it's not typical of other programs. As long as you know the convention, you'll be ok ...

... well, then I started wondering how this plays out in other things. If you do a lag plot

x = rnorm(500) lag.plot(x)you get x(t) [vertical axis] vs x(t+1) [horizontal axis], but your brain is saying it's x(t-1) on the horizontal.

So, I started playing around with some other commands. In what you'll see next, I'm using two simultaneously measured series presented in the text called soi and rec... it doesn't matter what they are for this demonstration. First, I entered the command

acf(cbind(soi, rec)) # and I got:Before you scroll down, try to figure out what the graphs are giving you (in particular, the off-diagonal plots ... and yes they're CCFs, but what's the lead-lag relationship in each plot???) ...

.

.

.

.

.

.

.

... here you go:

The jpg is messy, but you'll get the point... the writing is mine. When you see something like 'rec "leads" here', it means rec comes in time before soi, and so on. Anyway, to be consistent, shouldn't the graph in the 2nd row, 1st column be corr{rec(t+Lag}, soi(t)} for positive values of Lag ... or ... shouldn't the title be

**soi & rec**?? oops.

Now, try this

ccf(soi,rec) # and you getWhat you're seeing is corr{soi(t+Lag), rec(t)} versus Lag. So on the positive side of Lag, rec leads, and on the negative side of Lag, soi leads.

We're not done with this yet. If you want to do a regression of x on lag(x,-1), for example, you have to "tie" them together first. Here's an example.

x = arima.sim(list(order=c(1,0,0), ar=.9), n=20) # 20 obs from an AR(1) xb1 = lag(x,-1) ## you wouldn't regress x on lag(x) because that would be progress ;-) ##-- WRONG: cor(x,xb1) # correlation # [1] 1 ... is one? lm(x~xb1) # regression # Coefficients: # (Intercept) xb1 # 6.112e-17 1.000e+00 ## do it the WRONG way and you will think x(t)=x(t-1) ##-- RIGHT: y=cbind(x,xb1) # tie them together first lm(y[,1]~y[,2]) # regression # Coefficients: # (Intercept) y[, 2] # 0.5022 0.7315 ##-- OR: y=ts.intersect(x,xb1) # tie them together first this way lm(y[,1]~y[,2]) # regression # Coefficients: # (Intercept) y[, 2] # 0.5022 0.7315 cor(y[,1],y[,2]) # correlation # [1] 0.842086 ##--By the way, (Intercept) is used correctly here.R does warn you about this (type

`?lm`

and scroll down to "Using time series"), so consider
this a heads-up, rather than an issue. See our little tutorial for more info on this.
## Issue 8: u-g-l-y, you ain't got no alibi

When you're trying to fit an ARMA model to data, one of the first things you do is look at the ACF and PACF of the data. Let's try this for a simulated MA(1) process. Here's how:MA1 = arima.sim(list(order=c(0,0,1), ma=.5), n=100) par(mfcol=c(2,1)) acf(MA1,20) pacf(MA1,20) and here's the output:What's wrong with this picture? First, the two graphs are on different scales. The ACF axis goes from -.2 to 1, whereas the PACF axis goes from -.2 to .4. Also, the lag axis on the ACF plot starts at 0 (the 0 lag ACF is always 1 so you have to ignore it or put your thumb over it), whereas the lag axis on the PACF plot starts at 1. So, instead of getting a nice picture by default, you get a messy picture. Ok, the remedy is as follows:

par(mfrow=c(2,1)) acf(MA1,20,xlim=c(1,20)) # set the x-axis limits to start at 1 then # look at the graph and note the y-axis limits pacf(MA1,20,ylim=c(-.2,1)) # then use those limits here #and here's the output:Looks nice, but who wants to get carpal tunnel syndrome sooner than necessary? Not me... and hence

`acf2`

was born.