an R time series quick fix

The data sets used in this tutorial are available in astsa, the R package for the text. A detailed tutorial (and more!) is available in Appendix R of the text. This page is basically the quick fix from Edition 2 updated a bit.

You can copy-and-paste the R commands (multiple lines are ok) from this page into R. Printed output is blue, and you wouldn't want to paste those lines into R, would you?

This quick fix is meant for people who are just starting to use R for time series analysis. This is not a lesson in time series analysis. Also, the analyses performed on this page are simply demonstrations, they are not meant to be optimal or complete in any way. This is done intentionally so as not to spoil the fun you'll have working on the problems in the text.

If you're new to R/Splus, I suggest reading R for Beginners (a pdf file) first. Another good read for exploring time series is Econometrics in R (a pdf file). You may also want to poke around the QuickR website.

◊ Baby steps... your first R session. Get comfortable, then start her up and try some simple addition:

2+2
 [1] 5

Ok, now you're an expert useR. It's time to move on to time series. What you'll see in the following examples should be enough to get you through the first four chapters of the text.

We're going to get astsa now:

install.packages("astsa") # install it ... you'll be asked to choose the closest CRAN mirror
require(astsa)            # then load it (has to be done at the start of each session)

Let's play with the Johnson & Johnson data set. It's included in astsa as jj, that dynOmite character from Good Times. First, look at it.

data(jj)  # load the data
jj        # print it to the screen
        Qtr1  Qtr2  Qtr3  Qtr4
  1960  0.71  0.63  0.85  0.44
  1961  0.61  0.69  0.92  0.55
    .     .     .     .     .
    .     .     .     .     .  
  1979 14.04 12.96 14.85  9.99
  1980 16.20 14.67 16.02 11.61
and you see that jj is a collection of 84 numbers called a time series object. You can see/remove your objects this way:
ls()         # list your objects
 [1] "what"  "me"  "worry"  "jj"
rm(worry)    # remove your worry object

If you're a Matlab (or similar) user, you may think jj is an 84 × 1 vector, but it's not. It has order and length, but no dimensions (no rows, no columns). R calls these kinds of objects "vectors" so you have to be careful. In R, "matrices" have dimensions but "vectors" do not.

jj[1]       # the first element
 [1] 0.71
jj[84]      # the last element
 [1] 11.61
jj[1:4]     # the first 4 elements
 [1] 0.71 0.63 0.85 0.44
jj[-(1:80)] # everything EXCEPT the first 80 elements
 [1] 16.20 14.67 16.02 11.61
length(jj)  # the number of elements 
 [1] 84
dim(jj)     # but no dimensions ...
 NULL
nrow(jj)    # ... no rows
 NULL
ncol(jj)    # ... and no columns
 NULL
#-- if you want it to be a column vector (in R, a matrix), an easy way to go is:
jjm = as.matrix(jj)
dim(jjm)   
 [1] 84  1 

Now, let's make a monthly time series object that starts in June of the year 2163.

options(digits=2)  # the default is 7, but it's more than I want now
?options           # to see your options (it's the help file)
?rnorm             # we're using it on the next line
(zardoz = ts(rnorm(48), start=c(2163,6), frequency=12))
          Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
  2163                                     0.856 -0.355  0.245 -0.321  0.321  0.957 -0.867
  2164  0.358 -0.359  1.384 -0.041  1.107  0.784 -0.241  0.610  0.657 -0.737  1.026  1.023
  2165 -0.080 -0.686  1.445  0.066 -0.970  1.043 -0.516 -1.868 -1.783  1.741 -1.011 -1.044
  2166 -0.674  1.269 -0.104 -0.916  2.219 -0.587 -0.532  0.749  1.892  0.097 -2.192 -1.129
  2167 -0.322 -0.104 -0.240 -1.204  0.573

# use window() if you want a part of a ts object
(oz = window(zardoz, start=2164, end=c(2165,12)))
          Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
  2164  0.358 -0.359  1.384 -0.041  1.107  0.784 -0.241  0.610  0.657 -0.737  1.026  1.023
  2165 -0.080 -0.686  1.445  0.066 -0.970  1.043 -0.516 -1.868 -1.783  1.741 -1.011 -1.044
Note that the Johnson and Johnson data are quarterly earnings, hence it has frequency=4. The time series zardoz is monthly data, hence it has frequency=12. You also get some nice things with the ts object, for example, the corresponding time values:
time(jj)   
          Qtr1    Qtr2    Qtr3    Qtr4
  1960 1960.00 1960.25 1960.50 1960.75
  1961 1961.00 1961.25 1961.50 1961.75
    .     .     .     .     .     .
    .     .     .     .     .     .
  1979 1979.00 1979.25 1979.50 1979.75
  1980 1980.00 1980.25 1980.50 1980.75

Now try a plot of the data:

plot(jj, ylab="Earnings per Share", main="J & J")   

with the result being:

jj

Try these and see what happens:

plot(jj, type="o", col="blue", lty="dashed")
plot(diff(log(jj)), main="logged and diffed") 
and while you're here, check out plot.ts and ts.plot:
x = -5:5                  # sequence of integers from -5 to 5
y = 5*cos(x)              # guess
par(mfrow=c(3,2))         # multifigure setup: 3 rows, 2 cols
#---  plot:
plot(x, main="plot(x)")
plot(x, y, main="plot(x,y)")
#---  plot.ts:
plot.ts(x, main="plot.ts(x)")
plot.ts(x, y, main="plot.ts(x,y)")
#---  ts.plot:
ts.plot(x, main="ts.plot(x)")
ts.plot(ts(x), ts(y), col=1:2, main="ts.plot(x,y)")  # note- x and y are ts objects 
#---  the help files [? and help() are the same]:
?plot.ts
help(ts.plot)
?par        # might as well skim the graphical parameters help file while you're here
plots
Note that if your data are a time series object, plot() will do the trick (for a simple time plot, that is). Otherwise, plot.ts() will coerce the graphic into a time plot.

How about filtering/smoothing the Johnson & Johnson series using a two-sided moving average? Let's try this:
  fjj(t) =jj(t-2) + ¼ jj(t-1) + ¼ jj(t) + ¼ jj(t+1) + jj(t+2)
and we'll add a lowess (?lowess - you know the routine) fit for fun.
k = c(.5,1,1,1,.5)            # k is the vector of weights
(k = k/sum(k))       
  [1] 0.125 0.250 0.250 0.250 0.125
fjj = filter(jj, sides=2, k)  # ?filter for help [but you knew that already]
plot(jj)
lines(fjj, col="red")         # adds a line to the existing plot
lines(lowess(jj), col="blue", lty="dashed")
... and the result:
filter

Let's difference the logged data and call it dljj. Then we'll play with dljj:

dljj = diff(log(jj))        # difference the logged data
plot(dljj)                  # plot it (not shown)
shapiro.test(dljj)          # test for normality 
  Shapiro-Wilk normality test
  data:  dljj 
  W = 0.9725, p-value = 0.07211
sanity.clause(dljj)         # test for sanity
  dljj is crazy  
  SC = 45.57, p-value = 0.0003    
  
# I couldn't fool you - there ain't no sanity clause  

Now a histogram and a Q-Q plot, one on top of the other (but in a nice way):

par(mfrow=c(2,1))        # set up the graphics 
hist(dljj, prob=TRUE, 12)   # histogram    
lines(density(dljj))     # smooth it - ?density for details 
qqnorm(dljj)             # normal Q-Q plot  
qqline(dljj)             # add a line    

and the results:

qq

Let's check out the correlation structure of dljj using various techniques. First, we'll look at a grid of scatterplots of dljj(t-lag) vs dljj(t) for lag=1,2,...,9.

lag.plot(dljj, 9, do.lines=FALSE)  
lag1.plot(dljj, 9)  # if you have astsa loaded (not shown) 
# why the do.lines=FALSE? Because you get a phase plane if it's TRUE 
#   a little phase plane aside - try this on your own
x = cos(2*pi*1:100/4) + .2*rnorm(100)
plot.ts(x)
dev.new()
lag.plot(x, 4)

Back to Johnson's lag plot. Notice the large positive correlation at lags 4 and 8 and the negative correlations at a few other lags:

lag plot

Now let's take a look at the ACF and PACF of dljj:

par(mfrow=c(2,1)) # The power of accurate observation is commonly called cynicism 
                  #     by those who have not got it. - George Bernard Shaw
acf(dljj, 20)     # ACF to lag 20 - no graph shown... keep reading
pacf(dljj, 20)    # PACF to lag 20 - no graph shown... keep reading
# !!NOTE!! acf2 on the line below is ONLY available in astsa and tsa3
acf2(dljj)        # this is what you'll see below
acf and pacf
Note that the LAG axis is in terms of frequency, so 1,2,3,4,5 correspond to lags 4,8,12,16,20 because frequency=4 here. If you don't like this type of labeling, you can replace dljj in any of the above by ts(dljj, freq=1); e.g., acf(ts(dljj, freq=1), 20)

Moving on, let's try a structural decomposition of log(jj) = trend + season + error using lowess.
plot(dog <- stl(log(jj), "per"))   

Here's what you get:

stl plot
If you want to inspect the residuals, for example, they're in dog$time.series[,3], the third column of the resulting series (the seasonal and trend components are in columns 1 and 2). Check out the ACF of the residuals, acf(dog$time.series[,3]); the residuals aren't white- not even close. You can do a little (very little) better using a local seasonal window, plot(dog <- stl(log(jj), s.win=4)), as opposed to the global one used by specifying "per". Type ?stl for details. There's also something called StructTS that will fit parametric structural models. We don't use these functions in the text when we present structural modeling in Chapter 6 because we prefer to use our own programs.

◊ This is a good time to explain $. In the above, dog is an object containing a bunch of things (technical term). If you type dog, you'll see the components, and if you type summary(dog) you'll get a little summary of the results. One of the components of dog is time.series, which contains the resulting series (seasonal, trend, remainder). To see this component of the object dog, you type dog$time.series (and you'll see 3 series, the last of which contains the residuals). And that's the story of $ ... you'll see more examples as we move along.


And now, we'll do some of Problem 2.1. We're going to fit the regression
   log(jj)= β*time + α1*Q1 + α2*Q2 + α3*Q3 + α4*Q4 + ε
where Qi is an indicator of the quarter i = 1,2,3,4. Then we'll inspect the residuals.

Q = factor(rep(1:4,21))     # make (Q)uarter factors [that's repeat 1,2,3,4, 21 times]
trend = time(jj)-1970   # not necessary to "center" time, but the results look nicer
reg = lm(log(jj)~0+trend+Q, na.action=NULL)  # run the regression without an intercept
#-- the na.action statement is to retain time series attributes
summary(reg)
     Call:
lm(formula = log(jj) ~ 0 + trend + Q, na.action = NULL)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.29318 -0.09062 -0.01180  0.08460  0.27644 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
trend 0.167172   0.002259   74.00   <2e-16 ***
Q1    1.052793   0.027359   38.48   <2e-16 ***
Q2    1.080916   0.027365   39.50   <2e-16 ***
Q3    1.151024   0.027383   42.03   <2e-16 ***
Q4    0.882266   0.027412   32.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.1254 on 79 degrees of freedom
Multiple R-squared: 0.9935,     Adjusted R-squared: 0.9931 
F-statistic:  2407 on 5 and 79 DF,  p-value: < 2.2e-16  
You can view the model matrix (with the dummy variables) this way:
model.matrix(reg)

    trend Q1 Q2 Q3 Q4   (remember trend is time centered at 1970)
1  -10.00  1  0  0  0
2   -9.75  0  1  0  0
3   -9.50  0  0  1  0
4   -9.25  0  0  0  1
5   -9.00  1  0  0  0
6   -8.75  0  1  0  0
7   -8.50  0  0  1  0
8   -8.25  0  0  0  1
.     .    .  .  .  .
.     .    .  .  .  .
81  10.00  1  0  0  0
82  10.25  0  1  0  0
83  10.50  0  0  1  0
84  10.75  0  0  0  1

Now check out what happened. Look at a plot of the observations and their fitted values:

plot(log(jj), type="o")   # the data in black with little dots 
lines(fitted(reg), col=2) # the fitted values in bloody red - or use lines(reg$fitted, col=2)
you get:
fit plot

... and a plot of the residuals and the ACF of the residuals:

par(mfrow=c(2,1))
plot(resid(reg))      # residuals - reg$resid is same as resid(reg) 
acf(resid(reg),20)    # acf of the resids 
and you get:
res plot
Do those residuals look white? [Ignore the 0-lag correlation, it's always 1.]

You have to be careful when you regress one time series on lagged components of another using lm(). There is a package called dynlm that makes it easy to fit lagged regressions, and I'll discuss that right after this example. If you use lm(), then what you have to do is "tie" the series together using ts.intersect. If you don't tie the series together, they won't be aligned properly. Here's an example regressing weekly cardiovascular mortality (cmort) on particulate pollution (part.dat) at the present value and lagged four weeks (about a month). For details about the data set, see Chapter 2.
require(astsa)
data(cmort, part)
ded = ts.intersect(cmort,part,part4=lag(part,-4), dframe=TRUE) # align them in a data frame
fit = lm(cmort~part+part4, data=ded, na.action=NULL)           # now the regression will work
summary(fit)                            
  Call:
  lm(formula = cmort ~ part + part4, data = ded, na.action = NULL)

  Residuals:
     Min     1Q Median     3Q    Max 
  -22.74  -5.37  -0.41   5.27  37.85 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   69.010      1.375   50.19  < 2e-16 ***
  part           0.151      0.029    5.23  2.6e-07 ***
  part4          0.263      0.029    9.07  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
  
  Residual standard error: 8.3 on 501 degrees of freedom
  Multiple R-squared: 0.309,      Adjusted R-squared: 0.306 
  F-statistic:  112 on 2 and 501 DF,  p-value: <2e-16  
Note: There was no need to rename lag(part,-4) to part4, it's just an example of what you can do.

An alternative to the above is the package dynlm which has to be installed, of course (like we did for astsa up there at the beginning). After the package is installed, you can do the previous example as follows:
require(dynlm)                          # load the package
fit = dynlm(cmort~part + lag(part,-4))  # assumes cmort and part are ts objects, which they are
# fit = dynlm(cmort~part + L(part,4))  is the same thing.
summary(fit)
  Call:
  dynlm(formula = cmort ~ part + lag(part, -4))
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -22.7429  -5.3677  -0.4136   5.2694  37.8539 
  
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   69.01020    1.37498  50.190  < 2e-16 ***
  part           0.15140    0.02898   5.225 2.56e-07 ***
  lag(part, -4)  0.26297    0.02899   9.071  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
  
  Residual standard error: 8.323 on 501 degrees of freedom
  Multiple R-Squared: 0.3091,     Adjusted R-squared: 0.3063 
  F-statistic: 112.1 on 2 and 501 DF,  p-value: < 2.2e-16 


Well, it's time to simulate. The workhorse for ARIMA simulations is arima.sim(). Here are some examples; no output is shown here so you're on your own.
# some AR1s 
x1 = arima.sim(list(order=c(1,0,0), ar=.9), n=100) 
x2 = arima.sim(list(order=c(1,0,0), ar=-.9), n=100)
par(mfrow=c(2,1))
plot(x1, main=(expression(AR(1)~~~phi==+.9)))  # ~ is a space and == is equal  
plot(x2, main=(expression(AR(1)~~~phi==-.9)))
dev.new()           # open another graphics device if you wish
acf2(x1, 20)
dev.new()           # and another
acf2(x2, 20)  # recall acf2 belongs to astsa...  
#  ... if it's not loaded, then you would have to do something like:
par(mfrow=c(2,1)); acf(x1); pacf(x1) 

# an MA1  
x = arima.sim(list(order=c(0,0,1), ma=.8), n=100)
plot(x, main=(expression(MA(1)~~~theta==.8)))
dev.new()
acf2(x)

# an AR2 
x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=100) 
plot(x, main=(expression(AR(2)~~~phi[1]==1~~~phi[2]==-.9)))
dev.new()
acf2(x)

# an ARIMA(1,1,1) 
x = arima.sim(list(order=c(1,1,1), ar=.9, ma=-.5), n=200)
plot(x, main=(expression(ARIMA(1,1,1)~~~phi==.9~~~theta==-.5)))
dev.new()         # the process is not stationary, so there is no population [P]ACF ...
acf2(x, 30)   # but look at the sample values to see how they differ from the examples above


◊ Next, we're going to do some ARIMA estimation. This gets a bit tricky because R is not useR friendly when it comes to fitting ARIMA models. Much of the story is spelled out in our R Issues page. I'll be as gentle as I can at first.

First, we'll fit an ARMA model to some simulated data (with diagnostics and forecasting):
x = arima.sim(list(order=c(1,0,1), ar=.9, ma=-.5), n=100) # simulate some data
(x.fit = arima(x, order = c(1, 0, 1)))   # fit the model and print the results

Call: arima(x = x, order = c(1, 0, 1))
  Coefficients:
           ar1      ma1  intercept <-- NOT the intercept - see R Issue 1 
        0.8465  -0.5021     0.5006
  s.e.  0.0837   0.1356     0.3150
  sigma^2 estimated as 1.027:  log likelihood = -143.44,  aic = 294.89
... diagnostics:
tsdiag(x.fit, gof.lag=20) # don't use this - it's partially WRONG

... and the output

diag plot<-- see R Issue 4    tears

... forecast 10 ahead:

x.fore = predict(x.fit, n.ahead=10)  
# plot the forecasts
U = x.fore$pred + 2*x.fore$se
L = x.fore$pred - 2*x.fore$se
minx=min(x,L)
maxx=max(x,U)
ts.plot(x,x.fore$pred,col=1:2, ylim=c(minx,maxx))
lines(U, col="blue", lty="dashed")
lines(L, col="blue", lty="dashed") 

... and here's the plot of the data and the forecasts (with error bounds):

fore plot


That wasn't too bad... but hold on. We're going to work with the global temperature data from Chapter 2.
require(astsa)   # if it's not loaded already
data(gtemp)
plot(gtemp)      # graph the series (not shown here) 
... long story short, the data appear to be an ARIMA(1,1,1) with a drift of about +.6 oC per century. Let's fit the model:
arima(gtemp, order=c(1,1,1))   
   Call:  arima(x = gtemp, order = c(1, 1, 1))
   Coefficients:
            ar1      ma1
         0.2256  -0.7158
   s.e.  0.1235   0.0792
   sigma^2 estimated as 0.009539:  log likelihood = 116.83,  aic = -227.65
So what's wrong? .... well, there's no estimate of the drift!! With no drift, the global warming hypothesis is kaput (technical term)... that is, the temps are just basically taking a random walk. How do you get the estimate of drift?... do this:
arima(diff(gtemp), order=c(1,0,1))  # diff the data and fit an arma to the diffed data
   Call: arima(x = diff(gtemp), order = c(1, 0, 1))
   Coefficients:
            ar1      ma1  intercept  
         0.2570  -0.7854     0.0064  <-- this is the mean of the diff-ed temps, which is the drift
   s.e.  0.1177   0.0707     0.0025
   sigma^2 estimated as 0.009163:  log likelihood = 119.34,  aic = -230.68
What happened? The two runs should have given the same results, but the default models for the two cases are different. I won't go into detail here because the details can be found on the R Issues page (see Issues 2 and 3). And, of course, this problem continues if you try to do forecasting. There are remedies. One remedy is to use sarima from astsa:
sarima(gtemp, 1, 1, 1) # partial output shown (easy, huh?)
   Coefficients:
            ar1      ma1  constant
         0.2569  -0.7853    0.0064
   s.e.  0.1178   0.0708    0.0025
   sigma^2 estimated as 0.009163:  log likelihood = 119.34,  aic = -230.68
   $AIC
   [1] -3.646393   
   $AICc
   [1] -3.628547
   $BIC
   [1] -4.580219

 yep - you get the CORRECT diagnostics at no extra charge    
sarima diagnostics   
If you want to do ARIMA forecasting, sarima.for is included in astsa. Also, you might be wondering about the difference between aic and AIC above. For that you have to read the text or just don't worry about it, they only differ by constants that don't change from model to model.

◊ You may not have understood all the details of this example, but at least you should realize that you may get into trouble when fitting ARIMA models with R. In particular, you should come away from this realizing that, in R, arima(x, order=c(1,1,1)) is different than arima(diff(x), order=c(1,0,1)) and arima calls the estimate of the mean the intercept. Oh yeah, and tsdiag() is wrong. Again, much of the story is spelled out in our R Issues page.


And now for some regression with autocorrelated errors. This can be accomplished at least two different ways. The hard and slow way is to use gls() from the package nlme (details). But we'll do it the easy and quick way (you can never be too easy or too quick). We're going to fit the model Mt = α + βt + γPt + et where Mt and Pt are the mortality (cmort) and particulates (part) series, and et is autocorrelated error. First, do an OLS fit and check the residuals:
trend = time(cmort)       # assumes cmort and part are there from previous examples
fit.lm = lm(cmort~trend + part)  # ols
acf2(resid(fit.lm))       # check acf and pacf of the resids     
# resids appear to be AR(2) 
Now we'll use arima() to fit the full model:
(fit = arima(cmort, order=c(2,0,0), xreg=cbind(trend, part))) 
   Coefficients:
            ar1     ar2  intercept    trend    part
         0.3980  0.4135  3132.7085  -1.5449  0.1503
   s.e.  0.0405  0.0404   854.6662   0.4328  0.0211
   sigma^2 estimated as 28.99:  log likelihood = -1576.56,  aic = 3165.13 

Box.test(resid(fit), 12, type="Ljung")  # check whiteness via Ljung-Box-Pierce statistic                  
  X-squared = 8.2535, df = 12, p-value = 0.765                           
pchisq(8.2535, 10, lower=FALSE)    # the p-value (they are resids from an ar2 fit)    
  [1] 0.6040905
◊ On ARMAX: If you want to fit an ARMAX model you have to do it via a state space model... the details are in Chapter 6 of the text. Note that you can't use xreg in arima() to fit an ARMAX model (it was never claimed you could). You can only use it to do regression with autocorrelated errors.

Finally, a spectral analysis quicky:
x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data
 (u = polyroot(c(1,-1,.9)))            # x is AR(2) w/complex roots                          
  [1] 0.5555556+0.8958064i 0.5555556-0.8958064i
 Arg(u[1])/(2*pi)                      # dominant frequency around .16:
  [1] 0.1616497 
par(mfcol=c(2,2))
plot.ts(x, main="da data")  
mvspec(x, spans=c(5,5), plot=TRUE, taper=.1, log="no") # nonparametric spectral estimate                           
# spec.pgram(x, spans=c(5,5), log="no")  # same using stats package; also see spectrum()
spec.ar(x, log="no")                     # parametric spectral estimate
arma.spec(ar = c(1,-.9), log="no")       # model spectral density 

and the graph:

spec plot

To get a raw periodogram do this:
per = abs(fft(x-mean(x)))^2/length(x)  # Mod() and abs() same here 
# if you want a nice picture: 
freq = (1:length(x)-1)/length(x)
plot(freq, per, type="h")
Note that fft() returns a time series of length(x) (print per to see this). If you only want to display the periodogram from 0 to .5, then you have to figure that out on your own because I'm leaving now.