R time series quick fix

This is just a brief stroll down time seRies lane. My advice is to open R and play along with the tutorial. Hopefully, you have installed R and found the icon on your desktop that looks like an R... well, it is an R. If you're using Linux, then stop looking because it's not there ... just open a terminal and enter R (or install R Studio.)

If you want more on time series graphics, particularly using ggplot2, see the Graphics Quick Fix.

The quick fix is meant to expose you to basic R time series capabilities and is rated fun for people ages 8 to 80. This is NOT meant to be a lesson in time series analysis, but if you want one, you might try this easy short course:

EZ Online Time Series R Course
DataCamp


◊ Baby steps... your first R session. Get comfortable, then start her up and try some simple addition:

2+2
 [1] 5

Ok, now you're an expert useR.... we're going to get astsa now:

install.packages("astsa")  # install it ... 
library(astsa)             # then load it (has to be done at the start of each session)
data()                     # use this command to view all the loaded data 

Now that you're loaded, we can start... let's go!

First, we'll play with the Johnson & Johnson data set. It's included in astsa as jj, that dynOmite character from Good Times. First, look at it.

jj        # print it to the screen
        Qtr1  Qtr2  Qtr3  Qtr4
  1960  0.71  0.63  0.85  0.44
  1961  0.61  0.69  0.92  0.55
    .     .     .     .     .
    .     .     .     .     .  
  1979 14.04 12.96 14.85  9.99
  1980 16.20 14.67 16.02 11.61

and you see that jj is a collection of 84 numbers called a time series object. To see/remove your objects:

ls()         # list your objects
 [1] "what"  "me"  "worry"  "jj"
rm(worry)    # remove your worry object

If you are a Matlab (or similar) user, you may think jj is an 84 × 1 vector, but it's not. It has order and length, but no dimensions (no rows, no columns). R calls these kinds of objects "vectors" so you have to be careful. In R, "matrices" have dimensions but "vectors" do not - they just sort of dangle about in cyberspace.

jj[1]       # the first element
  [1] 0.71
jj[84]      # the last element
  [1] 11.61
jj[1:4]     # the first 4 elements
  [1] 0.71 0.63 0.85 0.44
jj[-(1:80)] # everything EXCEPT the first 80 elements
  [1] 16.20 14.67 16.02 11.61
length(jj)  # the number of elements 
  [1] 84
dim(jj)     # but no dimensions ...
  NULL
nrow(jj)    # ... no rows
  NULL
ncol(jj)    # ... and no columns
  NULL
#-- if you want it to be a column vector (in R, a matrix), an easy way to go is:
jjm = as.matrix(jj)
dim(jjm)   
 [1] 84  1 

Now, let's make a monthly time series object that starts in June of the year 2293. We enter the Vortex...

options(digits=2)  # the default is 7, but it's more than I want now
?options           # to see your options (it's the help file)
?rnorm             # we're using it on the next line
(zardoz = ts(rnorm(48), start=c(2293,6), frequency=12))
          Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
  2293                                     0.856 -0.355  0.245 -0.321  0.321  0.957 -0.867
  2294  0.358 -0.359  1.384 -0.041  1.107  0.784 -0.241  0.610  0.657 -0.737  1.026  1.023
  2295 -0.080 -0.686  1.445  0.066 -0.970  1.043 -0.516 -1.868 -1.783  1.741 -1.011 -1.044
  2296 -0.674  1.269 -0.104 -0.916  2.219 -0.587 -0.532  0.749  1.892  0.097 -2.192 -1.129
  2297 -0.322 -0.104 -0.240 -1.204  0.573

# use window() if you want a part of a ts object
(oz = window(zardoz, start=2294, end=c(2295,12)))
          Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
  2294  0.358 -0.359  1.384 -0.041  1.107  0.784 -0.241  0.610  0.657 -0.737  1.026  1.023
  2295 -0.080 -0.686  1.445  0.066 -0.970  1.043 -0.516 -1.868 -1.783  1.741 -1.011 -1.044

Note that the Johnson and Johnson data are quarterly earnings, hence it has frequency=4. The time series zardoz is monthly data, hence it has frequency=12. You also get some useful things with the ts object, for example:

time(jj)   
          Qtr1    Qtr2    Qtr3    Qtr4
  1960 1960.00 1960.25 1960.50 1960.75
  1961 1961.00 1961.25 1961.50 1961.75
    .     .     .     .     .     .
    .     .     .     .     .     .
  1979 1979.00 1979.25 1979.50 1979.75
  1980 1980.00 1980.25 1980.50 1980.75
  
cycle(jj)  
       Qtr1 Qtr2 Qtr3 Qtr4
  1960    1    2    3    4
  1961    1    2    3    4
    .     .    .    .    .     
    .     .    .    .    .     
  1979    1    2    3    4
  1980    1    2    3    4

Now try a plot of the Johnson & Johnson data:

plot(jj, ylab="Earnings per Share", main="J & J")

# with the result being (close to):
jj

The graph shown is a little more fancy than the code will give. For details, see the Graphics Quick Fix page. This goes for the rest of the plots you will see here.

Try these and see what happens:

plot(jj, type="o", col="blue", lty="dashed")
plot(diff(log(jj)), main="logged and diffed") 

and while you're here, check out plot.ts and ts.plot:

x = -5:5                  # sequence of integers from -5 to 5
y = 5*cos(x)              # guess
par(mfrow=c(3,2))         # multifigure setup: 3 rows, 2 cols
#---  plot:
plot(x, main="plot(x)")
plot(x, y, main="plot(x,y)")
#---  plot.ts:
plot.ts(x, main="plot.ts(x)")
plot.ts(x, y, main="plot.ts(x,y)")
#---  ts.plot:
ts.plot(x, main="ts.plot(x)")
ts.plot(ts(x), ts(y), col=1:2, main="ts.plot(x,y)")  # note- x and y are ts objects 
#---  the help files [? and help() are the same]:
?plot.ts
help(ts.plot)
?par        # might as well skim the graphical parameters help file while you're here

plots

Note that if your data are a time series object, plot() will do the trick (for a simple time plot, that is). Otherwise, plot.ts() will coerce the graphic into a time plot.

How about filtering/smoothing the Johnson & Johnson series using a two-sided moving average? Let's try this:
  fjj(t) =jj(t-2) + ¼ jj(t-1) + ¼ jj(t) + ¼ jj(t+1) + jj(t+2)
and we'll add a lowess (?lowess - you know the routine) fit for fun.

k = c(.5,1,1,1,.5)            # k is the vector of weights
(k = k/sum(k))       
  [1] 0.125 0.250 0.250 0.250 0.125
fjj = filter(jj, sides=2, k)  # ?filter for help [but you knew that already]
plot(jj)
lines(fjj, col="red")         # adds a line to the existing plot
lines(lowess(jj), col="blue", lty="dashed")

#... and the result:
filter 

Let's difference the logged data and call it dljj. Then we'll play with dljj:

dljj = diff(log(jj))        # difference the logged data
plot(dljj)                  # plot it (not shown)

shapiro.test(dljj)          # test for normality 
  Shapiro-Wilk normality test
  data:  dljj 
  W = 0.9725, p-value = 0.07211
  
sanity.clause(dljj)         # test for sanity
  dljj is crazy  
  SC = 45.57, p-value = 0.0003    
  
# I couldn't fool you - there ain't no sanity clause  

Now a histogram and a Q-Q plot, one on top of the other (but in a nice way):

par(mfrow=c(2,1))        # set up the graphics 
hist(dljj, prob=TRUE, 12)   # histogram    
lines(density(dljj))     # smooth it - ?density for details 
qqnorm(dljj)             # normal Q-Q plot  
qqline(dljj)             # add a line    

# and the results:
qq

Let's check out the correlation structure of dljj using various techniques. First, we'll look at a grid of scatterplots of dljj(t) versus lagged values.

lag1.plot(dljj, 4)  # this is the astsa version of lag.plot in the stats package 

lag plot

The lines are a lowess fit and the sample acf is blue in the box.

Now let's take a look at the ACF and PACF of dljj:

acf2(dljj)   # astsa gives both in one swell foop ... or use acf() and pacf() individually 
        ACF  PACF               ACF  PACF
 [1,] -0.51 -0.51       [11,] -0.22 -0.03
 [2,]  0.07 -0.26       [12,]  0.45 -0.03
 [3,] -0.40 -0.70       [13,] -0.21  0.04
 [4,]  0.73  0.27       [14,] -0.04 -0.08
 [5,] -0.37  0.16       [15,] -0.15 -0.04
 [6,]  0.00 -0.11       [16,]  0.35 -0.04
 [7,] -0.25 -0.01       [17,] -0.14 -0.04
 [8,]  0.56  0.11       [18,] -0.08 -0.06
 [9,] -0.28  0.05       [19,] -0.09 -0.02
[10,] -0.01  0.12       [20,]  0.27  0.01

acf and pacf

Note that the LAG axis is in terms of frequency, so 1,2,3,4,5 correspond to lags 4,8,12,16,20 because frequency=4 here. If you don't like this type of labeling, you can replace dljj in any of the above by ts(dljj, freq=1); e.g., acf(ts(dljj, freq=1), 20)

Moving on, let's try a structural decomposition of log(jj) = trend + season + error using lowess.

plot(dog <- stl(log(jj), "per"))   

stl plot

If you want to inspect the residuals, for example, they're in dog$time.series[,3], the third column of the resulting series (the seasonal and trend components are in columns 1 and 2). Check out the ACF of the residuals, acf(dog$time.series[,3]); the residuals aren't white- not even close. You can do a little (very little) better using a local seasonal window, as opposed to the global one used by specifying "per". Type ?stl for details. There's also something called StructTS that will fit parametric structural models. We don't use these functions in the text when we present structural modeling in Chapter 6 because we prefer to use our own programs.

◊ This is a good time to explain $. In the above, dog is an object containing a bunch of things (technical term). If you type dog, you'll see the components, and if you type summary(dog) you'll get a little summary of the results. One of the components of dog is time.series, which contains the resulting series (seasonal, trend, remainder). To see this component of the object dog, you type dog$time.series (and you'll see 3 series, the last of which contains the residuals). And that's the story of $ ... you'll see more examples as we move along.


And now we'll do a problem from Chapter 2. We're going to fit the regression
   log(jj)= β*time + α1*Q1 + α2*Q2 + α3*Q3 + α4*Q4 + ε
where Qi is an indicator of the quarter i = 1,2,3,4. Then we'll inspect the residuals.

Q      = factor(cycle(jj))        # make (Q)uarter factors
trend  = time(jj) - 1970          # not necessary to "center" time, but the results look nicer
reg    = lm(log(jj)~ 0 + trend + Q, na.action=NULL)  # run the regression without an intercept
#-- the na.action statement is to retain time series attributes

  summary(reg)
       Call:
  lm(formula = log(jj) ~ 0 + trend + Q, na.action = NULL)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -0.29318 -0.09062 -0.01180  0.08460  0.27644 
  
  Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
  trend 0.167172   0.002259   74.00   <2e-16 ***
  Q1    1.052793   0.027359   38.48   <2e-16 ***
  Q2    1.080916   0.027365   39.50   <2e-16 ***
  Q3    1.151024   0.027383   42.03   <2e-16 ***
  Q4    0.882266   0.027412   32.19   <2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1   # (I see stars)
 
# Here's my idea for significance stars, the number of stars should be equal to 1/p.value, 
# so if the p.value is .01, you get 100 stars,  
# if the p.value is .05, you get 20 stars,
# and if the p.value is 0, you get an infinite number of stars. 
# Thus, the little table above would be something like:
  Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
  trend 0.167172   0.002259   74.00   <2e-16 ************************************************************* ...
  Q1    1.052793   0.027359   38.48   <2e-16 ************************************************************* ...
  Q2    1.080916   0.027365   39.50   <2e-16 ************************************************************* ...
  Q3    1.151024   0.027383   42.03   <2e-16 ************************************************************* ...
  Q4    0.882266   0.027412   32.19   <2e-16 ************************************************************* ...
  ---
# You would need lots of room, but it really makes the point.  
# By the way, you can't get more stars, but you can turn them off:
 
options(show.signif.stars=FALSE) 

# back to the output
  
  Residual standard error: 0.1254 on 79 degrees of freedom
  Multiple R-squared: 0.9935,     Adjusted R-squared: 0.9931 
  F-statistic:  2407 on 5 and 79 DF,  p-value: < 2.2e-16  

You can view the model matrix (with the dummy variables) this way:

model.matrix(reg)

    trend Q1 Q2 Q3 Q4   # remember trend is time centered at 1970
1  -10.00  1  0  0  0
2   -9.75  0  1  0  0
3   -9.50  0  0  1  0
4   -9.25  0  0  0  1
5   -9.00  1  0  0  0
6   -8.75  0  1  0  0
7   -8.50  0  0  1  0
8   -8.25  0  0  0  1
.     .    .  .  .  .
.     .    .  .  .  .
81  10.00  1  0  0  0
82  10.25  0  1  0  0
83  10.50  0  0  1  0
84  10.75  0  0  0  1

Now check out what happened. Look at a plot of the observations and their fitted values:

plot(log(jj), type="o")    # the data in black with little dots 
lines(fitted(reg), col=2)  # the fitted values in bloody red 

fit plot

which shows that a plot of the data with the fit superimposed is not worth the cyberspace it takes up.

But a plot of the residuals and the ACF of the residuals is worth its weight in joules:

par( mfrow = c(2,1) )
plot( resid(reg) )       # residuals
acf( resid(reg), 20 )    # acf of the resids 

res plot  

Do those residuals look white? [Ignore the 0-lag correlation, it's always 1.] Hint: The answer is NO ... so the regression above is nugatory. So what's the remedy? Sorry, you'll have to take the class because this is not a lesson in time series... I warned you up at the top.

You have to be careful when you regress one time series on lagged components of another using lm(). There is a package called dynlm that makes it easy to fit lagged regressions, and I'll discuss that right after this example. If you use lm(), then what you have to do is "tie" the series together using ts.intersect. If you don't tie the series together, they won't be aligned properly. Here's an example regressing weekly cardiovascular mortality (cmort) on particulate pollution (part) at the present value and lagged four weeks (about a month). For details about the data set, see Chapter 2. Make sure astsa is loaded.

ded = ts.intersect(cmort, part, part4 = lag(part, -4))          # align the series first
fit = lm(cmort~ part+part4, data=ded, na.action=NULL)           # now the regression will work
summary(fit)                            
  Call:
  lm(formula = cmort ~ part + part4, data = ded, na.action = NULL)

  Residuals:
     Min     1Q Median     3Q    Max 
  -22.74  -5.37  -0.41   5.27  37.85 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   69.010      1.375   50.19  < 2e-16 ***
  part           0.151      0.029    5.23  2.6e-07 ***
  part4          0.263      0.029    9.07  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
  
  Residual standard error: 8.3 on 501 degrees of freedom
  Multiple R-squared: 0.309,      Adjusted R-squared: 0.306 
  F-statistic:  112 on 2 and 501 DF,  p-value: <2e-16  

Note: There was no need to rename lag(part,-4) to part4, it's just an example of what you can do.

An alternative to the above is the package dynlm which has to be installed, of course (like we did for astsa up there at the beginning). After the package is installed, you can do the previous example as follows:

library(dynlm)                          # load the package
fit = dynlm(cmort~part + lag(part,-4))  # assumes cmort and part are ts objects, which they are
# fit = dynlm(cmort~part + L(part,4))  is the same thing.
summary(fit)
  Call:
  dynlm(formula = cmort ~ part + lag(part, -4))
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -22.7429  -5.3677  -0.4136   5.2694  37.8539 
  
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   69.01020    1.37498  50.190  < 2e-16 ***
  part           0.15140    0.02898   5.225 2.56e-07 ***
  lag(part, -4)  0.26297    0.02899   9.071  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
  
  Residual standard error: 8.323 on 501 degrees of freedom
  Multiple R-Squared: 0.3091,     Adjusted R-squared: 0.3063 
  F-statistic: 112.1 on 2 and 501 DF,  p-value: < 2.2e-16 

Well, it's time to simulate. The workhorse for ARIMA simulations is arima.sim(). Here are some examples; no output is shown here so you're on your own.

# some AR1s 
x1 = arima.sim(list(order=c(1,0,0), ar=.9), n=100) 
x2 = arima.sim(list(order=c(1,0,0), ar=-.9), n=100)
par(mfrow=c(2,1))
plot(x1, main=(expression(AR(1)~~~phi==+.9)))  # ~ is a space and == is equal  
plot(x2, main=(expression(AR(1)~~~phi==-.9)))
dev.new()           # open another graphics device if you wish
acf2(x1, 20)
dev.new()           # and another
acf2(x2, 20)        # recall acf2 belongs to astsa...  
#  ... if it's not loaded, then you would have to do something like:
par(mfrow=c(2,1)); acf(x1); pacf(x1) 

# an MA1  
x = arima.sim(list(order=c(0,0,1), ma=.8), n=100)
plot(x, main=(expression(MA(1)~~~theta==.8)))
dev.new()
acf2(x)

# an AR2 
x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=100) 
plot(x, main=(expression(AR(2)~~~phi[1]==1~~~phi[2]==-.9)))
dev.new()
acf2(x)

# an ARIMA(1,1,1) 
x = arima.sim(list(order=c(1,1,1), ar=.9, ma=-.5), n=200)
plot(x, main=(expression(ARIMA(1,1,1)~~~phi==.9~~~theta==-.5)))
dev.new()     # the process is not stationary, so there is no population [P]ACF in terms of lag only...
acf2(x, 30)   # but look at the sample values to see how they differ from the examples above

Using astsa it's easy to fit an ARIMA model:

sarima(gtemp, 1, 1, 1)   # fit an ARIMA(1,1,1) to the gtemp series 
  Coefficients:
           ar1      ma1  constant
        0.2570  -0.7854    0.0064
  s.e.  0.1177   0.0707    0.0025 
  sigma^2 estimated as 0.009163:  log likelihood = 119.34,  aic = -230.68
  
  $degrees_of_freedom          # for estimating sigma^2
   [1] 127
  
  $ttable                      # a t-table with 2-sided p-values but no stars
           Estimate     SE  t.value p.value
  ar1        0.2570 0.1177   2.1829  0.0309
  ma1       -0.7854 0.0707 -11.1050  0.0000
  constant   0.0064 0.0025   2.5778  0.0111
  
  $AIC                         # some important information criteria
   [1] -3.646395
  $AICc
   [1] -3.628549
  $BIC
   [1] -4.580221
  
#  and you get the diagnostics for no extra charge   
sarima diagnostics   

You might be wondering about the difference between aic and AIC above. For that you have to read the text or just don't worry about it because it's not worth ruining your day thinking about it. And yes, those residuals look white.

If you want to do ARIMA forecasting, sarima.for is included in astsa.

sarima.for(gtemp, 10, 1, 1, 1)   # forecast gtemp for 10 years ahead based on an ARIMA(1,1,1)  
  
 $pred
  Time Series:
  Start = 2010 
  End = 2019 
  Frequency = 1 
  [1] 0.552 0.553 0.558 0.564  0.570 0.576 0.583 0.589 0.596 0.602

 $se
  Time Series:
  Start = 2010 
  End = 2019 
  Frequency = 1 
  [1] 0.096 0.106 0.111 0.114 0.118 0.121 0.124 0.127 0.130 0.133

sarima forecast 
gray ribbons are ±1 and ±2 root MSPE bounds

And now for some regression with autocorrelated errors. We're going to fit the model Mt = α + βt + γPt + et where Mt and Pt are the mortality (cmort) and particulates (part) series, and et is autocorrelated error. First, do an OLS fit and check the residuals:

trend = time(cmort) - mean(time(cmort))       # best to center time
fit.lm = lm(cmort ~ trend + part)             # ols
acf2( resid(fit.lm) )                         # check acf and pacf of the resids - not shown     
# resids appear to be AR(2) 

Now fit the model

sarima(cmort, 2,0,0, xreg = cbind(trend, part)) 

 Coefficients:
          ar1     ar2  intercept    trend    part
       0.3980  0.4135    81.6325  -1.5449  0.1503
 s.e.  0.0405  0.0404     1.6084   0.4328  0.0211
 sigma^2 estimated as 28.99:  log likelihood = -1576.56,  aic = 3165.13

 $degrees_of_freedom
  [1] 503
 $ttable
           Estimate     SE t.value p.value
 ar1         0.3980 0.0405  9.8363   0e+00
 ar2         0.4135 0.0404 10.2269   0e+00
 intercept  81.6325 1.6084 50.7549   0e+00
 trend      -1.5449 0.4328 -3.5700   4e-04
 part        0.1503 0.0211  7.1192   0e+00
 $AIC
  [1] 4.386799
 $AICc
  [1] 4.391066
 $BIC
  [1] 3.428437

The residual analysis (not shown) looks perfect.

Here's an ARMAX model, Mt = β0 + φ1 Mt-1 + φ2 Mt-2 + β1 t + β2 Tt-1 + β3 Pt + β4 Pt-4 + et, where et is possibly autocorrelated. First we try and ARMAX(p=2, q=0), then look at the residuals and realize there's no correlation left, so we're done.

trend  = time(cmort) - mean(time(cmort))
u = ts.intersect(M = cmort, M1 = lag(cmort,-1), M2 = lag(cmort,-2), T1 = lag(tempr,-1), P = part, 
                 P4 = lag(part -4), trend)
sarima(u[,1], 0,0,0, xreg=u[,2:7]) 
  $ttable                                  
            Estimate     SE t.value p.value
  intercept  40.2932 4.6285  8.7055  0.0000
  M1          0.3190 0.0372  8.5826  0.0000
  M2          0.3157 0.0391  8.0846  0.0000
  T1         -0.1961 0.0307 -6.3828  0.0000
  P           0.1210 0.0186  6.5102  0.0000
  P4          0.0210 0.0180  1.1633  0.2453
  trend      -0.4837 0.0955 -5.0642  0.0000
  
# No autocorrelation in the residuals, but it looks like P4 is not significant. 
# Rerun without P4 and the results are perfect.
  
sarima(u[,1], 0,0,0, xreg=u[,2:7][,-5])  
  $ttable                                  
            Estimate     SE t.value p.value
  intercept  40.2077 4.6341  8.6765       0
  M1          0.3222 0.0371  8.6806       0
  M2          0.3166 0.0391  8.0985       0
  T1         -0.1947 0.0307 -6.3325       0
  P           0.1321 0.0160  8.2666       0
  trend      -0.4806 0.0956 -5.0275       0
# and now we're done

Finally, a spectral analysis quicky:

x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data
 (u = polyroot(c(1,-1,.9)))            # x is AR(2) w/complex roots                          
  [1] 0.5555556+0.8958064i 0.5555556-0.8958064i
 Arg(u[1])/(2*pi)                      # dominant frequency around .16:
  [1] 0.1616497 
par(mfcol=c(2,2))
plot.ts(x, main="da data")  
mvspec(x, spans=c(5,5), plot=TRUE, taper=.1, log="no") # nonparametric spectral estimate                           
spec.ar(x, log="no")                     # parametric spectral estimate
arma.spec(ar = c(1,-.9), log="no")       # model spectral density 

spec plot

That's all for now. If you want more on time series graphics, see the Graphics Quick Fix page.