The data sets used in this tutorial are available in astsa, the R package for the text. A detailed tutorial (and more!) is available in Appendix R of the text. This page is basically the quick fix from Edition 2 updated a bit.
You can copy-and-paste the R commands (multiple lines are ok) from this page into R.
This quick fix is meant for people who are just starting to use R for time series analysis. This is not a lesson in time series analysis.
If you're new to R/Splus, I suggest reading
R for Beginners (a pdf file) first.
Another good read for exploring time series
is Econometrics in R (a pdf file).
You may also want to poke around the QuickR website.
◊
Baby steps... your first R session. Get comfortable, then
start her up and try some simple addition:
2+2 [1] 5
Ok, now you're an expert useR.
It's time to move on to time series. What you'll see in the following examples should be
enough to get you through the first four chapters of the text.
We're going to get astsa now:
install.packages("astsa") # install it ... you'll be asked to choose the closest CRAN mirror require(astsa) # then load it (has to be done at the start of each session)
Let's play with the Johnson & Johnson data set.
It's included in astsa as
data(jj) # load the data - you don't have to do this anymore with astsa, but you do in general jj # print it to the screen Qtr1 Qtr2 Qtr3 Qtr4 1960 0.71 0.63 0.85 0.44 1961 0.61 0.69 0.92 0.55 . . . . . . . . . . 1979 14.04 12.96 14.85 9.99 1980 16.20 14.67 16.02 11.61and you see that jj is a collection of 84 numbers called a time series object. You can see/remove your objects this way:
ls() # list your objects [1] "what" "me" "worry" "jj" rm(worry) # remove your worry object
If you're a Matlab (or similar) user, you may think jj is an 84 × 1 vector, but it's not. It has order and length, but no dimensions (no rows, no columns). R calls these kinds of objects "vectors" so you have to be careful. In R, "matrices" have dimensions but "vectors" do not.
jj[1] # the first element [1] 0.71 jj[84] # the last element [1] 11.61 jj[1:4] # the first 4 elements [1] 0.71 0.63 0.85 0.44 jj[-(1:80)] # everything EXCEPT the first 80 elements [1] 16.20 14.67 16.02 11.61 length(jj) # the number of elements [1] 84 dim(jj) # but no dimensions ... NULL nrow(jj) # ... no rows NULL ncol(jj) # ... and no columns NULL #-- if you want it to be a column vector (in R, a matrix), an easy way to go is: jjm = as.matrix(jj) dim(jjm) [1] 84 1
Now, let's make a monthly time series object that starts in June of the year 2293. We enter the Vortex...
options(digits=2) # the default is 7, but it's more than I want now ?options # to see your options (it's the help file) ?rnorm # we're using it on the next line (zardoz = ts(rnorm(48), start=c(2293,6), frequency=12)) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2293 0.856 -0.355 0.245 -0.321 0.321 0.957 -0.867 2294 0.358 -0.359 1.384 -0.041 1.107 0.784 -0.241 0.610 0.657 -0.737 1.026 1.023 2295 -0.080 -0.686 1.445 0.066 -0.970 1.043 -0.516 -1.868 -1.783 1.741 -1.011 -1.044 2296 -0.674 1.269 -0.104 -0.916 2.219 -0.587 -0.532 0.749 1.892 0.097 -2.192 -1.129 2297 -0.322 -0.104 -0.240 -1.204 0.573 # use window() if you want a part of a ts object (oz = window(zardoz, start=2293, end=c(2295,12))) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2294 0.358 -0.359 1.384 -0.041 1.107 0.784 -0.241 0.610 0.657 -0.737 1.026 1.023 2295 -0.080 -0.686 1.445 0.066 -0.970 1.043 -0.516 -1.868 -1.783 1.741 -1.011 -1.044Note that the Johnson and Johnson data are quarterly earnings, hence it has frequency=4. The time series zardoz is monthly data, hence it has frequency=12. You also get some nice things with the ts object, for example:
time(jj) Qtr1 Qtr2 Qtr3 Qtr4 1960 1960.00 1960.25 1960.50 1960.75 1961 1961.00 1961.25 1961.50 1961.75 . . . . . . . . . . . . 1979 1979.00 1979.25 1979.50 1979.75 1980 1980.00 1980.25 1980.50 1980.75 cycle(jj) Qtr1 Qtr2 Qtr3 Qtr4 1960 1 2 3 4 1961 1 2 3 4 . . . . . . . . . . 1979 1 2 3 4 1980 1 2 3 4
Now try a plot of the Johnson y Johnson data:
plot(jj, ylab="Earnings per Share", main="J & J")
with the result being:
Try these and see what happens:
plot(jj, type="o", col="blue", lty="dashed") plot(diff(log(jj)), main="logged and diffed")and while you're here, check out plot.ts and ts.plot:
x = -5:5 # sequence of integers from -5 to 5 y = 5*cos(x) # guess par(mfrow=c(3,2)) # multifigure setup: 3 rows, 2 cols #--- plot: plot(x, main="plot(x)") plot(x, y, main="plot(x,y)") #--- plot.ts: plot.ts(x, main="plot.ts(x)") plot.ts(x, y, main="plot.ts(x,y)") #--- ts.plot: ts.plot(x, main="ts.plot(x)") ts.plot(ts(x), ts(y), col=1:2, main="ts.plot(x,y)") # note- x and y are ts objects #--- the help files [? and help() are the same]: ?plot.ts help(ts.plot) ?par # might as well skim the graphical parameters help file while you're here
k = c(.5,1,1,1,.5) # k is the vector of weights (k = k/sum(k)) [1] 0.125 0.250 0.250 0.250 0.125 fjj = filter(jj, sides=2, k) # ?filter for help [but you knew that already] plot(jj) lines(fjj, col="red") # adds a line to the existing plot lines(lowess(jj), col="blue", lty="dashed")... and the result:
Let's difference the logged data and call it dljj. Then we'll play with dljj:
dljj = diff(log(jj)) # difference the logged data plot(dljj) # plot it (not shown) shapiro.test(dljj) # test for normality Shapiro-Wilk normality test data: dljj W = 0.9725, p-value = 0.07211 sanity.clause(dljj) # test for sanity dljj is crazy SC = 45.57, p-value = 0.0003 # I couldn't fool you - there ain't no sanity clause
Now a histogram and a Q-Q plot, one on top of the other (but in a nice way):
par(mfrow=c(2,1)) # set up the graphics hist(dljj, prob=TRUE, 12) # histogram lines(density(dljj)) # smooth it - ?density for details qqnorm(dljj) # normal Q-Q plot qqline(dljj) # add a line
and the results:
Let's check out the correlation structure of dljj using various techniques. First, we'll look at a grid of scatterplots of dljj(t) versus lagged values.
lag1.plot(dljj, 9) # also investigate lag.plot
The lines are a lowess fit and the values in the box are the sample ACFs. Notice the large positive correlation at lags 4 and 8 and the negative correlations at a few other lags:
Now let's take a look at the ACF and PACF of dljj:
par(mfrow=c(2,1)) # The power of accurate observation is commonly called cynicism # by those who have not got it. - George Bernard Shaw acf(dljj, 20) # ACF to lag 20 - no graph shown... keep reading pacf(dljj, 20) # PACF to lag 20 - no graph shown... keep reading # !!NOTE!! acf2 on the line below is ONLY available in astsa and tsa3 acf2(dljj) # this is what you'll see below
plot(dog <- stl(log(jj), "per"))
Here's what you get:
◊ This is a good time to explain $. In the above, dog is an
object containing a bunch of things (technical term). If you type dog, you'll see
the components, and if you type summary(dog) you'll get a little summary of the results.
One of the components of dog is time.series, which contains the resulting
series (seasonal, trend, remainder). To see this component of the object dog, you
type dog$time.series (and you'll see 3 series, the last of which contains the residuals).
And that's the story of $ ... you'll see more examples
as we move along.
And now, we'll do some of Problem 2.1.
We're going to fit the regression
log(jj)= β*time + α_{1}*Q1 + α_{2}*Q2 + α_{3}*Q3 + α_{4}*Q4 + ε
where Qi is an indicator of the quarter i = 1,2,3,4.
Then we'll inspect the residuals.
Q = factor(cycle(jj)) # make (Q)uarter factors trend = time(jj)-1970 # not necessary to "center" time, but the results look nicer reg = lm(log(jj)~0+trend+Q, na.action=NULL) # run the regression without an intercept #-- the na.action statement is to retain time series attributes summary(reg) Call: lm(formula = log(jj) ~ 0 + trend + Q, na.action = NULL) Residuals: Min 1Q Median 3Q Max -0.29318 -0.09062 -0.01180 0.08460 0.27644 Coefficients: Estimate Std. Error t value Pr(>|t|) trend 0.167172 0.002259 74.00 <2e-16 *** Q1 1.052793 0.027359 38.48 <2e-16 *** Q2 1.080916 0.027365 39.50 <2e-16 *** Q3 1.151024 0.027383 42.03 <2e-16 *** Q4 0.882266 0.027412 32.19 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1254 on 79 degrees of freedom Multiple R-squared: 0.9935, Adjusted R-squared: 0.9931 F-statistic: 2407 on 5 and 79 DF, p-value: < 2.2e-16You can view the model matrix (with the dummy variables) this way:
model.matrix(reg) trend Q1 Q2 Q3 Q4 (remember trend is time centered at 1970) 1 -10.00 1 0 0 0 2 -9.75 0 1 0 0 3 -9.50 0 0 1 0 4 -9.25 0 0 0 1 5 -9.00 1 0 0 0 6 -8.75 0 1 0 0 7 -8.50 0 0 1 0 8 -8.25 0 0 0 1 . . . . . . . . . . . . 81 10.00 1 0 0 0 82 10.25 0 1 0 0 83 10.50 0 0 1 0 84 10.75 0 0 0 1
Now check out what happened. Look at a plot of the observations and their fitted values:
plot(log(jj), type="o") # the data in black with little dots lines(fitted(reg), col=2) # the fitted values in bloody red - or use lines(reg$fitted, col=2)you get:
... and a plot of the residuals and the ACF of the residuals:
par(mfrow=c(2,1)) plot(resid(reg)) # residuals - reg$resid is same as resid(reg) acf(resid(reg),20) # acf of the residsand you get:
require(astsa) data(cmort, part) ded = ts.intersect(cmort,part,part4=lag(part,-4)) # align the series first fit = lm(cmort~part+part4, data=ded, na.action=NULL) # now the regression will work summary(fit) Call: lm(formula = cmort ~ part + part4, data = ded, na.action = NULL) Residuals: Min 1Q Median 3Q Max -22.74 -5.37 -0.41 5.27 37.85 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 69.010 1.375 50.19 < 2e-16 *** part 0.151 0.029 5.23 2.6e-07 *** part4 0.263 0.029 9.07 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.3 on 501 degrees of freedom Multiple R-squared: 0.309, Adjusted R-squared: 0.306 F-statistic: 112 on 2 and 501 DF, p-value: <2e-16Note: There was no need to rename lag(part,-4) to part4, it's just an example of what you can do.
require(dynlm) # load the package fit = dynlm(cmort~part + lag(part,-4)) # assumes cmort and part are ts objects, which they are # fit = dynlm(cmort~part + L(part,4)) is the same thing. summary(fit) Call: dynlm(formula = cmort ~ part + lag(part, -4)) Residuals: Min 1Q Median 3Q Max -22.7429 -5.3677 -0.4136 5.2694 37.8539 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 69.01020 1.37498 50.190 < 2e-16 *** part 0.15140 0.02898 5.225 2.56e-07 *** lag(part, -4) 0.26297 0.02899 9.071 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.323 on 501 degrees of freedom Multiple R-Squared: 0.3091, Adjusted R-squared: 0.3063 F-statistic: 112.1 on 2 and 501 DF, p-value: < 2.2e-16
# some AR1s x1 = arima.sim(list(order=c(1,0,0), ar=.9), n=100) x2 = arima.sim(list(order=c(1,0,0), ar=-.9), n=100) par(mfrow=c(2,1)) plot(x1, main=(expression(AR(1)~~~phi==+.9))) # ~ is a space and == is equal plot(x2, main=(expression(AR(1)~~~phi==-.9))) dev.new() # open another graphics device if you wish acf2(x1, 20) dev.new() # and another acf2(x2, 20) # recall acf2 belongs to astsa... # ... if it's not loaded, then you would have to do something like: par(mfrow=c(2,1)); acf(x1); pacf(x1) # an MA1 x = arima.sim(list(order=c(0,0,1), ma=.8), n=100) plot(x, main=(expression(MA(1)~~~theta==.8))) dev.new() acf2(x) # an AR2 x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=100) plot(x, main=(expression(AR(2)~~~phi[1]==1~~~phi[2]==-.9))) dev.new() acf2(x) # an ARIMA(1,1,1) x = arima.sim(list(order=c(1,1,1), ar=.9, ma=-.5), n=200) plot(x, main=(expression(ARIMA(1,1,1)~~~phi==.9~~~theta==-.5))) dev.new() # the process is not stationary, so there is no population [P]ACF ... acf2(x, 30) # but look at the sample values to see how they differ from the examples above
x = arima.sim(list(order=c(1,0,1), ar=.9, ma=-.5), n=100) # simulate some data (x.fit = arima(x, order = c(1, 0, 1))) # fit the model and print the results Call: arima(x = x, order = c(1, 0, 1)) Coefficients: ar1 ma1 intercept <-- NOT the intercept - see R Issue 1 0.8465 -0.5021 0.5006 s.e. 0.0837 0.1356 0.3150 sigma^2 estimated as 1.027: log likelihood = -143.44, aic = 294.89... diagnostics:
tsdiag(x.fit, gof.lag=20) # don't use this - it's partially WRONG
... and the output
<-- see R Issue 4 |
... forecast 10 ahead:
x.fore = predict(x.fit, n.ahead=10) # plot the forecasts U = x.fore$pred + 2*x.fore$se L = x.fore$pred - 2*x.fore$se minx=min(x,L) maxx=max(x,U) ts.plot(x,x.fore$pred,col=1:2, ylim=c(minx,maxx)) lines(U, col="blue", lty="dashed") lines(L, col="blue", lty="dashed")
... and here's the plot of the data and the forecasts (with error bounds):
require(astsa) # if it's not loaded already plot(gtemp) # graph the series (not shown here)... long story short, the data appear to be an ARIMA(1,1,1) with a drift of about +.6 ^{o}C per century. Let's fit the model:
arima(gtemp, order=c(1,1,1)) Call: arima(x = gtemp, order = c(1, 1, 1)) Coefficients: ar1 ma1 0.2256 -0.7158 s.e. 0.1235 0.0792 sigma^2 estimated as 0.009539: log likelihood = 116.83, aic = -227.65So what's wrong? .... well, there's no estimate of the drift!! With no drift, the global warming hypothesis is kaput (technical term)... that is, the temps are just basically taking a random walk. How do you get the estimate of drift?... do this:
arima(diff(gtemp), order=c(1,0,1)) # diff the data and fit an arma to the diffed data Call: arima(x = diff(gtemp), order = c(1, 0, 1)) Coefficients: ar1 ma1 intercept 0.2570 -0.7854 0.0064 <-- this is the mean of the diff-ed temps, which is the drift s.e. 0.1177 0.0707 0.0025 sigma^2 estimated as 0.009163: log likelihood = 119.34, aic = -230.68What happened? The two runs should have given the same results, but the default models for the two cases are different. I won't go into detail here because the details can be found on the R Issues page (see Issues 2 and 3). And, of course, this problem continues if you try to do forecasting. There are remedies. One remedy is to use sarima from astsa:
sarima(gtemp, 1, 1, 1) # partial output shown (easy, huh?) Coefficients: ar1 ma1 constant 0.2569 -0.7853 0.0064 s.e. 0.1178 0.0708 0.0025 sigma^2 estimated as 0.009163: log likelihood = 119.34, aic = -230.68 $AIC [1] -3.646393 $AICc [1] -3.628547 $BIC [1] -4.580219 yep - you get the CORRECT diagnostics at no extra chargeIf you want to do ARIMA forecasting, sarima.for is included in astsa. Also, you might be wondering about the difference between aic and AIC above. For that you have to read the text or just don't worry about it, they only differ by constants that don't change from model to model.
trend = time(cmort) # assumes cmort and part are there from previous examples fit.lm = lm(cmort~trend + part) # ols acf2(resid(fit.lm)) # check acf and pacf of the resids # resids appear to be AR(2)Now we'll use arima() to fit the full model:
(fit = arima(cmort, order=c(2,0,0), xreg=cbind(trend, part))) Coefficients: ar1 ar2 intercept trend part 0.3980 0.4135 3132.7085 -1.5449 0.1503 s.e. 0.0405 0.0404 854.6662 0.4328 0.0211 sigma^2 estimated as 28.99: log likelihood = -1576.56, aic = 3165.13 Box.test(resid(fit), 12, type="Ljung", fitdf = 2) # check whiteness via Q- statistic X-squared = 8.2535, df = 10, p-value = 0.6041 # fitdf=2 because the resids from an ar2 fit◊ On ARMAX: If you want to fit an ARMAX model you have to do it via a state space model... the details are in Chapter 6 of the text. Note that you can't use xreg in arima() to fit an ARMAX model (it was never claimed you could). You can only use it to do regression with autocorrelated errors.
x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data (u = polyroot(c(1,-1,.9))) # x is AR(2) w/complex roots [1] 0.5555556+0.8958064i 0.5555556-0.8958064i Arg(u[1])/(2*pi) # dominant frequency around .16: [1] 0.1616497 par(mfcol=c(2,2)) plot.ts(x, main="da data") mvspec(x, spans=c(5,5), plot=TRUE, taper=.1, log="no") # nonparametric spectral estimate # spec.pgram(x, spans=c(5,5), log="no") # same using stats package; also see spectrum() spec.ar(x, log="no") # parametric spectral estimate arma.spec(ar = c(1,-.9), log="no") # model spectral density
and the graph:
per = abs(fft(x-mean(x)))^2/length(x) # Mod() and abs() same here # if you want a nice picture: freq = (1:length(x)-1)/length(x) plot(freq, per, type="h")Note that fft() returns a time series of length(x) (print per to see this). If you only want to display the periodogram from 0 to .5, then you have to figure that out on your own because I'm leaving now.