Dr. H. Block's STAT 1301(2300)

FALL 2007, TH 11:00 - 12:15, CL 244B

________________________________________________

Last updated September 12, 2007

________________________________________________________________________

Office Hours and Contact Information

Office hours: T 10:00-10:45, H 2:20-3:45, others by appointment

Office: 2703 CL

Phone: 624-8369

Email: hwb@stat.pitt.edu

Grader: Chunhsiung Lu (email: ch186@pitt.edu)

C. Lu's office hours: M 10:30-11:30, H 1:00-2:00 (2617 CL)

FINAL EXAM DATE & TIME: Monday, December 10, 10:00 - 11:50 AM

Textbooks

1. G. Der and B. Everitt (2002). A Handbook of Statistical Analysis Using SAS (Second Edition).

2. P. Dalgard (2002). Statistics and Computaion, Introductory Statistics with R (ISwR).

Software

SAS is available for a nominal fee (usually $10) from Software Licensing. R is free and is available at www.R-project.org.

There is online help for SAS at http://support.sas.com/publishing/.

Final: Monday, December 10, Open Book and Notes (including HWs) , Chapters 1-7 of SAS book and Chapters 1-5 of  R text.

.

Assignments

Number    Date Due   Assignment             Comments

SAS Assignments

  #1                9/6          p51/ 2.1 - 2.6           Run programs and describe your findings

  #2                9/13        p78/ 3.2 -3.3            Programs, outputs (+logs), comments required                

                                                                      (misprint in 3.2, quantities multiplied, not subtracted) 

 #3                9/20        p99/4.1, 4.3-4.5        Programs, outputs (+logs), comments required

 #4                9/27        p.116/5.1,5.2            (as above)

 #5              10/11        p.130/6.1-6.4            (as above)

                                    p. 142/7.1-7.3

R Assignments

 #6              10/18        p.44/1.1-1.3

 #7               10/25       p.44/1.4,1.5,1.7,1.8

 #8               11/1         p.55/2.1-2.4

 #9               11/8         p.80/3.2-3.5

 #10             11/27       p.93/4.1-4.5

 #11             12/4         p.110/5.1-5.3

Solutions

Solutions to Chapter 2 problems:

2.1

Under the significance level a=0.05

It shows that variable “Mortal” is normally distributed because the p-value > 0.05.

Here the null hypothesis is that distribution of “Mortal” is a normal distribution. So, we failed to reject because of p-value > 0.05. We conclude that variable “Mortal” is normally distributed.

However, for variable “Hardness”, p-value < 0.05. That is, we reject the null hypothesis (distribution of Hardness is a normal distribution). We conclude that variable “Hardness” is normally distributed.

Moreover, you can also check the normal plot. You can get the same conclusion.

 

2.2

By boxplot, the distributions of “Mortal” and “Hardness” are different with different location ( North or South ). That is, we can conclude that distribution of “Mortal” is higher for “North” location. On the contrary, distribution of “Hardness” is higher for “South” location.

 

2.3

By comparing two distribution ( Lognormal and exponential distributions), we can tell that the distribution of “Hardness” belongs to exponential distributed because of p-value > 0.05. Here, the null hypothesis is that distribution of “Hardness” is a exponential distribution. So we fail to reject. We should conclude that distribution of “Hardness” is an exponential distribution. Furthermore, you can check by plot.

You should also check the test for “Lognormal distribution.” You would get the opposite conclusion.

 

2.4

By Kernel density estimate, it doesn’t show that there exist any significant extrapolations from both distributions.

 

2.5

The bivariate distributions for different locations are different by 3-D plots.

 

2.6

The slopes for different locations are approximately parallel. Moreover, they are negative correlated because of the negative slopes. On the other hand, based on “Hardness”, the scatter plot for “North” location is somehow higher than for “South” location.

 

Solutions to Chapter 3 Problems:

3.2

Compare with the original residual, r and adj r, we can conclude that

 

  1. Original residual: It only shows the differences between original data values and expected values. However, there is no significant meaning to compare each difference. The only point is the sign.
  2. R: This shows the changes under the original values. There is a significant meaning to compare each difference. Larger difference r means larger change within each cell.
  3. Adj R( Adjusted Residual): The same meaning in R. Moreover, larger difference means larger change within each cell under the whole table(Relative changes). It adjusts the value with the true difference under the whole given data. It’s because the denominator is adjusted by the size for each row and column.

 

 

3.3

For this data, we shouldn’t apply the chi-square test, LK-ratio test, or fisher’s exact test.

First, most of expected values in each cell are not greater than 5. It causes that Chi-square test or LK-ratio test is extremely conservative (Easy to reject null hypothesis).

Second, most of row values are small than 10. The same as before, the conservative test would be applied in Fisher’s Exact test.

You can also check P.73 in the text book.

Here, I’d suggest use the Mantel-Haenszel test. It shows that there is no significant difference among each cell.

However, this is what I think here. I know there is some controversy here comparing with what Dr. Block taught in class. So, I am happy to discuss that if you want to. Moreover, as I remembered I didn’t take point off if you want to use Fisher’s test. If I did, it’s welcome to ask it back.

Chapter 4 Solutions

4.1

The best model with five variables by CP selection is the one with Age, Ed, Ex1, U2 and X. The result coincides with the result from stepwise selection.

 

4.3

Here, this question is not very clear.

There are several different steps about regression diagnostics. One is about the model selection. For examples, we learned stepwise and CP in the textbook. Or you can discuss about the model fitted such as residuals (This is what mentioned in the book). Here, either one you employ is acceptable. I would prefer second one based on the book.

Model selection:

There are plenty ways for model selection. You can also use Forward(F), Backward(B), MAXR, MINR and etc. The following are some other choices. You can check by your own.

Fit Statistics

ADJRSQ: computes adjusted R2

AIC: computes Akaike's information criterion

B: computes parameter estimates for each model

BIC: computes Sawa's Bayesian information criterion

CP: computes Mallows' Cp statistic

GMSEP: computes estimated MSE of prediction assuming multivariate normality

JP: computes Jp, the final prediction error

MSE: computes MSE for each model

PC: computes Amemiya's prediction criterion

RMSE: displays root MSE for each model

SBC: computes the SBC statistic

SP: computes Sp statistic for each model

SSE: computes error sum of squares for each model

Fitted model:

You can use test or plot. Like residual vs. variables or expected values. Or studentized residual or semi-studentized residual compare with some other value. Normal probability plot, q-q plot, etc.

4.4

Comparing with the alternative model, the VIF for the new variable Avex=(Ex0+Ex1)/2 is slightly smaller than the value for only use Ex1 (drop Ex0.) Based on these two models, both are quite good for reducing the effect from Ex0 and Ex1. Of course, you can try to use model selection see how it goes.

 

4.5

Based on problem, we probably need to set up two models. One model is with the interaction and one is without it.

I know someone asked me this before and my answer is to set up the model with interaction only. And for most of students, they only check this model. This is still fine for this problem. Just make a clear comment about the model you use and how to interpret that model is fitted.

Here, I would use the plots like the display 4.11 and 4.13 in P.98-99. It’s easy and common to use them. Moreover, people who use the SAS code in the textbook. I would say it’s OK to use that. However, it is unclear and hard to interpret the model is fitted or not by plot. But you can check the ANOVA and give your comments.

As you can see, the residual plot for model without interaction is not that good. That is, it’s not widespread. Moreover, in the normal probability plot, it’s not a straight line.

On the other hand, the model with interaction, the residual plot is much better. However, it seems that there exist some outliers or heavier weight observations. We need to proceed next remedial procedure to assure this part. Second, the normal probability plot is much better than the first model. However, it still needs to check the normality by other tests.                  Check the following plots.

(a) Without interaction:

 

(b) With interaction:

 

 Chapter 5 Solutions

1.According to the output, it coincides with the same result with the Sheffe Method. That is, for both Duncan’s and Bonfferoni tests, mean of drug X is significantly different with mean of drug Y and Z.

 

2.

a. For diet effect, the mean and the range of present diet are somehow lower than those of absent diet. Furthermore, it’s reasonable to test the difference between these two diet effects with the Sheffe, Duncan’s or Bonfferoni tests.

b.

For drug effect, the mean and the range of drug X is somehow lower than those of drug Y and Z. Moreover, it sustains the results of the test in book and 5.1.

c.

For biofeed effect, the mean and the range of present one are somehow lower than those of one diet. Furthermore, it’s reasonable to test the difference between these two biofeed effects with the Sheffe, Duncan’s or Bonfferoni tests.

 

 

Data Sets

Informats Data

Alicia Grossman 13 c 10-28-2003 7.8 6.5 7.2 8.0 7.9
Matthew Lee 9 D 10-30-2003 6.5 5.9 6.8 6.0 8.1
Elizabeth Garcia 10 C 10-29-2003 8.9 7.9 8.5 9.0 8.8
Lori Newcombe 6 D 10-30-2003 6.7 5.6 4.9 5.2 6.1
Jose Martinez 7 d 10-31-2003 8.9 9.510.0 9.7 9.0
Brian Williams 11 C 10-29-2003 7.8 8.4 8.5 7.9 8.0

Proc Tabulate Data

Silent Lady Maalea sail sch 75.00
America II Maalea sail yac 32.95
Aloha Anai Lahaina sail cat 62.00
Ocean Spirit Maalea power cat 22.00
Anuenue Maalea sail sch 47.50
Hana Lei Maalea power cat 28.99
Leilani Maalea power yac 19.99
Kalakaua Maalea power cat 29.50
Reef Runner Lahaina power yac 29.95
Blue Dolphin Maalea sail cat 42.95

ODS Data

Big Zac, red, 80, 5
Delicious, red, 80, 3
Dinner Plate, red, 90, 2
Goliath, red, 85, 1.5
Mega Tom, red, 80, 2
Big Rainbow, yellow, 90, 1.5
Pineapple, yellow, 85, 2