| UP |
Missing cell (unbalanced)
Firstly, we need to create a Data frame. On this dataframe we execute an unbalanced two-way ANOVA. Description of the Data-set can be found in NKNW96. In short it describes the hours of relief of hay fever as function of the amount of ingredient A ( low, medium or high) and amount of ingredient B ( low, medium or high) in a compound medicine. See page 919 problem 22.9; one cell is missing.
Steps to follow:
| 1) |
Create the data-frame on the data ( textfile Ch22pr09; right click on the file and save on C:\Temp ) |
| 2) | Create factors in the frame |
| 3) | Execute the ANOVA two-way with interaction;
Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every ANOVA and note your observations |
| 4) | Balanced and fixed factors; so SS-decomposition is useful and correct
The General Linear Test or classical F tests can be done. |
| 5) | If appropiate execute multiple comparison or test specific contrast questions. |
We create a data-frame by typing (or copy/paste) the red letters after > followed by <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two "\\".
>Ch22pr09<-read.table("c:\\temp\\ch22pr09.txt")<ENTER>
Or directly reading via the URL:
Ch22pr09<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/datasetsTXT/CH22PR09.txt")
Ch22pr09
V1 V2 V3 V4
1 2.4 1
1 1
2 2.7 1
1 2
3 2.3 1
1 3
(rest of dataset)
The variable names are V1 to V4. We change them to meaningful names by:
names(Ch22pr09)<-c('hours','A','B','nr'); Ch22pr09
hours A B
nr
1
2.4 1 1 1
2
2.7 1 1 2
(rest of dataset)
We almost ready with our data-preparation and can make our variables more accessible by the attach statement:
attach(Ch22pr09)
We could create factors by the factor function in the lm-formula. It is safer to declare them now:
A<-factor(A) ; B<-factor(B); is.factor(A)
[1] TRUE
Now in the further analysis the variables are known as e.g. A and B and the full name with object (name of data-frame) and property (name of Variable) with $ as separator is not necessary anymore .
We are finally in business ( compare this to e.g. the datastep in SAS and others). R is a bit rough here as the volunteers of free-ware software are more keen on the statistics.
An useful visualisation is by calling the interaction plot function with (Factor A, Factor B,Response Var ) as arguments
interaction.plot(A,B,hours)
We miss out the line with empty cell. In the function "interaction.plot" we can override some defaults to see a maximum of points:
interaction.plot(A,B,hours,type="p",legend=F)

Cell 2:2 is missing. Cell 1:2 and 1:3 have a similar average.
Box-plots on one screen:
layout(matrix(c(1,2),1,2)); boxplot(hours~A);boxplot(hours~B)
What is your observation?
Let's calculate the estimated " μ 's" by:
MUidot=tapply(hours,A,mean);MUidot
1
2
3
3.883333 7.287500 9.833333
MUdotj=tapply(hours,B,mean);MUdotj
1
2
3
4.633333 7.437500 8.983333
MUij=tapply(hours,A:B,mean);MUij
1:1 1:2
1:3 2:1 2:2
2:3 3:1 3:2
3:3
2.475 4.600 4.575 5.450 NA 9.125 5.975 10.275 13.250
As can be seen cell "2:2" is missing. We do not have any observations at the combined medium level for A and B.
Now we can go for the Full model: two-way factor with interaction ANOVA. The data are unbalanced so the decompostion of SSTR in SSA SSB and SSAB is not useful. Rather we execute the GLT if we wish to test parts of the model.
ResMedFull<-lm(hours~A+B+A:B)
anova(ResMedFull)
Analysis of Variance Table
Response: hours
Df Sum Sq Mean Sq
F value Pr(>F)
A
2 213.520 106.760 1666.50
< 2.2e-16 ***
B
2 117.560 58.780
917.54 < 2.2e-16 ***
A:B 3
28.374 9.458
147.64 1.338e-15 ***
Residuals 24 1.538
0.064
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
As it is unbalanced; it is not correct to use the Sum of Squares-decompostion, but we have to use the GLT:
We accept the one-factor model or the two-factor; depending on our judgement. For the model we accept we have to check the residuals. We can plot them on one page. We subdivide 4 windows on one page organized 2 by 2 by the layout command:
layout(matrix(c(1,2,3,4),2,2))
plot(ResMedFull)

Are we happy with the residuals??? In other words are the assumptions on the
errors in the model correct?
Problem 22.9 a (see Neter et al, 1996; pag 919) a requires to study a number of comparisons.
Calculate the differences (remark that "1:3" means the index of the vector; or in practical terms level 1 for A with level 3 for B; while 1:3 without " means element 1 to 3 in the vector):
D1=MUij["1:3"]-MUij["1:1"]
D2=MUij["2:3"]-MUij["2:1"]
D3=MUij["3:3"]-MUij["3:1"]
L1=D1-D2; L2=D1-D3
D1;D2;D3;L1;L2
We are requested to compare by the Bonferroni procedure at the 90% family
confidence coefficient.
dfMSE=ResMedFull$df.residual;dfMSE
[1] 24
MSE=sum(ResMedFull$residuals^2)/dfMSE;MSE
[1] 0.0640625
For unbalanced the formulas for estimators and their estimated variances can be
found on pag 898 and 899 (Table 8.9) in Neter et al, 1996.
The number of observations per cell:
nij=tapply(hours,A:B,length);nij
1:1 1:2 1:3 2:1 2:2 2:3 3:1 3:2
3:3
4 4 4
4 NA 4 4 4 4
The estimated variance on D1, D2 and D3 is equal ( same observations per cell) and by formula 22.18:
sDestim=sqrt(MSE*(1/nij["1:3"]+1/nij["1:1"]));sDestim
1:3
0.1789728
The estimated variance on L1and L2 is also equal ( same observations per cell) and by formula 22.19.
This requires also the contrast coefficients; for L1 it should be:
cij=c(-1,0,1,1,0,-1,0,0,0); sum(cij)
[1] 0
For the empty cell "2:2" see section 22.4. Problem is that we have exclude
this cell in the calculation; otherwise formula 22.19 cannot be evaluated.
cijE=c(cij[1:4],cij[6:9])
nijE=c(nij[1:4],nij[6:9])
sLestim=sqrt(t(cijE^2)%*%(1/nijE)*MSE);sLestim
[,1]
[1,] 0.2531057
Bonferroni at alfa 10% with 5 contrasts simultaneously:
alfa=0.1
ncomp=5
tscore=qt(1-alfa/2/ncomp,dfMSE);tscore
[1] 2.492159
Now all confidence intervals can be calculated.
UP (for Proeftechniek)
Remark (niet voor "proeftechniek") : Creating dummies (indicator variables) can be done in several ways; the method in Neter et al, 1996 is just one method. By default R uses an alternative method, but allows several other methods by overriding the default. To add to the confusion R calls the indicator variable system contrasts. In what follows the method like in Neter et al, 1996, is applied in R and a X matrix is shown
XMat= model.matrix(lmRes)
mRes =lm(hours~A+B+A:B,contrasts=list(A="contr.sum",B="contr.sum"))
XMat= model.matrix(lmRes)
# shows the Xmat
XMat
| UP |
31 March, 2003 by Guido Wyseure