UP

Two Way ANOVA

Missing cell (unbalanced)

Firstly,  we need to create a Data frame. On this dataframe we execute an unbalanced two-way ANOVA.  Description of the Data-set can be found in NKNW96. In short it describes the hours of relief of hay fever as function of the amount of ingredient A ( low, medium or high) and amount of ingredient B ( low, medium or high) in a compound medicine. See page 919 problem 22.9; one cell is missing.

Steps to follow:

1)
Create the data-frame on the data ( textfile  Ch22pr09; right click on the file and save on C:\Temp )
2) Create factors in the frame
3) Execute the ANOVA two-way with interaction;

Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every ANOVA and note your observations  

4) Balanced and fixed factors; so SS-decomposition is useful and correct

The General Linear Test or classical F tests can be done.

5) If appropiate execute multiple comparison or test specific contrast questions.

We create a data-frame by typing (or copy/paste)  the red letters after > followed by  <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two  "\\".

>Ch22pr09<-read.table("c:\\temp\\ch22pr09.txt")<ENTER>

Or directly reading via the URL:

Ch22pr09<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/datasetsTXT/CH22PR09.txt")

Ch22pr09

          V1 V2 V3 V4
1       2.4    1    1    1
2       2.7    1    1    2
3       2.3    1    1    3


(rest of dataset)

The variable names are V1 to V4. We change them to meaningful names by:

names(Ch22pr09)<-c('hours','A','B','nr'); Ch22pr09

              hours A   B   nr
1               2.4  1   1    1
2               2.7  1   1    2

(rest of dataset)

We almost  ready with our data-preparation and can make our variables more accessible by the attach statement:

attach(Ch22pr09)
 

We could create factors by the factor function in the lm-formula. It is safer to declare them now:

A<-factor(A) ; B<-factor(B); is.factor(A)
[1] TRUE
 

Now in the further analysis the variables are known as e.g. A and B and the full name with object (name of data-frame) and property (name of Variable) with $ as separator is not necessary anymore .

We are finally in business ( compare this to e.g. the datastep in SAS and others). R is a bit rough here as the volunteers of  free-ware software are more keen on the statistics. 


An useful visualisation is by calling the interaction plot function with (Factor A, Factor B,Response Var ) as arguments

 interaction.plot(A,B,hours)

We miss out the line with empty cell. In the function "interaction.plot" we can override some defaults to see a maximum of points:

 interaction.plot(A,B,hours,type="p",legend=F)

Cell 2:2 is missing. Cell 1:2 and 1:3 have a similar average.

Box-plots on one screen:

layout(matrix(c(1,2),1,2)); boxplot(hours~A);boxplot(hours~B)

What is your observation?


Let's calculate the estimated " μ 's" by:

MUidot=tapply(hours,A,mean);MUidot
             1              2              3
3.883333 7.287500 9.833333

MUdotj=tapply(hours,B,mean);MUdotj
             1              2              3
4.633333 7.437500 8.983333

MUij=tapply(hours,A:B,mean);MUij
     1:1    1:2     1:3      2:1    2:2     2:3     3:1       3:2       3:3
2.475 4.600 4.575 5.450   NA 9.125 5.975 10.275 13.250

As can be seen cell "2:2" is missing. We do not have any observations at the combined  medium level for A and B.

Now we can go for the Full model: two-way factor with interaction ANOVA.  The data are unbalanced so the decompostion of SSTR in SSA SSB and SSAB is not useful. Rather we execute the GLT if we wish to test parts of the model.

  ResMedFull<-lm(hours~A+B+A:B)
  anova(ResMedFull)
 Analysis of Variance Table

Response: hours
               Df     Sum Sq   Mean Sq      F value          Pr(>F)   
A               2   213.520    106.760    1666.50     < 2.2e-16 ***
B               2   117.560       58.780     917.54      < 2.2e-16 ***
A:B            3    28.374          9.458     147.64     1.338e-15 ***
Residuals  24     1.538          0.064                     
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

As it is unbalanced; it is not correct to use the Sum of Squares-decompostion, but we have to use the GLT:

We accept the one-factor model or the two-factor; depending on our judgement. For the model we accept we have to check the residuals. We can plot them on one page. We subdivide 4 windows on one page organized 2 by  2 by the layout command:

layout(matrix(c(1,2,3,4),2,2))
plot(ResMedFull)


Are we happy with the residuals??? In other words are the assumptions on the errors in the model correct?


Problem 22.9 a (see Neter et al, 1996; pag 919) a requires to study a number of comparisons.

Calculate the differences (remark that "1:3" means the index of the vector; or in practical terms level 1 for A with level 3 for B; while 1:3 without " means element 1 to 3 in the vector):

D1=MUij["1:3"]-MUij["1:1"]
D2=MUij["2:3"]-MUij["2:1"]
D3=MUij["3:3"]-MUij["3:1"]

L1=D1-D2; L2=D1-D3
D1;D2;D3;L1;L2

We are requested to compare by the Bonferroni procedure at the 90% family confidence coefficient.

dfMSE=ResMedFull$df.residual;dfMSE
[1] 24
MSE=sum(ResMedFull$residuals^2)/dfMSE;MSE
[1] 0.0640625
For unbalanced the formulas for estimators and their estimated variances can be found on pag 898 and 899 (Table 8.9) in Neter et al, 1996.

The number of observations per cell:

nij=tapply(hours,A:B,length);nij
     1:1 1:2 1:3 2:1 2:2 2:3 3:1 3:2 3:3
       4    4    4    4 NA   4   4    4    4

The estimated variance on D1, D2 and D3 is equal ( same observations per cell) and by formula 22.18:

sDestim=sqrt(MSE*(1/nij["1:3"]+1/nij["1:1"]));sDestim
            1:3
0.1789728

The estimated variance on L1and L2 is also equal ( same observations per cell) and by formula 22.19.

This requires also the contrast coefficients; for L1 it should be:

cij=c(-1,0,1,1,0,-1,0,0,0); sum(cij)
[1] 0
For the empty cell "2:2" see section 22.4.  Problem is that we have exclude this cell in the calculation; otherwise formula 22.19 cannot be evaluated.

cijE=c(cij[1:4],cij[6:9])
nijE=c(nij[1:4],nij[6:9])
sLestim=sqrt(t(cijE^2)%*%(1/nijE)*MSE);sLestim

                  [,1]
[1,] 0.2531057

 

Bonferroni at alfa 10% with 5 contrasts simultaneously:

 alfa=0.1
 ncomp=5
 tscore=qt(1-alfa/2/ncomp,dfMSE);tscore

[1] 2.492159
Now all confidence intervals can be calculated.


UP (for Proeftechniek)


Remark (niet voor "proeftechniek") : Creating dummies (indicator variables) can be done in several ways; the method in Neter et al, 1996 is just one method. By default R uses an alternative method, but allows several other methods by overriding the default. To add to the confusion R calls the indicator variable system contrasts. In what follows the method like in Neter et al, 1996, is applied in R and a X matrix is shown

XMat= model.matrix(lmRes)
mRes =lm(hours~A+B+A:B,contrasts=list(A="contr.sum",B="contr.sum"))
XMat= model.matrix(lmRes)
# shows the Xmat
XMat       


UP

31 March, 2003 by Guido Wyseure