UP

Two Way ANOVA

unbalanced

Firstly, we need to create a Data frame. On this dataframe we execute an unbalanced two-way ANOVA.  Description of the Data-set can be found in NKNW96. In short it describes the days a treatment is effective as function of the duration of a treatment and the weightgain of the patient.

Steps to follow:

1)
Create the data-frame on the data (textfile  Ch22pr05; right click on the file and save on C:\Temp )
2) Create factors in the frame
3) Execute the ANOVA two-way with interaction;

Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every ANOVA and note your observations  

4) Balanced and fixed factors; so SS-decomposition is useful and correct

The General Linear Test or classical F tests can be done.

5) If appropiate execute multiple comparison or test specific contrast questions.

We create a data-frame by typing (or copy/paste)  the red letters after > followed by  <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two  "\\".

>Ch22pr05<-read.table("c:\\temp\\ch22pr05.txt")<ENTER>

Or directly reading via the URL:

Ch22pr05<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/datasetsTXT/CH22PR05.txt")

Ch22pr05

         V1    V2    V3    V4
1          0      1       1      1
2          2      1       1      2
3          1      1       1      3
4          3      1       1      4

(rest of dataset)

The variable names are V1 to V4. We change them to meaningful names by:

>names(Ch22pr05)<-c('days','duration','weightgn','nr'); Ch22pr05  <ENTER>

             days duration weightgn nr
1                0           1            1   1
2                2           1            1   2
3                1           1            1   3
(rest of dataset)

We transfrom the Y variable for better residuals and create a new vector:

 trans<- log10(Ch22pr05$days+1);trans
[1] 0.0000000       0.4771213      0.3010300     0.6020600  (rest of vector)

> is.vector(trans)
[1] TRUE

We can then combine the data.frame "Ch22pr05" and the vector "trans" into a new data.frame 'Medical'

> Medical<-cbind(Ch22pr05,trans);Medical
               days duration weightgn nr             trans
1                  0           1            1   1   0.0000000
2                  2           1            1   2   0.4771213
3                  1           1            1   3   0.3010300

(rest of dataset)

We almost  ready with our data-preparation and can make our variables more accessible by the attach statement:

>attach(Medical)
 

We could create factors by the factor function in the lm-formula. It is safer to declare them now:

> duration<-factor(duration) ; weightgn<-factor(weightgn)
> is.factor(duration)
[1] TRUE
 

Now in the further analysis the variables are known as e.g. duration and the full name with object (name of data-frame) and property (name of Variable) with $ as separator like Medical$duration  are not necessary anymore .

We are finally in business ( compare this to e.g. the datastep in SAS and others). R is a bit rough here as the volunteers of  free-ware software are more keen on the statistics. 


An useful visualisation is by calling the interaction plot function with (Factor A, Factor B , Response Var) as arguments

>  interaction.plot(duration,weightgn,trans) <ENTER>

Parallel lines or not? Look and judge carefully. What is your opinion?


Now we can go for the Full model: two-way factor with interaction ANOVA (the data are unbalanced so the decompostion of SSTR in SSA SSB and SSAB is not useful). Rahter we execute the GLT. In R just execute the Full model and the Reduced model and then ask execute the 'anova' function.

   ResMedFull<-lm(trans~duration+weightgn+duration:weightgn)
   ResMedNoint<-lm(trans~duration+weightgn)
  anova(ResMedFull,ResMedNoint)

Analysis of Variance Table

Model 1: trans ~ duration + weightgn + duration:weightgn
Model 2: trans ~ duration + weightgn
Res.     Df       RSS       Df      Sum of Sq               F       Pr(>F)
1          51   5.3383
2          53   5.4393       -2          -0.1010      0.4826          0.62

We can accept the H0 or the reduced model ( no interaction). Is that conform to the interaction plot?

Having excluded the interaction we check whether both factors are important.

We use the 'anova' function in another way:

> anova(ResMedNoint)
Analysis of Variance Table

Response: trans
                    Df      Sum Sq     Mean Sq        F value        Pr(>F)
duration          1       0.4071        0.4071        3.9668       0.05157     .
weightgn         2       3.0692        1.5346        14.9529     7.093e-06 ***
Residuals      53       5.4393        0.1026
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

We see that the 'duration' is borderline just less then the 5%.

As it is unbalanced; it is not correct to use the Sum of Squares-decompostion, but we have to use the GLT:

> ResMedW<-lm(trans~weightgn)
> anova(ResMedNoint,ResMedW)

Analysis of Variance Table

Model 1: trans ~ duration + weightgn
Model 2: trans ~ weightgn
Res.    Df            RSS         Df        Sum of Sq             F            Pr(>F)
1        53          5.4393
2        54          5.8410        -1        -0.4017       3.9143          0.05309 .
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

As it is almost balanced the mistake would be small.. However, the second approach is the correct one. The GLT is equivalent to the type III SS (called orthogonal SS ; which means in pratical terms the SS extra by adding this term after all other terms)

We accept the one-factor model or the two-factor; depending on our judgement. For the model we accept we have to check the residuals. We can plot them on one page. We subdivide 4 windows on one page organized 2 by  2 by the layout command:

> layout(matrix(c(1,2,3,4),2,2))
> plot(ResMedW)


Are we happy with the residuals??? In other words are the assumptions on the errors in the model correct?


Check the residuals for the untransformed dependent variable.

ResMedNoTrans<-lm(days~duration+weightgn)

Plot(ResMedNoTrans)

what do you think; is the transformation appropiate or not?


For unbalanced ANOVA we can use the regression approach (NKNW; 96 pag 832)

The dataset is prepared in regressionapproach. Right click and save on "c:/temp"..

RegCh22pr05<-read.table("c:\\temp\\regressionapproach.txt",header=TRUE);RegCh22pr05

We can "regress"; we use the dummy variables as numeric variables not factors.

 RegFullMod=lm( transform~ XA1+ XA2+ XB1+ XA1B1+ XA2B1,RegCh22pr05)

RegRedlMod=lm( transform~ XA1+ XA2+ XB1,RegCh22pr05)
and can compare to the reduced model without interaction by the allmighty GLT:

anova(RegRedMod,RegFullMod)
Analysis of Variance Table

Model 1: transform ~ XA1 + XA2 + XB1
Model 2: transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1
                    Res.Df        RSS    Df    Sum of Sq          F      Pr(>F)
                1        53      5.438
                2        51      5.337       2     0.101       0.4826         0.62

Coefficients of full model:

RegFullMod

Call:
lm(formula = transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1, data = RegCh22pr05)

Coefficients:
(Intercept)       XA1          XA2          XB1      XA1B1     XA2B1
0.69130   -0.27488   -0.01279    0.08404   -0.05706     0.01357
 

Compare to earlier results...
 

UP (for Proeftechniek)


Remark (niet voor "proeftechniek") : Creating dummies (indicator variables) can be done in several ways; the method in Neter et al, 1996 is just one method. By default R uses an alternative method, but allows several other methods by overriding the default. To add to the confusion R calls the indicator variable system contrasts. In what follows the method like in Neter et al, 1996, is applied in R and a matrix calculation is performed.

lmRes =lm(trans~weightgn+duration+duration:weightgn,contrasts=list(duration="contr.sum",weightgn="contr.sum"))
XMat= model.matrix(lmRes); XMat        

### shows the X matrix

YMat<-matrix(trans); YMat                     

### shows the Y matrix

 bMat2<-crossprod(solve(crossprod(XMat,XMat)),crossprod(XMat,YMat)) ; bMat2
                    [,1]
[1,] 0.69139158
[2,] -0.27491981
[3,] -0.01281334
[4,] 0.08406665
[5,] -0.05705599
[6,] 0.01354753


Compare the coefficients from the regression to indicatorvariables ( see above horizontal line):

Call:
lm(formula = transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1, data = RegCh22pr05)

Coefficients:
(Intercept)       XA1          XA2          XB1      XA1B1     XA2B1
0.69130   -0.27488   -0.01279    0.08404   -0.05706     0.01357
The fourth  significant digit is different! This is due to a different algorithm to invert the matrix.


UP

31 March, 2003 by Guido Wyseure