| UP |
unbalanced
Firstly, we need to create a Data frame. On this dataframe we execute an unbalanced two-way ANOVA. Description of the Data-set can be found in NKNW96. In short it describes the days a treatment is effective as function of the duration of a treatment and the weightgain of the patient.
Steps to follow:
| 1) |
Create the data-frame on the data (textfile Ch22pr05; right click on the file and save on C:\Temp ) |
| 2) | Create factors in the frame |
| 3) | Execute the ANOVA two-way with interaction;
Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every ANOVA and note your observations |
| 4) | Balanced and fixed factors; so SS-decomposition is useful and correct
The General Linear Test or classical F tests can be done. |
| 5) | If appropiate execute multiple comparison or test specific contrast questions. |
We create a data-frame by typing (or copy/paste) the red letters after > followed by <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two "\\".
>Ch22pr05<-read.table("c:\\temp\\ch22pr05.txt")<ENTER>
Or directly reading via the URL:
Ch22pr05<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/datasetsTXT/CH22PR05.txt")
Ch22pr05
V1
V2 V3 V4
1 0
1 1 1
2 2
1 1 2
3 1
1 1 3
4 3
1 1 4
(rest of dataset)
The variable names are V1 to V4. We change them to meaningful names by:
>names(Ch22pr05)<-c('days','duration','weightgn','nr'); Ch22pr05 <ENTER>
days duration weightgn nr
1
0 1
1 1
2
2 1
1 2
3
1 1
1 3
(rest of dataset)
We transfrom the Y variable for better residuals and create a new vector:
trans<- log10(Ch22pr05$days+1);trans
[1] 0.0000000
0.4771213 0.3010300
0.6020600 (rest of vector)
> is.vector(trans)
[1] TRUE
We can then combine the data.frame "Ch22pr05" and the vector "trans" into a new data.frame 'Medical'
> Medical<-cbind(Ch22pr05,trans);Medical
days duration weightgn nr
trans
1
0 1
1 1 0.0000000
2
2 1
1 2 0.4771213
3
1 1
1 3 0.3010300
(rest of dataset)
We almost ready with our data-preparation and can make our variables more accessible by the attach statement:
>attach(Medical)
We could create factors by the factor function in the lm-formula. It is safer to declare them now:
> duration<-factor(duration) ; weightgn<-factor(weightgn)
> is.factor(duration)
[1] TRUE
Now in the further analysis the variables are known as e.g. duration and the full name with object (name of data-frame) and property (name of Variable) with $ as separator like Medical$duration are not necessary anymore .
We are finally in business ( compare this to e.g. the datastep in SAS and others). R is a bit rough here as the volunteers of free-ware software are more keen on the statistics.
An useful visualisation is by calling the interaction plot function with (Factor A, Factor B , Response Var) as arguments
> interaction.plot(duration,weightgn,trans) <ENTER>
Parallel lines or not? Look and judge carefully. What is your opinion?
Now we can go for the Full model: two-way factor with interaction ANOVA (the data are unbalanced so the decompostion of SSTR in SSA SSB and SSAB is not useful). Rahter we execute the GLT. In R just execute the Full model and the Reduced model and then ask execute the 'anova' function.
ResMedFull<-lm(trans~duration+weightgn+duration:weightgn)
ResMedNoint<-lm(trans~duration+weightgn)
anova(ResMedFull,ResMedNoint)
Analysis of Variance Table
Model 1: trans ~ duration + weightgn + duration:weightgn
Model 2: trans ~ duration + weightgn
Res. Df RSS
Df Sum of Sq
F Pr(>F)
1 51 5.3383
2 53 5.4393
-2 -0.1010
0.4826 0.62
We can accept the H0 or the reduced model ( no interaction). Is that conform to
the interaction plot?
Having excluded the interaction we check whether both factors are important.
We use the 'anova' function in another way:
> anova(ResMedNoint)
Analysis of Variance Table
Response: trans
Df Sum Sq Mean Sq
F value Pr(>F)
duration 1
0.4071 0.4071
3.9668 0.05157
.
weightgn 2
3.0692 1.5346
14.9529 7.093e-06 ***
Residuals 53
5.4393 0.1026
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.'
0.1 ` ' 1
We see that the 'duration' is borderline just less then the 5%.
As it is unbalanced; it is not correct to use the Sum of Squares-decompostion, but we have to use the GLT:
> ResMedW<-lm(trans~weightgn)
> anova(ResMedNoint,ResMedW)
Analysis of Variance Table
Model 1: trans ~ duration + weightgn
Model 2: trans ~ weightgn
Res. Df
RSS Df
Sum of Sq
F Pr(>F)
1 53 5.4393
2 54
5.8410 -1
-0.4017 3.9143
0.05309 .
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
As it is almost balanced the mistake would be small.. However, the second approach is the correct one. The GLT is equivalent to the type III SS (called orthogonal SS ; which means in pratical terms the SS extra by adding this term after all other terms)
We accept the one-factor model or the two-factor; depending on our judgement. For the model we accept we have to check the residuals. We can plot them on one page. We subdivide 4 windows on one page organized 2 by 2 by the layout command:
> layout(matrix(c(1,2,3,4),2,2))
> plot(ResMedW)

Are we happy with the residuals??? In other words are the assumptions on the
errors in the model correct?
Check the residuals for the untransformed dependent variable.
ResMedNoTrans<-lm(days~duration+weightgn)
Plot(ResMedNoTrans)
what do you think; is the transformation appropiate or not?
For unbalanced ANOVA we can use the regression approach (NKNW; 96 pag 832)
The dataset is prepared in regressionapproach. Right click and save on "c:/temp"..
RegCh22pr05<-read.table("c:\\temp\\regressionapproach.txt",header=TRUE);RegCh22pr05
We can "regress"; we use the dummy variables as numeric variables not factors.
RegFullMod=lm( transform~ XA1+ XA2+ XB1+ XA1B1+ XA2B1,RegCh22pr05)
RegRedlMod=lm( transform~ XA1+ XA2+ XB1,RegCh22pr05)
and can compare to the reduced model without interaction by the allmighty
GLT:
anova(RegRedMod,RegFullMod)
Analysis of Variance Table
Model 1: transform ~ XA1 + XA2 + XB1
Model 2: transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1
Res.Df RSS Df
Sum of Sq F
Pr(>F)
1 53
5.438
2 51
5.337 2 0.101
0.4826 0.62
Coefficients of full model:
RegFullMod
Call:
lm(formula = transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1, data = RegCh22pr05)
Coefficients:
(Intercept) XA1
XA2 XB1
XA1B1 XA2B1
0.69130 -0.27488 -0.01279 0.08404
-0.05706 0.01357
Compare to earlier results...
UP (for Proeftechniek)
Remark (niet voor "proeftechniek") : Creating dummies (indicator variables) can be done in several ways; the method in Neter et al, 1996 is just one method. By default R uses an alternative method, but allows several other methods by overriding the default. To add to the confusion R calls the indicator variable system contrasts. In what follows the method like in Neter et al, 1996, is applied in R and a matrix calculation is performed.
lmRes
=lm(trans~weightgn+duration+duration:weightgn,contrasts=list(duration="contr.sum",weightgn="contr.sum"))
XMat= model.matrix(lmRes); XMat
### shows the X matrix
YMat<-matrix(trans); YMat
### shows the Y matrix
bMat2<-crossprod(solve(crossprod(XMat,XMat)),crossprod(XMat,YMat))
; bMat2
[,1]
[1,] 0.69139158
[2,] -0.27491981
[3,] -0.01281334
[4,] 0.08406665
[5,] -0.05705599
[6,] 0.01354753
Compare the coefficients from the regression to indicatorvariables ( see
above horizontal line):
Call:
lm(formula = transform ~ XA1 + XA2 + XB1 + XA1B1 + XA2B1, data = RegCh22pr05)
Coefficients:
(Intercept) XA1
XA2 XB1
XA1B1 XA2B1
0.69130 -0.27488 -0.01279 0.08404
-0.05706 0.01357
The fourth significant digit is different! This is due to a
different algorithm to invert the matrix.
| UP |
31 March, 2003 by Guido Wyseure