UP

Two Way ANOVA

One per Cell

First we need to create a Data frame. On this dataframe we execute an balanced two-way ANOVA.  The dataset is about cultivars of Alfafa and time of sowing.


1996 SOUTH DAKOTA ALFALFA CULTIVAR YIELD TEST

In 1996 werd een reeks van 7 variëteiten Alfalfa (luzerne of  Medicago sativa) getest op hun opbrengst.  In totaal werden de 7 variëteiten  ingezaaid op 28 velden.  Er werd gezaaid op vier verschillende tijdstippen, waarbij per variëteit random één veldje werd gekozen om in te zaaien.  De veldjes werden op hetzelfde ogenblik geoogst, waarbij de opbrengst werd gemeten (de voorgestelde gegevens zijn ton Droge Stof per acre)

The data are:
 

variëteit zaaitijd 1 zaaitijd 2 zaaitijd 3 zaaitijd 4
 
ICI 630 1.73 1.38 1.17 1.14
ICI 631 1.78 1.36 1.14 1.11
Viking 1 1.71 1.29 1.11 1.12
Multi-plier 1.72 1.32 1.04 0.98
Flagship 75 1.71 1.29 1.10 1.07
Proof 1.73 1.27 1.09 1.04
ICI Brand 645 1.62 1.32 0.99 0.97

Is het zaaitijdstip van belang ?  Is er een verschil tussen variëteiten ?  Welke variëteit geeft de hoogste opbrengst ?  Welk zaaitijdstip zorgt voor de hoogste opbrengst ?  Geef de onderstellingen die aan de hypothesetesten vooraf gaan. Is het in dit geval aannemelijk deze onderstellingen te maken ?
Vergelijk  variëteit 'ICI Brand 645' tegenover het gemiddelde van de zes andere variëteiten.

 


Steps to follow:

1)
Create the data-frame on the data ( we first create a text file from a textfile  Luzerne (alfafa)  right click on the file and save on C:\Temp )
2) Create factors in the frame
3) Execute the ANOVA two-way with interaction(try!!  ?) ;

Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every ANOVA and note your observations  

4) Balanced and fixed factors; so SS-decomposition is useful and correct

( The General Linear Test or classical F tests can be done but ANOVA-table can be used directly).

5) If appropiate execute multiple comparison or test specific contrast questions.

We create a data-frame by typing (or copy/paste)  the red letters after > followed by  <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two  "\\".


>luzerne<-read.table("c:\\temp\\luzerne.txt",header=T);luzerne
<ENTER>

Or directly from the URL:

>luzerne<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/anovabyRr/luzerne.txt",header=T);luzerne

      variety    time   yield
1        630        1   1.73
2        631        1   1.78
3     viking        1   1.71
4       multi        1   1.72
(rest of dataset)

We almost  ready with our data-preparation and can make our variables more accessible by the attach statement:

>attach(luzerne)

We could create factors by the factor function in the lm-formula, but it is safer to declare them in advance:

The variety is recognized by R as a factor:

> is.factor(variety)
[1] TRUE
But the numeric values for the sowing time are not!

> is.factor(time)
[1] FALSE

So, we ensure

> time<-factor(time)

Now in the further analysis the variables are known as e.g. variety and time and the full name with object (name of data-frame) and property (name of Variable) with $ as separator like luzerne$time  are not necessary anymore . If we do not indicate to R that time is a factor we get rubbish results. Other statistical programmes will equally go haywire as time = "1", "2" and "3" etc has no numerical meaning.

We are finally in business ( compare this to e.g. the datastep in SAS and others). R is a bit rough here as the volunteers of  free-ware software are more keen on the statistics then on the user interface.... 


An useful visualisation is by the command "interaction.plot (Factor.A, Factor.B , Response.Var)" as arguments:

>  interaction.plot(time,variety,yield) <ENTER>

Parallel lines or not? Look and judge carefully. What is your opinion?

If we are interested in one factor like the variety or the sewing time a "boxplot(Factor~Response.Var) might be useful:

> boxplot(yield~variety)

>  boxplot(yield~time)


As the lines in the plot are not parallel. We can go for the Full model: two-way factor with interaction ANOVA (the data are balanced so the decompostion of SSTR in SSA SSB and SSAB is meaningful). If necessary we reduce the model.

> ResFullMod<-lm(yield~variety+time+variety:time);ResFullMod

We use the 'anova' function to get the ANOVA-table ( type I SS or sequential extra sum of squares):

> anova(ResFullMod)
Analysis of Variance Table

Response: yield
                       Df     Sum Sq     Mean Sq    F value Pr(>F)
variety              6    0.04984       0.00831
time                  3    1.90641       0.63547
variety:time     18    0.02279       0.00127
Residuals          0    0.00000

We see that the SSE becomes zero and that the F score are not possible. We lack degrees of freedom to study interaction in case of one observation per cell.

The addivity test ( Neter et al (1996), pag 882 can be used to examen the interaction.

So we can only study the main factors without any interaction

> ResFullMod<-lm(yield~variety+time);anova(ResFullMod)
Analysis of Variance Table

Response: yield
                    Df      Sum Sq     Mean Sq       F value             Pr(>F)
variety            6     0.04984      0.00831       6.5624      0.0008353 ***
time               3      1.90641     0.63547     502.0025      < 2.2e-16 ***
Residuals     18      0.02279     0.00127
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

What can we conclude? Can we use the anova-table? Which factors have significantly different average?

> layout(matrix(c(1,2,3,4),2,2))
> plot(ResFullMod)


Are we happy with the residuals??? In other words are the assumptions on the errors in the model correct?


We can apply Tukey multiple comparisons to find group in variety:

> TukeyHSD(aov(yield~variety+time), "variety")
All combinations are compared. A lot quicker is to look to a plot of the differences:

 layout(matrix(c(1),1,1))
> plot(TukeyHSD(aov(yield~variety+time), "variety"))

We can see which varieties are different according to the Tukey multiple comparisons method.

Remark: TukeyHSD works on a list made by the function aov. This function is an application for lm limited to ANOVA.


We can apply Tukey multiple comparisons to find groups in sowing dates:

> TukeyHSD(aov(yield~variety+time), "time", ordered = TRUE)
All combinations are compared. A lot quicker is to look to a plot of the differences:

 layout(matrix(c(1),1,1))
> plot(TukeyHSD(aov(yield~variety+time), "time", ordered = TRUE))

We can see which sowing dates are different according to the Tukey multiple comparisons method.

Remark: ordered has a meaning here; the sowing dates are consecutively.


Tukey Additivity test ( Neter et al, 1996, pag 882 ).

R does not contain this test in the base package. Jim Robison-Cox in his stat 412 course (very slow web-site) programmed a short function in R to execute this test:

tukeys.add.test <- function(y,A,B){
## Y is the response vector
## A and B are factors used to predict the mean of y
## Note the ORDER of arguments: Y first, then A and B
   dname <- paste(deparse(substitute(A)), "and", deparse(substitute(B)),
                  "on",deparse(substitute(y)) )
   A <- factor(A); B <- factor(B)
   ybar.. <- mean(y)
   ybari. <- tapply(y,A,mean)
   ybar.j <- tapply(y,B,mean)
   len.means <- c(length(levels(A)), length(levels(B)))
   SSAB <- sum( rep(ybari. - ybar.., len.means[2]) * 
                rep(ybar.j - ybar.., rep(len.means[1], len.means[2])) *
                tapply(y, interaction(A,B), mean))^2 / 
                  ( sum((ybari. - ybar..)^2) * sum((ybar.j - ybar..)^2))
   aovm <- anova(lm(y ~ A+B))
   SSrem <- aovm[3,2] - SSAB
   dfdenom <- aovm[3,1] - 1
    STATISTIC <- SSAB/SSrem*dfdenom
    names(STATISTIC) <- "F"
    PARAMETER <- c(1, dfdenom)
    names(PARAMETER) <- c("num df", "denom df")
    D <- sqrt(SSAB/  ( sum((ybari. - ybar..)^2) * sum((ybar.j - ybar..)^2)))
    names(D) <- "D estimate"
    RVAL <- list(statistic = STATISTIC, parameter = PARAMETER, 
        p.value = 1 - pf(STATISTIC, 1,dfdenom), estimate = D,
        method = "Tukey's one df F test for Additivity", 
        data.name = dname)
    attr(RVAL, "class") <- "htest"
    return(RVAL)
   }

Copy paste the red-lines in R. Alternatively save tukey.add.test.R on your PC and execute File/Source R code from that location.

After nd execute:

 tukeys.add.test(yield,variety,time)

data: variety and time on yield
F = 1.8688,   num df = 1,   denom df = 17,   p-value = 0.1894
   sample estimates:
           D estimate
         0.8154772

See Neter et al, 1996, pag 882 -884 for interpretation.


We wish to compare "ICI Brand 645" with the average of the six other varieties. See page on contrasts#one per cell


UP

31 March, 2003 by Guido Wyseure