UP

Multiple comparisons

Firstly,  we need to create a Data frame. On this dataframe we execute the ANOVA. The dataset is the example from NKWN; 1996 Table 17.4 pag 729.

Steps to follow:

1)
Create the data-frame on the data (textfile  Table17.2 ; right click on the file and save on C:\Temp )
2) Create factors in the data-frame
3) Execute the ANOVA 

Check the the graphically the predictions, residuals, QQ-plot, Cook distance plot...for every analysis and note your observations  

4) If appropiate execute multiple comparison or test specific contrast questions.

We create a data-frame by typing (or copy/paste)  the red letters after > followed by  <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two  "\\".

RustDF<-read.table("c://temp//ch17ta02.txt")

Alternative approach is to open directly from the WWW-server:

RustDF<-read.table("http://www.biw.kuleuven.be/vakken/statisticsbyR/datasetsTXT/CH17TA02.txt")

RustDF

                 V1 V2 V3
1             43.9    1   1
2             39.0    1   2
3             46.7    1   3
4             43.8    1   4
5             44.2    1   5

(rest of dataset)

We can give names to the variables:

 names(RustDF)=c("Rust","Brand","Replication")

We can make our variables more accessible by the attach statement:

> attach(RustDF)<ENTER>

Although we can create factors by the factor function in the lm-formula, it is some what neater to do it now and will protect us against mistakes:

>  Brand=factor(Brand)
> is.factor(Brand)
[1] TRUE

Very important is to ensure that qualitative variables are not used as numeric values!!!

Common mistake with all software!!!!!!!!!!!!!


Before the analysis a visualisation is useful and important.

A boxplot is a popular way of comparing groups:.

>  boxplot(Rust ~ Brand)
 With as result:

 

It appears from the boxplot that differences appear to be important. However, we wish to know if this is significant and which groups are significantly different.


Now, we can execute the ANOVA (single factor).  Remark that we use the aov() and not lm(). aov uses lm for specific ANOVA analysis including the Tukey multiple comparison.

Is there (somewhere) a difference between Brands ?

>  summary(fm1 <- aov(Rust ~ Brand))
                   Df    Sum Sq      Mean Sq       F value         Pr(>F)
Brand            3   15953.5        5317.8      866.12      < 2.2e-16 ***
Residuals     36       221.0             6.1
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1


Conclusion of the F-test: there is somewhere a difference. Next question is: where are the differences?

> plot(fm1)

This plot controls the basis assumptions for the classic ANOVA model with normally distributed, independent and constant error.
Homoscedasticity ( variance of the error is constant)  is very important (i.e. error in the model is constant and independent of the factor levels). The QQplot checks the normal distributions. Outliers can be detected in the Cooks' distance plot.

If we are not satisfied, we should consider transformations, removal of suspect outliers etc...


Multiple comparisons by TukeyHSD (must be done on a list made by aov() ).

(or the classical Tukey method: family of all pairwise comparisons)

Avoid looking for differences if the ANOVA null-hypothesis is accepted. This means that the ANOVA-test indicates that all averages are equal and there is no difference. However it could be possible that within a lot of groups just one group is different from all the others... and the null-hypothesis is still accepted. A Boxplot will often help you to recognize such dangerous cases. Remark that with many groups (e.g. 20) a wrong pairwise t-test will very likely (at least one if 5% alfa) show differences between the lowest and the highest average.

The Tukey multiple comparison procedure can by the function TukeyHSD. This function is available in the base of R.

> fm1Tukey=TukeyHSD(fm1,"Brand") ; fm1Tukey
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = Rust ~ Brand)

$Brand
                   diff                  lwr                    upr
2-1         46.30      43.315536      49.2844635
3-1         24.81      21.825536      27.7944635
4-1         -2.67       -5.654464        0.3144635
3-2       -21.49     -24.474464     -18.5055365
4-2       -48.97     -51.954464     -45.9855365
4-3       -27.48     -30.464464     -24.4955365


  Differences between Brands are significant at 5% level if the confidence interval around the estimation of the difference does not contain zero.

This can be visualised by a plot of the list.

> plot(fm1Tukey)
 

We see that all brands have different levels with the exception of th brand 1 as compared to brand 4.


Multiple comparisons by three major methods proposed by NKNW 96

  1. Bonferroni (a specified number of comparisons)
  2. Scheffé (all contrasts)
  3. Tukey (all pairswise comparisons, same method as TukeyHSD)
Method worked out by Jim Robison-Cox. See multiple comparison JIMRC.

Contrasts for one way ANOVA see contrasts#one way ANOVA


UP

31 March, 2003 by Guido Wyseure