Up

Regression  

by R

First we need to create a Data frame. On this dataframe we execute the regression. This example is similar to the first example. Main difference is that we start from a spreadsheet.

Steps to follow:

1)
Create the data frame on the data ( we first create a text file from a spreadsheet WQijse.xls  , right click to save on C:\temp)
2) Plot the data and look at the graph;
3) Execute the regression;
4) Interpretation of the results.

We select the block of data (without the titles in the first row); copy them to the clipboard and paste (paste special if needed) them on a empty worksheet in cell A1. This worksheet we than save on "C:\temp" under the name "WQIJSE.txt" with the "File/save As" as a Text (Tab-delimited) (*.txt). The data will be saved on "C:\temp" exactly as they appear in the spreadsheet; make sure you perform the right formatting and centering of the cells is often useful. Check on  "#"'s or too many digits, be careful with dates.. Remark that alternative approaches are possible. Common headache are the delimiters in the textfile, empty rows and columns and the headerlines, which can cause havoc.

The function "read.table()" has several possibilities. Use "help(read.table)" to know the default values and which changes are necessary. The argument list contains variables with assigned values.

> help(read.table)<ENTER>

Gives the help window and under the title usage the list of arguments.

read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".",
row.names, col.names, as.is = FALSE, na.strings = "NA",
colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#")

As an example:  header = FALSE means that by default there are no headers. Only if header = TRUE we need to specify this.

A small trick is to open in R the "wqijse.txt" file by "file/display file" and to inspect the file.

We create a data-frame by typing the red letters after > followed by  <ENTER> The blue letters are the answer from the R-programme. We check the content by typing in the name of the data-frame. Remark systems like R originate in Unix in which they discriminate uppercase and undercase and replace the usual backslash "\" by two  "\\".

> WQ<-read.table("c:\\temp\\wqijse.txt")<ENTER>
> WQ<ENTER>
                V1                V2     V3      V4      V5     V6
1              25-Mar-91    25        3   1991     9.3     8.5
2              23-Apr-91    23         4  1991     9.0      9.7
3              24-May-91    24        5  1991    13.5    10.4
4              15-Jun-91      15        6  1991    13.5      7.4
(rest of dataset)

The variable names are V1 to V6. We change them to meaningful names by:

> names(WQ)<-c('Date','Day','Month','Year','Temp','Oxygen') <ENTER>

> WQ<ENTER>
         Date            Day   Month    Year   Temp   Oxygen
1       25-Mar-91    25           3    1991      9.3          8.5
2       23-Apr-91    23            4    1991     9.0          9.7
3       24-May-91   24            5    1991    13.5       10.4        (rest of dataset)

Remark: Uppercase and undercase are important in. So 'Oxygen' is not identical to 'oxygen'.

We like to see the data in a plot

Syntax  Y ~X  , name of the dataframe (names of the variables should precise)

> plot(Oxygen~Temp,WQ)<ENTER>

The plot shows that a linear relation might be appropiate.


We now excecute the regression ( is done by the command lm ; linear model):

> WQReg<-lm(Oxygen~Temp,WQ) <ENTER>

The result is an object WQReg. This object WQReg  (select a different name as the data WQ) is stored int eh memory as a list. Do not forget undercase and uppercase are relevant. So 'WQReg' is not identical to 'WQreg'; last is unknown.


Certain attributes of  WQReg can be displayed (zie table). 

QQ-plot of  residuals plot(WQReg)
residual dot plot plot(WQReg) na vorige
Estimation summary(WQReg)
ANOVA-table anova(WQReg)
regression-coefficients WQReg$coefficients or  coefficients(WQReg)
residuals WQReg$residuals or residuals(WQReg)
predicted values WQReg$fitted.values or fitted.values(WQReg) or predict(WQReg)
degrees of freedom of error term WQReg$df.residual or df.residual(WQReg)

Examples:

> plot(WQReg) <ENTER>
Hit <Return> to see next plot:
<ENTER>
Hit <Return> to see next plot:

(Note: R shows subsequently a number of graphs serving for diagnostics in a different window; after every <ENTER> follows a new graph.

> summary(WQReg) <ENTER>
some output....

> anova(WQReg) <ENTER>
some output....

> coefficients(WQReg) <ENTER>
(Intercept) Temp 
12.4151595 -0.3090217 

> residuals(WQReg) <ENTER>
                  1                   2                 3                   4                   5                   6 
-1.04125785 0.06603565 2.15663324 -0.84336676 0.39655709 -0.83050797 
and so on....until end

               55                56                  57                 58                 59 
1.38542649 3.94761155 -0.09749688 -0.63185544 -0.14298604 

> fitted.values(WQReg) <ENTER>
some output....

> predict(WQReg) <ENTER>
identical as above

> df.residual(WQReg) <ENTER>
[1] 57


With the plot comands: 

>plot(WQReg$residuals)<ENTER>

>plot(WQReg$fitted.values~WQ$Temp,type="l")<ENTER>

Do we get different graphs. Remark the '$' sign between the object and the part of object.


An interesting plot gives simultaneously the regression-line and the observed points.

>> plot(WQ$Oxygen~WQ$Temp,pch=4)<ENTER> this graph infact we made already, but this time we use "X" as a symbol for observations.

> abline(WQReg)<ENTER>  immediately after the previous commands draws on the previous graph also a regression line.

The same result could have been obtained by:

> lines(WQReg$fitted.value~WQ$Temp)
in stead of abline. Remark that this allows to plot several lines and points on the existing plot.


Questions to be answered:

Answers:

1) observational

2)

> WQReg$coefficients
(Intercept)                  Temp
12.9965476      -0.3798286

3)

> Estim<- WQReg$coefficients[1]+WQReg$coefficients[2]*7; Estim
(Intercept)
10.33775

4)

> anova(WQReg)
Analysis of Variance Table

Response: Oxygen
                  Df          Sum Sq    Mean Sq      F value           Pr(>F)
Temp            1          78.024       78.024      86.079     2.627e-11    ***
Residuals     38         34.444         0.906
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

MSE is Mean Sq for Residuals: 0.906

5)

 > Residual<-9.1-(WQReg$coefficients[1]+WQReg$coefficients[2]*8);Residual
(Intercept)
-0.8579191

6)

 > WQReg$coefficients[1]+WQReg$coefficients[2]*10.72
(Intercept)
8.924785

In principle the regression line goes through the average X and Y of the sample so:

> mean(Oxygen)
[1] 8.8925
 

Is also a good answer.


Remember a source of frustation is the undercase uppercase issue; (r i.p.v. R)

> plot(WQreg)<ENTER>
Error in plot(WQreg) : Object "WQreg" not found

If doubts we can ask which "objects" we have. 

> objects()<ENTER>
[1] "WQ" "WQReg"

Up

16 July 2002 by Guido Wyseure