Up

Formulas in R

  1. Do not forget R is case-sensitive: "mean", "Mean" and "mEan" are three different names!!!!

  2. R uses the "." in a name as just a character e.g. "Mean.of.X"

  3. Formulas can be used in functions or as part of function definition

Syntax of formulas*

The formula specifies a model for functions like "lm" etc. In a formula a response variable, like "y", is modelled.

Right to the ~ operator the model is specified while left we have the response. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.

A model consists of a series of terms separated by + operators.

To avoid confusion, the function I() is used to bracket those portions of a model formula where the operators are used in their arithmetic sense. For example, in the formula y ~ a + I(b+c), the term b+c is to be interpreted as one single variable composed by the sum of b and c and is a model with only two variables a and I(b+c). This is not as model with three variables a, b and c.  The model with three variables would be    y ~ a + b + c  .

* In case this text is read for ANOVA. See footnote for ANOVA; not so relevant for regression

Additional features important for regression:

The - operator removes terms and is used to remove the intercept term: y~x - 1 is a line through the origin.

While formulae usually involve just variable and factor names, they can also involve arithmetic expressions. The formula log(y) ~ a + log(x) is quite legal. When such arithmetic expressions involve operators which are also used symbolically in model formulae, there can be confusion between arithmetic and symbolic operator use. This confusion is avoided by the I( ) function like in the example of a forth degree polynomial. This model can be specified as y ~ x + I(x^2)+ I(x^3)+ I(x^4).

Formulas as an argument to a function

The "Statistical Linear Model" in R has a specific function "lm".  Regression and ANOVA models can be executed.

  1. Regression models use quantitative variables in the model

  2. ANOVA models use qualitative factors

  3. ANCOVA models use both quantitative variables and qualitative factors (important is to define

The function factor is used to encode a vector as a factor (the names category and enumerated type are also used for factors. The function  is.factor can be use to test whether se classes.

> help(lm)

A window appears with help; part is shown below:

some text before.....

Usage:

          lm(formula, data, subset, weights, na.action,
          method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
          singular.ok = TRUE, contrasts = NULL, offset = NULL, ...)

Arguments:

          formula: a symbolic description of the model to be fit. The details of
          model specification are given below.

         data: an optional data frame containing the variables in the model.
         By default the variables are taken from
        `environment(formula)', typically the environment from which
         `lm' is called.
 

and more.....
 

 

For most applications the argument list contains the formula and the data.frame.

Defining an object which is a formula

The formula itself can be treated as an object. We can use the same data as the first regression problem: CH01TA01.txt  Save it on some place like "C:\temp\" and make a data.frame.

> Toluca<-read.table("c://temp//Ch01Ta01.txt")

or access directly on the web-page:

> Toluca<-read.table("http://www.biw.kuleuven.be//vakken//statisticsbyR//datasetsTXT//CH01TA01.txt")

This data.frame can be used in R:

> names(Toluca)<-c("x","y")

Now we define a object which is the single linear regression formula and give it the name "SLR"

> SLR<-y~x

The object SLR is now the formula we can use where we need this formula:

> plot(SLR,data=Toluca)
> SLRresult<-lm(SLR,data=Toluca)
> SLRresult


Call:
lm(formula = SLR, data = Toluca)

Coefficients:
          (Intercept)        x
                62.37    3.57


The formula is stored between other elements of the list SLRresult:

> SLRresult$call$formula
SLR

> SLR
y ~ x
 

A very strong point of R is that everything is treated as an object an can be stored, recalled and used as such. Initially it does not make live easier for a novel user; but when it comes to comparing models (stored in models) this quickly offers an advantage. 

==FOOTNOTE ANOVA :  not for regression !!! More complex operators; useful for advanced ANOVA; ANCOVA etc====

In addition to + , a number of other operators are useful in model formulae:

  1. The  : operator is interpreted as the interaction of all the variables and factors appearing in this term. 
  2. The * operator denotes factor crossing: a*b interpreted as a+b+a:b.
  3. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions.
  4. The %in% operator indicates that the terms on its left are nested within those on the right. For example a+b%in%a expands to the formula a+a:b.

===========not for regression !!!==================================================================


Up

10 May 2005 by Guido Wyseure