Up

Data-storage for statistics

Statistical consultancy is only meaningful if your data comply with this page!

Introduction

Important: enter and structure the data, so that the step to the required data-storage is straightforward and easy.

A little bit of planning and a few principles are required.  Please read on.

It is common practice to enter data in a spreadsheet (Excel, Openoffice, Quattro, Lotus ). A spreadsheet is extremely flexible and is suitable to perform some initial screening and exploration of the data. Danger is that data-entry in spreadsheet can be unplanned, unstructured, if not completely chaotic. Too often the data are entered in a collection of separate tables and in an organization, which leads to a lot of unnecessary work when starting a statistical analysis. The poor structure is often mainly inspired by the final presentation in a text and not oriented to analysis.

In addition it is common practice to use statistics like "painting the new house" (= last and not so crucial job by the contractor before handing over the new house to the contracting person). Statistics just like painting requires a lot more respect then an attempt to hide some of the construction flaws (i.e. design flaws in the data-collection).

The major mistake is that tables are made in a format close to that  for the final report (dissertation, thesis etc) and not in a data-structure for statistical analysis.

Good statistics requires two important planning steps:

  1. planning in the data collection setup (design of experiments, design of observational studies)
  2. planning the data-entry phase (see also discplined use of spreadsheet for data entry as part of DFID-programme at  Reading university).

A shortcoming in the first step cannot be corrected without new data-collection. Retrospective statistics is then called in for window-dressing A shortcoming in the second will lead to excessive work and possibly a wrong analysis. Typical mistake is the pair-wise comparison with two-sample t-test. An ANOVA-analysis followed (if apropiate) by a multiple comparisons procedure is a lot quicker and more correct. Another and similar mistake is to execute several linear regressions and limit the analysis to mentioning the regression coefficients and the coefficient of determination and discussing in a qualitative way the graphs. An ANCOVA would be quicker and can decide whether slopes/intercepts are significantly different or not.

Sometimes this poor structure of a data-set is used as a (very poor) excuse not to execute the proper statistical procedure!!! Do not waste the time of a statistical advisor and your own time with such claims; this is an indication of  sloppiness and lack of professional self-respect.

Application for statistical software in general and R in particular

Most software have rather rigid data-storage requirement. SAS, as an example, uses a SAS-dataset. The major reason is to ensure that the data are properly passed on to the apropiate statistical procedure without misinterpretation.

In most analysis by R an object of the type "data.frame"  is used. A lot of procedures require a data.frame in their argument list. See data-structures for a more complete information of different data-structures.  Similar is  that the result of an analysis by R can often stored in a list.

R Object Main use Several modes allowed in one object
data.frame dataset for analysis numeric, character, complex or logical
list results; information from analysis numeric, character, complex, logical, function, expresion and  formula

In the data-frame vectors of equal length are present. The vectors can contain different "modes". A factor (with qualitative levels) is used in ANOVA and ANCOVA and a vector with numeric (quantitative) values are used in Regression. For some practical reasons factor levels can be represented by numbers (1, 2, 3 etc). With most statistical computer programmes one has to be extremely careful to force the analysis to use the level numbers as qualitative levels.

A software cannot guess your intention when a numeric value should be used as a qualitative level. Example: the color of a package is indicated by numbers: 1, 2, 3 etc... and is initially a numeric vector by reading in numbers into the data.frame;

ColorPackage = factor (ColorPackage)

avoids any confusion for R. ColorPackage is forced to be a factor. In a factor the numbers

Principles for data-entry

Much better and efficient is to prepare a simple data-base.  The tool in not important. It can be done within a spreadsheet (or for a large dataset within a specific database program or even a text-file for a small dataset...). The principles of the data-structure are in fact independent of the software!!!

This database should respect some very simple but crucial and essential rules:

  1. all the data are stored in one table only (avoid several separate tables);
  2. every observation is a record (every replication has its own separate record);
  3. every record has its own line (row) with fields for every different variable;
  4. every different variable has is own specific field;
  5. a specific field is always in the same column of field number;
  6. do not NEVER use any empty columns or rows for readability but start your database in the top left (cell A1 for a spreadsheet).
  7.  if for part of the dataset a variable is not observed; then in that field the string for missing data is filled in ( e.g. for R "NA" by default);

As a result a very simple rectangular data-structure is prepared. With for every record the same number of fields and in the same order. Such data-structure is easy to enter into any statistical program.

Less crucial (software like R allows a high flexibility by overwriting defaults) rules:

  1. the first line (row)  in the spreadsheet contains the titles of the variables
  2. titles of variables should be legal character-strings in the software you plan to use (e.g. for most software no spaces in a title)
  3. in some software  you could use some comment lines with the units of the variables, relevant short information and a general title on top of your dataset ( like in R lines starting by #); make sure the software you plan to use can handle this lines (often just by skipping )

Note: Avoid Excel for ANOVA analysis; it only works correctly for balanced and does not warn you if your data are unbalanced!


Up

10 May 2005 by Guido Wyseure