|
Statistical consultancy is only meaningful if your data comply with this page!
| Important: enter and structure the data,
so that the step to the required data-storage
is straightforward and easy. A little bit of planning and a few principles are required. Please read on. |
It is common practice to enter data in a spreadsheet (Excel, Openoffice, Quattro, Lotus ). A spreadsheet is extremely flexible and is suitable to perform some initial screening and exploration of the data. Danger is that data-entry in spreadsheet can be unplanned, unstructured, if not completely chaotic. Too often the data are entered in a collection of separate tables and in an organization, which leads to a lot of unnecessary work when starting a statistical analysis. The poor structure is often mainly inspired by the final presentation in a text and not oriented to analysis.
In addition it is common practice to use statistics like "painting the new house" (= last and not so crucial job by the contractor before handing over the new house to the contracting person). Statistics just like painting requires a lot more respect then an attempt to hide some of the construction flaws (i.e. design flaws in the data-collection).
The major mistake is that tables are made in a format close to that for the final report (dissertation, thesis etc) and not in a data-structure for statistical analysis.
Good statistics requires two important planning steps:
A shortcoming in the first step cannot be corrected without new data-collection. Retrospective statistics is then called in for window-dressing A shortcoming in the second will lead to excessive work and possibly a wrong analysis. Typical mistake is the pair-wise comparison with two-sample t-test. An ANOVA-analysis followed (if apropiate) by a multiple comparisons procedure is a lot quicker and more correct. Another and similar mistake is to execute several linear regressions and limit the analysis to mentioning the regression coefficients and the coefficient of determination and discussing in a qualitative way the graphs. An ANCOVA would be quicker and can decide whether slopes/intercepts are significantly different or not.
Sometimes this poor structure of a data-set is used as a (very poor) excuse not to execute the proper statistical procedure!!! Do not waste the time of a statistical advisor and your own time with such claims; this is an indication of sloppiness and lack of professional self-respect.
Most software have rather rigid data-storage requirement. SAS, as an example, uses a SAS-dataset. The major reason is to ensure that the data are properly passed on to the apropiate statistical procedure without misinterpretation.
In most analysis by R an object of the type "data.frame" is used. A lot of procedures require a data.frame in their argument list. See data-structures for a more complete information of different data-structures. Similar is that the result of an analysis by R can often stored in a list.
| R Object | Main use | Several modes allowed in one object |
| data.frame | dataset for analysis | numeric, character, complex or logical |
| list | results; information from analysis | numeric, character, complex, logical, function, expresion and formula |
In the data-frame vectors of equal length are present. The vectors can contain different "modes". A factor (with qualitative levels) is used in ANOVA and ANCOVA and a vector with numeric (quantitative) values are used in Regression. For some practical reasons factor levels can be represented by numbers (1, 2, 3 etc). With most statistical computer programmes one has to be extremely careful to force the analysis to use the level numbers as qualitative levels.
A software cannot guess your intention when a numeric value should be used as a qualitative level. Example: the color of a package is indicated by numbers: 1, 2, 3 etc... and is initially a numeric vector by reading in numbers into the data.frame;
ColorPackage = factor (ColorPackage)
avoids any confusion for R. ColorPackage is forced to be a factor. In a factor the numbers
Much better and efficient is to prepare a simple data-base. The tool in not important. It can be done within a spreadsheet (or for a large dataset within a specific database program or even a text-file for a small dataset...). The principles of the data-structure are in fact independent of the software!!!
This database should respect some very simple but crucial and essential rules:
As a result a very simple rectangular data-structure is prepared. With for every record the same number of fields and in the same order. Such data-structure is easy to enter into any statistical program.
Less crucial (software like R allows a high flexibility by overwriting defaults) rules:
Note: Avoid Excel for ANOVA analysis; it only works correctly for balanced and does not warn you if your data are unbalanced!
| Up |
10 May 2005 by Guido Wyseure