[R] SAS or R software

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Sun Dec 19 13:34:37 CET 2004


On 19-Dec-04 Tim Churches wrote:
> Shawn Way wrote:
> 
>>I've seen multiple comments about MS Excel's precision and accuracy.
>>Can you please point me in the right direction in locating information
>>about these?
>>  
>>
> As always, Google is your friend, but see for example 
> http://www.nwpho.org.uk/sadb/Poisson%20CI%20in%20spreadsheets.pdf

There is a huge literature on this topic, some of it published
in journals, much of it floating around on the web and in the
archives of mailing lists.

Tim's reference above is interesting, but only one example.
The McCullough and Wilson reference given there, though now
somewhat dated, identifies many of the classic problems.
Googling on

  mccullough wilson excel

will throw up a host of followups.

Informed statistical comment on the problems of Excel encountered
by serious users can be found by browsing in the mailing list
ASSUME (Association of Statistics Specialists Using Microsoft Excel):

  http://www.jiscmail.ac.uk/lists/assume.html

The most recent serious issue reported there is the RAND() bug:
see ASSUME archives for Dec 2003 followed up in the March 2004
archives. The latter point to a statement from Microsoft:

  SYMPTOMS
  When you use the RAND function in Microsoft Office Excel 2003,
  the RAND function may return negative numbers.
  CAUSE
  This problem may occur when you try to use many random numbers,
  and you update the RAND function multiple times. For example,
  this problem may occur when you update your Excel worksheet by
  pressing F9 ten times or more.
  RESOLUTION
  This problem is fixed in the Microsoft Excel 2003 Hotfix Package
  that is dated February 29, 2004. 

  http://support.microsoft.com/default.aspx?scid=kb;en-us;834520

  (and the deeper you probe in this, the worse it gets).


While using Excel for statistics has some limited value in the
context of initiating to statistics students whose IT experience
is limited to exposure to courses on Excel and Word, and the
teacher wants to build on such experience, I think that Excel
should never be used for serious statistical work, for several
reasons.

1. The many reported (and some allegedly fixed) bugs in
   calculation and algorithms necessarily provoke suspicion
   that others still exist or may have been introduced. One
   simply cannot trust the results without checking.

2. Too many things can be done silently and invisibly, "behind
   the spreadsheet", by Excel. Changes to data and differences
   between what is shown on the spreadsheet and what goes into
   exported files can arise without the user being aware of them.

   A particularly frightening example is the "sort" disaster
   reported to ASSUME (8 Dec 2003) by Allan Reese.

3. Excel has some value as a straightforward data entry pad.
   However, I have seen far too many cases where sloppy usage
   has led to the resulting spreadsheet containing "information"
   which is either superfluous or wrong, in ways which would
   not be obvious to the user.

   For example, a "missing data" cell, if blank, may be interpreted
   as having value zero. Some people enter "." for missing data,
   but often are not consistent. If inadvertently a space is
   entered in a cell outside the intended row/column range of the
   data (or, I suspect, even if the spreadsheet cursor wanders
   outside the range) then when the sheet is exported (e.g. as
   "CSV") these extra rows and columns will be included.

   In one case I received an Excel spredsheet with hundreds of
   such extra rows and dozens of extra columns, together with
   dozens of cases where " " and "." had been used inconsistently,
   all this over and over on each of about 6 "worksheet" pages;
   not to mention data in the wrong columns etc. It took about 4
   days of continuous work to clean this up.

   To be frank, for entering complex data the discipline enforced
   by a properly designed Data Entry Form in a database package
   would avoid such problems altogether, and such should be used.
   The illusion of success that Excel gives the user is a most
   treacherous danger and frankly I simply do not, in the first
   instance, trust data in a spreadsheet.

4. The use of formulae in cells to generate cell values can
   cause all sorts of problems. One to especially watch out for
   is that a formula may have been wrongly or inappropriately
   "copied" from one column to another or from one worksheet
   to another. You can of course check this by moving the cell
   cursor to such cells and noting what the formula is, but as
   you can imagine this is a horribly uhpleasant process (and
   by the way take care that you don't inadvertently alter it
   while you're doing this!). Also see Allan Reese's "sort"
   disaster above, which was formula-induced.

I could go on. I've written at length already because many
readers of R-help may be in situations where they necessarily
receive data in Excel files, or have to use Excel themselves,
and may not yet have become aware of the risks. So I'm writing
as a warning to them: Don't trust Excel, but if you must use it
then check everything, make sure it's what it should be, and
make sure that it stays that way when the spreadsheet is
accessed (as in (3) or (4) above, things may change invisibly).

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 19-Dec-04                                       Time: 12:34:37
------------------------------ XFMail ------------------------------




More information about the R-help mailing list