SUMMARY: [R] Comparison of SAS & R/Splus
Paul, David A
paulda at BATTELLE.ORG
Fri Sep 5 01:06:02 CEST 2003
My thanks to Drs. Armstrong, Bates, Harrell, Liaw, Lumley,
Prager, Schwartz, and Mr. Wang for their replies. I have
pasted my original message and their replies below.
After viewing http://www.itl.nist.gov/div898/strd/ as suggested
by Dr. Schwartz, it occurred to me that it might be educational
to search for some data repositories on google. I was able to find
some,though I'm sure many of the R listserv readers are already
aware of them:
http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLOther.html
http://www.ldeo.columbia.edu/datarep/
http://data.geocomm.com/
http://libraries.mit.edu/gis/data/repository.html
http://nssdc.gsfc.nasa.gov/
-david paul
-----Original Message-----
I am one of only 5 or 6 people in my organization making the
effort to include R/Splus as an analysis tool in everyday
work - the rest of my colleagues use SAS exclusively.
Today, one of them made the assertion that he believes the
numerical algorithms in SAS are superior to those in Splus
and R -- ie, optimization routines are faster in SAS, the
SAS Institute has teams of excellent numerical analysts that
ensure its superiority to anything freely available, PROC
NLMIXED is more flexible than nlme( ) in the sense that it
allows a much wider array of error structures than can be used
in R/Splus, &etc.
I obviously do not subscribe to these views and would like
to refute them, but I am not a numerical analyst and am still
a novice at R/Splus. Do there exist refereed papers comparing the
numerical capabilities of these platforms? If not, are there
other resources I might look up and pass along to my colleagues?
---------------------------
This link might give you some insight, but SAS is not one of the
packages benchmarked here.
http://www.sciviews.org/other/benchmark.htm
[Whit Armstrong]
---------------------------
I don't have papers comparing the numerical capabilities but I say
bunk to your colleagues. The last time I looked, SAS still relies
on the out of date Gauss-Jordan sweep operator in many key places,
in place of the QR decomposition that R and S-Plus use in regression.
And SAS being closed source makes it impossible to see how it
really does calculations in some cases.
See http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf Section
1.6 for a comparison of S and SAS (though this doesn't address
numerical reliability). Overall, SAS is about 11 years behind R
and S-Plus in statistical capabilities (last year it was about 10
years behind) in my estimation.
Frank Harrell
SAS User, 1969-1991
---
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
---------------------------
Too bad your colleagues weren't at the "State of Statistical
Software" session at JSM. I was there. It was so packed that
people ran out of standing room. The three speakers are all R
advocates (Jan De Leeuw, Luke Tierney and Duncan Temple Lang).
The most interesting thing (to me) about the session is that the
discussant is a person from SAS (first name Wolfgang). I just
had to hear what he'd say.
The SAS person essentially said that the numerical accuracy of
R (probability functions, especially) is unmatched because the
routines were written by authority figures in the area. (That's
one advantage he said R has, but also said that the fact that
that code is open, even SAS is looking at the R source, and that,
to him, is a disadvantage. He obviously missed the point of
open source.) One of the criticisms he had for R, compared to
SAS, is that R may not have undergone extensive QA tests. He
said that SAS now probably has only a handful of PROC developers
(not exactly the "team" your colleague imagined), but 5-6 times
more software testers.
I think hearing from the horse's mouth beats reading articles
in the journal for this sort of things. There was a recent article
in American statistician bashing the numerical instability and bad
quality of RNG in JMP (a SAS product). SAS posted a "white paper"
on their web site refuting some those claims (but they did changed
the RNG to Mersenne Twister in JMP5), comparing JMP with Excel and
SAS. I must say that comparison isn't convincing, as neither Excel
nor SAS can really be trusted as gold standard.
Andy [Liaw]
---------------------------
In follow up to Frank's reply, allow me to point you to some
additional papers and articles on numerical accuracy issues. I
have not reviewed these in some time and they may be a bit dated
relative to current versions. These do not cover R specifically,
but do address S-Plus and SAS. This is not an exhaustive list by
any means, but many of the papers do have other references that
may be of value.
1. http://www.stat.uni-muenchen.de/~knuesel/elv/accuracy.html
2. http://www.amstat.org/publications/tas/mccull-1.pdf
3. http://www.amstat.org/publications/tas/mccull.pdf
4. http://www.npl.co.uk/ssfm/download/documents/cmsc06_00.pdf
Another option is that NIST has reference datasets available for
comparison at:
http://www.itl.nist.gov/div898/strd/
These would allow you to conduct your own comparisons if you desire.
HTH,
Marc Schwartz
(Also a former SAS user)
---------------------------
I can't say for the optimisation routines, but I have found this...
When I was doing my MSc thesis, using tree-based models and neural
networks for classifications, I discovered something interesting.
Using SAS Enterprise Miner (SAS EM), its Tree Node is far more efficient
than the rpart package. Using the same (or very similar at least) parameter
settings, SAS EM can produce a tree in about 1 minute while it would take
rpart 5 ~ 6 minutes (same data, same machine....). Having said that, I
still prefer rpart as it can draw a beautiful tree, whereas it is very
difficult to fit the graphical tree produced by SAS EM into one A4 page --
in the end I had to use the text tree.
However, the Neural Network node in SAS EM is less efficient than nnet.
The time it takes to fit a neural network in R using nnet is much
faster....
Cheers,
Kevin [Wang]
---------------------------
I suspect it will be difficult to find the answer to your colleagues'
assertions without doing your own studies. How important is it to you
to settle this disagreement?
One could always name the many leading statisticians who contribute to
R, but I don't think that name-dropping settles anything.
Nonetheless, even if SAS were faster, that would be only part of the
issue. As you know, R offers vastly better exploratory graphics, better
graphics overall, far more flexible programming, user extensibility, and
more natural programming access to the results of previous
computations. So even if your colleagues were right in their
assertions, they would be overlooking many capabilities of the S
language that are not readily available in SAS.
IMO, SAS shines in its ability to read files in almost any format, to
handle gigantic data sets without burping, and to produce formatted
cross-tabulations and other highly structured text reports. However, if
your colleagues work at all in data exploration, they are ignoring
important tools by not exploring R or S-Plus.
Michael Prager, Ph.D.
NOAA Center for Coastal Fisheries and Habitat Research Beaufort,
North Carolina 28516 http://shrimp.ccfhrb.noaa.gov/~mprager/
DISCLAIMER: Opinions expressed are personal, not official. No
government endorsement of any commercial product is made or implied.
---------------------------
Although they are out of date, there are some comparisons of accuracy in
McCullough, B. D. (1998), "Assessing the reliability of statistical
software: Part I", The American Statistician, 52, 358-366.
McCullough, B. D. (1999), "Assessing the reliability of statistical
software: Part II", The American Statistician, 53, 149-159.
Regarding PROC NLMIXED versus nlme, there are a lot of differences
between them. I don't think that PROC NLMIXED will handle nested
random effects while nlme does. However, nlme assumes the underlying
noise is Gaussian while PROC NLMIXED allows Gaussian or binomial or
Poisson. PROC NLMIXED uses adaptive Gaussian quadrature to evaluate the
marginal log-likelihood whereas nlme uses a less accurate evaluation but
better parameterizations of the variance of the random effects. I think
it would be difficult to declare one to be superior to the other.
[Douglas Bates]
---------------------------
While I don't subscribe to the general theory, they have a point about
PROC NLMIXED. It does more accurate calculations for generalised linear
mixed models than are currently available in R/S-PLUS, and for logistic
random effects models the difference can sometimes be large enought to
matter.
-Thomas [Lumley]
More information about the R-help
mailing list