[R] Reasons to Use R

Wensui Liu liuwensui at gmail.com
Tue Apr 10 23:25:54 CEST 2007


Greg,
As far as I understand, SAS is more efficient handling large data
probably than S+/R. Do you have any idea why?

On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> > Bi-Info (http://members.home.nl/bi-info)
> > Sent: Monday, April 09, 2007 4:23 PM
> > To: Gabor Grothendieck
> > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > Subject: Re: [R] Reasons to Use R
>
> [snip]
>
> > So what's the big deal about S using files instead of memory
> > like R. I don't get the point. Isn't there enough swap space
> > for S? (Who cares
> > anyway: it works, isn't it?) Or are there any problems with S
> > and large datasets? I don't get it. You use them, Greg. So
> > you might discuss that issue.
> >
> > Wilfred
> >
> >
>
> This is my understanding of the issue (not anything official).
>
> If you use up all the memory while in R, then the OS will start swapping
> memory to disk, but the OS does not know what parts of memory correspond
> to which objects, so it is entirely possible that the chunk swapped to
> disk contains parts of different data objects, so when you need one of
> those objects again, everything needs to be swapped back in.  This is
> very inefficient.
>
> S-PLUS occasionally runs into the same problem, but since it does some
> of its own swapping to disk it can be more efficient by swapping single
> data objects (data frames, etc.).  Also, since S-PLUS is already saving
> everything to disk, it does not actually need to do a full swap, it can
> just look and see that a particular data frame has not been used for a
> while, know that it is already saved on the disk, and unload it from
> memory without having to write it to disk first.
>
> The g.data package for R has some of this functionality of keeping data
> on the disk until needed.
>
> The better approach for large data sets is to only have some of the data
> in memory at a time and to automatically read just the parts that you
> need.  So for big datasets it is recommended to have the actual data
> stored in a database and use one of the database connection packages to
> only read in the subset that you need.  The SQLiteDF package for R is
> working on automating this process for R.  There are also the bigdata
> module for S-PLUS and the biglm package for R have ways of doing some of
> the common analyses using chunks of data at a time.  This idea is not
> new.  There was a program in the late 1970s and 80s called Rummage by
> Del Scott (I guess technically it still exists, I have a copy on a 5.25"
> floppy somewhere) that used the approach of specify the model you wanted
> to fit first, then specify the data file.  Rummage would then figure out
> which sufficient statistics were needed and read the data in chunks,
> compute the sufficient statistics on the fly, and not keep more than a
> couple of lines of the data in memory at once.  Unfortunately it did not
> have much of a user interface, so when memory was cheap and datasets
> only medium sized it did not compete well, I guess it was just a bit too
> ahead of its time.
>
> Hope this helps,
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at intermountainmail.org
> (801) 408-8111
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)



More information about the R-help mailing list