[R] Reasons to Use R

Gabor Grothendieck ggrothendieck at gmail.com
Wed Apr 11 01:05:41 CEST 2007


I think SAS was developed at a time when computer memory was
much smaller than it is now and the legacy of that is its better
usage of computer resources.

On 4/10/07, Wensui Liu <liuwensui at gmail.com> wrote:
> Greg,
> As far as I understand, SAS is more efficient handling large data
> probably than S+/R. Do you have any idea why?
>
> On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> > > Bi-Info (http://members.home.nl/bi-info)
> > > Sent: Monday, April 09, 2007 4:23 PM
> > > To: Gabor Grothendieck
> > > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > > Subject: Re: [R] Reasons to Use R
> >
> > [snip]
> >
> > > So what's the big deal about S using files instead of memory
> > > like R. I don't get the point. Isn't there enough swap space
> > > for S? (Who cares
> > > anyway: it works, isn't it?) Or are there any problems with S
> > > and large datasets? I don't get it. You use them, Greg. So
> > > you might discuss that issue.
> > >
> > > Wilfred
> > >
> > >
> >
> > This is my understanding of the issue (not anything official).
> >
> > If you use up all the memory while in R, then the OS will start swapping
> > memory to disk, but the OS does not know what parts of memory correspond
> > to which objects, so it is entirely possible that the chunk swapped to
> > disk contains parts of different data objects, so when you need one of
> > those objects again, everything needs to be swapped back in.  This is
> > very inefficient.
> >
> > S-PLUS occasionally runs into the same problem, but since it does some
> > of its own swapping to disk it can be more efficient by swapping single
> > data objects (data frames, etc.).  Also, since S-PLUS is already saving
> > everything to disk, it does not actually need to do a full swap, it can
> > just look and see that a particular data frame has not been used for a
> > while, know that it is already saved on the disk, and unload it from
> > memory without having to write it to disk first.
> >
> > The g.data package for R has some of this functionality of keeping data
> > on the disk until needed.
> >
> > The better approach for large data sets is to only have some of the data
> > in memory at a time and to automatically read just the parts that you
> > need.  So for big datasets it is recommended to have the actual data
> > stored in a database and use one of the database connection packages to
> > only read in the subset that you need.  The SQLiteDF package for R is
> > working on automating this process for R.  There are also the bigdata
> > module for S-PLUS and the biglm package for R have ways of doing some of
> > the common analyses using chunks of data at a time.  This idea is not
> > new.  There was a program in the late 1970s and 80s called Rummage by
> > Del Scott (I guess technically it still exists, I have a copy on a 5.25"
> > floppy somewhere) that used the approach of specify the model you wanted
> > to fit first, then specify the data file.  Rummage would then figure out
> > which sufficient statistics were needed and read the data in chunks,
> > compute the sufficient statistics on the fly, and not keep more than a
> > couple of lines of the data in memory at once.  Unfortunately it did not
> > have much of a user interface, so when memory was cheap and datasets
> > only medium sized it did not compete well, I guess it was just a bit too
> > ahead of its time.
> >
> > Hope this helps,
> >
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at intermountainmail.org
> > (801) 408-8111
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> WenSui Liu
> A lousy statistician who happens to know a little programming
> (http://spaces.msn.com/statcompute/blog)
>



More information about the R-help mailing list