[Rd] Vector binding on harddisk

Henrik Bengtsson hb at stat.berkeley.edu
Thu Feb 14 17:57:32 CET 2008


On Thu, Feb 14, 2008 at 8:18 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> On 2/14/2008 10:54 AM, Henrik Bengtsson wrote:
>  > On Thu, Feb 14, 2008 at 7:23 AM, Jeffrey Horner
>  > <jeff.horner at vanderbilt.edu> wrote:
>  >> Paul Gilbert wrote on 02/14/2008 09:14 AM:
>  >>
>  >> >
>  >>  > _ wrote:
>  >>  >> Hi all,
>  >>  >> Using big vectors (more than 4GB) is unfortunately not possible under
>  >>  >> Windows or other OS's if not enough RAM exists.
>  >>  >
>  >>  > This is NOT true. It is not limited by RAM, but rather by RAM and swap
>  >>  > space.  With 500G hard disks at about $100, the more serious limitation
>  >>  > is a 32bit OS.  Speed is a different consideration, but I doubt that
>  >>  > taking over what the OS is suppose to do will be the real answer.
>  >>
>  >>  Umm... swap... yeah... My experience with any code that I write that
>  >>  needs to use swap is to reboot the machine and rewrite my code NOT to
>  >>  use swap.
>  >
>  > ...and the OS can never predict how you are going to access the data -
>  > as a developer you will always be able to write a much faster
>  > file-based "database" compared with a generic swap that you have
>  > little control over.
>
>  The original request from "_" was for R to provide a generic swap
>  mechanism.  Why would R be better at it than Linux, say?
>
>  It might make sense for "_" to write his own solution, but any general
>  purpose solution in R is likely to have "disadvantages".  Either it will
>  require too much work from the user (as he claims ff does), or it will
>  be too generic and slow (as the OS swap mechanism is).

But sometimes you don't have the option to store data in memory.
There is always an upper hardware limit and today most people hit that
either at 2/3/4GB or 64GB.

Databases are the most straightforward way to expand, but again it is
hard to beat a framework that stores (contiguous) data that is
optimize for reading and/or writing, for accessing the data by columns
or by row, or by chunks of subtables and so on.

But yes, the time to set this up from scratch can be substantial, but
if you plan to work with that amount of data for a while it is
probably worth investing the time sooner than later.  And don't get me
wrong, some database systems are fast, so make sure to look at those
options as well.

My $0.02

/Henrik

>
>  Duncan Murdoch
>
>
>
>  >
>  > /Henrik
>  >
>  >>
>  >>  Jeff
>  >>
>  >>
>  >>  >
>  >>  > Paul
>  >>  >
>  >>  >> Could it be possible to implement an a new data type in R, like a
>  >>  >> vector, but instead holding the information in memory, the data lies on
>  >>  >> an file. If data is accessed, the data type vector get the information
>  >>  >> automatically from the file.
>  >>  >> There is a package out there (named ff) but the accessed boundary have
>  >>  >> to be declared by the user this is a disadvantage.
>  >>  >>
>  >>  >> Greetings.
>  >>  >>
>  >>  >> ______________________________________________
>  >>  >> R-devel at r-project.org mailing list
>  >>  >> https://stat.ethz.ch/mailman/listinfo/r-devel
>  >>  > ====================================================================================
>  >>  >
>  >>  > La version française suit le texte anglais.
>  >>  >
>  >>  > ------------------------------------------------------------------------------------
>  >>  >
>  >>  > This email may contain privileged and/or confidential ...{{dropped:23}}
>  >>
>  >>
>  >>
>  >>  ______________________________________________
>  >>  R-devel at r-project.org mailing list
>  >>  https://stat.ethz.ch/mailman/listinfo/r-devel
>  >>
>  >
>  > ______________________________________________
>  > R-devel at r-project.org mailing list
>  > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list