[R] Any chance R will ever get beyond the 2^31-1 vector size limit?

Fri Apr 16 02:05:05 CEST 2010

There is one thing that would definitely break. Quite a bit of compiled code relies on the fact that the R integer type and the type used to index arrays and the C int type are the same.  The C int type won't change, so if the type used to index arrays changes, the R integer type will be different from at least one of them.

Suppose you now write   .C("foo", as.double(x), as.integer(length(x))) to call a C function
    void foo(double* x, int* n)
If length(x) could be larger than 2^31, then this isn't going to work -- either as.integer() will fail, or it will succeed and produce a value that doesn't fit in an int.

    -thomas

On Thu, 15 Apr 2010, Matthew Keller wrote:

> HI Duncan and R users,
>
> Duncan, thank you for taking the time to respond. I've had several
> other comments off the list, and I'd like to summarize what these have
> to say, although I won't give sources since I assume there was a
> reason why people chose not to respond to the whole list. The long and
> short of it is that there is hope for people who want R to get beyond
> the 2^31-1 vector size limit.
>
> First off, I received a couple of responses from people who wanted to
> commiserate and me to summarize what I learned. Here you go.
>
> Second, the package bigmemory and ff can both help with memory issues.
> I've had success using bigmemory before, and found it to be quite
> intuitive.
>
> Third, one knowledgeable responder doubted that changing the 2^31-1
> limit would 'break' old datasets. He says, "This might be true for
> isolated cases of objects stored in binary formats or in workspaces,
> but I don't see that as anywhere near as important as the change you
> (and we) would like to see."
>
> Fourth, another knowledgeable responder felt it was likely that, given
> the demand driven by the huge increases in dataset sizes, this
> limitation would likely be overcome within the next few years.
>
> Best,
>
> Matt
>
>
> On Fri, Apr 9, 2010 at 6:36 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
>> On 09/04/2010 7:38 PM, Matthew Keller wrote:
>>>
>>> Hi all,
>>>
>>> My institute will hopefully be working on cutting-edge genetic
>>> sequencing data by the Fall of 2010. The datasets will be 10's of GB
>>> large and growing. I'd like to use R to do primary analyses. This is
>>> OK, because we can just throw $ at the problem and get lots of RAM
>>> running on 64 bit R. However, we are still running up against the fact
>>> that vectors in R cannot contain more than 2^31-1. I know there are
>>> "ways around" this issue, and trust me, I think I've tried them all
>>> (e.g., bringing in portions of the data at a time; using large-dataset
>>> packages in R; using SQL databases, etc). But all these 'solutions'
>>> are, at the end of the day, much much more cumbersome,
>>> programming-wise, than just doing things in native R. Maybe that's
>>> just the cost of doing what I'm doing. But my questions, which  may
>>> well be naive (I'm not a computer programmer), are:
>>>
>>> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
>>> in an alternative history of R's development, would it have been
>>> feasible for R to not have had this limitation?
>>
>> The problem is that we use "int" as a vector index.  On most platforms,
>> that's a signed 32 bit integer, with max value 2^31-1.
>>
>>
>>>
>>> 2) Is there any possibility that this limit will be overcome in future
>>> revisions of R?
>>
>>
>> Of course, R is open source.  You could rewrite all of the internal code
>> tomorrow to use 64 bit indexing.
>>
>> Will someone else do it for you?  Even that is possible.  One problem are
>> that this will make all of your data incompatible with older versions of R.
>>  And back to the original question:  are you willing to pay for the
>> development?  Then go ahead, you can have it tomorrow (or later, if your
>> budget is limited).  Are you waiting for someone else to do it for free?
>>  Then you need to wait for someone who knows how to do it to want to do it.
>>
>>
>>> I'm very very grateful to the people who have spent important parts of
>>> their professional lives developing R. I don't think anyone back in,
>>> say, 1995, could have foreseen that datasets would be >>2^32-1 in
>>> size. For better or worse, however, in many fields of science, that is
>>> routinely the case today. *If* it's possible to get around this limit,
>>> then I'd like to know whether the R Development Team takes seriously
>>> the needs of large data users, or if they feel that (perhaps not
>>> mutually exclusively) developing such capacity is best left up to ad
>>> hoc R packages and alternative analysis programs.
>>
>> There are many ways around the limit today.  Put your data in a dataframe
>> with many columns each of length 2^31-1 or less.  Put your data in a
>> database, and process it a block at a time.  Etc.
>>
>> Duncan Murdoch
>>
>>>
>>> Best,
>>>
>>> Matt
>>>
>>>
>>>
>>
>>
>
>
>
> -- 
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle