[Rd] [R] Successive subsets from a vector?

Wed Aug 23 08:41:43 CEST 2006

On Tue, 22 Aug 2006, Thomas Lumley wrote:

> On Tue, 22 Aug 2006, hadley wickham wrote:
> 
> >> The loop method took 195 secs.  Just assigning to an answer of the correct
> >> length reduced this to 5 secs.  e.g. use
> >>
> >>     ADDRESSES <- character(length(VECTOR)-4)
> >>
> >> Moral: don't grow vectors repeatedly.
> >
> > Other languages (eg. Java) grow the size of the vector independently
> > of the number of observations in it (I think Java doubles the size
> > whenever the vector is filled), thus changing O(n) behaviour to O(log
> > n).  I've always wondered why R doesn't do this.
> >
> 
> (redirected to r-devel, a better location for wonder of this type)
> 
> This was apparently the intention at the beginnng of time, thus the LENGTH 
> and TRUELENGTH macros in the source.
> 
> In many cases, though, there is duplication as well as length change, eg
>     x<-c(x, something)
> will set NAMED(x) to 2 by the second iteration, forcing duplication at 
> each subsequent iteration. The doubling strategy would still leave us with 
> O(n) behaviour, just with a smaller constant.
> 
> The only case I can think of where the doubling strategy actually helps a 
> lot is the one in Atte's example, assigning off the end of an existing 
> vector.  That wasn't legal in early versions of R (and I think most people 
> would agree that it shouldn't be encouraged).
> 
> A reAllocVector() function would clearly have some benefits, but not as 
> many as one would expect. That's probably why it hasn't been done (which 
> doesn't mean that it shouldn't be done).

The expectation was 'negligible' for this value of 'one'.  The only 
benefit which is clear to me is that short vectors are allocated in node 
classes, and so are often slightly over-allocated.  In the places where 
the doubling strategy is use it would allow the use of realloc, which 
might be a smidgen more efficient.   In any case, we have lengthgets, 
which is a bit more general and not always used when it could be.

> Providing the ability to write assignment functions that don't duplicate 
> is a more urgent problem.

You mean, for end-users?  It can be done via primitives.

As I said in my reply on R-help, I don't see the original as at all a 
common problem.  About the only times where a bound on number of entries 
is unknown in advance is when reading from a connection (and there the 
internal code does uses doubling strategies), and in a iterative procedure 
with no limit on the number of iterations (not usually good practice).

Now, we could add a c<-() (perhaps better a concat<-) function to R 
used as

c(x) <- something

and do it efficiently, but would those who encounter the inefficiencies of 
repeated concatenation actually know to use it?  After all,

x = x + 1;

is common in C code, even in R's C code in the recent past (although my 
guess is that any decent compiler can optimize that to the same code as 
x++).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595