[Rd] allocVector bug ?

Vladimir Dergachev vdergachev at rcgardis.com
Wed Nov 8 22:33:17 CET 2006


On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> > Hi Luke,
> >
> >
> > I generally agree with this, however I believe that current logic breaks
> > down for large allocation sizes and my code ends up spending 70% (and up)
> > of computer time spinning inside garbage collector (I run oprofile to
> > observe what is going on).
>
> Again please be careful about these sorts of statements.  I am sure
> there are bugs in the memory manager and places where things "break
> down" but this isn't one of them.  The memory manager is quite
> deliberately biased towards keeping the total allocation low, if
> necessary at the expense of some extra gc overhead.  This is needed if
> we want to use the same settings across a wide range of
> configurations, some of which have relatively little memory available
> (think student labs).  The memory manager does try to learn about the
> needs of a session, and as a result triggering value get adjusted.  It
> is not true that every large allocation causes a gc.  This may be true
> _initially_, but once total memory usage stabilizes at a particular
> level it is no longer true (look at the way the heap limits are
> adjusted).
>
> This approach of adjusting based on usage within a session is
> reasonable and works well for longer sessions.  It may not work well
> for short scripts that need large allocations.  I doubt that any
> automated setting can work well in that situation while at the same
> time keeping memory usage in other settings low. So it may be useful
> to find ways of specifying a collection strategy appropriate for these
> situations. If you can send me a simplified version of your usage
> scenario then I will give this some thought and see if we can come up
> with some reasonable ways of allowing user code to tweak gc behavior
> for these situations.
>

Hi Luke, 
   
   Yes, I gladly concede the point that for a heuristic algorithm the notion 
of what is a "bug" is murky (besides crashes, etc, which is not what I am not 
talking about).

   Here is why I called this a bug:

     1. My understanding is that each time gc() needs to increase memory it 
performs a full garbage collection run. Right ?

     2. This is not a problem with small memory sizes as they imply 
(presumably) small number of objects.

     3. However, if one wants to allocate many objects (say columns in a data 
frame or just vectors) this results in large penalty

Example 1: This simulates allocation of a data.frame with some character 
columns which are assumed to be factors. On my system first assignment is 
nearly instantaneous, why subsequent assignments take slightly less than 0.1 
seconds each.

L<-list()
Chars<-as.character(1:100000)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000))
Times<-do.call(rbind, L)

Example 2: Same as example 1 but we first grow the memory with fake 
allocation:

L<-list()
Chars<-as.character(1:100000)
Data<-1:100000000
rm(Data)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000))
Times<-do.call(rbind, L)

In this case the first 20 or so allocations are very quck (faster than 0.02 
sec) and then garbage collector kicks in and the time rises to 0.08 seconds 
each - still less than in Example 1.

This example is relevant because this sequence of allocations is exactly what 
happens when one uses read.table or scan (or database query) to load data.

What is more, if the user then manipulates the loaded data by creating columns 
that are a combination of existing ones then this is very slow as well.

I looked more carefully at your code in src/main/memory.c, function 
AdjustHeapSize:

R_VSize = VNeeded;
    if (vect_occup > R_VGrowFrac) {
	R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
	if (R_MaxVSize - R_VSize >= change)
	    R_VSize += change;
    }

Could it be that R_NSize should be R_VSize ? This would explain why I see a 
problem in case R_VSize>>R_NSize.

                            thank you very much !

                                     Vladimir Dergachev



More information about the R-devel mailing list