[Rd] allocVector bug ?

Thu Nov 9 18:21:09 CET 2006

On Wed, 8 Nov 2006, Vladimir Dergachev wrote:

> On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
>> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
>>> Hi Luke,
>>>
>>>
>>> I generally agree with this, however I believe that current logic breaks
>>> down for large allocation sizes and my code ends up spending 70% (and up)
>>> of computer time spinning inside garbage collector (I run oprofile to
>>> observe what is going on).
>>
>> Again please be careful about these sorts of statements.  I am sure
>> there are bugs in the memory manager and places where things "break
>> down" but this isn't one of them.  The memory manager is quite
>> deliberately biased towards keeping the total allocation low, if
>> necessary at the expense of some extra gc overhead.  This is needed if
>> we want to use the same settings across a wide range of
>> configurations, some of which have relatively little memory available
>> (think student labs).  The memory manager does try to learn about the
>> needs of a session, and as a result triggering value get adjusted.  It
>> is not true that every large allocation causes a gc.  This may be true
>> _initially_, but once total memory usage stabilizes at a particular
>> level it is no longer true (look at the way the heap limits are
>> adjusted).
>>
>> This approach of adjusting based on usage within a session is
>> reasonable and works well for longer sessions.  It may not work well
>> for short scripts that need large allocations.  I doubt that any
>> automated setting can work well in that situation while at the same
>> time keeping memory usage in other settings low. So it may be useful
>> to find ways of specifying a collection strategy appropriate for these
>> situations. If you can send me a simplified version of your usage
>> scenario then I will give this some thought and see if we can come up
>> with some reasonable ways of allowing user code to tweak gc behavior
>> for these situations.
>>
>
> Hi Luke,
>
>   Yes, I gladly concede the point that for a heuristic algorithm the notion
> of what is a "bug" is murky (besides crashes, etc, which is not what I am not
> talking about).
>
>   Here is why I called this a bug:
>
>     1. My understanding is that each time gc() needs to increase memory it
> performs a full garbage collection run. Right ?

The allocation process does not call gc before every call to malloc.
It only calls gc if the allocation would cross a threshold level.
Those theshold levels are adjusted in an effort to compromise between
keeping memory footprint low and not calling gc too often.  The code
you quote below is part of this adjustment process.  If this process
is working properly then as memory use grows there will initially be
more gc activity and then less as the thresholds adjust.

The command line arguments mentioned in ?Memory can be used to
influence some of this behavior.  Calling gcinfo(TRUE) will show you
when the internally triggered collections occur, and which level of
collection occurs.

>
>     2. This is not a problem with small memory sizes as they imply
> (presumably) small number of objects.
>
>     3. However, if one wants to allocate many objects (say columns in a data
> frame or just vectors) this results in large penalty
>
> Example 1: This simulates allocation of a data.frame with some character
> columns which are assumed to be factors. On my system first assignment is
> nearly instantaneous, why subsequent assignments take slightly less than 0.1
> seconds each.

I'm not sure these are quite doing what you intend.  You define Chars
but don't use it.  Also, system.time by default calls gc() before
doing the evaluation. Giving FALSE as the second argument may give you
a more realistic picture.

>
> L<-list()
> Chars<-as.character(1:100000)
> for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000))
> Times<-do.call(rbind, L)
>
> Example 2: Same as example 1 but we first grow the memory with fake
> allocation:
>
> L<-list()
> Chars<-as.character(1:100000)
> Data<-1:100000000
> rm(Data)
> for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000))
> Times<-do.call(rbind, L)
>
> In this case the first 20 or so allocations are very quck (faster than 0.02
> sec) and then garbage collector kicks in and the time rises to 0.08 seconds
> each - still less than in Example 1.
>
> This example is relevant because this sequence of allocations is exactly what
> happens when one uses read.table or scan (or database query) to load data.
>
> What is more, if the user then manipulates the loaded data by creating columns
> that are a combination of existing ones then this is very slow as well.
>
> I looked more carefully at your code in src/main/memory.c, function
> AdjustHeapSize:
>
> R_VSize = VNeeded;
>    if (vect_occup > R_VGrowFrac) {
> 	R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
> 	if (R_MaxVSize - R_VSize >= change)
> 	    R_VSize += change;
>    }
>
> Could it be that R_NSize should be R_VSize ? This would explain why I see a
> problem in case R_VSize>>R_NSize.

That does indeed look like a bug and that R_NSize should be R_VSize --
well spotted, thanks.  I will need to experiment with this a bit more
to see if it can safely be changed.  It will increase the memory
footprint a bit.  Probaly not by enough to matter but if it does we
may need to adjust some of the tuning constants.

Best,

luke

>                            thank you very much !
>
>                                     Vladimir Dergachev
>
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu