[Rd] Resizing a named vector crashes R with gctorture(TRUE) (PR#13837)

Hervé Pagès hpages at fhcrc.org
Fri Jul 17 02:30:50 CEST 2009

Duncan Murdoch wrote:
> On 16/07/2009 5:06 PM, Hervé Pagès wrote:
>> Duncan Murdoch wrote:
>>> On 7/16/2009 2:34 PM, Hervé Pagès wrote:
>>>> Duncan Murdoch wrote:
>>>>> On 15/07/2009 10:15 PM, Hervé Pagès wrote:
>>>>>> I have to confess that I'm a little bit puzzled by how the
>>>>>> PROTECT/UNPROTECT mechanism is used in the C code of R.
>>>>>> Duncan, you say the problem you just fixed was an easy one.
>>>>>> I looked at the C code too and was able to recognize a pattern
>>>>>> that is indeed easy to identify as problematic:
>>>>>>    an unprotected call to allocVector() followed by a call
>>>>>>    that can trigger garbage collection (in that case another
>>>>>>    call to allocVector())
>>>>>> It only took me 1 minute to find another occurrence of this pattern.
>>>>>> It's in the do_grep() function (src/main/character.c, line 1168):
>>>>>>    > gctorture(TRUE)
>>>>>>    > grep("b", c(A="aa", B="aba"), value=TRUE)
>>>>>>      B
>>>>>>    "B"
>>>>>> Given that the overhead of PROTECTing the SEXP returned by
>>>>>> allocVector() can really be considered 0 (or almost), I'm
>>>>>> wondering why this is not done in a more systematic way.
>>>>> This is an explanation, not a justification:
>>>>> If you look at the history of that file, you'll see a hint:  line 
>>>>> 1168 was written in 1998, the other lines were written later, by 
>>>>> other people.  It is simply a matter of someone thinking something 
>>>>> was safe when it wasn't, and it's not clear who was wrong:  it may 
>>>>> have been safe when written, but susceptible to later changes.
>>>>>> Even when nothing between PROTECT(allocVector()) and the
>>>>>> corresponding UNPROTECT could trigger garbage collection
>>>>>> (e.g. PROTECT(allocVector()) is close to the return statement).
>>>>>> Because making exceptions like this can make your code
>>>>>> really hard to maintain in the long term.
>>>>> There are a lot of people who object to anything that slows R at 
>>>>> all. That puts pressure on anyone writing code to do it in a way 
>>>>> that wastes as few cycles as possible.  That in turn makes it 
>>>>> harder for someone else to analyze the code.  And overuse of 
>>>>> PROTECT also makes the code harder to read.
>>>> Most of the calls to allocVector() are currently protected. There is a
>>>> very small percentage of calls to allocVector() that are not. Most of
>>>> the times because people apparently decided that, at the time they 
>>>> wrote
>>>> the code, it didn't seem necessary. It doesn't matter if they were 
>>>> wrong
>>>> or write. My point is that this game is not worth it.
>>>> I bet if you protected all the calls to allocVector() you wouldn't 
>>>> notice
>>>> any slow down in R. What is guaranteed though is that you end up 
>>>> with code
>>>> that sooner or later will break because of some changes that are 
>>>> made to
>>>> the function itself or to another function called by your function 
>>>> (because
>>>> this other function is now calling gc and you were assuming that it 
>>>> wouldn't
>>>> do that).
>>>> And this kind of breakage is one of the worst kinds: if you are lucky,
>>>> you get a segfault, but if you are not, you don't notice anything and
>>>> get the wrong answer, like in the length<-() and grep() examples (and
>>>> you can safely assume that there are many other places like this in R).
>>>> I guess 99.999% of R users would happily trade a 0.001% slow down for
>>>> a correct result.
>>>> First make it right, then make it fast. And sorry, but you're not going
>>>> to make it fast by saving a few calls to PROTECT() here and here.
>>> As I said, I gave you an explanation, not a justification.  I 
>>> generally agree with you, but not everyone does.  For example, after 
>>> posting the first patch I received a private email suggesting that 
>>> the following PROTECT on xnames could be removed.  I didn't remove 
>>> it, because I think it is mostly harmless, and it's not worth my time 
>>> to analyze whether any particular PROTECT is unneeded.
>>> I generally agree with you, but I don't totally agree with you.  The 
>>> protection stack is not infinite, so any time you add a PROTECT you 
>>> have to be sure it will be removed in a relatively short time.  You 
>>> can't have something like
>>> for (i=0; i < n; i++) { PROTECT( ans <- allocVector(...) ) ; ... }
>>> because you are likely to blow the stack when n is large.  You need 
>>> the UNPROTECTs within the loop, but still after ans stops being 
>>> vulnerable to garbage collection.
>> Indeed, putting the UNPROTECT out of the loop would be a bad idea.
>> The only reason I see people would do this is because they have
>> some continue, break or return statements inside the loop and they
>> don't want to put the UNPROTECT before each of them.
>> But most of the times, it should be clear where to put the UNPROTECT,
>> and, if this is not the case, then that means that it was not clear
>> either that ans didn't need to be protected in the first place!
>>> It's hard to place them automatically.  And every extra 
>>> function/macro call adds to the obscurity of the code, so it's harder 
>>> to read it and know whether it really does what you wanted it to.  My 
>>> inclination is to over-PROTECT things, but not to PROTECT everything.
>> I didn't say everything should be protected. Just that
>> PROTECT(allocVector()) could be used in a more systematic way.
> Tell me the system.

OK, the system is the following. Here "you" is not you Duncan, but
the developer that is facing a protect-or-not-protect dilemma.

Every time you are tempted to write

   x = allocVector();

think about what will happen the day someone will come and
add the following line right after your line:

   y = allocVector();

Then cross your finger that s/he will remember to fix your code.
Alternatively you have the option to anticipate and make your
code safer in the long run. It's easy and at the same time you
show that you care more about long term maintainability than
saving an insignificant number of CPU cycles.

It's easy to imagine that people will be reluctant to change
things like:

   ans_elt = allocVector();
   SET_VECTOR_ELT(ans, i, ans_elt);

but I still think that the following is as good and not a lot
harder to read:

   PROTECT(ans_elt = allocVector());
   SET_VECTOR_ELT(ans, i, ans_elt);

But maybe you can have a short list of authorized exceptions
to the rule for these very simple cases.

Otherwise, when you see code like

   /* No protection needed as ExtractSubset does not allocate */
   result = allocVector(mode, n);
   PROTECT(result = ExtractSubset(x, result, indx, call));

it's nice to have a comment, but since 'result' gets finally
protected (ExtractSubset returns the same 'result' that was
passed to it), then why not just do:

   PROTECT(result = allocVector(mode, n));
   result = ExtractSubset(x, result, indx, call);

so you don't need to comment anything and the day someone
decides to allocate in ExtractSubset you are still good.

More generally, when a function is changed from being
non-allocating to be allocating, is the person in charge of
this change also supposed to come to every place where the
function is called and add the missing PROTECT/UNPROTECT?
Even worse, if it's a low-level routine that becomes an
allocating function, it could be that dozens or hundreds
of higher level functions now become allocating (being an
allocating function is a property that propagates to the
parents of the function), so the person can end up having to
check hundreds of places! The task might just become impossible.


> Duncan Murdoch
>> Thanks,
>> H.
>>> Duncan Murdoch
>>>> Cheers,
>>>> H.
>>>>> As an example: just below line 1168 there's another unprotected 
>>>>> allocVector of nm, but I think that one is safe, because it is 
>>>>> attached as an attribute to ans (which is now PROTECT'd) before 
>>>>> anything is done that could trigger gc.  And a few lines below 
>>>>> that, on another branch of the if, another unprotected but 
>>>>> safe-looking allocation.  Should I protect those?  Then I'd also 
>>>>> need to call UNPROTECT again, or keep a counter of PROTECT calls, 
>>>>> and the code would be a little harder to read.
>>>>> Thanks for tracking down these two bugs; I'll fix the grep bug 
>>>>> too.  If you feel like looking for more, it would be appreciated.  
>>>>> (Writing an automatic tool to analyze code and determine where new 
>>>>> ones are needed and where existing ones could be eliminated might 
>>>>> be a fun project, but there are too many fun projects.)
>>>>> Duncan Murdoch
>>>>>> Cheers,
>>>>>> H.
>>>>>> Hervé Pagès wrote:
>>>>>>> murdoch at stats.uwo.ca wrote:
>>>>>>>> On 15/07/2009 8:30 PM, murdoch at stats.uwo.ca wrote:
>>>>>>>>> On 15/07/2009 8:08 PM, Hervé Pagès wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>>    > length(x) <- 1
>>>>>>>>>>    > x
>>>>>>>>>>     a
>>>>>>>>>>    10
>>>>>>>>>> But with gctorture turned on, I get:
>>>>>>>>>>    > gctorture(TRUE)
>>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>>    > length(x) <- 1
>>>>>>>>>>    > x
>>>>>>>>>>      a
>>>>>>>>>>    "a"   <---- ???
>>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>>    > length(x) <- 3
>>>>>>>>>>     *** caught segfault ***
>>>>>>>>>>    address (nil), cause 'unknown'
>>>>>>>>>>    Possible actions:
>>>>>>>>>>    1: abort (with core dump, if enabled)
>>>>>>>>>>    2: normal R exit
>>>>>>>>>>    3: exit R without saving workspace
>>>>>>>>>>    4: exit R saving workspace
>>>>>>>>>> This seems to have been around for a while (I get this with R 
>>>>>>>>>> 2.10,
>>>>>>>>>> 2.9 and 2.8). Note that I don't get this with an unnamed vector.
>>>>>>>>>> This problem affects the methods package. I found it while
>>>>>>>>>> troubleshooting the "Protection stack overflow" I reported 
>>>>>>>>>> earlier
>>>>>>>>>> (see 
>>>>>>>>>> https://stat.ethz.ch/pipermail/r-devel/2009-July/054030.html)
>>>>>>>>>> but I can't tell yet whether the 2 issues are related or not.
>>>>>>>>> That's clearly a bug (reproducible in today's R-devel build); 
>>>>>>>>> I've cc'd this reply to r-bugs.  I'll take a look and see if I 
>>>>>>>>> can track it down.
>>>>>>>> That's got to be the easiest low-level bug I've worked on in a 
>>>>>>>> while. Just a missing PROTECT.  Now fixed, about to be committed 
>>>>>>>> to R-devel.
>>>>>>> Thanks Duncan! And the "Protection stack overflow" issue that was 
>>>>>>> affecting
>>>>>>> the methods package is gone now :)
>>>>>>> Cheers,
>>>>>>> H.
>>>>>>>> Duncan Murdoch
>>>>>>>>>> It would be nice to see some reaction from the R developers
>>>>>>>>>> about these issues. Thanks in advance!
>>>>>>>>> You should post them as bug reports if they are as clearly bugs 
>>>>>>>>> as this one; otherwise they can easily get lost in the noise.  
>>>>>>>>> I'm not going to offer to look into the other one; I don't know 
>>>>>>>>> the insides of the methods package.
>>>>>>>>> Duncan Murdoch
>>>>>>>>>> H.
>>>>>>>>>> hpages at fhcrc.org wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>   > gctorture(TRUE)
>>>>>>>>>>>   > setGeneric("foo", function(x, y) standardGeneric("foo"))
>>>>>>>>>>>   [1] "foo"
>>>>>>>>>>>   > setMethod("foo", c("ANY", "ANY"),
>>>>>>>>>>>   +   function(x, y) cat("calling foo,ANY,ANY method\n")
>>>>>>>>>>>   + )
>>>>>>>>>>>   Error: protect(): protection stack overflow
>>>>>>>>>>> Sorry this is something I already reported one week ago here
>>>>>>>>>>>   https://stat.ethz.ch/pipermail/r-devel/2009-July/053973.html
>>>>>>>>>>> but I just had a 2nd look at it and realized that the problem
>>>>>>>>>>> can in fact be reproduced out of the .onLoad() hook. So I'm
>>>>>>>>>>> reporting it again with a different subject.
>>>>>>>>>>> See my sessionInfo() below. Thanks!
>>>>>>>>>>> H.
>>>>>>>>>>>> sessionInfo()
>>>>>>>>>>> R version 2.10.0 Under development (unstable) (2009-06-26 
>>>>>>>>>>> r48837)
>>>>>>>>>>> x86_64-unknown-linux-gnu
>>>>>>>>>>> locale:
>>>>>>>>>>>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>>>>>>>>>>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>>>>>>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
>>>>>>>>>>>  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
>>>>>>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>>>>> attached base packages:
>>>>>>>>>>> [1] stats     graphics  grDevices utils     datasets  
>>>>>>>>>>> methods   base
>>>>>>>>>>> ______________________________________________
>>>>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>>> ______________________________________________
>>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>> ______________________________________________
>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the R-devel mailing list