[Rd] Resizing a named vector crashes R with gctorture(TRUE) (PR#13837)

Duncan Murdoch murdoch at stats.uwo.ca
Thu Jul 16 23:51:18 CEST 2009


On 16/07/2009 5:06 PM, Hervé Pagès wrote:
> Duncan Murdoch wrote:
>> On 7/16/2009 2:34 PM, Hervé Pagès wrote:
>>> Duncan Murdoch wrote:
>>>> On 15/07/2009 10:15 PM, Hervé Pagès wrote:
>>>>> I have to confess that I'm a little bit puzzled by how the
>>>>> PROTECT/UNPROTECT mechanism is used in the C code of R.
>>>>> Duncan, you say the problem you just fixed was an easy one.
>>>>> I looked at the C code too and was able to recognize a pattern
>>>>> that is indeed easy to identify as problematic:
>>>>>
>>>>>    an unprotected call to allocVector() followed by a call
>>>>>    that can trigger garbage collection (in that case another
>>>>>    call to allocVector())
>>>>>
>>>>> It only took me 1 minute to find another occurrence of this pattern.
>>>>> It's in the do_grep() function (src/main/character.c, line 1168):
>>>>>
>>>>>    > gctorture(TRUE)
>>>>>    > grep("b", c(A="aa", B="aba"), value=TRUE)
>>>>>      B
>>>>>    "B"
>>>>>
>>>>> Given that the overhead of PROTECTing the SEXP returned by
>>>>> allocVector() can really be considered 0 (or almost), I'm
>>>>> wondering why this is not done in a more systematic way.
>>>> This is an explanation, not a justification:
>>>>
>>>> If you look at the history of that file, you'll see a hint:  line 
>>>> 1168 was written in 1998, the other lines were written later, by 
>>>> other people.  It is simply a matter of someone thinking something 
>>>> was safe when it wasn't, and it's not clear who was wrong:  it may 
>>>> have been safe when written, but susceptible to later changes.
>>>>
>>>>> Even when nothing between PROTECT(allocVector()) and the
>>>>> corresponding UNPROTECT could trigger garbage collection
>>>>> (e.g. PROTECT(allocVector()) is close to the return statement).
>>>>> Because making exceptions like this can make your code
>>>>> really hard to maintain in the long term.
>>>> There are a lot of people who object to anything that slows R at all. 
>>>> That puts pressure on anyone writing code to do it in a way that 
>>>> wastes as few cycles as possible.  That in turn makes it harder for 
>>>> someone else to analyze the code.  And overuse of PROTECT also makes 
>>>> the code harder to read.
>>> Most of the calls to allocVector() are currently protected. There is a
>>> very small percentage of calls to allocVector() that are not. Most of
>>> the times because people apparently decided that, at the time they wrote
>>> the code, it didn't seem necessary. It doesn't matter if they were wrong
>>> or write. My point is that this game is not worth it.
>>>
>>> I bet if you protected all the calls to allocVector() you wouldn't notice
>>> any slow down in R. What is guaranteed though is that you end up with 
>>> code
>>> that sooner or later will break because of some changes that are made to
>>> the function itself or to another function called by your function 
>>> (because
>>> this other function is now calling gc and you were assuming that it 
>>> wouldn't
>>> do that).
>>>
>>> And this kind of breakage is one of the worst kinds: if you are lucky,
>>> you get a segfault, but if you are not, you don't notice anything and
>>> get the wrong answer, like in the length<-() and grep() examples (and
>>> you can safely assume that there are many other places like this in R).
>>>
>>> I guess 99.999% of R users would happily trade a 0.001% slow down for
>>> a correct result.
>>>
>>> First make it right, then make it fast. And sorry, but you're not going
>>> to make it fast by saving a few calls to PROTECT() here and here.
>> As I said, I gave you an explanation, not a justification.  I generally 
>> agree with you, but not everyone does.  For example, after posting the 
>> first patch I received a private email suggesting that the following 
>> PROTECT on xnames could be removed.  I didn't remove it, because I think 
>> it is mostly harmless, and it's not worth my time to analyze whether any 
>> particular PROTECT is unneeded.
>>
>> I generally agree with you, but I don't totally agree with you.  The 
>> protection stack is not infinite, so any time you add a PROTECT you have 
>> to be sure it will be removed in a relatively short time.  You can't 
>> have something like
>>
>> for (i=0; i < n; i++) { PROTECT( ans <- allocVector(...) ) ; ... }
>> UNPROTECT(n);
>>
>> because you are likely to blow the stack when n is large.  You need the 
>> UNPROTECTs within the loop, but still after ans stops being vulnerable 
>> to garbage collection.
> 
> Indeed, putting the UNPROTECT out of the loop would be a bad idea.
> The only reason I see people would do this is because they have
> some continue, break or return statements inside the loop and they
> don't want to put the UNPROTECT before each of them.
> But most of the times, it should be clear where to put the UNPROTECT,
> and, if this is not the case, then that means that it was not clear
> either that ans didn't need to be protected in the first place!
> 
>> It's hard to place them automatically.  And 
>> every extra function/macro call adds to the obscurity of the code, so 
>> it's harder to read it and know whether it really does what you wanted 
>> it to.  My inclination is to over-PROTECT things, but not to PROTECT 
>> everything.
> 
> I didn't say everything should be protected. Just that
> PROTECT(allocVector()) could be used in a more systematic way.

Tell me the system.

Duncan Murdoch

> 
> Thanks,
> H.
> 
>> Duncan Murdoch
>>
>>
>>> Cheers,
>>> H.
>>>
>>>
>>>> As an example: just below line 1168 there's another unprotected 
>>>> allocVector of nm, but I think that one is safe, because it is 
>>>> attached as an attribute to ans (which is now PROTECT'd) before 
>>>> anything is done that could trigger gc.  And a few lines below that, 
>>>> on another branch of the if, another unprotected but safe-looking 
>>>> allocation.  Should I protect those?  Then I'd also need to call 
>>>> UNPROTECT again, or keep a counter of PROTECT calls, and the code 
>>>> would be a little harder to read.
>>>>
>>>> Thanks for tracking down these two bugs; I'll fix the grep bug too.  
>>>> If you feel like looking for more, it would be appreciated.  (Writing 
>>>> an automatic tool to analyze code and determine where new ones are 
>>>> needed and where existing ones could be eliminated might be a fun 
>>>> project, but there are too many fun projects.)
>>>>
>>>> Duncan Murdoch
>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>
>>>>> Hervé Pagès wrote:
>>>>>> murdoch at stats.uwo.ca wrote:
>>>>>>> On 15/07/2009 8:30 PM, murdoch at stats.uwo.ca wrote:
>>>>>>>> On 15/07/2009 8:08 PM, Hervé Pagès wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>    > length(x) <- 1
>>>>>>>>>    > x
>>>>>>>>>     a
>>>>>>>>>    10
>>>>>>>>>
>>>>>>>>> But with gctorture turned on, I get:
>>>>>>>>>
>>>>>>>>>    > gctorture(TRUE)
>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>    > length(x) <- 1
>>>>>>>>>    > x
>>>>>>>>>      a
>>>>>>>>>    "a"   <---- ???
>>>>>>>>>
>>>>>>>>>    > x <- c(a=10, b=20)
>>>>>>>>>    > length(x) <- 3
>>>>>>>>>
>>>>>>>>>     *** caught segfault ***
>>>>>>>>>    address (nil), cause 'unknown'
>>>>>>>>>
>>>>>>>>>    Possible actions:
>>>>>>>>>    1: abort (with core dump, if enabled)
>>>>>>>>>    2: normal R exit
>>>>>>>>>    3: exit R without saving workspace
>>>>>>>>>    4: exit R saving workspace
>>>>>>>>>
>>>>>>>>> This seems to have been around for a while (I get this with R 2.10,
>>>>>>>>> 2.9 and 2.8). Note that I don't get this with an unnamed vector.
>>>>>>>>>
>>>>>>>>> This problem affects the methods package. I found it while
>>>>>>>>> troubleshooting the "Protection stack overflow" I reported earlier
>>>>>>>>> (see https://stat.ethz.ch/pipermail/r-devel/2009-July/054030.html)
>>>>>>>>> but I can't tell yet whether the 2 issues are related or not.
>>>>>>>> That's clearly a bug (reproducible in today's R-devel build); 
>>>>>>>> I've cc'd this reply to r-bugs.  I'll take a look and see if I 
>>>>>>>> can track it down.
>>>>>>> That's got to be the easiest low-level bug I've worked on in a 
>>>>>>> while. Just a missing PROTECT.  Now fixed, about to be committed 
>>>>>>> to R-devel.
>>>>>> Thanks Duncan! And the "Protection stack overflow" issue that was 
>>>>>> affecting
>>>>>> the methods package is gone now :)
>>>>>>
>>>>>> Cheers,
>>>>>> H.
>>>>>>
>>>>>>> Duncan Murdoch
>>>>>>>
>>>>>>>>> It would be nice to see some reaction from the R developers
>>>>>>>>> about these issues. Thanks in advance!
>>>>>>>> You should post them as bug reports if they are as clearly bugs 
>>>>>>>> as this one; otherwise they can easily get lost in the noise.  
>>>>>>>> I'm not going to offer to look into the other one; I don't know 
>>>>>>>> the insides of the methods package.
>>>>>>>>
>>>>>>>> Duncan Murdoch
>>>>>>>>
>>>>>>>>> H.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> hpages at fhcrc.org wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>   > gctorture(TRUE)
>>>>>>>>>>   > setGeneric("foo", function(x, y) standardGeneric("foo"))
>>>>>>>>>>   [1] "foo"
>>>>>>>>>>   > setMethod("foo", c("ANY", "ANY"),
>>>>>>>>>>   +   function(x, y) cat("calling foo,ANY,ANY method\n")
>>>>>>>>>>   + )
>>>>>>>>>>   Error: protect(): protection stack overflow
>>>>>>>>>>
>>>>>>>>>> Sorry this is something I already reported one week ago here
>>>>>>>>>>   https://stat.ethz.ch/pipermail/r-devel/2009-July/053973.html
>>>>>>>>>> but I just had a 2nd look at it and realized that the problem
>>>>>>>>>> can in fact be reproduced out of the .onLoad() hook. So I'm
>>>>>>>>>> reporting it again with a different subject.
>>>>>>>>>>
>>>>>>>>>> See my sessionInfo() below. Thanks!
>>>>>>>>>> H.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> sessionInfo()
>>>>>>>>>> R version 2.10.0 Under development (unstable) (2009-06-26 r48837)
>>>>>>>>>> x86_64-unknown-linux-gnu
>>>>>>>>>>
>>>>>>>>>> locale:
>>>>>>>>>>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>>>>>>>>>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>>>>>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
>>>>>>>>>>  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
>>>>>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>>>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>>>
>>>>>>>>>> attached base packages:
>>>>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   
>>>>>>>>>> base
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>>> ______________________________________________
>>>>>>>> R-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>> ______________________________________________
>>>>>>> R-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list