[Rd] incorrect output and segfaults from sprintf with %*d (PR#13667)

Fri Apr 24 12:45:24 CEST 2009

>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Thu, 23 Apr 2009 12:40:22 +0200 (CEST) writes:

>>>>> "vQ" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>>     on Thu, 23 Apr 2009 11:49:54 +0200 writes:

    vQ> maechler at stat.math.ethz.ch wrote:
    >>> 
    vQ> sprintf has a documented limit on strings included in the output using the
    vQ> format '%s'.  It appears that there is a limit on the length of strings included
    vQ> with, e.g., the format '%d' beyond which surprising things happen (output
    vQ> modified for conciseness):
    >>> 

    vQ> ... and this limit is *not* documented.

    MM> well, it is basically (+ a few bytes ?)
    MM> the same  8192  limit that *is* documented.

indeed, I was right with that..

    vQ> gregexpr('1', sprintf('%9000d', 1))
    vQ> # [1] 9000 9801
    >>> 
    vQ> gregexpr('1', sprintf('%9000d', 1))
    vQ> # [1]  9000  9801 10602
    >>> 
    vQ> gregexpr('1', sprintf('%9000d', 1))
    vQ> # [1]  9000  9801 10602 11403
    >>> 
    vQ> gregexpr('1', sprintf('%9000d', 1))
    vQ> # [1]  9000  9801 10602 11403 12204
    >>> 
    vQ> ...
    >>> 

    vQ> Note that not only more than one '1' is included in the
    vQ> output, but also that the same functional expression (no
    vQ> side effects used beyond the interface) gives different
    vQ> results on each execution.  Analogous behaviour can be
    vQ> observed with '%nd' where n > 8200.

    vQ> The actual output above is consistent across separate sessions.
    >>> 
    vQ> With sufficiently large field width values, R segfaults:
    >>> 
    vQ> sprintf('%*d', 10^5, 1)
    vQ> # *** caught segfault ***
    vQ> # address 0xbfcfc000, cause 'memory not mapped'
    vQ> # Segmentation fault
    >>> 
    >>> 
    >>> Thank you, Wacek.
    >>> That's all ``interesting''  ... unfortunately, 
    >>> 
    >>> my version of  'man 3 sprintf' contains
    >>> 
    >>> 
    >>>>> BUGS
    >>>>> Because sprintf() and vsprintf() assume an arbitrarily
    >>>>> long string, callers must be careful not to overflow the
    >>>>> actual space; this is often impossible to assure. Note
    >>>>> that the length of the strings produced is
    >>>>> locale-dependent and difficult to predict.  Use
    >>>>> snprintf() and vsnprintf() instead (or asprintf() and vasprintf).

    vQ> yes, but this is c documentation, not r documentation.

    MM> Of course! ...  and I  *do*  apply it to R's C code [sprintf.c]
    MM> and hence am even concurring with you .. 

    vQ> it's applicable
    vQ> to a degree, since ?sprintf does say that sprintf is "a wrapper for the
    vQ> C function 'sprintf'".  however, in c you use a buffer and you usually
    vQ> have control over it's capacity, while in r this is a hidden
    vQ> implementational detail, which should not be visible to the user, or
    vQ> should cause an attempt to overflow the buffer to fail more gracefully
    vQ> than with a segfault.

    vQ> in r, sprintf('%9000d', 1) will produce a confused output with a count
    vQ> of 1's variable (!) across runs (while sprintf('%*d', 9000, 1) seems to
    vQ> do fine):

    vQ> gregexpr('1', sprintf('%*d', 9000, 1))
    vQ> # [1] 9000

    vQ> gregexpr('1', sprintf('%9000d', 1))
    vQ> # [1] 9000 9801 ..., variable across executions

    vQ> on one execution in a series i actually got this:

    vQ> Warning message:
    vQ> In gregexpr("1", sprintf("%9000d", 1)) :
    vQ> input string 1 is invalid in this locale

    vQ> while the very next execution, still in the same session, gave

    vQ> # [1]  9000  9801 10602

    vQ> with sprintf('%*d', 10000, 1) i got segfaults on some executions but
    vQ> correct output on others, while sprintf('%10000d', 1) is confused again.

    >>> (note the "impossible" part above)       
    >>> 

    vQ> yes, but it does also say "must be careful", and it seems that someone
    vQ> has not been careful enough.

    >>> and we haven't used  snprintf() yet, probably because it
    >>> requires the  C99 C standard, and AFAIK, we have only relatively
    >>> recently started to more or less rely on C99 in the R sources.
    >>> 

    vQ> while snprintf would help avoid buffer overflow, it may not be a
    vQ> solution to the issue of confused output.

    MM> I think it would / will.  We would be able to give warnings and
    MM> errors, by checking the  snprintf()  return codes.

My current working code gives an error for all the above
examples, e.g.,

 > sprintf('%9999d', 1)
 Error in sprintf("%9999d", 1) : 
   required resulting string length 9999 is > maximal 8191

it passes  'make check-devel' and I am inclined to commit that
code to R-devel (e.g. tomorrow). 

Yes, the documentation will also have to be amended, but apart
from that, would people see a big problem with the "8192" limit
which now is suddenly of greater importance
{{as I said all along;  hence my question to Wacek (and the
  R-develers)  if anybody found that limit too low}}

Of course, one could realloc()ate longer strings when needed
inside R's  sprintf() code,  but I reluctant to do that (render
the code yet more complicated ...) if nobody sees a need.

Martin

    >>> More precisely, I see that some windows-only code relies on
    >>> snprintf() being available  whereas in at least on non-Windows
    >>> section, I read   /* we cannot assume snprintf here */
    >>> 
    >>> Now such platform dependency issues and corresponding configure
    >>> settings I do typically leave to other R-corers with a much
    >>> wider overview about platforms and their compilers and C libraries.
    >>> 

    vQ> it looks like src/main/sprintf.c is just buggy, and it's plausible that
    vQ> the bug could be repaired in a platform-independent manner.

    MM> definitely.
    MM> In the mean time, I've actually found that what I first said on
    MM> the usability of snprintf() in R's code base was only partly correct.
    MM> There are other parts of R code where we use  snprintf() for all
    MM> platforms, hence we rely on its presence (and correct
    MM> implementation!) and so we can and I think should use it in
    MM> place of sprintf() in quite a few places inside R's sprintf.c

    >>> BTW,  
    >>> 1) sprintf("%n %g", 1,1)   also seg.faults
    >>> 

    vQ> as do

    vQ> sprintf('%n%g', 1, 1)
    vQ> sprintf('%n%')

    vQ> etc., while

    vQ> sprintf('%q%g', 1, 1)
    vQ> sprintf('%q%')

    vQ> work just fine.  strange, because per ?sprintf 'n' is not recognized as
    vQ> a format specifier, so the output from the first two above should be as
    vQ> from the last two above, respectively.  (and likewise in the %S case,
    vQ> discussed and bug-reported earlier.)

    MM> I have now fixed these bugs at least;
    MM> the more subtle  "%<too_large_n>d" ones are different, and
    MM> as I said, I'm convinced that a nice & clean fix for those will
    MM> start using snprintf().

    >>> 2) Did you have a true use case where  the  8192  limit was an
    >>> undesirable limit?

    vQ> how does it matter?  

    MM> well, we could increase it, if it did matter.
    MM> {{ you *could* have been more polite here, no?
    MM> it *was* after all a serious question that I asked! }}

    vQ> if you set a limit, be sure to consistently enforce
    vQ> it and warn the user on attempts to exceed it.  or write clearly in the
    vQ> docs that such attempts will cause the output to be silently truncated. 

    MM> Sure, I'm not at all disagreeing on that, and if you read this into my
    MM> posting, you misunderstand.

    MM> Martin

    [..........................]