[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Sun May 31 00:22:58 CEST 2009

Martin Maechler wrote:


>     vQ> the first question is, why does ER return the string as const?  it
>     vQ> appears that the returned pointer provides the address of a buffer used
>     vQ> internally in ER, which is allocated *statically*.  that is, each call
>     vQ> to ER operates on the same memory location, and each call to ER returns
>     vQ> the address of that same location.  i suspect this is intended to be a
>     vQ> smart optimization, to avoid heap- or stack-allocating a new buffer in
>     vQ> each call to ER, and deallocating it after use.  however, this appraoch
>     vQ> is problematic, in that any two calls to ER return the address of the
>     vQ> same piece of memory, and this may easily lead to data corruption. 
> Well, that would be ok if R could be used "threaded" / parallel / ...

this can cause severe problems even without concurrency, as one of my
examples hinted.

> and we all know that there are many other pieces of code {not
> just R's "own", but also in Fortran/C algorithms ..} that are
> not thread-safe.

absolutely.  again, ER is unsafe even in a sequential execution environment.

> "Yes, of course", R looks like a horrible piece of software 


> to
> some,  because of that
>     vQ> under the assumption that the content of this piece of memory is copied
>     vQ> before any destructive use, and that after the string is copied the
>     vQ> address is not further distributed, the hack is relatively harmless. 
>     vQ> this is what mkChar (via mkCharLenCE) does;  in SFR it copies the
>     vQ> content of s with memcpy, and wraps it into a SEXP that becomes the
>     vQ> return value from SFR.
> exactly.

but it should be made clear, by means of a comment, that ER is supposed
to be used in this way.  there is no hint at the interface level.

>     vQ> the original author of this hack seems to have had some concern about
>     vQ> exporting (from ER) the address of a static buffer, hence the returned
>     vQ> buffer is const.  in principle, this should prevent corruption of the
>     vQ> buffer's content in situations such as
>     vQ> // hypothetical
>     vQ> char *p1 = ER(...);
>     vQ> // p1 is some string returned from ER
>     vQ> char p2 = ER(...);
>     vQ> // p2 is some other string returned from ER
>     vQ> // some modifications performed on the string referred to by p1
>     vQ> p1[0] = 'x';
>     vQ> // p2[0] is 'x' -- possible data corruption
>     vQ> still worse in a scenario with concurrent calls to ER.
> (which will not happen in the near future)

unless you know a powerful and willing magician.

>     vQ> however, since the output from ER is const, this is no longer possible
>     vQ> -- at least, not without a deconstifying cast the petr style.  the
>     vQ> problem with petr's solution is not only that it modifies shared memory
>     vQ> purposefully qualified as const (by virtue of ER's return type), but
>     vQ> also that it effectively distributes the address for further use. 
>     vQ> unfortunately, like most of the r source code, ER is not appropriately
>     vQ> commented at the declaration and the definition, and without looking at
>     vQ> the code, one can hardly have any clue that ER always return the same
>     vQ> address of a static location.  while the original developer might be
>     vQ> careful enough not to misuse ER, in a large multideveloper project it's
>     vQ> hard expect that from others.  petr's function is precisely an example
>     vQ> of such misuse, and as it adds (again, without an appropriate comment) a
>     vQ> step of indirection; any use of petr's function other than what you have
>     vQ> in SFR (and can you guarantee no one will ever use DT for other
>     vQ> purposes?) is even more likely to end up in data corruption.
> you have a point here, and as a consequence, I'm proposing to
> put the following version of DT  into the source :
> ------------------------------------------------------------------------
> /* Note that we modify a  'const char*'  which is unsafe in general,
>  * but ok in the context of filtering an Encode*() value into mkChar(): */
> static const char* dropTrailing0(char *s, char cdec)
> {
>     char *p = s;
>     for (p = s; *p; p++) {
> 	if(*p == cdec) {
> 	    char *replace = p++;
> 	    while ('0' <= *p  &&  *p <= '9')
> 		if(*(p++) != '0')
> 		    replace = p;
> 	    while((*(replace++) = *(p++)))
> 		;
> 	    break;
> 	}
>     }
>     return s;
> }

the first line appears inessential;  to an informed programmer, taking a
string as char* (as opposed to const char*) means that it *may* be
modified within the call, irrespectively of whether it actually is, and
on what occasions, and one should not assume the string is not
destructively modified.

i think it is much more appropriate to comment (a) ER, with a warning to
the effect that it always returns the same address, hence the output
should be used immediately and never written to, (b) the use of ER in
SFR where it's output is cast to char* precisely for the purpose of
destructive modification, to the contrary of what (a) says.

i've attached a patch with an alternative comment.

> ------------------------------------------------------------------------
> so it has a comment along the lines you suggest, 


> *and* as static
> is not callable from outside coerce.c

indeed.  unfortunately, ER is callable from throughout the place.

>     vQ> one simple way to improve the code is as follows;  instead of (simplified)
>     vQ> const char* dropTrailing(const char* s, ...) {
>     vQ> const char *p = s;
>     vQ> char *replace;
>     vQ> ...
>     vQ> replace = (char*) p;
>     vQ> ...
>     vQ> return s; }
>     vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ...
>     vQ> you can have something like
>     vQ> const char* dropTrailing(char* s, ...) {
>     vQ> char *p = s, *replace;
>     vQ> ...
>     vQ> replace = p;
>     vQ> ...
>     vQ> return s; }
>     vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
>     vQ> where it is clear, from DT's signature, that it may (as it purposefully
>     vQ> does, in fact) modify the content of s.  that is, you drop the
>     vQ> promise-not-to-modify contract in DT, and move the need for
>     vQ> deconstifying ER's return out of DT, making it more explicit.
>     vQ> however, this is still an ad hoc hack;  it still breaks the original
>     vQ> developer's assumption (if i'm correct) that the return from ER
>     vQ> (pointing to its internal buffer) should not be destructively modified
>     vQ> outside of ER.
> I think that's a misinterpretation of the history of ER. 

"if i'm correct"

> IIRC, the main issue when changing from 'char*' to 'const char*'
> in *many* places "simultaneously"
> was the introduction of hashes / cashes of strings (in
> mkChar(.)) in order to make R-level character handling (and
> storage of STRSXP/CHARSXP) much more efficient.

what does it have to do with ER returning a pointer to a static location?

>     vQ> (3) modify petr's solution along the lines above, i.e., have the input
>     vQ> in the signature non-const and deconst-cast the output from ER outside
>     vQ> of the call to DT.
> that's what I have adopted, as I'm sure you've noticed when you
> saw the code above.

surprisingly, sometimes i'm able to notice ;)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: coerce.diff
Type: text/x-diff
Size: 1106 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090531/f692e569/attachment.bin>

More information about the R-devel mailing list