[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Martin Maechler maechler at stat.math.ethz.ch
Sat May 30 19:32:52 CEST 2009

>>>>> "vQ" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>>     on Sat, 30 May 2009 11:16:43 +0200 writes:

    vQ> Martin Maechler wrote:
    >> Hi Waclav (and other interested parties),
    >> I have committed my working version of src/main/coerce.c
    >> so you can prepare your patch against that.

    vQ> some further investigation and reflections on the code in StringFromReal
    vQ> (henceforth SFR), src/main/coerce.c:315 (as in the
    vQ> patched version, now in r-devel).

    vQ> petr's elim_trailing (renamed to dropTrailing,
    vQ> henceforth referred to as DT) takes as input a const
    vQ> char*, and returns a const char*.

    vQ> const-ness of the return is not a problem; it is fed
    vQ> into mkChar, which (via mkCharLenCE) makes a local
    vQ> memcpy of the string, and there's no violation of the
    vQ> contract here.

    vQ> const-ness of the input is a consequence of the return
    vQ> type of EncodeReal (henceforth EC).  however, it is
    vQ> hardly ever, "in principle", a good idea to
    vQ> destructively modify const input (as DT does) if it
    vQ> comes from a function that explicitly provides it as
    vQ> const (as ER does).

yes, I think we've alway agreed on that "in principle".

    vQ> the first question is, why does ER return the string as const?  it
    vQ> appears that the returned pointer provides the address of a buffer used
    vQ> internally in ER, which is allocated *statically*.  that is, each call
    vQ> to ER operates on the same memory location, and each call to ER returns
    vQ> the address of that same location.  i suspect this is intended to be a
    vQ> smart optimization, to avoid heap- or stack-allocating a new buffer in
    vQ> each call to ER, and deallocating it after use.  however, this appraoch
    vQ> is problematic, in that any two calls to ER return the address of the
    vQ> same piece of memory, and this may easily lead to data corruption. 

Well, that would be ok if R could be used "threaded" / parallel / ...
and we all know that there are many other pieces of code {not
just R's "own", but also in Fortran/C algorithms ..} that are
not thread-safe.
"Yes, of course", R looks like a horrible piece of software to
some,  because of that

    vQ> under the assumption that the content of this piece of memory is copied
    vQ> before any destructive use, and that after the string is copied the
    vQ> address is not further distributed, the hack is relatively harmless. 
    vQ> this is what mkChar (via mkCharLenCE) does;  in SFR it copies the
    vQ> content of s with memcpy, and wraps it into a SEXP that becomes the
    vQ> return value from SFR.


    vQ> the original author of this hack seems to have had some concern about
    vQ> exporting (from ER) the address of a static buffer, hence the returned
    vQ> buffer is const.  in principle, this should prevent corruption of the
    vQ> buffer's content in situations such as

    vQ> // hypothetical
    vQ> char *p1 = ER(...);
    vQ> // p1 is some string returned from ER
    vQ> char p2 = ER(...);
    vQ> // p2 is some other string returned from ER

    vQ> // some modifications performed on the string referred to by p1
    vQ> p1[0] = 'x';
    vQ> // p2[0] is 'x' -- possible data corruption

    vQ> still worse in a scenario with concurrent calls to ER.
(which will not happen in the near future)

    vQ> however, since the output from ER is const, this is no longer possible
    vQ> -- at least, not without a deconstifying cast the petr style.  the
    vQ> problem with petr's solution is not only that it modifies shared memory
    vQ> purposefully qualified as const (by virtue of ER's return type), but
    vQ> also that it effectively distributes the address for further use. 

    vQ> unfortunately, like most of the r source code, ER is not appropriately
    vQ> commented at the declaration and the definition, and without looking at
    vQ> the code, one can hardly have any clue that ER always return the same
    vQ> address of a static location.  while the original developer might be
    vQ> careful enough not to misuse ER, in a large multideveloper project it's
    vQ> hard expect that from others.  petr's function is precisely an example
    vQ> of such misuse, and as it adds (again, without an appropriate comment) a
    vQ> step of indirection; any use of petr's function other than what you have
    vQ> in SFR (and can you guarantee no one will ever use DT for other
    vQ> purposes?) is even more likely to end up in data corruption.

you have a point here, and as a consequence, I'm proposing to
put the following version of DT  into the source :

/* Note that we modify a  'const char*'  which is unsafe in general,
 * but ok in the context of filtering an Encode*() value into mkChar(): */
static const char* dropTrailing0(char *s, char cdec)
    char *p = s;
    for (p = s; *p; p++) {
	if(*p == cdec) {
	    char *replace = p++;
	    while ('0' <= *p  &&  *p <= '9')
		if(*(p++) != '0')
		    replace = p;
	    while((*(replace++) = *(p++)))
    return s;


so it has a comment along the lines you suggest, *and* as static
is not callable from outside coerce.c

    vQ> one simple way to improve the code is as follows;  instead of (simplified)

    vQ> const char* dropTrailing(const char* s, ...) {
    vQ> const char *p = s;
    vQ> char *replace;
    vQ> ...
    vQ> replace = (char*) p;
    vQ> ...
    vQ> return s; }

    vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ...

    vQ> you can have something like

    vQ> const char* dropTrailing(char* s, ...) {
    vQ> char *p = s, *replace;
    vQ> ...
    vQ> replace = p;
    vQ> ...
    vQ> return s; }

    vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
    vQ> where it is clear, from DT's signature, that it may (as it purposefully
    vQ> does, in fact) modify the content of s.  that is, you drop the
    vQ> promise-not-to-modify contract in DT, and move the need for
    vQ> deconstifying ER's return out of DT, making it more explicit.

    vQ> however, this is still an ad hoc hack;  it still breaks the original
    vQ> developer's assumption (if i'm correct) that the return from ER
    vQ> (pointing to its internal buffer) should not be destructively modified
    vQ> outside of ER.

I think that's a misinterpretation of the history of ER. 
IIRC, the main issue when changing from 'char*' to 'const char*'
in *many* places "simultaneously"
was the introduction of hashes / cashes of strings (in
mkChar(.)) in order to make R-level character handling (and
storage of STRSXP/CHARSXP) much more efficient.

    vQ> another issue is that even making the return from ER const does not
    vQ> protect against data corruption.  for example,

    vQ> const char *p1 = ER(...)
    vQ> // p1 is some string returned from ER
    vQ> const char *p2 = ER(...)
    vQ> // p2 is some other string returned from ER
    vQ> // but p1 == p2

    vQ> if p1 is used after the second call to ER, it's likely to lead to data
    vQ> corruption problems.  frankly, i'd consider the design of ER essentially
    vQ> flawed.  removing const from ER's return is obviously not an option;  it
    vQ> would certainly be safer to have it return a pointer to a heap-allocated
    vQ> buffer, but then the returned buffer would have to be freed at some
    vQ> point in the client code, and this would require a whole cascade of
    vQ> modifications to the r source code.  since it usually is more funny to
    vQ> introduce new bugs than repair old ones, i'd not expect r core to be
    vQ> ever willing to invest time in this.

    vQ> the following seem acceptable and relatively reasonable ways to address
    vQ> the issue:

    vQ> (1) noop:  leave as is.

    vQ> (2) minimal:  adopt the (inessential) changes suggested in my previous post:

    vQ> const char* dropTrailing(const char* s, ...) {
    vQ> char *p = (char*) s;
    vQ> // no further need for (char*) casting from p to replace
    vQ> ... }

    vQ> (3) modify petr's solution along the lines above, i.e., have the input
    vQ> in the signature non-const and deconst-cast the output from ER outside
    vQ> of the call to DT.

that's what I have adopted, as I'm sure you've noticed when you
saw the code above.

So, let's get to something more interesting,
"Happy Pentecoste!"

    vQ> (4) modify petr's solution to operate on a local, non-const copy of s:

    vQ> const char* dropTrailing(const char* s, ...) {
    vQ> int length = strlen(s);
    vQ> char *ss = malloc((length+1)*sizeof(char));
    vQ> memcpy(ss, s, length+1);
    vQ> // work on ss rather than on s
    vQ> ...
    vQ> return ss; }

    vQ> and don't forget to deallocate the return from DT after mkChar (or
    vQ> whoever calls DT) has no more need for it.  (an alternative is to use
    vQ> the allocCharsxp approach of mkCharLenCE, but that would be overdoing
    vQ> the job.)

    vQ> to sum up:  for a clean solution, i find (4) preferable.  for
    vQ> efficiency, (3) seems better, provided that you add a clear comment to
    vQ> the effect that the assumption of non-modifiable output from ER is
    vQ> purposefully violated.  (2) is acceptable provided that DT is
    vQ> appropriately documented (i.e., that it does not really treat s as const
    vQ> char*, but effectively as non-const char*), and with a note as above.

    vQ> a final, bitter remark:  i'm surprised that it's so easy to have an r
    vQ> core developer submit to r-devel a patch that violates an explicit
    vQ> demand on the immutability of a shared, statically allocated piece of
    vQ> memory. (even if it's the already existing code that actually is
    vQ> problematic.)

    vQ> no patch attached, you need to consider your options.  hope this helps
    vQ> anyway.

    vQ> best,
    vQ> vQ

More information about the R-devel mailing list