[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Sat May 30 11:16:43 CEST 2009

Martin Maechler wrote:
> Hi Waclav (and other interested parties),
>
> I have committed my working version of src/main/coerce.c
> so you can prepare your patch against that.
>   

some further investigation and reflections on the code in StringFromReal
(henceforth SFR), src/main/coerce.c:315 (as in the patched version, now
in r-devel).

petr's elim_trailing (renamed to dropTrailing, henceforth referred to as
DT) takes as input a const char*, and returns a const char*.

const-ness of the return is not a problem;  it is fed into mkChar, which
(via mkCharLenCE) makes a local memcpy of the string, and there's no
violation of the contract here.

const-ness of the input is a consequence of the return type of
EncodeReal (henceforth EC).  however, it is hardly ever, "in principle",
a good idea to destructively modify const input (as DT does) if it comes
from a function that explicitly provides it as const (as ER does).

the first question is, why does ER return the string as const?  it
appears that the returned pointer provides the address of a buffer used
internally in ER, which is allocated *statically*.  that is, each call
to ER operates on the same memory location, and each call to ER returns
the address of that same location.  i suspect this is intended to be a
smart optimization, to avoid heap- or stack-allocating a new buffer in
each call to ER, and deallocating it after use.  however, this appraoch
is problematic, in that any two calls to ER return the address of the
same piece of memory, and this may easily lead to data corruption. 

under the assumption that the content of this piece of memory is copied
before any destructive use, and that after the string is copied the
address is not further distributed, the hack is relatively harmless. 
this is what mkChar (via mkCharLenCE) does;  in SFR it copies the
content of s with memcpy, and wraps it into a SEXP that becomes the
return value from SFR.

the original author of this hack seems to have had some concern about
exporting (from ER) the address of a static buffer, hence the returned
buffer is const.  in principle, this should prevent corruption of the
buffer's content in situations such as

    // hypothetical
    char *p1 = ER(...);
    // p1 is some string returned from ER
    char p2 = ER(...);
    // p2 is some other string returned from ER

    // some modifications performed on the string referred to by p1
    p1[0] = 'x';
    // p2[0] is 'x' -- possible data corruption

still worse in a scenario with concurrent calls to ER.

however, since the output from ER is const, this is no longer possible
-- at least, not without a deconstifying cast the petr style.  the
problem with petr's solution is not only that it modifies shared memory
purposefully qualified as const (by virtue of ER's return type), but
also that it effectively distributes the address for further use. 

unfortunately, like most of the r source code, ER is not appropriately
commented at the declaration and the definition, and without looking at
the code, one can hardly have any clue that ER always return the same
address of a static location.  while the original developer might be
careful enough not to misuse ER, in a large multideveloper project it's
hard expect that from others.  petr's function is precisely an example
of such misuse, and as it adds (again, without an appropriate comment) a
step of indirection; any use of petr's function other than what you have
in SFR (and can you guarantee no one will ever use DT for other
purposes?) is even more likely to end up in data corruption.

one simple way to improve the code is as follows;  instead of (simplified)

    const char* dropTrailing(const char* s, ...) {
       const char *p = s;
       char *replace;
       ...
       replace = (char*) p;
       ...
       return s; }

    ...mkChar(dropTrailing(EncodeReal(...), ...) ...

you can have something like

    const char* dropTrailing(char* s, ...) {
       char *p = s, *replace;
       ...
       replace = p;
       ...
       return s; }

    ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...

where it is clear, from DT's signature, that it may (as it purposefully
does, in fact) modify the content of s.  that is, you drop the
promise-not-to-modify contract in DT, and move the need for
deconstifying ER's return out of DT, making it more explicit.

however, this is still an ad hoc hack;  it still breaks the original
developer's assumption (if i'm correct) that the return from ER
(pointing to its internal buffer) should not be destructively modified
outside of ER.

another issue is that even making the return from ER const does not
protect against data corruption.  for example,

    const char *p1 = ER(...)
    // p1 is some string returned from ER
    const char *p2 = ER(...)
    // p2 is some other string returned from ER
    // but p1 == p2

if p1 is used after the second call to ER, it's likely to lead to data
corruption problems.  frankly, i'd consider the design of ER essentially
flawed.  removing const from ER's return is obviously not an option;  it
would certainly be safer to have it return a pointer to a heap-allocated
buffer, but then the returned buffer would have to be freed at some
point in the client code, and this would require a whole cascade of
modifications to the r source code.  since it usually is more funny to
introduce new bugs than repair old ones, i'd not expect r core to be
ever willing to invest time in this.

the following seem acceptable and relatively reasonable ways to address
the issue:

(1) noop:  leave as is.

(2) minimal:  adopt the (inessential) changes suggested in my previous post:

    const char* dropTrailing(const char* s, ...) {
       char *p = (char*) s;
       // no further need for (char*) casting from p to replace
       ... }

(3) modify petr's solution along the lines above, i.e., have the input
in the signature non-const and deconst-cast the output from ER outside
of the call to DT.

(4) modify petr's solution to operate on a local, non-const copy of s:

    const char* dropTrailing(const char* s, ...) {
       int length = strlen(s);
       char *ss = malloc((length+1)*sizeof(char));
       memcpy(ss, s, length+1);
       // work on ss rather than on s
       ...
       return ss; }

and don't forget to deallocate the return from DT after mkChar (or
whoever calls DT) has no more need for it.  (an alternative is to use
the allocCharsxp approach of mkCharLenCE, but that would be overdoing
the job.)

to sum up:  for a clean solution, i find (4) preferable.  for
efficiency, (3) seems better, provided that you add a clear comment to
the effect that the assumption of non-modifiable output from ER is
purposefully violated.  (2) is acceptable provided that DT is
appropriately documented (i.e., that it does not really treat s as const
char*, but effectively as non-const char*), and with a note as above.

a final, bitter remark:  i'm surprised that it's so easy to have an r
core developer submit to r-devel a patch that violates an explicit
demand on the immutability of a shared, statically allocated piece of
memory. (even if it's the already existing code that actually is
problematic.)

no patch attached, you need to consider your options.  hope this helps
anyway.

best,
vQ