[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

Uwe Ligges ligges at statistik.tu-dortmund.de
Sat May 30 15:24:46 CEST 2009



Wacek Kusnierczyk wrote:
> Martin Maechler wrote:
>> Hi Waclav (and other interested parties),
>>
>> I have committed my working version of src/main/coerce.c
>> so you can prepare your patch against that.
>>   
> 
> some further investigation and reflections on the code in StringFromReal
> (henceforth SFR), src/main/coerce.c:315 (as in the patched version, now
> in r-devel).
> 
> petr's elim_trailing (renamed to dropTrailing, henceforth referred to as
> DT) takes as input a const char*, and returns a const char*.
> 
> const-ness of the return is not a problem;  it is fed into mkChar, which
> (via mkCharLenCE) makes a local memcpy of the string, and there's no
> violation of the contract here.
> 
> const-ness of the input is a consequence of the return type of
> EncodeReal (henceforth EC).  however, it is hardly ever, "in principle",
> a good idea to destructively modify const input (as DT does) if it comes
> from a function that explicitly provides it as const (as ER does).
> 
> the first question is, why does ER return the string as const?  it
> appears that the returned pointer provides the address of a buffer used
> internally in ER, which is allocated *statically*.  that is, each call
> to ER operates on the same memory location, and each call to ER returns
> the address of that same location.  i suspect this is intended to be a
> smart optimization, to avoid heap- or stack-allocating a new buffer in
> each call to ER, and deallocating it after use.  however, this appraoch
> is problematic, in that any two calls to ER return the address of the
> same piece of memory, and this may easily lead to data corruption. 
> 
> under the assumption that the content of this piece of memory is copied
> before any destructive use, and that after the string is copied the
> address is not further distributed, the hack is relatively harmless. 
> this is what mkChar (via mkCharLenCE) does;  in SFR it copies the
> content of s with memcpy, and wraps it into a SEXP that becomes the
> return value from SFR.
> 
> the original author of this hack seems to have had some concern about
> exporting (from ER) the address of a static buffer, hence the returned
> buffer is const.  in principle, this should prevent corruption of the
> buffer's content in situations such as
> 
>     // hypothetical
>     char *p1 = ER(...);
>     // p1 is some string returned from ER
>     char p2 = ER(...);
>     // p2 is some other string returned from ER
> 
>     // some modifications performed on the string referred to by p1
>     p1[0] = 'x';
>     // p2[0] is 'x' -- possible data corruption
> 
> still worse in a scenario with concurrent calls to ER.
>   
> however, since the output from ER is const, this is no longer possible
> -- at least, not without a deconstifying cast the petr style.  the
> problem with petr's solution is not only that it modifies shared memory
> purposefully qualified as const (by virtue of ER's return type), but
> also that it effectively distributes the address for further use. 
> 
> unfortunately, like most of the r source code, ER is not appropriately
> commented at the declaration and the definition, and without looking at
> the code, one can hardly have any clue that ER always return the same
> address of a static location.  while the original developer might be
> careful enough not to misuse ER, in a large multideveloper project it's
> hard expect that from others.  petr's function is precisely an example
> of such misuse, and as it adds (again, without an appropriate comment) a
> step of indirection; any use of petr's function other than what you have
> in SFR (and can you guarantee no one will ever use DT for other
> purposes?) is even more likely to end up in data corruption.
> 
> one simple way to improve the code is as follows;  instead of (simplified)
> 
>     const char* dropTrailing(const char* s, ...) {
>        const char *p = s;
>        char *replace;
>        ...
>        replace = (char*) p;
>        ...
>        return s; }
> 
>     ...mkChar(dropTrailing(EncodeReal(...), ...) ...
> 
> you can have something like
> 
>     const char* dropTrailing(char* s, ...) {
>        char *p = s, *replace;
>        ...
>        replace = p;
>        ...
>        return s; }
> 
>     ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
>   
> where it is clear, from DT's signature, that it may (as it purposefully
> does, in fact) modify the content of s.  that is, you drop the
> promise-not-to-modify contract in DT, and move the need for
> deconstifying ER's return out of DT, making it more explicit.
> 
> however, this is still an ad hoc hack;  it still breaks the original
> developer's assumption (if i'm correct) that the return from ER
> (pointing to its internal buffer) should not be destructively modified
> outside of ER.
> 
> another issue is that even making the return from ER const does not
> protect against data corruption.  for example,
> 
>     const char *p1 = ER(...)
>     // p1 is some string returned from ER
>     const char *p2 = ER(...)
>     // p2 is some other string returned from ER
>     // but p1 == p2
> 
> if p1 is used after the second call to ER, it's likely to lead to data
> corruption problems.  frankly, i'd consider the design of ER essentially
> flawed.  removing const from ER's return is obviously not an option;  it
> would certainly be safer to have it return a pointer to a heap-allocated
> buffer, but then the returned buffer would have to be freed at some
> point in the client code, and this would require a whole cascade of
> modifications to the r source code.  since it usually is more funny to
> introduce new bugs than repair old ones, i'd not expect r core to be
> ever willing to invest time in this.
> 
> the following seem acceptable and relatively reasonable ways to address
> the issue:
> 
> (1) noop:  leave as is.
> 
> (2) minimal:  adopt the (inessential) changes suggested in my previous post:
> 
>     const char* dropTrailing(const char* s, ...) {
>        char *p = (char*) s;
>        // no further need for (char*) casting from p to replace
>        ... }
> 
> (3) modify petr's solution along the lines above, i.e., have the input
> in the signature non-const and deconst-cast the output from ER outside
> of the call to DT.
> 
> (4) modify petr's solution to operate on a local, non-const copy of s:
> 
>     const char* dropTrailing(const char* s, ...) {
>        int length = strlen(s);
>        char *ss = malloc((length+1)*sizeof(char));
>        memcpy(ss, s, length+1);
>        // work on ss rather than on s
>        ...
>        return ss; }
> 
> and don't forget to deallocate the return from DT after mkChar (or
> whoever calls DT) has no more need for it.  (an alternative is to use
> the allocCharsxp approach of mkCharLenCE, but that would be overdoing
> the job.)
> 
> 
> to sum up:  for a clean solution, i find (4) preferable.  for
> efficiency, (3) seems better, provided that you add a clear comment to
> the effect that the assumption of non-modifiable output from ER is
> purposefully violated.  (2) is acceptable provided that DT is
> appropriately documented (i.e., that it does not really treat s as const
> char*, but effectively as non-const char*), and with a note as above.
> 
> a final, bitter remark:  i'm surprised that it's so easy to have an r
> core developer submit to r-devel a patch that violates an explicit
> demand on the immutability of a shared, statically allocated piece of
> memory. (even if it's the already existing code that actually is
> problematic.)


Well, I am not surprised at all: everybody is working hard (even on 
saturdays and sundays) and R is a thing that is also done, beside the 
real job as a recognized statistician (not a job as computer scientist, 
by the way). And I think we can be very happy with this status, because 
if the language was written by computer scientists only, it wouldn't 
have its best strengths where we need it: in statistics.

Please, we do not need the perfectly designed language (which does not 
exist anyway), but what we need is a language that is very easy to use 
in order to cope with all the actual problems in the world of statistics 
and beyond.

It is nice to see people submitting bug reports and even better to 
submit patches. Thank you very much for that! But I think it is not fair 
to blame the inventors and developers of the language again and again. I 
mean the language that has become so famous and indispensable among 
statisticians for some very good reasons.

Best,
Uwe Ligges



> no patch attached, you need to consider your options.  hope this helps
> anyway.
> 
> best,
> vQ
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list