[Rd] (PR#8192) [ subscripting sometimes loses names

Tim Hesterberg TimHesterberg at gmail.com
Sun Feb 1 18:25:50 CET 2009


>...
>Simon, no, the drop=FALSE argument has nothing to do with what
>Christian was talking about.  The kind of thing he meant is PR# 8192,
>"Subject: [ subscripting sometimes loses names":
>
>  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>
>In R, subscripting with "[" USUALLY retains names, but R has various
>edge cases where it (IMNSHO) inappropriately discards them.  This
>occurs with both .Primitive("[") and "[.data.frame".  This has been
>known for years, but I have not yet tried digging into R's
>implementation to see where and how the names are actually getting
>lost.
>
>Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
>in 2001 show similar buggy edge case behavior.  Older versions of
>S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
>behavior.  I presume that the original Bell Labs S had correct
>name-preserving behavior, and then the S-Plus developers broke it
>sometime along the way.

(Later comments on the thread pointed out the difference between
x[,1] for matrices and data frames.)

I rewrote the S-PLUS data frame code around then, to fix
various inconsistencies and improve efficiency.
This was probably my change, and I would do it again.

Note that the components of a data frame do not have names
attached to them; the row names are a separate object.
Extracting a component vector or matrix from a data frame should not
attach names to the result, because of:
* memory (attaching row names to an object can more than double the
  size of the object),
* speed
* some objects cannot take names, and attaching them could change
  the class and other behavior of an object, and
* the names are usually/often (depending on the user) meaningless,
  artifacts of an early design decision that all data frames have row names.

Data frames differ from matrices in two ways that matter here:
* columns in matrices are all the same kind, and are simple objects
  (numeric, etc.), whereas components of data frames can be nearly
  arbitrary objects, and
* row names get added to a data frame whether a user wants them or not,
  whereas row names on a matrix have to be specified.

A historical note - unique row names on data frame were a design
decision made when people worked with small data frames, and are
convenient for small data frames.  But they are a problem for large
data frames.  I was writing for all users, not just those with small
data frames and meaningful names.

I like R's 'automatic' row names.  This is a big help working with
huge data frames (and I do this often, at Google).  But this doesn't
go far enough; subscripting and other operations sometimes convert the
automatic names to real names, and check/enforce uniqueness, which is
a big waste of time when working with large data frames.  I'll comment
more on this in a new thread.

Tim Hesterberg



More information about the R-devel mailing list