[Rd] Quiz: How to get a "named column" from a data frame

Mon Aug 20 00:00:54 CEST 2012

On Sat, Aug 18, 2012 at 02:13:20PM -0400, Christian Brechb?hler wrote:
> On 8/18/12, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> > On Sat, Aug 18, 2012 at 5:14 PM, Christian Brechb?hler .... wrote:
> >> On Sat, Aug 18, 2012 at 11:03 AM, Martin Maechler
> >> <maechler at stat.math.ethz.ch> wrote:
> 
> >>> Consider this toy example, where the dataframe already has only
> >>> one column :
> >>>
> >>> > nv <- c(a=1, d=17, e=101); nv
> >>>   a   d   e
> >>>   1  17 101
> >>>
> >>> > df <- as.data.frame(cbind(VAR = nv)); df
> >>>   VAR
> >>> a   1
> >>> d  17
> >>> e 101
> >>>
> >>> Now how, can I get 'nv' back from 'df' ?   I.e., how to get

> >>> identical(nv, df[,1])
> >> [1] TRUE
> 
> > But it is not a solution in a current version of R!
> > though it's still interesting that   df[,1]  worked in some incantation of
> > R.
> 
> My mistake!  We disliked some quirks of indexing, so we've long had
> our own patch for "[.data.frame" in place, which I used inadvertently.

As I understand it, when when doing 'df[,1]' on a data frame, Bell
Labs S and all versions of S-Plus prior to 3.4 always retained the
data frame's row names as the names on the result vector.  'df[,1]'
gave you a named vector identical to your 'nv' above.  Then in 1996
with S-Plus 3.4, Insightful broke that behavior, after which 'df[,1]'
returned a vector without any names.  I believe R copied that
late-1990s S-Plus behavior, but I don't know why exactly.

When subscripting objects, R sometimes retains the object's dimnames
as names in the result, and sometimes not, which I find frustrating.
Personally, I think it would make much more sense if subscripting
ALWAYS retained any names it could, and worked as similarly as
possible across data frames, matrices, arrays, vectors, etc.  After
all, explicitly dropping names afterwards is trivial, while adding
them back on is not.

Back on 2005-10-19 with R 2.2.0, I gave a simple test of 15 cases; 4
of them dropped names during subscripting, the other 11 preseved them.
That's towards the end of the discussion here:

  https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=8192

Contrary to the initial tone of my old 2005 "bug" report, current R
subscripting behavior is of course NOT a bug, as AFAIK it's working as
the R Core Team intended.  However, I definitely consider the current
behavior a design infelicity.

Just now on stock R 2.15.1 (with --vanilla), I ran an updated version
of those same simple tests.  Of 22 subscripting test cases, 7 lose
names and 15 preserve them.  (If anyone's interested in the specific
tests, I can send them, or try to append them to that old 8192 feature
request.)

For what it's worth, at work, for years we ran various versions of
pre-namespace R using some ugly patches of "[" and "[.data.frame" to
force name retention during subscripting.  Since we were not using
namespaces at all, those "keep names" subscripting hacks were
affecting ALL R code we ran, not just our own custom code which needed
and expected the names to be retained.  Yet perhaps surprisingly, I
don't think I ever ran into a single case where the forced retention
of names broke any code.  We of course ran only a tiny sample of the
huge amount of code on CRAN, but that experience suggests that most R
code which expects un-named objects doesn't mind at all if names are
present.

If anyone would genuinely like to add an option for name-preserving
subscripting to R, I'm willing to work on it, so please do let me know
your thoughts.  So far though, I've never dug into the guts of the
.Primitive("[") and "[.data.frame" functions to see how/why they
sometimes keep and sometime discard names during subscripting.

-- 
Andrew Piskorski <atp at piskorski.com>