[Rd] RE: [R] Removing "row.names"

David James David James <dj@research.bell-labs.com>
Wed, 7 Feb 2001 13:50:26 -0500 (EST)


> Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
> From: Thomas Lumley <tlumley@u.washington.edu>
> To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
> cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
> Subject: Re: [Rd] RE: [R] Removing "row.names"
> MIME-Version: 1.0
> 
> On Wed, 7 Feb 2001, Kurt Hornik wrote:
> 
> > >>>>> Thomas Lumley writes:
> > 
> > > On Wed, 7 Feb 2001, Kurt Hornik wrote:
> > >> >>>>> Peter Dalgaard BSA writes:
> > >> 
> > >> > Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
> > >> >> names(sampled) <- " "
> > >> >> and
> > >> >> dimnames(sampled)[[2]] <- " "
> > >> >> 
> > >> >> happily introduce non-unique variable names in the data frame.
> > >> >> 
> > >> >> Is the rule that row.names and names must be unique still on?
> > >> >> 
> > >> >> Argh ...
> > >> 
> > >> > Splus 3.4 dispatches on dimnames<-, but not on names<- with the
> > >> > following curious result:
> > >> 
> > >> >> d <- data.frame(a=1:3,b=4:6)
> > >> >> names(d)<-c(" "," ")
> > >> >> d
> > >> 
> > >> > 1 1 4
> > >> > 2 2 5
> > >> > 3 3 6
> > >> >> dimnames(d)[[1]] <- rep(" ",3)  
> > >> > Error in "dimnames<-.data.frame"(d, .A0): column names must be unique
> > >> > Dumped
> > >> 
> > >> > R dispatches similarly, but doesn't check the dimnames in
> > >> > dimnames<-.data.frame. It could do so quite easily. Just add 
> > >> 
> > >> > || any(duplicated(d[[1]])) || any(duplicated(d[[2]]))
> > >> 
> > >> > at the appropriate spot.
> > >> 
> > >> Thomas' view about what should be permitted seems to be different.
> > 
> > > I wouldn't object to making it hard to create duplicated names(), but
> > > I think it would be a bad idea to have data.frame() make up unique
> > > names if it's given non-unique ones.
> > 
> > Maybe `check.names' could also be used for uniqueness testing?
> > 
> > In any case, I think we should specify what *exactly* a data frame is.
> >
> 
> I think we should specify, and check.names is a logical way to
> allow/forbid non-unique columns.  
> 
> Having a new class would be messy: logically it shouldn't inherit from
> data.frame, data.frame should inherit from it, but that would be a real
> pain to set up.
> 

Data frames were originally meant to be used in modeling functions.
The opening paragraph in Chapter 3 (Data for Models) in the White Book
says:
 
  "This chapter describes the general structure for data that
  will be used throughout the book.  In particular, it introduces the
  data frame, a class of objects to represent the data typically encounterd  
  in fitting models."

However, data.frames may not be quite appropriate for representing
other types of tabular data (certainly a data.frame does not capture
the essence of, say, a "relational" table in the SQL sense, which doesn't
have the concept of row names).  Several manifestations of this problem are 
coercing character data to factors "at the drop of a hat" (as someone wrote 
here or in s-news), the row.names issue now being discussed,  problems 
including general objets in the "cells" of the data.frame, etc.  

I think that the concept of a data.frame to represent data for fitting
models is fine, but we may (certainly I) have abused this concept.  We need 
other classes of tabular data objects in addition (not as a replacement) to 
data.frames, together with coercion methods and perhaps other utilities.


David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._