[R] subsetting a data.frame to the 'unique' of a column

Berton Gunter gunter.berton at gene.com
Thu Dec 23 18:47:41 CET 2004


Spencer's solution is considerably more inefficient then using duplicated()
and subscripting: in a small example with 3 columns and 10000 rows, it took
5 times as long on my Windows setup.

The reason is that aggregate() is basically a wrapper for tapply and tapply
basically loops in R. duplicated() loops in C (and uses hashing, I believe).

Cheers,

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Spencer Graves
> Sent: Thursday, December 23, 2004 9:06 AM
> To: Göran Broström
> Cc: Rudi Alberts; r-help at stat.math.ethz.ch
> Subject: Re: [R] subsetting a data.frame to the 'unique' of a column
> 
>       What about "aggregate"? 
> 
>  DF <- data.frame(a=c(1,1,2), b=1:3, c=letters[1:3])
>  aggregate(DF[2:3], DF[1], function(x)x[1])
>   a b c
> 1 1 1 1
> 2 2 3 3
> 
>       hope this helps.  spencer graves
> 
> Göran Broström wrote:
> 
> >On Thu, Dec 23, 2004 at 11:28:31AM -0800, Rudi Alberts wrote:
> >  
> >
> >>Hi,
> >>
> >>I often run into this problem:
> >>I have a data.frame with one column containing entries that are not
> >>unique. What I then want is a subset of the data.frame in which
> >>the entries in that column have become the 'unique' of the original
> >>column. 
> >>Normally I program around it by taking the unique of the column and
> >>making a new data.frame with it and filling the rest of the data.
> >>
> >>(By the way, when moving to the smaller data.frame for 
> example 5 rows
> >>with the same value in that column will be replaced by one 
> row for that
> >>value. I don't mind which of the rows now..)
> >>
> >>
> >>something like this, however, this gives me the complete df.
> >>
> >>df[df$colname %in% unique(df$colname),]
> >>
> >>or this, which doesnt work
> >>
> >>df[df$colname == unique(df$colname),]
> >>
> >>    
> >>
> > Use 'duplicated':
> >
> >  
> >
> >>df[!duplicated(df$colname), ]
> >>    
> >>
> >
> >  
> >
> 
> -- 
> Spencer Graves, PhD, Senior Development Engineer
> O:  (408)938-4420;  mobile:  (408)655-4567
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list