[R] selecting rows with more than x occurrences in a given column (data type is names)

Marc Schwartz marc_schwartz at comcast.net
Tue Mar 13 16:38:31 CET 2007


On Tue, 2007-03-13 at 10:32 -0500, Marc Schwartz wrote:
> On Tue, 2007-03-13 at 10:38 -0400, Mike Jasper wrote:
> > Despite a long search on the archives, I couldn't find how to do this.
> > Thanks in advance for what is likely a simple issue.
> > 
> > I have a data set where the first column is name (i.e., 'Joe Smith',
> > 'Jane Doe', etc). The following columns are data associated with that
> > person. I have many people with multiple rows. What I want is to get a
> > new data frame out with only the people who have more than x
> > occurrences in the first column.
> > 
> > Here's what I've done, that's not working:
> > 
> > Let's call my old data.frame "all.data"
> > 
> > table(all.data$names)>10
> > 
> > I get a list of names and TRUE/FALSE values. I then want to make a
> > list of the TRUEs and pass that to some subset type command like
> > 
> > dup.names=table(all.data$names)>10
> > 
> > new.data=(all.data[all.data$names==dup.names,])
> > 
> > That's not working because the dimensions are wrong (I think). But
> > even when I tried to do part of it manually (to troubleshoot) like
> > this
> > 
> > dup.names=c('Joe Smith','Jane Doe','etc')
> > 
> > I got warnings and it didn't work correctly. There must be a simple
> > way to do this that I'm just not seeing. Thanks.
> 
> 
> Something like this should work:
> 
>   NewDF <- subset(all.data, names %in% unique(names[duplicated(names)]))
> 
> See ?duplicated, ?unique and ?"%in%" for more information.
> 
> HTH,
> 
> Marc Schwartz

Ack...sorry about that.  I misread the query as for any duplicated
occurrences. The solution provided by Dimitris is correct.

Marc



More information about the R-help mailing list