[R] Select rows based on matching conditions and logical operators

William Dunlap wdunlap at tibco.com
Thu Jul 26 00:43:31 CEST 2012


And another way to drop the unneed interaction levels is to supply
drop=TRUE to ave():
   > z <- ave(LETTERS[1:3], 1:3, 1:3, FUN=function(x)print(x), drop=TRUE)
   [1] "A"
   [1] "B"
   [1] "C"

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of William Dunlap
> Sent: Wednesday, July 25, 2012 3:37 PM
> To: Bert Gunter; Rui Barradas
> Cc: r-help
> Subject: Re: [R] Select rows based on matching conditions and logical operators
> 
> Any of those would work.  I wish ave() did that part of the job.
> I don't think there is any reason it shouldn't.  The following only
> needs to call FUN three times, not 9:
>    > z <- ave(LETTERS[1:3], 1:3, 1:3, FUN=function(x)print(x))
>    [1] "A"
>    character(0)
>    character(0)
>    character(0)
>    [1] "B"
>    character(0)
>    character(0)
>    character(0)
>    [1] "C"
>    > z
>    [1] "A" "B" "C"
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
> > -----Original Message-----
> > From: Bert Gunter [mailto:gunter.berton at gene.com]
> > Sent: Wednesday, July 25, 2012 3:04 PM
> > To: Rui Barradas
> > Cc: William Dunlap; r-help
> > Subject: Re: [R] Select rows based on matching conditions and logical operators
> >
> > Wouldn't
> >
> > > interaction(..., drop=TRUE)
> >
> > be the same, but terser in this situation?
> >
> > Also I tend to use paste() for this, i.e. instead of
> >
> > > interaction(v1,v2, drop=TRUE)
> >
> > simply
> >
> > > paste(v1,v2)
> >
> > Again, this seems shorter and simpler -- but are there good reasons to
> > prefer the use of interaction()?
> >
> > Cheers,
> > Bert
> >
> > On Wed, Jul 25, 2012 at 2:51 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > Hello,
> > >
> > > You're right, thanks.
> > > In my solution, I had tried to keep to the op as much as possible. A glance
> > > at it made me realize that one change only would do the job, and that was
> > > it, no performance worries.
> > > I particularly liked the interaction/droplevels trick.
> > >
> > > Rui Barradas
> > >
> > > Em 25-07-2012 22:13, William Dunlap escreveu:
> > >>
> > >> Rui,
> > >>    Your solution works, but it can be faster for large data.frames if you
> > >> compute
> > >> the indices of the desired rows of the input data.frame and then using one
> > >> subscripting call to select the rows  instead of splitting the input
> > >> data.frame
> > >> into a list of data.frames, extracting the desired row from each
> > >> component,
> > >> and then calling rbind to put the rows together again.  E.g., compare your
> > >> approach, which I've put into the function f1
> > >>    f1 <- function (dataFrame)  {
> > >>        retval <- with(dataFrame, sapply(split(dataFrame, list(PTID,
> > >>            Year)), function(x) if (nrow(x))
> > >>            x[which.max(x$Count), ]))
> > >>        retval <- do.call(rbind, retval)
> > >>        rownames(retval) <- 1:nrow(retval)
> > >>        retval
> > >>    }
> > >> with one that computes a logical subscripting vector (by splitting just
> > >> the
> > >> Counts vector, not the whole data.frame)
> > >>    f2 <- function (dataFrame)  {
> > >>        keep <- as.logical(ave(dataFrame$Count,
> > >> droplevels(interaction(dataFrame$PTID,
> > >>            dataFrame$Year)), FUN = function(x) if (length(x)) seq_along(x)
> > >> ==
> > >>            which.max(x)))
> > >>        dataFrame[keep, ]
> > >>    }
> > >>
> > >> The both compute the same thing, aside from the fact that the rows
> > >> are in a different order (f2 keeps the order of the original data.frame)
> > >> and f2 leaves the original row label with the row.
> > >>>
> > >>> f1(df1)
> > >>
> > >>    PGID  PTID Year Visit Count
> > >> 1 6755 53122 2008     3     1
> > >> 2 6755 53121 2009     1     0
> > >> 3 6755 53122 2009     3     2
> > >>>
> > >>> f2(df1)
> > >>
> > >>    PGID  PTID Year Visit Count
> > >> 1 6755 53121 2009     1     0
> > >> 6 6755 53122 2008     3     1
> > >> 9 6755 53122 2009     3     2
> > >> When there are a lot of output rows the f2 can be quite a bit faster.
> > >>
> > >> (I put the call to droplevels(interaction(...)) into the call to ave
> > >> because ave
> > >> can waste a lot of time calling FUN for nonexistent interaction levels.)
> > >>
> > >> Bill Dunlap
> > >> Spotfire, TIBCO Software
> > >> wdunlap tibco.com
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> > >>> On
> > >>> Behalf Of Rui Barradas
> > >>> Sent: Wednesday, July 25, 2012 10:24 AM
> > >>> To: kborgmann
> > >>> Cc: r-help
> > >>> Subject: Re: [R] Select rows based on matching conditions and logical
> > >>> operators
> > >>>
> > >>> Hello,
> > >>>
> > >>> Apart from the output order this does it.
> > >>> (I have changed 'df' to 'df1', 'df' is an R function, the F distribution
> > >>> density.)
> > >>>
> > >>>
> > >>> df1 <- read.table(text="
> > >>> PGID PTID Year Visit  Count
> > >>> 6755 53121 2009 1 0
> > >>> 6755 53121 2009 2 0
> > >>> 6755 53121 2009 3 0
> > >>> 6755 53122 2008 1 0
> > >>> 6755 53122 2008 2 0
> > >>> 6755 53122 2008 3 1
> > >>> 6755 53122 2009 1 0
> > >>> 6755 53122 2009 2 1
> > >>> 6755 53122 2009 3 2", header=TRUE)
> > >>>
> > >>>
> > >>> df2 <- with(df1, sapply(split(df1, list(PTID, Year)),
> > >>>       function(x) if (nrow(x)) x[which.max(x$Count), ]))
> > >>> df2 <- do.call(rbind, df2)
> > >>> rownames(df2) <- 1:nrow(df2)
> > >>> df2
> > >>>
> > >>> which.max(9, not which().
> > >>>
> > >>> Hope this helps,
> > >>>
> > >>> Rui Barradas
> > >>> Em 25-07-2012 18:10, kborgmann escreveu:
> > >>>>
> > >>>> Hi,
> > >>>> I have a dataset in which I would like to select rows based on matching
> > >>>> conditions and return the maximum value of a variable else return one
> > >>>> row if
> > >>>> duplicate counts exist.  My dataset looks like this:
> > >>>> PGID    PTID    Year     Visit  Count
> > >>>> 6755    53121   2009    1       0
> > >>>> 6755    53121   2009    2       0
> > >>>> 6755    53121   2009    3       0
> > >>>> 6755    53122   2008    1       0
> > >>>> 6755    53122   2008    2       0
> > >>>> 6755    53122   2008    3       1
> > >>>> 6755    53122   2009    1       0
> > >>>> 6755    53122   2009    2       1
> > >>>> 6755    53122   2009    3       2
> > >>>>
> > >>>> I would like to select rows if PTID and Year match and return the
> > >>>> maximum
> > >>>> count else return one row if counts are the same, such that I get this
> > >>>> output
> > >>>> PGID    PTID    Year     Visit  Count
> > >>>> 6755    53121   2009    1       0
> > >>>> 6755    53122   2008    3       1
> > >>>> 6755    53122   2009    3       2
> > >>>>
> > >>>> I tried the following code and the output is almost correct but
> > >>>> duplicate
> > >>>> values were included
> > >>>> df2<-with(df, sapply(split(df, list(PTID, Year)),
> > >>>> function(x) if (nrow(x)) x[which(x$Count==max(x$Count)),]))
> > >>>> df<-do.call(rbind,df)
> > >>>> rownames(df)<-1:nrow(df)
> > >>>>
> > >>>> Any suggestions?
> > >>>> Thanks much for your responses!
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> View this message in context:
> > >>>> http://r.789695.n4.nabble.com/Select-rows-based-
> > >>>
> > >>> on-matching-conditions-and-logical-operators-tp4637809.html
> > >>>>
> > >>>> Sent from the R help mailing list archive at Nabble.com.
> > >>>>
> > >>>> ______________________________________________
> > >>>> R-help at r-project.org mailing list
> > >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>> PLEASE do read the posting guide
> > >>>> http://www.R-project.org/posting-guide.html
> > >>>> and provide commented, minimal, self-contained, reproducible code.
> > >>>
> > >>> ______________________________________________
> > >>> R-help at r-project.org mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> > >>> http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> > --
> >
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> >
> > Internal Contact Info:
> > Phone: 467-7374
> > Website:
> > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-
> > biostatistics/pdb-ncb-home.htm
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list