[R] Subsetting for the ten highest values by group in a dataframe

Gabor Grothendieck ggrothendieck at gmail.com
Sat Jan 28 17:26:52 CET 2012


On Fri, Jan 27, 2012 at 2:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:
> Hello,
>
> I am looking for a way to subset a data frame by choosing the top ten
> maximum values from that dataframe. As well this occurs within some
> factor levels.
>
> ## I've used plyr here but I'm not married to this approach
> require(plyr)
>
> ## I've created a data.frame with two groups and then a id variable (y)
> df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B"))
>
> ## So using ddply I can find the highest value of x
> df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])
>
> ## Or the 2nd highest value
> df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])
>
> ## And so on.... but when I try to make a series of numbers like so
> ## to get the top ten values, I don't get a warning message but
> ## two values that don't really make sense to me
> df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])
>
> ## So no error message when I use the method above, which is clearly wrong.
> ## But I really am not sure how to diagnose the problem.
>
> ## Can anyone suggest a way to subset a data.frame with groups to
> select the top ten max values in that data.frame for each group?
>
> ## Thanks so much in advance?
>

(1) Here is a pure R solution.  The result of ave is a vector with the
rank within group of each row so we just take the ones that are 10 or
less:

df.top <- df[ave(-df$x, df$z, FUN = rank) <= 10, ]

We may or may not wish to use the ties.method= argument to rank
depending on how ties are to be handled.  The question did not
indicate that the result needed to be sorted but if it does then
df.top can be sorted like this:

df.top[with(df.top, order(z, -x)), ]

(2) Here is an sqldf solution that uses the default SQLite database:

library(sqldf)
sqldf("select * from df i
   where rowid in
    (select rowid from df where z = i.z order by x desc limit 10)
   order by i.z, i.x desc")

This always gives 10 in each group even if there are ties.  We can
omit the last order by clause if the result need not be sorted.

(3) and here is an sqldf using the PostgreSQL database (which has
additional simplifying features):

library(sqldf)
library(RPostgreSQL)

sqldf("select * from
	(select *, row_number() over  (partition by z order by x desc) rank from df) s
	where rank <= 10
	order by z, rank")

Depending on how we wish to handle ties we may need to replace
row_number() with rank().  Also, we can omit the last order by clause
if the result need not be sorted.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list