[R] Subsetting for the ten highest values by group in a dataframe

Sat Jan 28 14:55:18 CET 2012

On Fri, Jan 27, 2012 at 1:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:
> Hello,
>
> I am looking for a way to subset a data frame by choosing the top ten
> maximum values from that dataframe. As well this occurs within some
> factor levels.
>
> ## I've used plyr here but I'm not married to this approach
> require(plyr)
>
> ## I've created a data.frame with two groups and then a id variable (y)
> df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B"))
>
> ## So using ddply I can find the highest value of x
> df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])
>
> ## Or the 2nd highest value
> df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])
>
> ## And so on.... but when I try to make a series of numbers like so
> ## to get the top ten values, I don't get a warning message but
> ## two values that don't really make sense to me
> df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])

Well, sort returns a vector, so you probably want

df.max <- ddply(df, c("z"), subset, x %in% sort(x, TRUE)[1:10])

but it would be better to do (e.g.)

df.max <- ddply(df, c("z"), subset, rank(x) <= 10)

which will also make it possible to deal with ties in a principled way.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/