[R] Advanced Filtering problem

Fri Jun 20 01:49:12 CEST 2008

Hi Tyler,

> I've attached 100 rows of a data frame I am working with.
> I have one factor, id, with 27 levels.  There are two columns of reference
> data, x and y (UTM coordinates), one column "date" in POSIXct format, and
> one column "diff" in times format (chron package).
>
> What I am trying to do is as follows:
> For each day of the year (date, irrespective of time), select that row for
> each id which contains the smallest "diff" value, resulting in an output
> containing in general one value per id per day.

There's a basic strategy that makes solving this type of problem much
easier.  I call it split-apply-combine.  The basic idea is that if you
had a single day, the problem would be pretty easy:

df <- read.csv("http://www.nabble.com/file/p18018170/subdata.csv")

oneday <- subset(df, day == "01-01-05")
oneday[which.min(oneday$diff), ]

# Let's make that into a function to make it easier to apply to all days

mindiff <- function(df) df[which.min(df$diff), ]

# Now we split up the data frame so that we have a data frame for
# each day

pieces <- split(df, df$day)

# And use lapply to apply that function to each piece:

results <- lapply(pieces, mindiff)

# Then finally join all the pieces back together

df_done <- do.call("rbind", results)

So we split the data frame into individual days, picked the correct
row for each day, and then joined all the pieces back together.  This
isn't the most efficient solution, but I think it's easy to see how
each part works, and how you can apply it to new situations.  If you
aren't familiar with lapply or do.call, it's worth having a look at
their examples to get a feel for how they work (although for this case
you can of course just copy and paste them without caring how they
work)

Hadley

-- 
http://had.co.nz/