[R] Subset by Factor by date

Sat Jun 14 14:09:10 CEST 2008

On Jun 14, 2008, at 2:59 AM, T.D.Rudolph wrote:

>
> I can't speak to the intricacies of the formula but when I run the
> ByDataFrame() function provided on a subsample of my data (n=50) it  
> returned
> only the very first id value in the output; the rest came out as  
> <NA>....
> This is not to say it has not properly selected the rows with min(x 
> $diff),
> but I have no way of verifying without the id membership in the  
> output.

And equally we can't help you with that without a reproducible  
example. Doesn't it do the right thing in the little sample I posted?  
It moves the id and day columns to the end. Without that, the only  
thing I can think of that might cause trouble is that you have a  
matrix instead of a data.frame, or otherwise the columns have some  
class I have not anticipated. Perhaps you can send me a part of your  
data off-list, if you can't post it here?

Haris Skiadas
Department of Mathematics and Computer Science
Hanover College

> Charilaos Skiadas-3 wrote:
>>
>>
>> On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
>>
>>>
>>> aggregate() is indeed a useful function in this case, but it only
>>> returns the
>>> columns by which it was grouped.  Is there a way I can use this  
>>> while
>>> simultaneously retaining all the other column values in the  
>>> dataframe?
>>>
>>> e.g. add superfluous (yet pertinent for later) column containing any
>>> information at all and retain it in the final output
>>
>> I had exactly this kind of need many times, and I have finally
>> created a function for it, which I hope to include soon in an
>> upcoming package. Here is a run of it (I added an extra "A" column
>> containing just the numbers 1:8):
>>
>>> DF
>>    id      day diff A
>> 1  1 01-01-09  0.5 1
>> 2  1 01-01-09  0.7 2
>> 3  2 01-01-09  0.2 3
>> 4  2 01-01-09  0.4 4
>> 5  1 01-02-09  0.1 5
>> 6  1 01-02-09  0.3 6
>> 7  2 01-02-09  0.3 7
>> 8  2 01-02-09  0.4 8
>>> byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
>>    diff A id      day
>> 1  0.5 1  1 01-01-09
>> 2  0.2 3  2 01-01-09
>> 3  0.1 5  1 01-02-09
>> 4  0.3 7  2 01-02-09
>>
>> Would that do what you want?
>>
>> I've appended the function byDataFrame, and its prerequisite, a
>> function parseIndexList. I'm not quite set on the names yet, but
>> anyway. Hope this helps. I haven't really tested it on large sets, it
>> might perform poorly. Any suggestions on speeding the code /
>> corrections are welcome.
>>
>> Haris Skiadas
>> Department of Mathematics and Computer Science
>> Hanover College
>>
>>
>>
>> parseIndexList <- function(indexList) {
>>    # browser()
>>    if (!is.list(indexList))
>>      indexList <- as.list(indexList)
>>    nI <- length(indexList)
>>    namelist <- vector("list", nI)
>>    names(namelist) <- names(indexList)
>>    extent <- integer(nI)
>>    nx <- length(indexList[[1]])
>>    one <- as.integer(1)
>>    group <- rep.int(one, nx)
>>    ngroup <- one
>>    for (i in seq.int(indexList)) {
>>        index <- as.factor(indexList[[i]])
>>        if (length(index) != nx)
>>            stop("arguments must have same length")
>>        namelist[[i]] <- sort(unique(indexList[[i]]))
>>        extent[i] <- length(namelist[[i]])
>>        group <- group + ngroup * (as.integer(index) - one)
>>        ngroup <- ngroup * nlevels(index)
>>    }
>>    nms <- do.call(expand.grid, namelist)
>>    ind <- unique(sort(group))
>>    res <- data.frame(index=ind, nms[ind, , drop=FALSE])
>>    return(list(cases=group, groups=res))
>> }
>>
>> byDataFrame <- function (data, INDEX, FUN, newnames,
>> omit.index.cols=TRUE, ...) {
>> # # Part of the code shamelessly stolen from tapply
>>    IND <- eval(substitute(INDEX), data)
>>    nms <- as.character(as.list(substitute(INDEX)))
>>    if (!is.list(IND)) {
>>      IND <- list(IND)
>>      names(IND) <- nms
>>    } else {
>>      names(IND) <- nms[-1]
>>    }
>>    funname <- paste(as.character(substitute(FUN)), collapse=".")
>>    indexInfo <- parseIndexList(IND)
>>    FUNx <- if (omit.index.cols) {
>>      omit.cols <- match(names(indexInfo$groups)[-1], names(data))
>>      function(x, ...) FUN(data[x, -omit.cols], ...)
>>    } else {
>>      function(x, ...) FUN(data[x, ], ...)
>>    }
>>    ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
>>    index <- as.numeric(names(ans))
>>    if (!is.data.frame(ans[[1]])) {
>>      ans <- lapply(ans, function(x) {
>>        dframe <- as.data.frame(t(x))
>>        if (is.null(names(x)))
>>          names(dframe) <- funname
>>        dframe
>>      })
>>    }
>>    lengths <- sapply(ans, nrow)
>>    ans <- do.call(rbind, ans)
>>    if (!missing(newnames))
>>      names(ans) <- newnames
>>    nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
>>    res <- cbind(ans, nms)
>>    res
>> }