[R] Faster Subsetting

Wed Sep 28 21:44:54 CEST 2016

"I'm surprised nobody suggested split(). "

I did.

by() is a data frame oriented version of tapply(), which uses split().

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Sep 28, 2016 at 11:53 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi,
>
> I'm surprised nobody suggested split(). Splitting the data.frame
> upfront is faster than repeatedly subsetting it:
>
>   tmp <- data.frame(id = rep(1:20000, each = 10), foo = rnorm(200000))
>   idList <- unique(tmp$id)
>
>   system.time(for (i in idList) tmp[which(tmp$id == i),])
>   #   user  system elapsed
>   # 16.286   0.000  16.305
>
>   system.time(split(tmp, tmp$id))
>   #   user  system elapsed
>   #  5.637   0.004   5.647
>
> Cheers,
> H.
>
> On 09/28/2016 09:09 AM, Doran, Harold wrote:
>>
>> I have an extremely large data frame (~13 million rows) that resembles the
>> structure of the object tmp below in the reproducible code. In my real data,
>> the variable, 'id' may or may not be ordered, but I think that is
>> irrelevant.
>>
>> I have a process that requires subsetting the data by id and then running
>> each smaller data frame through a set of functions. One example below uses
>> indexing and the other uses an explicit call to subset(), both return the
>> same result, but indexing is faster.
>>
>> Problem is in my real data, indexing must parse through millions of rows
>> to evaluate the condition and this is expensive and a bottleneck in my code.
>> I'm curious if anyone can recommend an improvement that would somehow be
>> less expensive and faster?
>>
>> Thank you
>> Harold
>>
>>
>> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>>
>> idList <- unique(tmp$id)
>>
>> ### Fast, but not fast enough
>> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>>
>> ### Not fast at all, a big bottleneck
>> system.time(replicate(500, subset(tmp, id == idList[1])))
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.