[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Erik Iverson iverson at biostat.wisc.edu
Wed Aug 13 17:30:23 CEST 2008


I still don't understand what you are doing.  Can you make a small 
example that shows what you have and what you want?

Is ?split what you are after?

Emmanuel Levy wrote:
> Dear Peter and Henrik,
> 
> Thanks for your replies - this helps speed up a bit, but I thought
> there would be something much faster.
> 
> What I mean is that I thought that a particular value of a level
> could be accessed instantly, similarly to a "hash" key.
> 
> Since I've got about 6000 levels in that data frame, it means that
> making a list L of the form
> L[[1]] = values of name "1"
> L[[2]] = values of name "2"
> L[[3]] = values of name "3"
> ...
> would take ~1hour.
> 
> Best,
> 
> Emmanuel
> 
> 
> 
> 
> 2008/8/12 Henrik Bengtsson <hb at stat.berkeley.edu>:
>> To simplify:
>>
>> n <- 2.7e6;
>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>
>> # Identify 'A':s
>> t1 <- system.time(res <- which(x == "A"));
>>
>> # To compare a factor to a string, the factor is in practice
>> # coerced to a character vector.
>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>
>> # Interestingly enough, this seems to be faster (repeated many times)
>> # Don't know why.
>> print(t2/t1);
>>    user   system  elapsed
>> 0.632653 1.600000 0.754717
>>
>> # Avoid coercing the factor, but instead coerce the level compared to
>> t3 <- system.time(res <- which(x == match("A", levels(x))));
>>
>> # ...but gives no speed up
>> print(t3/t1);
>>    user   system  elapsed
>> 1.041667 1.000000 1.018182
>>
>> # But coercing the factor to integers does
>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
>> print(t4/t1);
>>     user    system   elapsed
>> 0.4166667 0.0000000 0.3636364
>>
>> So, the latter seems to be the fastest way to identify those elements.
>>
>> My $.02
>>
>> /Henrik
>>
>>
>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd at gmail.com> wrote:
>>> Emmanuel,
>>>
>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
>>>> Dear All,
>>>>
>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
>>>> extract the information in a particular way illustrated below:
>>>>
>>>>
>>>> Given a data frame "df":
>>>>
>>>>> col1=sample(c(0,1),10, rep=T)
>>>>> names = factor(c(rep("A",5),rep("B",5)))
>>>>> df = data.frame(names,col1)
>>>>> df
>>>>   names col1
>>>> 1      A    1
>>>> 2      A    0
>>>> 3      A    1
>>>> 4      A    0
>>>> 5      A    1
>>>> 6      B    0
>>>> 7      B    0
>>>> 8      B    1
>>>> 9      B    0
>>>> 10     B    0
>>>>
>>>> I would like to tranform it in the form:
>>>>
>>>>> index = c("A","B")
>>>>> col1[[1]]=df$col1[which(df$name=="A")]
>>>>> col1[[2]]=df$col1[which(df$name=="B")]
>>> I'm not sure I fully understand your problem, you example would not run for me.
>>>
>>> You could get a small speedup by omitting which(), you can subset by a
>>> logical vector also which give a small speedup.
>>>
>>>> n <- 2700000
>>>> foo <- data.frame(
>>> +       one = sample(c(0,1), n, rep = T),
>>> +       two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
>>> +       )
>>>> system.time(out <- which(foo$two=="A"))
>>>   user  system elapsed
>>>  0.566   0.146   0.761
>>>> system.time(out <- foo$two=="A")
>>>   user  system elapsed
>>>  0.429   0.075   0.588
>>>
>>> You might also find use for unstack(), though I didn't see a speedup.
>>>> system.time(out <- unstack(foo))
>>>   user  system elapsed
>>>  1.068   0.697   2.004
>>>
>>> HTH
>>>
>>> Peter
>>>
>>>> My problem is that the command:  *** which(df$name=="A") ***
>>>> takes about 1 second because df is so big.
>>>>
>>>> I was thinking that a "level" could maybe be accessed instantly but I am not
>>>> sure about how to do it.
>>>>
>>>> I would be very grateful for any advice that would allow me to speed this up.
>>>>
>>>> Best wishes,
>>>>
>>>> Emmanuel
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list