[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Wed Aug 13 18:11:15 CEST 2008

If you want the index, then use:

> system.time(y <- split(seq(nrow(x)), x$name))
   user  system elapsed
   0.81    0.06    0.88
> str(y[1:10])
List of 10
 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624
65767 75359 ...
 $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134
64414 64491 ...
 $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ...
 $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769
81783 83951 ...
 $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ...
 $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791
69855 74272 ...
 $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769
88878 90594 ...
 $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725
65243 73260 ...
 $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680
35828 50490 ...
 $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709
55505 70957 ...
>
>

On Wed, Aug 13, 2008 at 9:09 AM, jim holtman <jholtman at gmail.com> wrote:
> split if probably what you are after.  Here is an example:
>
>> n <- 2700000
>> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
>> # split it into 6000 lists
>> system.time(y <- split(x$value, x$name))
>   user  system elapsed
>   0.80    0.20    1.07
>> str(y[1:10])
> List of 10
>  $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
>  $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
>  $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
>  $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
>  $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
>  $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
>  $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
>  $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
>  $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
>  $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...
>>
>  Takes less that 1 second to split into 6000 lists.
>
> On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
>> Wow great! Split was exactly what was needed. It takes about 1 second
>> for the whole operation :D
>>
>> Thanks again - I can't believe I never used this function in the past.
>>
>> All the best,
>>
>> Emmanuel
>>
>>
>> 2008/8/13 Erik Iverson <iverson at biostat.wisc.edu>:
>>> I still don't understand what you are doing.  Can you make a small example
>>> that shows what you have and what you want?
>>>
>>> Is ?split what you are after?
>>>
>>> Emmanuel Levy wrote:
>>>>
>>>> Dear Peter and Henrik,
>>>>
>>>> Thanks for your replies - this helps speed up a bit, but I thought
>>>> there would be something much faster.
>>>>
>>>> What I mean is that I thought that a particular value of a level
>>>> could be accessed instantly, similarly to a "hash" key.
>>>>
>>>> Since I've got about 6000 levels in that data frame, it means that
>>>> making a list L of the form
>>>> L[[1]] = values of name "1"
>>>> L[[2]] = values of name "2"
>>>> L[[3]] = values of name "3"
>>>> ...
>>>> would take ~1hour.
>>>>
>>>> Best,
>>>>
>>>> Emmanuel
>>>>
>>>>
>>>>
>>>>
>>>> 2008/8/12 Henrik Bengtsson <hb at stat.berkeley.edu>:
>>>>>
>>>>> To simplify:
>>>>>
>>>>> n <- 2.7e6;
>>>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>>>>
>>>>> # Identify 'A':s
>>>>> t1 <- system.time(res <- which(x == "A"));
>>>>>
>>>>> # To compare a factor to a string, the factor is in practice
>>>>> # coerced to a character vector.
>>>>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>>>>
>>>>> # Interestingly enough, this seems to be faster (repeated many times)
>>>>> # Don't know why.
>>>>> print(t2/t1);
>>>>>   user   system  elapsed
>>>>> 0.632653 1.600000 0.754717
>>>>>
>>>>> # Avoid coercing the factor, but instead coerce the level compared to
>>>>> t3 <- system.time(res <- which(x == match("A", levels(x))));
>>>>>
>>>>> # ...but gives no speed up
>>>>> print(t3/t1);
>>>>>   user   system  elapsed
>>>>> 1.041667 1.000000 1.018182
>>>>>
>>>>> # But coercing the factor to integers does
>>>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
>>>>> print(t4/t1);
>>>>>    user    system   elapsed
>>>>> 0.4166667 0.0000000 0.3636364
>>>>>
>>>>> So, the latter seems to be the fastest way to identify those elements.
>>>>>
>>>>> My $.02
>>>>>
>>>>> /Henrik
>>>>>
>>>>>
>>>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd at gmail.com> wrote:
>>>>>>
>>>>>> Emmanuel,
>>>>>>
>>>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Dear All,
>>>>>>>
>>>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would
>>>>>>> like to
>>>>>>> extract the information in a particular way illustrated below:
>>>>>>>
>>>>>>>
>>>>>>> Given a data frame "df":
>>>>>>>
>>>>>>>> col1=sample(c(0,1),10, rep=T)
>>>>>>>> names = factor(c(rep("A",5),rep("B",5)))
>>>>>>>> df = data.frame(names,col1)
>>>>>>>> df
>>>>>>>
>>>>>>>  names col1
>>>>>>> 1      A    1
>>>>>>> 2      A    0
>>>>>>> 3      A    1
>>>>>>> 4      A    0
>>>>>>> 5      A    1
>>>>>>> 6      B    0
>>>>>>> 7      B    0
>>>>>>> 8      B    1
>>>>>>> 9      B    0
>>>>>>> 10     B    0
>>>>>>>
>>>>>>> I would like to tranform it in the form:
>>>>>>>
>>>>>>>> index = c("A","B")
>>>>>>>> col1[[1]]=df$col1[which(df$name=="A")]
>>>>>>>> col1[[2]]=df$col1[which(df$name=="B")]
>>>>>>
>>>>>> I'm not sure I fully understand your problem, you example would not run
>>>>>> for me.
>>>>>>
>>>>>> You could get a small speedup by omitting which(), you can subset by a
>>>>>> logical vector also which give a small speedup.
>>>>>>
>>>>>>> n <- 2700000
>>>>>>> foo <- data.frame(
>>>>>>
>>>>>> +       one = sample(c(0,1), n, rep = T),
>>>>>> +       two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
>>>>>> +       )
>>>>>>>
>>>>>>> system.time(out <- which(foo$two=="A"))
>>>>>>
>>>>>>  user  system elapsed
>>>>>>  0.566   0.146   0.761
>>>>>>>
>>>>>>> system.time(out <- foo$two=="A")
>>>>>>
>>>>>>  user  system elapsed
>>>>>>  0.429   0.075   0.588
>>>>>>
>>>>>> You might also find use for unstack(), though I didn't see a speedup.
>>>>>>>
>>>>>>> system.time(out <- unstack(foo))
>>>>>>
>>>>>>  user  system elapsed
>>>>>>  1.068   0.697   2.004
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>> My problem is that the command:  *** which(df$name=="A") ***
>>>>>>> takes about 1 second because df is so big.
>>>>>>>
>>>>>>> I was thinking that a "level" could maybe be accessed instantly but I
>>>>>>> am not
>>>>>>> sure about how to do it.
>>>>>>>
>>>>>>> I would be very grateful for any advice that would allow me to speed
>>>>>>> this up.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>>
>>>>>>> Emmanuel
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?