[R] Thoughts for faster indexing

MacQueen, Don macqueen1 at llnl.gov
Thu Nov 21 16:42:06 CET 2013


I have some processes where I do the same thing, iterate over subsets of a
data frame.
My data frame has ~250,000 rows, 30 variables, and the subsets are such
that there are about 6000 of them.

Performing a which() statement like yours seems quite fast.

For example, wrapping unix.time() around the which() expression, I get

   user  system elapsed   0.008   0.000   0.008

It's hard for me to imagine the single task of getting the indexes is slow
enough to be a bottleneck.



On the other hand, if the variable being used to identify subsets is a
factor with many levels (~6000 in my case), it is noticeably slower.

   user  system elapsed
  0.024   0.002   0.026


I haven't tested it, and have no real expectation that it will make a
difference, but perhaps sorting by the index variable before iterating
will help (if you haven't already). Since these are not true indexes in
the sense used by relational database systems, maybe it will make a
difference.


-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 11/20/13 12:16 PM, "Noah Silverman" <noahsilverman at g.ucla.edu> wrote:

>Hello,
>
>I have a fairly large data.frame.  (About 150,000 rows of 100
>variables.) There are case IDs, and multiple entries for each ID, with a
>date stamp.  (i.e. records of peoples activity.)
>
>
>I need to iterate over each person (record ID) in the data set, and then
>process their data for each date.  The processing part is fast, the date
>part is fast.  Locating the records is slow.  I've even tried using
>data.table, with ID set as the index, and it is still slow.
>
>The line with the slow process (According to Rprof) is:
>
>
>j <- which( d$id == person )
>
>(I then process all the records indexed by j, which seems fast enough.)
>
>where d is my data.frame or data.table
>
>I thought that using the data.table indexing would speed things up, but
>not in this case.
>
>Any ideas on how to speed this up?
>
>
>Thanks!
>
>-- 
>Noah Silverman, M.S., C.Phil
>UCLA Department of Statistics
>8117 Math Sciences Building
>Los Angeles, CA 90095
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list