[R] Data Frame Search Slow

jim holtman jholtman at gmail.com
Tue Nov 22 21:36:39 CET 2011


take a look at using the 'data.table' package.  Here are some times to
do the lookup using dataframes, matrices and data.tables:  data.tables
give the answer is less than 0.1 seconds.

> str(x.df)
'data.frame':   2500000 obs. of  4 variables:
 $ x  : Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
 $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 200683 388992 241029
305994 209907 112469 105656 233058 247529 416273 ...
> system.time(a <- x.df[[1]] %in% "AAAA")
   user  system elapsed
   0.33    0.00    0.39
> x.m <- as.matrix(x.df)
> str(x.m)
 chr [1:2500000, 1:4] "LMDC" "WFXC" "NUBQ" "RMOK" "LZVR" "GLCE" "GAZE"
"NIFT" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:4] "x" "x.1" "x.2" "x.3"
> system.time(a <- x.m[,1] %in% "AAAA")
   user  system elapsed
   0.50    0.00    0.51
> require(data.table)
> x.df <- data.table(x.df)
> setkey(x.df, x)
> system.time(a <- x.df["AAAA"])
   user  system elapsed
   0.05    0.03    0.13
> str(a)
Classes ‘data.table’ and 'data.frame':  7 obs. of  4 variables:
 $ x  : Factor w/ 1 level "AAAA": 1 1 1 1 1 1 1
 $ x.1: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 $ x.2: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 $ x.3: Factor w/ 455063 levels "AAAA","AAAB",..: 1 1 1 1 1 1 1
 - attr(*, "sorted")= chr "x"
> system.time(x.df["ABCD"])
   user  system elapsed
   0.08    0.02    0.16
>

On Tue, Nov 22, 2011 at 2:01 PM, TimothyDalbey <tmdalbey at gmail.com> wrote:
> Hey All,
>
> So - I promise to write a blog post on this topic and post it somewhere on
> the internet once I get to the bottom of this.  Basically, the set-up to the
> problem is like this:
>
> 1.  I have a data frame with dim (2547290, 4)
> 2.  I need to make SQL like lookups on the dataframe.  I have been using the
> following sort of syntax:
>
> a.dataframe[a.dataframe[[column_index]] %in% some_value, ]
>
> 3.  This process takes quite a lot of time (~2 seconds) on m1.small
> instances AMIs (AWS)
>
> So, I hope I can get that look-up/search logic quite a lot faster.  I have
> heard that using matrices is the way to do it but I haven't found any
> resources on performing that sort of operation specifically that have
> yielded better results.
>
> Thought, feelings and advice are more than welcome.
>
> Best,
> TMD
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Data-Frame-Search-Slow-tp4096906p4096906.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list