[R] data frame subset too slow

jim holtman jholtman at gmail.com
Thu Dec 30 17:13:40 CET 2010


You should be using dat[[1]].  Here is an example with 80000 rows that
take about 0.02 seconds to get the subset.

Provide an 'str' of what your data looks like

> n <- 80000  # rows to create
> dat <- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n), runif(n))
> lst <- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n), runif(n))
> str(dat)
'data.frame':   80000 obs. of  5 variables:
 $ sample.1.200..n..TRUE.: int  39 116 69 163 51 125 144 32 28 4 ...
 $ runif.n.              : num  0.519 0.793 0.549 0.77 0.272 ...
 $ runif.n..1            : num  0.691 0.89 0.783 0.467 0.357 ...
 $ runif.n..2            : num  0.705 0.254 0.584 0.998 0.279 ...
 $ runif.n..3            : num  0.873 1 0.678 0.702 0.455 ...
> str(lst)
'data.frame':   80000 obs. of  5 variables:
 $ sample.1.100..n..TRUE.: int  38 83 38 70 77 44 81 55 32 1 ...
 $ runif.n.              : num  0.0621 0.7374 0.074 0.4281 0.0516 ...
 $ runif.n..1            : num  0.879 0.294 0.146 0.884 0.58 ...
 $ runif.n..2            : num  0.648 0.745 0.825 0.507 0.799 ...
 $ runif.n..3            : num  0.2523 0.1679 0.9728 0.0478 0.0967 ...
> system.time({
+ dat.sub <- dat[dat[[1]] %in% lst[[1]],]
+ })
   user  system elapsed
   0.02    0.00    0.01
> str(dat.sub)
'data.frame':   39803 obs. of  5 variables:
 $ sample.1.200..n..TRUE.: int  39 69 51 32 28 4 69 3 48 69 ...
 $ runif.n.              : num  0.5188 0.5494 0.2718 0.5566 0.0893 ...
 $ runif.n..1            : num  0.691 0.783 0.357 0.619 0.717 ...
 $ runif.n..2            : num  0.705 0.584 0.279 0.789 0.192 ...
 $ runif.n..3            : num  0.873 0.678 0.455 0.843 0.383 ...
>

On Thu, Dec 30, 2010 at 10:23 AM, Duke <duke.lists at gmx.com> wrote:
> Hi all,
>
> First I dont have much experience with R so be gentle. OK, I am dealing with
> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I have
> to create some subset of this data based on some certain conditions (for
> example, same first column with another dataset etc...). Here is how I did
> it:
>
> # import data
> dat <- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
> list <- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
> # create sub data
> subdat <- dat[dat[1] %in% list[1],]
>
> So the third line is to create a new data frame with all the same first
> column in both dat and list. There is no problem with the code as it runs
> just fine with testing data (small). When I tried with my real data (~80k
> lines, ~ 15MB size), it takes like forever (few hours). I dont know why it
> takes that long, but I think it shouldnt. I think even with a for loop in
> C++, I can get this done in say few minutes.
>
> So anyone has any idea/advice/suggestion?
>
> Thanks so much in advance and Happy New Year to all of you.
>
> D.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list