[R] data frame subset too slow

Wed Jan 12 20:59:11 CET 2011

Sorry for the late response. I was away for vacation and was unable to 
keep on working on the codes.

Anyway, I was unable to provide *str* of that specific data since they 
are all in a big package with lots of inputs/outputs. Quickly gazing 
through the code, I narrowed them down (and made a bad guess) to data 
frame. But it turned out that data frame was not the reason. After 
carefully check through the package, I found out that there is a double 
for loop. I replaced that double for loop and now instead of running ~ 
13hrs, the package now runs ~ 13min for a similar dataset.

Thanks for all your helps,

D.

On 12/30/10 11:40 AM, jim holtman wrote:
> If you want the data in the first column of the dataframe, then you
> should be using '[['.  Notice what comes back in each of these cases:
>
>> str(dat)
> 'data.frame':   80000 obs. of  5 variables:
>   $ sample.1.200..n..TRUE.: int  25 199 70 124 93 157 49 137 192 57 ...
>   $ runif.n.              : num  0.7725 0.0263 0.0728 0.7594 0.2792 ...
>   $ runif.n..1            : num  0.4304 0.8608 0.0882 0.5666 0.1721 ...
>   $ runif.n..2            : num  0.3797 0.1191 0.0481 0.3297 0.0649 ...
>   $ runif.n..3            : num  0.0895 0.0441 0.0403 0.9679 0.3986 ...
>> str(dat[1])
> 'data.frame':   80000 obs. of  1 variable:
>   $ sample.1.200..n..TRUE.: int  25 199 70 124 93 157 49 137 192 57 ...
>> str(dat[[1]])
>   int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
>> str(dat$sample.1.200..n..TRUE)
>   int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
>>   str(dat[,1])
>   int [1:80000] 25 199 70 124 93 157 49 137 192 57 ...
>
> You will get different classes of values.  We would really need to see
> the output of 'str' on your data structures to see what might be
> happening.  Your data is not that big and most subsetting/extractions
> should be in less than a second unless there is something funny in
> your data.  So provide the 'str' so we can see.
>
>
> On Thu, Dec 30, 2010 at 11:28 AM, Duke<duke.lists at gmx.com>  wrote:
>> Hi Jim,
>>
>> Is this really a problem for me to use [1] instead of [[1]]? Will this make
>> it run slower? Also, if I use dat$V1 %in% list$V1, will it be fine?
>>
>> Anyway, my data and list are basically gene lists (tab delimited):
>>
>> $ head test.txt
>> Xkr4    chr1    -    3204562    3661579    3206102    3661429    3
>>   3204562,3411782,3660632,    3207049,3411982,3661579,
>> Rp1    chr1    -    4280926    4399322    4283061    4399268    4
>>   4280926,4341990,4342282,4399250,    4283093,4342162,4342918,4399322,
>> Rp1_2    chr1    -    4333587    4350395    4334680    4342906    4
>>   4333587,4341990,4342282,4350280,    4340172,4342162,4342918,4350395,
>> Sox17    chr1    -    4481008    4486494    4481796    4483487    5
>>   4481008,4483180,4483852,4485216,4486371,
>>   4482749,4483547,4483944,4486023,4486494,
>> Mrpl15    chr1    -    4763278    4775807    4764532    4775758    5
>>   4763278,4767605,4772648,4774031,4775653,
>>   4764597,4767729,4772814,4774186,4775807,
>> Mrpl15_2    chr1    -    4763278    4775807    4775807    4775807    4
>>   4763278,4767605,4772648,4775653,    4764597,4767729,4772814,4775807,
>> $ head list.txt
>> GeneNames    Chr    Start    End
>> 0610007C21Rik    chr5    31351012    31356996
>> 0610007L01Rik    chr5    130695613    130719635
>> 0610007L01Rik_2    chr5    130698204    130719635
>> 0610007P08Rik    chr13    63916627    64001609
>> 0610007P08Rik_2    chr13    63916641    63970963
>> 0610007P14Rik    chr12    87156404    87165495
>>
>> Thanks,
>>
>> D.
>>
>> On 12/30/10 11:13 AM, jim holtman wrote:
>>> You should be using dat[[1]].  Here is an example with 80000 rows that
>>> take about 0.02 seconds to get the subset.
>>>
>>> Provide an 'str' of what your data looks like
>>>
>>>> n<- 80000  # rows to create
>>>> dat<- data.frame(sample(1:200, n, TRUE), runif(n), runif(n), runif(n),
>>>> runif(n))
>>>> lst<- data.frame(sample(1:100, n, TRUE), runif(n), runif(n), runif(n),
>>>> runif(n))
>>>> str(dat)
>>> 'data.frame':   80000 obs. of  5 variables:
>>>   $ sample.1.200..n..TRUE.: int  39 116 69 163 51 125 144 32 28 4 ...
>>>   $ runif.n.              : num  0.519 0.793 0.549 0.77 0.272 ...
>>>   $ runif.n..1            : num  0.691 0.89 0.783 0.467 0.357 ...
>>>   $ runif.n..2            : num  0.705 0.254 0.584 0.998 0.279 ...
>>>   $ runif.n..3            : num  0.873 1 0.678 0.702 0.455 ...
>>>> str(lst)
>>> 'data.frame':   80000 obs. of  5 variables:
>>>   $ sample.1.100..n..TRUE.: int  38 83 38 70 77 44 81 55 32 1 ...
>>>   $ runif.n.              : num  0.0621 0.7374 0.074 0.4281 0.0516 ...
>>>   $ runif.n..1            : num  0.879 0.294 0.146 0.884 0.58 ...
>>>   $ runif.n..2            : num  0.648 0.745 0.825 0.507 0.799 ...
>>>   $ runif.n..3            : num  0.2523 0.1679 0.9728 0.0478 0.0967 ...
>>>> system.time({
>>> + dat.sub<- dat[dat[[1]] %in% lst[[1]],]
>>> + })
>>>     user  system elapsed
>>>     0.02    0.00    0.01
>>>> str(dat.sub)
>>> 'data.frame':   39803 obs. of  5 variables:
>>>   $ sample.1.200..n..TRUE.: int  39 69 51 32 28 4 69 3 48 69 ...
>>>   $ runif.n.              : num  0.5188 0.5494 0.2718 0.5566 0.0893 ...
>>>   $ runif.n..1            : num  0.691 0.783 0.357 0.619 0.717 ...
>>>   $ runif.n..2            : num  0.705 0.584 0.279 0.789 0.192 ...
>>>   $ runif.n..3            : num  0.873 0.678 0.455 0.843 0.383 ...
>>> On Thu, Dec 30, 2010 at 10:23 AM, Duke<duke.lists at gmx.com>    wrote:
>>>> Hi all,
>>>>
>>>> First I dont have much experience with R so be gentle. OK, I am dealing
>>>> with
>>>> a dataset (~ tens of thousand lines, each line ~ 10 columns of data). I
>>>> have
>>>> to create some subset of this data based on some certain conditions (for
>>>> example, same first column with another dataset etc...). Here is how I
>>>> did
>>>> it:
>>>>
>>>> # import data
>>>> dat<- read.table( "test.txt", header=TRUE, fill=TRUE, sep="\t" )
>>>> list<- read.table( "list.txt", header=TRUE, fill=TRUE, sep="\t" )
>>>> # create sub data
>>>> subdat<- dat[dat[1] %in% list[1],]
>>>>
>>>> So the third line is to create a new data frame with all the same first
>>>> column in both dat and list. There is no problem with the code as it runs
>>>> just fine with testing data (small). When I tried with my real data (~80k
>>>> lines, ~ 15MB size), it takes like forever (few hours). I dont know why
>>>> it
>>>> takes that long, but I think it shouldnt. I think even with a for loop in
>>>> C++, I can get this done in say few minutes.
>>>>
>>>> So anyone has any idea/advice/suggestion?
>>>>
>>>> Thanks so much in advance and Happy New Year to all of you.
>>>>
>>>> D.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>
>