[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())

Peter Ehlers ehlers at ucalgary.ca
Mon May 30 20:02:05 CEST 2011


On 2011-05-29 23:08, Matthew Keller wrote:
> God this listserve is awesome. Thanks to everyone for their ideas.
> I'll speed&  memory test tomorrow and change the code. Thanks again!

Since you're dealing with a vector of ~ 1e8 elements, you might
find that (at a probably small cost of time) you can reduce the
memory requirements by processing the vector in pieces:

## adjust n to suit trade-off between memory usage and time
n <- 100
k <- length(x) / n
L <- vector("list", n)
for( i in 1:n ) {
   y <- x[seq((i - 1) * k + 1, i * k)]
   L[[i]] <- gsub("^(.*?)\\..*$","\\1",y, perl=TRUE)
}
newx <- unlist(L)


Peter Ehlers

>
> Matt
>
> On Sun, May 29, 2011 at 6:44 PM, Ian Gow<iandgow at gmail.com>  wrote:
>> Not a new approach, but some benchmark data (the perl=TRUE speeds up Jim's
>> suggestion):
>>
>>> x<- c('18x.6','12x.9','302x.3')
>>> y<- rep(x,100000)
>>> system.time(temp<- unlist(lapply(strsplit(y,".",fixed=TRUE),function(x)
>>> x[1])))
>>    user  system elapsed
>>   1.203   0.018   1.222
>>> system.time(temp2<- gsub("^(.*?)\\..*$","\\1",y, perl=TRUE))
>>    user  system elapsed
>>   0.176   0.001   0.176
>>> identical(temp2, temp)
>> [1] TRUE
>>> system.time(temp3<- gsub("^(.*)\\..*", '\\1', y))
>>    user  system elapsed
>>   0.292   0.001   0.291
>>> identical(temp3, temp)
>> [1] TRUE
>>> system.time(temp3<- gsub("^(.*)\\..*", '\\1', y, perl=TRUE))
>>    user  system elapsed
>>   0.160   0.001   0.161
>>
>>
>>
>>
>>
>>
>> On 5/29/11 7:40 PM, "jim holtman"<jholtman at gmail.com>  wrote:
>>
>>> Try this approach:
>>>
>>>> x<- c('18x.6','12x.9','302x.3')
>>>> gsub("^(.*)\\..*", '\\1', x)
>>> [1] "18x"  "12x"  "302x"
>>>
>>>
>>> On Sun, May 29, 2011 at 8:10 PM, Matthew Keller<mckellercran at gmail.com>
>>> wrote:
>>>> hi all,
>>>>
>>>> I'm full of questions today :). Thanks in advance for your help!
>>>>
>>>> Here's the problem:
>>>> x<- c('18x.6','12x.9','302x.3')
>>>>
>>>> I want to get a vector that is c('18x','12x','302x')
>>>>
>>>> This is easily done using this code:
>>>>
>>>> unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
>>>>
>>>> So far so good. The problem is that x is a vector of length 132e6.
>>>> When I run the above code, it runs for>  30 minutes, and it takes>  23
>>>> Gb RAM (no kidding!).
>>>>
>>>> Does anyone have ideas about how to speed up the code above and (more
>>>> importantly) reduce the RAM footprint? I'd prefer not to change the
>>>> file on disk using, e.g., awk, but I will do that as a last resort.
>>>>
>>>> Best
>>>>
>>>> Matt
>>>>
>>>> --
>>>> Matthew C Keller
>>>> Asst. Professor of Psychology
>>>> University of Colorado at Boulder
>>>> www.matthewckeller.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>
>
>



More information about the R-help mailing list