[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())

Mon May 30 02:41:53 CEST 2011

Hi Matt,

There are likely more efficient ways still, but this is a big
performance boost time-wise for me:

x <- c('18x.6','12x.9','302x.3')

gsub("\\.(.+$)", "", x)

x <- rep(x, 10^5)

> system.time(out1 <- unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1])))
   user  system elapsed
   2.89    0.03    2.96
> system.time(out2 <- gsub("\\.(.+$)", "", x))
   user  system elapsed
   0.57    0.00    0.59
> all.equal(out1, out2)
[1] TRUE

Cheers,

Josh

On Sun, May 29, 2011 at 5:10 PM, Matthew Keller <mckellercran at gmail.com> wrote:
> hi all,
>
> I'm full of questions today :). Thanks in advance for your help!
>
> Here's the problem:
> x <- c('18x.6','12x.9','302x.3')
>
> I want to get a vector that is c('18x','12x','302x')
>
> This is easily done using this code:
>
> unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
>
> So far so good. The problem is that x is a vector of length 132e6.
> When I run the above code, it runs for > 30 minutes, and it takes > 23
> Gb RAM (no kidding!).
>
> Does anyone have ideas about how to speed up the code above and (more
> importantly) reduce the RAM footprint? I'd prefer not to change the
> file on disk using, e.g., awk, but I will do that as a last resort.
>
> Best
>
> Matt
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/