[R] Memoize and vectorize a custom function

Henrik Bengtsson hb at biostat.ucsf.edu
Fri Apr 27 18:15:01 CEST 2012


On Thu, Apr 26, 2012 at 3:21 PM, Kamil Slowikowski
<kslowikowski at gmail.com> wrote:
> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
>    seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
>    > GC <- function(s) {
>        if (!is.character(s)) return(NA)
>        n <- nchar(s)
>        if (n == 0) return(NA)
>        m <- gregexpr('[GCSgcs]', s)[[1]]
>        if (m[1] < 1) return(0)
>        return(100.0 * length(m) / n)
>    }
>
> It works:
>
>    > GC('')
>    [1] NA
>    > GC('G')
>    [1] 100
>    > GC('GAG')
>    [1] 66.66667
>    > sapply(seqs, GC)
>                      G         C       CCC         T               TTCCT
>                C
>           NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
>     NA 100.00000
>          CTC
>     66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?
>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
>    > system.time(dummy <- sapply(rep(seqs,100), GC))
>       user  system elapsed
>      0.044   0.000   0.054
>    >
>    > library(memoise)
>    > GCm1 <- memoise(GC)
>    > system.time(dummy <- sapply(rep(seqs,100), GCm1))
>       user  system elapsed
>      0.164   0.000   0.173
>    >
>    > library(R.cache)
>    > GCm2 <- addMemoization(GC)
>    > system.time(dummy <- sapply(rep(seqs,100), GCm2))
>       user  system elapsed
>     10.601   0.252  10.926
>
> Notice that the memoized functions are several orders of magnitude slower.

About R.cache: All memoization by R.cache is currently done toward the
file system.  In other words, it is designed for larger objects (so
you cannot hold all of the cache in memory) and more computationally
expensive tasks.

/Henrik

>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
>    > cache <- hash()
>    > GCc <- function(s) {
>        if (!is.character(s) || nchar(s) == 0) {
>            return(NA)
>        }
>        if(exists(s, cache)) {
>            return(cache[[s]])
>        }
>        result <- GC(s)
>        cache[[s]] <<- result
>        return(result)
>    }
>    > sapply(seqs,GCc)
>    [[1]]
>    [1] NA
>
>    $G
>    [1] 100
>
>    $C
>    NULL
>
>    $CCC
>    [1] 100
>
>    $T
>    NULL
>
>    [[6]]
>    [1] NA
>
>    $TTCCT
>    [1] 40
>
>    [[8]]
>    [1] NA
>
>    $C
>    NULL
>
>    $CTC
>    [1] 66.66667
>
> At least I figured out how to vectorize:
>
>    > GCv <- Vectorize(GC)
>    > GCv(seqs)
>                      G         C       CCC         T               TTCCT
>                C
>      0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
> 0.00000 100.00000
>          CTC
>     66.66667
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list