[R] vectorized sub, gsub, grep, etc.

Bert Gunter bgunter.4567 at gmail.com
Wed Jul 29 17:02:45 CEST 2015


There is confusion here. apply() family functions are **NOT**
vectorization -- they ARE loops (at the interpreter level), just done
in "functionalized" form. Please read background material (John
Chambers's books, MASS, or numerous others) to improve your
understanding and avoid posting erroneous comments.

Cheers,
Bert


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Tue, Jul 28, 2015 at 3:00 PM, John Thaden <jjthaden at flash.net> wrote:
> Adam,    The method you propose gives a different result than the prior methods for these example vectors
> X <- c("ab", "cd", "ef")
> patt <- c("b", "cd", "a")
> repl <- c("B", "CD", "A")
>
> Old method 1
>
> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
> gives
>   b   cd    a
> "aB" "CD" "ef"
>
> Old method 2
>
> sub2 <- function(pattern, replacement, x) {
>     len <- length(x)
>     if (length(pattern) == 1)
>         pattern <- rep(pattern, len)
>     if (length(replacement) == 1)
>         replacement <- rep(replacement, len)
>     FUN <- function(i, ...) {
>         sub(pattern[i], replacement[i], x[i], fixed = TRUE)
>     }
>     idx <- 1:length(x)
>     sapply(idx, FUN)
> }
> sub2(patt, repl, X)
>  gives
> [1] "aB" "CD" "ef"
>
> Your method (I gave it the unique name "sub3")
>  sub3 <- function(pattern, replacement, x) {   len    <- length(x)  y      <- character(length=len)  patlen <- length(pattern)  replen <- length(replacement)  if(patlen != replen) stop('Error: Pattern and replacement length do not match')  for(i in 1:replen) {    y[which(x==pattern[i])] <- replacement[i]  }  return(y)}sub3(patt, repl, X)
> gives[1] ""   "CD" ""
>
> Granted, whatever it does, it does it faster
> #Old method 1
> system.time(for(i in 1:50000)
> mapply(function(p,r,x) sub(p,r,x, fixed = TRUE),p=patt,r=repl,x=X))
>    user  system elapsed
>    2.53    0.00    2.52
>
> #Old method 2
> system.time(for(i in 1:50000)sub2(patt, repl, X))   user  system elapsed
>    2.32    0.00    2.32
>
> #Your proposed method
> system.time(for(i in 1:50000) sub3(patt, repl, X))
>    user  system elapsed
>    1.02    0.00    1.01
>  but would it still be faster if it actually solved the same problem?
>
> -John Thaden
>
>
>
>
>      On Monday, July 27, 2015 11:40 PM, Adam Erickson <adam.michael.erickson at gmail.com> wrote:
>
> I know this is an old thread, but I wrote a simple FOR loop with vectorized pattern replacement that is much faster than either of those (it can also accept outputs differing in length from the patterns):
>   sub2  <- function(pattern, replacement, x) {     len   <- length(x)    y      <- character(length=len)    patlen <- length(pattern)    replen <- length(replacement)    if(patlen != replen) stop('Error: Pattern and replacement length do not match')    for(i in 1:replen) {      y[which(x==pattern[i])] <- replacement[i]    }    return(y)  }
> system.time(test <- sub2(patt, repl, XX))   user  system elapsed       0       0       0
> Cheers,
> Adam
> On Wednesday, October 8, 2008 at 9:38:01 PM UTC-7, john wrote:
> Hello Christos,
>   To my surprise, vectorization actually hurt processing speed!#Example
> X <- c("ab", "cd", "ef")
> patt <- c("b", "cd", "a")
> repl <- c("B", "CD", "A")sub2 <- function(pattern, replacement, x) {
>     len <- length(x)
>     if (length(pattern) == 1)
>         pattern <- rep(pattern, len)
>     if (length(replacement) == 1)
>         replacement <- rep(replacement, len)
>     FUN <- function(i, ...) {
>         sub(pattern[i], replacement[i], x[i], fixed = TRUE)
>     }
>     idx <- 1:length(x)
>     sapply(idx, FUN)
> }
>
> system.time(  for(i in 1:10000)  sub2(patt, repl, X)  )
>    user  system elapsed
>    1.18    0.07    1.26 system.time(  for(i in 1:10000)  mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)  )
>    user  system elapsed
>    1.42    0.05    1.47
>
> So much for avoiding loops.
> John Thaden======= At 2008-10-07, 14:58:10 Christos wrote: =======>John,
>>Try the following:
>>
>> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
>>   b   cd    a
>>"aB" "CD" "ef"
>>
>>-Christos>> -----My Original Message-----
>>> R pattern-matching and replacement functions are
>>> vectorized: they can operate on vectors of targets.
>>> However, they can only use one pattern and replacement.
>>> Here is code to apply a different pattern and replacement for
>>> every target.  My question: can it be done better?
>>>
>>> sub2 <- function(pattern, replacement, x) {
>>>     len <- length(x)
>>>     if (length(pattern) == 1)
>>>         pattern <- rep(pattern, len)
>>>     if (length(replacement) == 1)
>>>         replacement <- rep(replacement, len)
>>>     FUN <- function(i, ...) {
>>>         sub(pattern[i], replacement[i], x[i], fixed = TRUE)
>>>     }
>>>     idx <- 1:length(x)
>>>     sapply(idx, FUN)
>>> }
>>>
>>> #Example
>>> X <- c("ab", "cd", "ef")
>>> patt <- c("b", "cd", "a")
>>> repl <- c("B", "CD", "A")
>>> sub2(patt, repl, X)
>>>
>>> -John______________________________________________
> R-h... at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list