[Rd] Regular expressions & large strings (PR#6617)

mjw at celos.net mjw at celos.net
Fri Feb 27 12:19:15 MET 2004


A possible regex bug when working with large strings.  The
following code snippet

  t5 <- paste( c( "# === TEST", rep(' ', 2452294) ), collapse='')
  str( sub("^.*TEST", "xyz", t5) )
  str( sub("^.*TEST", "xyz", substr(t5,0,200)) )

doesn't behave right; on one machine, the second and third
lines print different results [the second line, on the long
string, doesn't do the substitution], while on another, the
second line causes a segfault.  Both are running R 1.8.1
with PCRE, under NetBSD (1.6.1 and 1.6 respectively).

Possible related (although perhaps not a bug):

  function(n) {
    line <- paste(as.character(trunc(runif(n)*100)),collapse=" ")
    system.time( rep  <- gsub("[[:space:]]", "-", line) )
  }

gives rather long times rising v sharply for big strings (eg
2.2s at n=2e4, 360s at n=2e5 on AMD 1.2GHz).  Other languages 
aren't so slow on this task (eg n=2e5: 0.4s ruby 1.8.1, and
5.2s python 2).  Doubtless my extremely-quick-hack benchmarks
aren't fair, but the difference still seems rather big.

Mark <><



More information about the R-devel mailing list