[R] Fast string comparison

Matt Shotwell shotwelm at musc.edu
Tue Jul 13 15:24:53 CEST 2010


On Tue, 2010-07-13 at 01:42 -0400, Hadley Wickham wrote:
> strings <- replicate(1e5, paste(sample(letters, 100, rep = T), collapse =  ""))
> system.time(strings[-1] == strings[-1e5])
> #   user  system elapsed
> #  0.016   0.000   0.017
> 
> So it takes ~1/100 of a second to do ~100,000 string comparisons. You
> need to provide a reproducible example that illustrates why you think
> string comparisons are slow.

Here's a vectorized alternative to '==' for strings, with minimal
argument checking or result conversion. I haven't looked at the
corresponding R source code, it may be similar:

library(inline)
code <- "
    SEXP ans;
    int i, len, *cans;
    if(!isString(s1) || !isString(s2))
        error(\"invalid arguments\");
    len = length(s1)>length(s2)?length(s2):length(s1);
    PROTECT(ans = allocVector(INTSXP, len));
    cans = INTEGER(ans);
    for(i = 0; i < len; i++)
        cans[i] = strcmp(CHAR(STRING_ELT(s1,i)),\
                         CHAR(STRING_ELT(s2,i)));
    UNPROTECT(1);
    return ans;
"
sig <- signature(s1="character", s2="character")
strcmp <- cfunction(sig, code)


> system.time(strings[-1] == strings[-1e5])
   user  system elapsed 
  0.036   0.000   0.035 
> system.time(strcmp(strings[-1], strings[-1e5]))
   user  system elapsed 
  0.032   0.000   0.034 

That's pretty fast, though I seem to be working with a slower system
than Hadley. It's hard to see how this could be improved, except maybe
by caching results of string comparisons. 

-Matt

> 
> Hadley
> 
> 
> On Tue, Jul 13, 2010 at 6:52 AM, Ralf B <ralf.bierig at gmail.com> wrote:
> > I am asking this question because String comparison in R seems to be
> > awfully slow (based on profiling results) and I wonder if perhaps '=='
> > alone is not the best one can do. I did not ask for anything
> > particular and I don't think I need to provide a self-contained source
> > example for the question. So, to re-phrase my question, are there more
> > (runtime) effective ways to find out if two strings (about 100-150
> > characters long) are equal?
> >
> > Ralf
> >
> >
> >
> >
> >
> >
> > On Sun, Jul 11, 2010 at 2:37 PM, Sharpie <chuck at sharpsteen.net> wrote:
> >>
> >>
> >> Ralf B wrote:
> >>>
> >>> What is the fastest way to compare two strings in R?
> >>>
> >>> Ralf
> >>>
> >>
> >> Which way is not fast enough?
> >>
> >> In other words, are you asking this question because profiling showed one of
> >> R's string comparison operations is causing a massive bottleneck in your
> >> code? If so, which one and how are you using it?
> >>
> >> -Charlie
> >>
> >> -----
> >> Charlie Sharpsteen
> >> Undergraduate-- Environmental Resources Engineering
> >> Humboldt State University
> >> --
> >> View this message in context: http://r.789695.n4.nabble.com/Fast-string-comparison-tp2285156p2285409.html
> >> Sent from the R help mailing list archive at Nabble.com.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com



More information about the R-help mailing list