[Rd] grep with fixed=TRUE and ignore.case=TRUE
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu May 17 11:28:55 CEST 2007
On Thu, 17 May 2007, Petr Savicky wrote:
>> strncasecmp is not standard C (not even C99), but R does have a substitute
>> for it. Unfortunately strncasecmp is not usable with multibyte charsets:
>> Linux systems have wcsncasecmp but that is not portable. In these days of
>> widespread use of UTF-8 that is a blocking issue, I am afraid.
> What could help are the functions mbrtowc and towctrans and simple
> long integer comparison. Are the functions mbrtowc and towctrans
> available under Windows? mbrtowc seems to be available as Rmbrtowc
> in src/gnuwin32/extra.c.
> I did not find towctrans defined in R sources, but it is in
I don't see it in Rdll.hide. It is a C99 function (see your unix man
> and used in do_tolower. Does this mean that tolower is not usable
> with utf-8 under Windows?
UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in
so far as that makes sense: Chinese chars do not have 'case').
Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is
not currently active.
>> In the case of grep I think all you need is
>> grep(tolower(pattern), tolower(x), fixed = TRUE)
>> and similarly for regexpr.
> Yes. this is correct, but it has disadvantages. It needs more
> space and, if value=TRUE, we would have to do something like
> x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
> This is hard to implement in src/library/base/R/grep.R,
> where the call to .Internal(grep(pattern,...)) is the last command
> and I think this should be preserved.
>>> Ignore case option is not meaningfull in gsub.
>> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
>> is different from 'ignore.case=FALSE', and I see the meaning as clear.
>> So what did you mean? (Unfortunately the tolower trick does not work for
> The meaning of ignore.case in [g]sub is problematic due to the following.
> sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
>  "xyzD" "xyzd"
> but the user may in fact need the following
>  "XYZD" "xyzd"
He may, but that is not what 'ignore case' means, more like 'case
> It is correct that "xyzD" "xyzd" is produced, but the user
> should be aware of the fact that several substitutions like
> x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE
> sub("ABC", "XYZ", x) # ignore.case=FALSE
> may be more useful.
> I have another question concerning the speed of grep. I expected that
> fgrep_one function is slower than calling a library routine
> for regular expressions. In particular, if the pattern has a lot of
> long partial matches in the target string, I expected that it may be much
> slower. A short example is
> y <- "aaaaaaaaab"
> x <- "aaaaaaaaaaaaaaaaaaab"
> which requires 110 comparisons (10 comparisons for each of 11 possible
> beginnings of y in x). In general, the complexity in the worst case is
> O(m*n), where m,n are the lengths of y,x resp. I would expect that
> the library function for matching regular expressions needs
> time O(m+n) and so will be faster. However, the result obtained
> on a larger example is
> > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
> > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
> > y <- paste(c(rep("a", times = 10000), x2), collapse = "")
> > z <- rep(y, times = 100)
> > system.time(i <- grep(x1, z, fixed = T))
>  1.970 0.000 1.985 0.000 0.000
> > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*)
>  40.374 0.003 40.381 0.000 0.000
> > system.time(i <- grep(x2, z, fixed = T))
>  0.113 0.000 0.113 0.000 0.000
> > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one
>  0.019 0.000 0.019 0.000 0.000
> Do you have an explanation of these results, in particular (*)?
Yes, there is a comment on the help page to that effect. But these are
highly atypical uses. Try perl=TRUE, and be aware that the locale matters
a lot in such tests (via the charset).
No one is attempting to make R a fast string-processing language and so
developers resources are spent on performance where it matters to more
typical usage. (E.g. reducing duplication in as.double and friends speeds
up just about every R session, and speeds up some numerical sessions
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel