[R] Which function to use: grep, replace, substr etc.?

Mon Oct 17 03:25:04 CEST 2011

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius
> Sent: Sunday, October 16, 2011 1:59 PM
> To: Jeff Newmiller
> Cc: r-help at r-project.org; syrvn
> Subject: Re: [R] Which function to use: grep, replace, substr etc.?
> 
> 
> On Oct 16, 2011, at 1:32 PM, Jeff Newmiller wrote:
> 
> > Note that "male" comes before "female" in your data frame.
> > ---------------------------------------------------------------------------
> > Jeff Newmiller The ..... ..... Go Live...
> >
> 
> > syrvn <mentor_ at gmx.net> wrote:
> >
> > Hi,
> >
> > thanks for the tip! I do it as follows now but I still have a
> > problem I do
> > not understand:
> >
> >
> > abbrvs <- data.frame(c("peter", "name", "male", "female"),
> > 		 	 c("P", "N", "m", "f"))
> >
> > colnames(abbrvs) <- c("pattern", "replacement")
> >
> > str <- "My name is peter and I am male"
> >
> > for(m in 1:nrow(abbrvs)) {
> > 		str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str,
> > fixed=TRUE)
> > 		print(str)
> > 	}
> >
> >
> > This works perfectly fine as I get: "My N is P and I am m"
> >
> > However, when I replace male by female then I get the following: "My
> > N is P
> > and I am fem"
> >
> > but I want to have "My N is P and I am f".
> >
> > Even with the parameter fixed=true I get the same result. Why is that?
> 
> Because "male" is in "female? This reminds me of a comment on a
> posting I made this morning on SO.
> http://stackoverflow.com/questions/7782113/counting-keyword-occurrences-in-r
> 
> The problem was slightly different, but the greppish principle was
> that in order to match only complete words, you need to specific "^",
> "$" or " " at each end of the word:
> 
> dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
> grep("^corn$|^corn | corn$", dataset)
> [1] 1 3

You can use the 2 character sequences "\\<" and "\\>" to match
the beginning and end of a "word" (where the match takes up zero
characters):
  > dataset <- c("corn", "cornmeal", "corn on the cob", "popcorn", "this corn is sweet")
  > grep("^corn$|^corn | corn$", dataset)
  [1] 1 3
  > grep("\\<corn\\>", dataset)
  [1] 1 3 5
  > gsub("\\<corn\\>", "CORN", dataset)
  [1] "CORN"              
  [2] "cornmeal"          
  [3] "CORN on the cob"   
  [4] "popcorn"           
  [5] "this CORN is sweet"

If your definition of a "word" is more expansive it gets complicated.
E.g., if words might include letters, numbers, and periods but not
underscores or anything else, you could use:
  > gsub("(^|[^.[:alpha:][:digit:]])?corn($|[^.[:alpha:][:digit:]])?",
      "\\1CORN.BY.ITSELF\\2",
      c("corn.1", "corn_2", " corn", "4corn", "1.corn"))
  [1] "corn.1"          
  [2] "CORN.BY.ITSELF_2"
  [3] " CORN.BY.ITSELF" 
  [4] "4corn"           
  [5] "1.corn"
Moving to perl regular expressions would probably make this simpler.    

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> In such cases you may want to look at the gsubfn package. It offers
> higher level matching functions and I think strapply might be more
> efficient and expressive here. I can imagine construction in a loop
> such as yours, but you would probably want to build a pattern outside
> the sub() call.
> 
> After struggling to fix your loop (and your data.frame which
> definitely should not be using factor variables), I am even more
> convinced you should be learning "gubfn" facilities. (Tate out the
> debugging print statements.)
> 
>  > abbrvs <- data.frame(c("peter", "name", "male", "female"),
> + 		 	 c(" P ", " N ", " m ", " f "), stringsAsFactors=FALSE)
>  >
>  > colnames(abbrvs) <- c("pattern", "replacement")
> 
> 
>  > for(m in 1:nrow(abbrvs)) { patt <- paste("^",abbrvs$pattern[m], "$|
> ",
> +                   abbrvs$pattern[m], " | ",
> +                   abbrvs$pattern[m], "$", sep="")
> +              print(c( patt, abbrvs$replacement[m]))
> + 		str <- sub(patt, abbrvs$replacement[m], str)
> + 		print(str)
> + 	}
> [1] "^peter$| peter | peter$" " P "
> [1] "My name is P and I am female"
> [1] "^name$| name | name$" " N "
> [1] "My N is P and I am female"
> [1] "^male$| male | male$" " m "
> [1] "My N is P and I am female"
> [1] "^female$| female | female$" " f "
> [1] "My N is P and I am f "
> 
> --
> 
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.