[R] Which function to use: grep, replace, substr etc.?

David Winsemius dwinsemius at comcast.net
Sun Oct 16 22:59:00 CEST 2011


On Oct 16, 2011, at 1:32 PM, Jeff Newmiller wrote:

> Note that "male" comes before "female" in your data frame.
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
>

> syrvn <mentor_ at gmx.net> wrote:
>
> Hi,
>
> thanks for the tip! I do it as follows now but I still have a  
> problem I do
> not understand:
>
>
> abbrvs <- data.frame(c("peter", "name", "male", "female"),
> 		 	 c("P", "N", "m", "f"))
> 						
> colnames(abbrvs) <- c("pattern", "replacement")
> 	
> str <- "My name is peter and I am male"
>
> for(m in 1:nrow(abbrvs)) {
> 		str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str,  
> fixed=TRUE)
> 		print(str)
> 	}
> 	
>
> This works perfectly fine as I get: "My N is P and I am m"
>
> However, when I replace male by female then I get the following: "My  
> N is P
> and I am fem"
>
> but I want to have "My N is P and I am f".
>
> Even with the parameter fixed=true I get the same result. Why is that?

Because "male" is in "female? This reminds me of a comment on a  
posting I made this morning on SO.
http://stackoverflow.com/questions/7782113/counting-keyword-occurrences-in-r

The problem was slightly different, but the greppish principle was  
that in order to match only complete words, you need to specific "^",  
"$" or " " at each end of the word:

dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
grep("^corn$|^corn | corn$", dataset)
[1] 1 3

In such cases you may want to look at the gsubfn package. It offers  
higher level matching functions and I think strapply might be more  
efficient and expressive here. I can imagine construction in a loop  
such as yours, but you would probably want to build a pattern outside  
the sub() call.

After struggling to fix your loop (and your data.frame which  
definitely should not be using factor variables), I am even more  
convinced you should be learning "gubfn" facilities. (Tate out the  
debugging print statements.)

 > abbrvs <- data.frame(c("peter", "name", "male", "female"),
+ 		 	 c(" P ", " N ", " m ", " f "), stringsAsFactors=FALSE)
 > 						
 > colnames(abbrvs) <- c("pattern", "replacement")


 > for(m in 1:nrow(abbrvs)) { patt <- paste("^",abbrvs$pattern[m], "$|  
",
+                   abbrvs$pattern[m], " | ",
+                   abbrvs$pattern[m], "$", sep="")
+              print(c( patt, abbrvs$replacement[m]))
+ 		str <- sub(patt, abbrvs$replacement[m], str)
+ 		print(str)
+ 	}
[1] "^peter$| peter | peter$" " P "
[1] "My name is P and I am female"
[1] "^name$| name | name$" " N "
[1] "My N is P and I am female"
[1] "^male$| male | male$" " m "
[1] "My N is P and I am female"
[1] "^female$| female | female$" " f "
[1] "My N is P and I am f "

-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT



More information about the R-help mailing list