[R] regular expression help to extract specific strings from text

Tony B tony.breyal at googlemail.com
Wed Mar 31 15:20:49 CEST 2010


Dear all,

Lets say I have the following:

> x <- c("Eve: Going to try something new today...", "Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's awesome, so much better at statistics that #Excel ever was! @Cain & @Able disagree though :(", "Adam: @Eve I'm sure they'll sort it out :)", "blahblah")
> x
[1] "Eve: Going to try something new
today..."
[2] "Adam: Hey @Eve, how are you finding R?
#rstats"
[3] "Eve: @Adam, It's awesome, so much better at statistics that
\n#Excel ever was! @Cain & @Able disagree though :("
[4] "Adam: @Eve I'm sure they'll sort it
out :)"
[5] "blahblah"

I would like to come up with a data frame which looks like this
(pulling out the usernames and #tags):

> data.frame(Msg = x, Source = c("Eve", "Adam", "Eve", "Adam", NA), Mentions = c(NA, "Eve", "Adam, Cain, Able", "Eve", NA), HashTags = c(NA, "rstats", "Excel", NA, NA))

The best I can do so far is:

source <- lapply(x, function (x) {
   tmp <- strsplit(x, ":", fixed = TRUE)
   if(length(tmp[[1]]) < 2) {
     tmp <- c(NA, tmp)
   }
   return(tmp[[1]][1])
 } )
source <- unlist(source)

[1] "Eve"  "Adam" "Eve"  "Adam" NA

I can't work out how to extract the usernames starting with '@' or the
#tags. I can identify them using gsub and replace them, but I don't
know how to just extract those terms only, e.g. sort of the opposite
of the following

> gsub("@([A-Za-z0-9_]+)", "@[...]", x)
[1] "Eve: Going to try something new today..."
[2] "Adam: Hey @[...], how are you finding R? #rstats"
[3] "Eve: @[...], It's awesome, so much better at statistics that
#Excel ever was! @[...] & @[...] disagree though :("
[4] "Adam: @[...] I'm sure they'll sort it out :)"
[5] "blahblah"

and

> gsub("#([A-Za-z0-9_]+)", "#[...]", x)
[1] "Eve: Going to try something new today..."
[2] "Adam: Hey @Eve, how are you finding R? #[...]"
[3] "Eve: @Adam, It's awesome, so much better at statistics that
#[...] ever was! @Cain & @Able disagree though :("
[4] "Adam: @Eve I'm sure they'll sort it out :)"
[5] "blahblah"

I hope that makes sense, and thank you kindly in advance for your
time.
Tony Breyal



More information about the R-help mailing list