[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Christopher W Ryan cryan at binghamton.edu
Thu Jul 9 04:48:54 CEST 2015


Running R 3.1.1 on windows 7

I want to identify as a case any record in a dataframe that contains any
of several keywords in any of several variables.

Example:

# create a dataframe with 4 variables and 10 records
v2 <- c("white bird", "blue bird", "green turtle", "quick brown fox",
"big black dog", "waffle the hamster", "benny likes food a lot", "hello
world", "yellow giraffe with a long neck", "black bear")
v3 <- c("harry potter", "hermione grainger", "ronald weasley", "ginny
weasley", "dudley dursley", "red sparks", "blue sparks", "white dress
robes", "gandalf the white", "gandalf the grey")
zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2),
stringsAsFactors=FALSE)
str(zz)
zz

# here are the keywords
alarm.words <- c("red", "green", "turtle", "gandalf")

# For each row/record, I want to test whether the string in v2 or the
string in v3 contains any of the strings in alarm.words. And then if so,
set zz$v5=TRUE for that record.

# I'm thinking the str_detect function in the stringr package ought to
be able to help, perhaps with some use of apply over the rows, but I
obviously misunderstand something about how str_detect works

library(stringr)

str_detect(zz[,2:3], alarm.words)    # error: the target of the search
                                     # must be a vector, not multiple
                                     # columns

str_detect(zz[1:4,2:3], alarm.words) # same error

str_detect(zz[,2], alarm.words)      # error, length of alarm.words
                                     # is less than the number of
                                     # rows I am using for the
                                     # comparison

str_detect(zz[1:4,2], alarm.words)   # works as hoped when
length(alarm.words)                  # confining nrows
                                     # to the length of alarm.words

str_detect(zz, alarm.words)          # obviously not right

# maybe I need apply() ?
my.f <- function(x){str_detect(x, alarm.words)}

apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
                           # between alarm.words and that
                           # in which I am searching for
                           # matching strings

apply(zz, 2, my.f)         # now I'm getting somewhere
apply(zz[1:4,], 2, my.f)   # but still only works with 4
                           # rows of the dataframe

Appreciate any advice.

-- Chris Ryan



More information about the R-help mailing list