[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Christopher W Ryan cryan at binghamton.edu
Thu Jul 9 20:48:46 CEST 2015


Thanks everyone.  John's original solution worked great.  And with
27,000 records, 65 alarm.words, and 6 columns to search, it takes only
about 15 seconds.  That is certainly adequate for my needs.  But I
will try out the other strategies too.

And thanks also for lot's of new R things to learn--grep, grepl,
do.call . . . that's always a bonus!

--Chris Ryan

On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
> Yup, that does it. Let grep figure out what's a word rather than doing
> it manually. Forgot about "\b"
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>    -- Clifford Stoll
>
>
> On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Just add a word break marker before and after:
>>
>> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
>> ---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>                                       Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> ---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>>Jeff:
>>>
>>>Well, it would be much better (no loops!) except, I think, for one
>>>issue: "red" would match "barred" and I don't think that this is what
>>>is wanted: the matches should be on whole "words" not just string
>>>patterns.
>>>
>>>So you would need to fix up the matching pattern to make this work,
>>>but it may be a little tricky, as arbitrary whitespace characters,
>>>e.g. " " or "\n" etc. could be in the strings to be matched separating
>>>the words or ending the "sentence."  I'm sure it can be done, but I'll
>>>leave it to you or others to figure it out.
>>>
>>>Of course, if my diagnosis is wrong or silly, please point this out.
>>>
>>>Cheers,
>>>Bert
>>>
>>>
>>>Bert Gunter
>>>
>>>"Data is not information. Information is not knowledge. And knowledge
>>>is certainly not wisdom."
>>>   -- Clifford Stoll
>>>
>>>
>>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>>><jdnewmil at dcn.davis.ca.us> wrote:
>>>> I think grep is better suited to this:
>>>>
>>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste,
>>>zz[ , 2:3 ] ) ) )
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller                        The     .....       .....  Go
>>>Live...
>>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>>Go...
>>>>                                       Live:   OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------
>>>> Sent from my phone. Please excuse my brevity.
>>>>
>>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
>>>wrote:
>>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only a
>>>>>single, not a double, loop. It should be more efficient.
>>>>>
>>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>>+       function(x)any(x %in% alarm.words))
>>>>>
>>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
>>>>>
>>>>>The idea is to paste the strings in each row (do.call allows an
>>>>>arbitrary number of columns) into a single string and then use
>>>>>strsplit to break the string into individual "words" on whitespace.
>>>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>>>
>>>>>Cheers,
>>>>>Bert
>>>>>Bert Gunter
>>>>>
>>>>>"Data is not information. Information is not knowledge. And knowledge
>>>>>is certainly not wisdom."
>>>>>   -- Clifford Stoll
>>>>>
>>>>>
>>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>>>>>> Dear Chris,
>>>>>>
>>>>>> If I understand correctly what you want, how about the following?
>>>>>>
>>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>>>>>grepl, x=x)))
>>>>>>> zz[rows, ]
>>>>>>
>>>>>>           v1                              v2                v3 v4
>>>>>> 3  -1.022329                    green turtle    ronald weasley  2
>>>>>> 6   0.336599              waffle the hamster        red sparks  1
>>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the white  1
>>>>>> 10  1.130622                      black bear  gandalf the grey  2
>>>>>>
>>>>>> I hope this helps,
>>>>>>  John
>>>>>>
>>>>>> ------------------------------------------------
>>>>>> John Fox, Professor
>>>>>> McMaster University
>>>>>> Hamilton, Ontario, Canada
>>>>>> http://socserv.mcmaster.ca/jfox/
>>>>>>
>>>>>>
>>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>>>  "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>>>>>> Running R 3.1.1 on windows 7
>>>>>>>
>>>>>>> I want to identify as a case any record in a dataframe that
>>>contains
>>>>>any
>>>>>>> of several keywords in any of several variables.
>>>>>>>
>>>>>>> Example:
>>>>>>>
>>>>>>> # create a dataframe with 4 variables and 10 records
>>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>>>>>fox",
>>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>>>>>"hello
>>>>>>> world", "yellow giraffe with a long neck", "black bear")
>>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>>>>>"ginny
>>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>>>>>dress
>>>>>>> robes", "gandalf the white", "gandalf the grey")
>>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
>>>lambda=2),
>>>>>>> stringsAsFactors=FALSE)
>>>>>>> str(zz)
>>>>>>> zz
>>>>>>>
>>>>>>> # here are the keywords
>>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>>>>>
>>>>>>> # For each row/record, I want to test whether the string in v2 or
>>>>>the
>>>>>>> string in v3 contains any of the strings in alarm.words. And then
>>>if
>>>>>so,
>>>>>>> set zz$v5=TRUE for that record.
>>>>>>>
>>>>>>> # I'm thinking the str_detect function in the stringr package
>>>ought
>>>>>to
>>>>>>> be able to help, perhaps with some use of apply over the rows, but
>>>I
>>>>>>> obviously misunderstand something about how str_detect works
>>>>>>>
>>>>>>> library(stringr)
>>>>>>>
>>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the target of the
>>>>>search
>>>>>>>                                      # must be a vector, not
>>>>>multiple
>>>>>>>                                      # columns
>>>>>>>
>>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>>>>
>>>>>>> str_detect(zz[,2], alarm.words)      # error, length of
>>>alarm.words
>>>>>>>                                      # is less than the number of
>>>>>>>                                      # rows I am using for the
>>>>>>>                                      # comparison
>>>>>>>
>>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped when
>>>>>>> length(alarm.words)                  # confining nrows
>>>>>>>                                      # to the length of
>>>alarm.words
>>>>>>>
>>>>>>> str_detect(zz, alarm.words)          # obviously not right
>>>>>>>
>>>>>>> # maybe I need apply() ?
>>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>>>>
>>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
>>>>>>>                            # between alarm.words and that
>>>>>>>                            # in which I am searching for
>>>>>>>                            # matching strings
>>>>>>>
>>>>>>> apply(zz, 2, my.f)         # now I'm getting somewhere
>>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with 4
>>>>>>>                            # rows of the dataframe
>>>>>>>
>>>>>>>
>>>>>>> # perhaps %in% could do the job?
>>>>>>>
>>>>>>> Appreciate any advice.
>>>>>>>
>>>>>>> --Chris Ryan
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>______________________________________________
>>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list