[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Thu Jul 9 19:30:01 CEST 2015


Just add a word break marker before and after:

zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>Jeff:
>
>Well, it would be much better (no loops!) except, I think, for one
>issue: "red" would match "barred" and I don't think that this is what
>is wanted: the matches should be on whole "words" not just string
>patterns.
>
>So you would need to fix up the matching pattern to make this work,
>but it may be a little tricky, as arbitrary whitespace characters,
>e.g. " " or "\n" etc. could be in the strings to be matched separating
>the words or ending the "sentence."  I'm sure it can be done, but I'll
>leave it to you or others to figure it out.
>
>Of course, if my diagnosis is wrong or silly, please point this out.
>
>Cheers,
>Bert
>
>
>Bert Gunter
>
>"Data is not information. Information is not knowledge. And knowledge
>is certainly not wisdom."
>   -- Clifford Stoll
>
>
>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us> wrote:
>> I think grep is better suited to this:
>>
>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste,
>zz[ , 2:3 ] ) ) )
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>Go...
>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>
>---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
>wrote:
>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only a
>>>single, not a double, loop. It should be more efficient.
>>>
>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>+       function(x)any(x %in% alarm.words))
>>>
>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
>>>
>>>The idea is to paste the strings in each row (do.call allows an
>>>arbitrary number of columns) into a single string and then use
>>>strsplit to break the string into individual "words" on whitespace.
>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>
>>>Cheers,
>>>Bert
>>>Bert Gunter
>>>
>>>"Data is not information. Information is not knowledge. And knowledge
>>>is certainly not wisdom."
>>>   -- Clifford Stoll
>>>
>>>
>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>>>> Dear Chris,
>>>>
>>>> If I understand correctly what you want, how about the following?
>>>>
>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>>>grepl, x=x)))
>>>>> zz[rows, ]
>>>>
>>>>           v1                              v2                v3 v4
>>>> 3  -1.022329                    green turtle    ronald weasley  2
>>>> 6   0.336599              waffle the hamster        red sparks  1
>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the white  1
>>>> 10  1.130622                      black bear  gandalf the grey  2
>>>>
>>>> I hope this helps,
>>>>  John
>>>>
>>>> ------------------------------------------------
>>>> John Fox, Professor
>>>> McMaster University
>>>> Hamilton, Ontario, Canada
>>>> http://socserv.mcmaster.ca/jfox/
>>>>
>>>>
>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>  "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>>>> Running R 3.1.1 on windows 7
>>>>>
>>>>> I want to identify as a case any record in a dataframe that
>contains
>>>any
>>>>> of several keywords in any of several variables.
>>>>>
>>>>> Example:
>>>>>
>>>>> # create a dataframe with 4 variables and 10 records
>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>>>fox",
>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>>>"hello
>>>>> world", "yellow giraffe with a long neck", "black bear")
>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>>>"ginny
>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>>>dress
>>>>> robes", "gandalf the white", "gandalf the grey")
>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
>lambda=2),
>>>>> stringsAsFactors=FALSE)
>>>>> str(zz)
>>>>> zz
>>>>>
>>>>> # here are the keywords
>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>>>
>>>>> # For each row/record, I want to test whether the string in v2 or
>>>the
>>>>> string in v3 contains any of the strings in alarm.words. And then
>if
>>>so,
>>>>> set zz$v5=TRUE for that record.
>>>>>
>>>>> # I'm thinking the str_detect function in the stringr package
>ought
>>>to
>>>>> be able to help, perhaps with some use of apply over the rows, but
>I
>>>>> obviously misunderstand something about how str_detect works
>>>>>
>>>>> library(stringr)
>>>>>
>>>>> str_detect(zz[,2:3], alarm.words)    # error: the target of the
>>>search
>>>>>                                      # must be a vector, not
>>>multiple
>>>>>                                      # columns
>>>>>
>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>>
>>>>> str_detect(zz[,2], alarm.words)      # error, length of
>alarm.words
>>>>>                                      # is less than the number of
>>>>>                                      # rows I am using for the
>>>>>                                      # comparison
>>>>>
>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped when
>>>>> length(alarm.words)                  # confining nrows
>>>>>                                      # to the length of
>alarm.words
>>>>>
>>>>> str_detect(zz, alarm.words)          # obviously not right
>>>>>
>>>>> # maybe I need apply() ?
>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>>
>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
>>>>>                            # between alarm.words and that
>>>>>                            # in which I am searching for
>>>>>                            # matching strings
>>>>>
>>>>> apply(zz, 2, my.f)         # now I'm getting somewhere
>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with 4
>>>>>                            # rows of the dataframe
>>>>>
>>>>>
>>>>> # perhaps %in% could do the job?
>>>>>
>>>>> Appreciate any advice.
>>>>>
>>>>> --Chris Ryan
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>______________________________________________
>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>and provide commented, minimal, self-contained, reproducible code.
>>



More information about the R-help mailing list