[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Bert Gunter bgunter.4567 at gmail.com
Thu Jul 9 19:52:45 CEST 2015


Yup, that does it. Let grep figure out what's a word rather than doing
it manually. Forgot about "\b"

Cheers,
Bert


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> Just add a word break marker before and after:
>
> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>Jeff:
>>
>>Well, it would be much better (no loops!) except, I think, for one
>>issue: "red" would match "barred" and I don't think that this is what
>>is wanted: the matches should be on whole "words" not just string
>>patterns.
>>
>>So you would need to fix up the matching pattern to make this work,
>>but it may be a little tricky, as arbitrary whitespace characters,
>>e.g. " " or "\n" etc. could be in the strings to be matched separating
>>the words or ending the "sentence."  I'm sure it can be done, but I'll
>>leave it to you or others to figure it out.
>>
>>Of course, if my diagnosis is wrong or silly, please point this out.
>>
>>Cheers,
>>Bert
>>
>>
>>Bert Gunter
>>
>>"Data is not information. Information is not knowledge. And knowledge
>>is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>
>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>><jdnewmil at dcn.davis.ca.us> wrote:
>>> I think grep is better suited to this:
>>>
>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste,
>>zz[ , 2:3 ] ) ) )
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>Go...
>>>                                       Live:   OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
>>wrote:
>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only a
>>>>single, not a double, loop. It should be more efficient.
>>>>
>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>+       function(x)any(x %in% alarm.words))
>>>>
>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
>>>>
>>>>The idea is to paste the strings in each row (do.call allows an
>>>>arbitrary number of columns) into a single string and then use
>>>>strsplit to break the string into individual "words" on whitespace.
>>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>>
>>>>Cheers,
>>>>Bert
>>>>Bert Gunter
>>>>
>>>>"Data is not information. Information is not knowledge. And knowledge
>>>>is certainly not wisdom."
>>>>   -- Clifford Stoll
>>>>
>>>>
>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>>>>> Dear Chris,
>>>>>
>>>>> If I understand correctly what you want, how about the following?
>>>>>
>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>>>>grepl, x=x)))
>>>>>> zz[rows, ]
>>>>>
>>>>>           v1                              v2                v3 v4
>>>>> 3  -1.022329                    green turtle    ronald weasley  2
>>>>> 6   0.336599              waffle the hamster        red sparks  1
>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the white  1
>>>>> 10  1.130622                      black bear  gandalf the grey  2
>>>>>
>>>>> I hope this helps,
>>>>>  John
>>>>>
>>>>> ------------------------------------------------
>>>>> John Fox, Professor
>>>>> McMaster University
>>>>> Hamilton, Ontario, Canada
>>>>> http://socserv.mcmaster.ca/jfox/
>>>>>
>>>>>
>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>>  "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>>>>> Running R 3.1.1 on windows 7
>>>>>>
>>>>>> I want to identify as a case any record in a dataframe that
>>contains
>>>>any
>>>>>> of several keywords in any of several variables.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> # create a dataframe with 4 variables and 10 records
>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>>>>fox",
>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>>>>"hello
>>>>>> world", "yellow giraffe with a long neck", "black bear")
>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>>>>"ginny
>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>>>>dress
>>>>>> robes", "gandalf the white", "gandalf the grey")
>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
>>lambda=2),
>>>>>> stringsAsFactors=FALSE)
>>>>>> str(zz)
>>>>>> zz
>>>>>>
>>>>>> # here are the keywords
>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>>>>
>>>>>> # For each row/record, I want to test whether the string in v2 or
>>>>the
>>>>>> string in v3 contains any of the strings in alarm.words. And then
>>if
>>>>so,
>>>>>> set zz$v5=TRUE for that record.
>>>>>>
>>>>>> # I'm thinking the str_detect function in the stringr package
>>ought
>>>>to
>>>>>> be able to help, perhaps with some use of apply over the rows, but
>>I
>>>>>> obviously misunderstand something about how str_detect works
>>>>>>
>>>>>> library(stringr)
>>>>>>
>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the target of the
>>>>search
>>>>>>                                      # must be a vector, not
>>>>multiple
>>>>>>                                      # columns
>>>>>>
>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>>>
>>>>>> str_detect(zz[,2], alarm.words)      # error, length of
>>alarm.words
>>>>>>                                      # is less than the number of
>>>>>>                                      # rows I am using for the
>>>>>>                                      # comparison
>>>>>>
>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped when
>>>>>> length(alarm.words)                  # confining nrows
>>>>>>                                      # to the length of
>>alarm.words
>>>>>>
>>>>>> str_detect(zz, alarm.words)          # obviously not right
>>>>>>
>>>>>> # maybe I need apply() ?
>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>>>
>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
>>>>>>                            # between alarm.words and that
>>>>>>                            # in which I am searching for
>>>>>>                            # matching strings
>>>>>>
>>>>>> apply(zz, 2, my.f)         # now I'm getting somewhere
>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with 4
>>>>>>                            # rows of the dataframe
>>>>>>
>>>>>>
>>>>>> # perhaps %in% could do the job?
>>>>>>
>>>>>> Appreciate any advice.
>>>>>>
>>>>>> --Chris Ryan
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>______________________________________________
>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible code.
>>>
>



More information about the R-help mailing list