[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

John Fox jfox at mcmaster.ca
Thu Jul 9 21:24:00 CEST 2015


Dear Christopher,

My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. 

That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that).

Best,
 John

> -----Original Message-----
> From: Christopher W Ryan [mailto:cryan at binghamton.edu]
> Sent: July-09-15 2:49 PM
> To: Bert Gunter
> Cc: Jeff Newmiller; R Help; John Fox
> Subject: Re: [R] detecting any element in a vector of strings, appearing
> anywhere in any of several character variables in a dataframe
> 
> Thanks everyone.  John's original solution worked great.  And with
> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only
> about 15 seconds.  That is certainly adequate for my needs.  But I
> will try out the other strategies too.
> 
> And thanks also for lot's of new R things to learn--grep, grepl,
> do.call . . . that's always a bonus!
> 
> --Chris Ryan
> 
> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> > Yup, that does it. Let grep figure out what's a word rather than doing
> > it manually. Forgot about "\b"
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "Data is not information. Information is not knowledge. And knowledge
> > is certainly not wisdom."
> >    -- Clifford Stoll
> >
> >
> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
> > <jdnewmil at dcn.davis.ca.us> wrote:
> >> Just add a word break marker before and after:
> >>
> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ),
> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
> >> ---------------------------------------------------------------------
> ------
> >> Jeff Newmiller                        The     .....       .....  Go
> Live...
> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
> >>                                       Live:   OO#.. Dead: OO#..
> Playing
> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> rocks...1k
> >> ---------------------------------------------------------------------
> ------
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
> wrote:
> >>>Jeff:
> >>>
> >>>Well, it would be much better (no loops!) except, I think, for one
> >>>issue: "red" would match "barred" and I don't think that this is what
> >>>is wanted: the matches should be on whole "words" not just string
> >>>patterns.
> >>>
> >>>So you would need to fix up the matching pattern to make this work,
> >>>but it may be a little tricky, as arbitrary whitespace characters,
> >>>e.g. " " or "\n" etc. could be in the strings to be matched
> separating
> >>>the words or ending the "sentence."  I'm sure it can be done, but
> I'll
> >>>leave it to you or others to figure it out.
> >>>
> >>>Of course, if my diagnosis is wrong or silly, please point this out.
> >>>
> >>>Cheers,
> >>>Bert
> >>>
> >>>
> >>>Bert Gunter
> >>>
> >>>"Data is not information. Information is not knowledge. And knowledge
> >>>is certainly not wisdom."
> >>>   -- Clifford Stoll
> >>>
> >>>
> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
> >>><jdnewmil at dcn.davis.ca.us> wrote:
> >>>> I think grep is better suited to this:
> >>>>
> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call(
> paste,
> >>>zz[ , 2:3 ] ) ) )
> >>>>
> >>>---------------------------------------------------------------------
> ------
> >>>> Jeff Newmiller                        The     .....       .....  Go
> >>>Live...
> >>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
> Live
> >>>Go...
> >>>>                                       Live:   OO#.. Dead: OO#..
> >>>Playing
> >>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> with
> >>>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >>>rocks...1k
> >>>>
> >>>---------------------------------------------------------------------
> ------
> >>>> Sent from my phone. Please excuse my brevity.
> >>>>
> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter
> <bgunter.4567 at gmail.com>
> >>>wrote:
> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only
> a
> >>>>>single, not a double, loop. It should be more efficient.
> >>>>>
> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
> >>>>>+       function(x)any(x %in% alarm.words))
> >>>>>
> >>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
> >>>>>
> >>>>>The idea is to paste the strings in each row (do.call allows an
> >>>>>arbitrary number of columns) into a single string and then use
> >>>>>strsplit to break the string into individual "words" on whitespace.
> >>>>>Then the matching is vectorized with the any( %in% ... ) call.
> >>>>>
> >>>>>Cheers,
> >>>>>Bert
> >>>>>Bert Gunter
> >>>>>
> >>>>>"Data is not information. Information is not knowledge. And
> knowledge
> >>>>>is certainly not wisdom."
> >>>>>   -- Clifford Stoll
> >>>>>
> >>>>>
> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
> >>>>>> Dear Chris,
> >>>>>>
> >>>>>> If I understand correctly what you want, how about the following?
> >>>>>>
> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
> >>>>>grepl, x=x)))
> >>>>>>> zz[rows, ]
> >>>>>>
> >>>>>>           v1                              v2                v3 v4
> >>>>>> 3  -1.022329                    green turtle    ronald weasley  2
> >>>>>> 6   0.336599              waffle the hamster        red sparks  1
> >>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the white  1
> >>>>>> 10  1.130622                      black bear  gandalf the grey  2
> >>>>>>
> >>>>>> I hope this helps,
> >>>>>>  John
> >>>>>>
> >>>>>> ------------------------------------------------
> >>>>>> John Fox, Professor
> >>>>>> McMaster University
> >>>>>> Hamilton, Ontario, Canada
> >>>>>> http://socserv.mcmaster.ca/jfox/
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
> >>>>>>  "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
> >>>>>>> Running R 3.1.1 on windows 7
> >>>>>>>
> >>>>>>> I want to identify as a case any record in a dataframe that
> >>>contains
> >>>>>any
> >>>>>>> of several keywords in any of several variables.
> >>>>>>>
> >>>>>>> Example:
> >>>>>>>
> >>>>>>> # create a dataframe with 4 variables and 10 records
> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
> >>>>>fox",
> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
> >>>>>"hello
> >>>>>>> world", "yellow giraffe with a long neck", "black bear")
> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
> >>>>>"ginny
> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
> >>>>>dress
> >>>>>>> robes", "gandalf the white", "gandalf the grey")
> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
> >>>lambda=2),
> >>>>>>> stringsAsFactors=FALSE)
> >>>>>>> str(zz)
> >>>>>>> zz
> >>>>>>>
> >>>>>>> # here are the keywords
> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
> >>>>>>>
> >>>>>>> # For each row/record, I want to test whether the string in v2
> or
> >>>>>the
> >>>>>>> string in v3 contains any of the strings in alarm.words. And
> then
> >>>if
> >>>>>so,
> >>>>>>> set zz$v5=TRUE for that record.
> >>>>>>>
> >>>>>>> # I'm thinking the str_detect function in the stringr package
> >>>ought
> >>>>>to
> >>>>>>> be able to help, perhaps with some use of apply over the rows,
> but
> >>>I
> >>>>>>> obviously misunderstand something about how str_detect works
> >>>>>>>
> >>>>>>> library(stringr)
> >>>>>>>
> >>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the target of the
> >>>>>search
> >>>>>>>                                      # must be a vector, not
> >>>>>multiple
> >>>>>>>                                      # columns
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
> >>>>>>>
> >>>>>>> str_detect(zz[,2], alarm.words)      # error, length of
> >>>alarm.words
> >>>>>>>                                      # is less than the number
> of
> >>>>>>>                                      # rows I am using for the
> >>>>>>>                                      # comparison
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped when
> >>>>>>> length(alarm.words)                  # confining nrows
> >>>>>>>                                      # to the length of
> >>>alarm.words
> >>>>>>>
> >>>>>>> str_detect(zz, alarm.words)          # obviously not right
> >>>>>>>
> >>>>>>> # maybe I need apply() ?
> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
> >>>>>>>
> >>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
> >>>>>>>                            # between alarm.words and that
> >>>>>>>                            # in which I am searching for
> >>>>>>>                            # matching strings
> >>>>>>>
> >>>>>>> apply(zz, 2, my.f)         # now I'm getting somewhere
> >>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with 4
> >>>>>>>                            # rows of the dataframe
> >>>>>>>
> >>>>>>>
> >>>>>>> # perhaps %in% could do the job?
> >>>>>>>
> >>>>>>> Appreciate any advice.
> >>>>>>>
> >>>>>>> --Chris Ryan
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> >>>>>
> >>>>>______________________________________________
> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the R-help mailing list