[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Fri Oct 19 04:34:51 CEST 2018


Sorry. Typo.  The last line should be:

ans$Result <- apply(ans,1,function(r)phrasewords[[r[1]]] %allin%
tweets[[r[2]]])

-- Bert



On Thu, Oct 18, 2018 at 7:04 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> All (especially Nathan): **Please feel free to ignore this post without
> response.** It just represents a bit of OCD-ness on my part that may or may
> not be of interest to anyone else.
>
> Purpose of this post: To give an alternative considerably simpler and
> considerably faster solution to the problem than those which I offered
> previously. It may or may not be what the OP asked for, but the improvement
> exercise was instructive to me . Notation as previously in this thread.
>
> New solution:
>
> getwords <-
> function(x)strsplit(gsub("(^[[:space:]]+)|([[:space:]]+)$)","",tolower(x)),split
> = " +")
> ## split lower-cased text into a vector of "words"
> ## I made this a bit fancier to handle some "corner" cases, but the
> previous simpler version may well suffice.
>
> '%allin%' <- function(x, table)prod(match(x,table, nomatch = 0L)) > 0L
> ## a convenience function/operator that improves efficiency.
>
> ## lists of  search word vectors as before
> phrasewords <- getwords(st$terms)
> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel")) ## the
> tweets + one additional
>
> ## simpler approach just using indexing for the bookkeeping that nested
> _apply
> ## loops previously were used for
> ans <- expand.grid(phrases = seq_along(phrasewords),tweets =
> seq_along(tweets), Result = FALSE)
> ans$Result <- apply(ind,1,function(r)phrasewords[[r[1]]] %allin%
> tweets[[r[2]]])
>
> ## ans is a data frame in which the first column indexes phrases and the
> second tweets
> ## The ith row of ans$Result == TRUE iff all the words in the phrase
> indexed by the ith row of the
> ##  phrase column are contained in the tweet indexed by that row's tweet
> column.
>
> This was way faster than my previous offerings.
>
> Note also that just the matching phrases and tweets can be extracted as
> usual by:
>
> > ans[ans[,3],]
>    phrases tweets Result
> 42       6      7   TRUE
> ## all the words in the 6th search phrase appeared in the 7th tweet.
>
> ** I promise to natter on about this no longer! **
>
> Cheers,
> Bert
>
>
> On Wed, Oct 17, 2018 at 7:50 PM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
>
>>
>> If you wish to use R, you need to at least understand its basic data
>> structures and functionality. Expecting that mimickry of code in special
>> packages will suffice is, I believe, an illusion. If you haven't already
>> done so, you should go through a basic R tutorial or two (there are many on
>> the web; some recommendations, by no means necessarily "the best",  can be
>> found here:
>> https://www.rstudio.com/online-learning/#r-programming).
>>
>> Having said that, I realized that my previous "solution" using regular
>> expressions was more complicated than it needed to be and somewhat foolish
>> ( so much for all my "expertise"). A simpler and better approach is simply
>> to break up both the tweet texts and your search phrases into vectors of
>> their "words" (i.e. character strings surrounded by spaces) using
>> strplit(), and then using R's built-in matching capabilities with %in%.
>> This is quite straightforward, pretty robust (no regex's to wrestle with),
>> and does not require "herculean efforts" to understand. The only wrinkle is
>> some bookkeeping with the "apply" family of functions. These are, as you
>> may know, the functional programming way of handling iteration (loops), but
>> they are what I would consider part of "basic" R functionality and worth
>> spending the time to learn about.
>>
>> Herewith my better, simpler proposal, using your example data as before:
>>
>> getwords <- function(x)strsplit(tolower(x),split = " +")
>> ## split text into a vector of lower-cased "words"
>>
>> phrasewords <- structure(getwords(st$terms), names = st$terms)
>> ## named list of your search word vectors
>>
>> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel"))
>> ## the tweets + one additional that should match the last phrase
>>
>> ans <- lapply(phrasewords, function(x) apply(sapply(tweets,function(y)x
>> %in% y), 2, all))
>> ## a list indexed by the search phrases,
>> ## with each component a vector of logicals with vec[i] == TRUE iff
>> ## the ith tweet contains all the words in the search phrase
>>
>> > ans
>> $`me abused depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`me hurt depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel hopeless depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel alone depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel helpless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel worthless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>>
>> -- Bert
>>
>> On Wed, Oct 17, 2018 at 9:20 AM Nathan Parsons <
>> nathan.f.parsons using gmail.com> wrote:
>>
>>> I do not have your command of base r, Bert. That is a herculean effort!
>>> Here’s what I spent my night putting together:
>>>
>>> ## Create search terms
>>> ## dput(st)
>>> st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
>>> ), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
>>> ), word3 = c("lifts", "depressed", "depressed", "depressed",
>>> "depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names =
>>> c(NA,
>>> -5L))
>>>
>>> ## Create tweets
>>> ## dput(th)
>>> th <- structure(list(status_id = c("x1047841705729306624",
>>> "x1046966595610927105",
>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>> "x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
>>> "x1048269727582355457", "x1048092408544677890"), created_at =
>>> c("2018-10-04T13:31:45Z",
>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
>>> "2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
>>> ), text = c("technique is everything with olympic lifts ! @ body by john
>>> ",
>>> "@subtronics just went back and rewatched ur fblice with ur cdjs and let
>>> me tell you man. you are the fucking messiah",
>>> "@ic4rus1 opportunistic means short-game. as in getting drunk now vs.
>>> not being hung over tomorrow vs. not fucking up your life ten years later.",
>>> "i tend to think about my dreams before i sleep.", "@michaelavenatti
>>> @senatorcollins so if your client was in her 20s attending parties with
>>> teenagers doesnt that make her at the least immature as hell or at the
>>> worst a pedophile and a person contributing to the delinquency of minors?",
>>> "i wish i could take credit for this", "i woulda never imagined.
>>> #lakeshow ",
>>> "@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
>>> "sunday ill have a booth in katy at the real craft wives of katy fest
>>> @nolabelbrewco cmon yall!everything is better when you top it with
>>> tias!order today we ship to all 50 ",
>>> "dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
>>> 40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
>>> 32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
>>> -100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
>>> -86.2447285), county_name = c("Cumberland County", "Delaware County",
>>> "San Francisco County", "Allegheny County", "Concho County",
>>> "Los Angeles County", "Los Angeles County", "Marathon County",
>>> "Harris County", "Montgomery County"), fips = c(23005L, 39041L,
>>> 6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
>>> state_name = c("Maine", "Ohio", "California", "Pennsylvania",
>>> "Texas", "California", "California", "Wisconsin", "Texas",
>>> "Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
>>> "CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
>>> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>> "NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
>>> "Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
>>> 2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
>>> 184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
>>> 4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
>>> 10L), class = "data.frame")
>>>
>>> ## Clean tweets - basically just remove everything we don’t need from
>>> the text including punctuation and urls
>>> th %>%
>>> mutate(linenumber = row_number(),
>>> text = str_remove_all(text, "[^\x01-\x7F]"),
>>> text = str_remove_all(text, "\n"),
>>> text = str_remove_all(text, ","),
>>> text = str_remove_all(text, "'"),
>>> text = str_remove_all(text, "&"),
>>> text = str_remove_all(text, "<"),
>>> text = str_remove_all(text, ">"),
>>> text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
>>> text = tolower(text)) -> th
>>>
>>> ## Create search function that looks for each search term in the
>>> provided string, evaluates if all three search terms have been found, and
>>> returns a logical
>>> srchr <- function(df) {
>>> str_detect(df, "olympic") -> a
>>> str_detect(df, "technique") -> b
>>> str_detect(df, "lifts") -> c
>>> ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
>>> }
>>>
>>> ## Evaluate tweets for presence of search term
>>> th %>%
>>> mutate(flag = map_chr(text, srchr)) -> th_flagged
>>>
>>> As far as I can tell, this works. I have to manually enter each set of
>>> search terms into the function, which is not ideal. Also, this only
>>> generates a True/False for each tweet based on one search term - I end up
>>> with an evaluatory column for each search term that I would then have to
>>> collapse together somehow. I’m sure there’s a more elegant solution.
>>>
>>> --
>>>
>>> Nate Parsons
>>> Pronouns: He, Him, His
>>> Graduate Teaching Assistant
>>> Department of Sociology
>>> Portland State University
>>> Portland, Oregon
>>>
>>> 503-725-9025
>>> 503-725-3957 FAX
>>> On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <bgunter.4567 using gmail.com>,
>>> wrote:
>>>
>>> OK, as no one else has offered a solution, I'll take a whack at it.
>>>
>>> Caveats: This is a brute force attempt using R's basic regular
>>> expression engine. It is inelegant and barely tested, so likely to be at
>>> best incomplete and buggy, and at worst, incorrect. But maybe Nathan or
>>> someone else on the list can fix it up. So if (when) it breaks, complain on
>>> the list to give someone (almost certainly not me) the opportunity.
>>>
>>> The basic idea is that the tweets are just character strings and the
>>> search phrases are just character vectors all of whose elements must match
>>> "appropriately" -- i.e. they must match whole words -- in the character
>>> strings. So my desired output from the code is a list indexed by the search
>>> phrases, each of whose components if a logical vector of length the number
>>> of tweets each of whose elements = TRUE iff all the words in the search
>>> phrase match somewhere in the tweet.
>>>
>>> Here's the code(using the data Nathan provided):
>>>
>>> > words <- sapply(st[[1]],strsplit,split = " +" )
>>> ## convert the phrases to a list of character vectors of the words
>>> ## Result:
>>> > words
>>> $`me abused depressed`
>>> [1] "me"        "abused"    "depressed"
>>>
>>> $`me hurt depressed`
>>> [1] "me"        "hurt"      "depressed"
>>>
>>> $`feel hopeless depressed`
>>> [1] "feel"      "hopeless"  "depressed"
>>>
>>> $`feel alone depressed`
>>> [1] "feel"      "alone"     "depressed"
>>>
>>> $`i feel helpless`
>>> [1] "i"        "feel"     "helpless"
>>>
>>> $`i feel worthless`
>>> [1] "i"         "feel"      "worthless"
>>>
>>> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," ","
>>> "),x, c(" "," "," *$")))
>>> ## function to create regexes for words when they are at the beginning,
>>> middle, or end of tweets
>>>
>>> > wordregex <- lapply(words,expand.words)
>>> ##Result
>>> ## too lengthy to include
>>> ##
>>> > tweets <- th$text
>>> ##extract the tweets
>>> > findin <- function(x,y)
>>>    ## x is a vector of regex patterns
>>>    ## y is a character vector
>>>    ## value = vector,vec, with length(vec) == length(y) and vec[i] ==
>>> TRUE iff any of x matches y[i]
>>> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
>>> }
>>>
>>> ## add a matching "tweet" to the tweet vector:
>>> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>>>
>>> > ans <-
>>> lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
>>> all))
>>> ## Result:
>>> > ans
>>> $`me abused depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`me hurt depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`feel hopeless depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`feel alone depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`i feel helpless`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`i feel worthless`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>>>
>>> ## None of the tweets match any of the phrases except for the last tweet
>>> that I added.
>>>
>>> ## Note: you need to add capabilities to handle upper and lower case.
>>> See, e.g. ?casefold
>>>
>>> Cheers,
>>> Bert
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <bgunter.4567 using gmail.com>
>>> wrote:
>>>
>>>> The problem wasn't the data tibbles. You posted in html -- which you
>>>> were explictly warned against -- and that corrupted your text (e.g. some
>>>> quotes became "smart quotes", which cannot be properly cut and pasted into
>>>> R).
>>>>
>>>> Bert
>>>>
>>>>
>>>> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <
>>>> nathan.f.parsons using gmail.com> wrote:
>>>>
>>>>> Argh! Here are those two example datasets as data frames (not tibbles).
>>>>> Sorry again. This apparently is just not my day.
>>>>>
>>>>>
>>>>> th <- structure(list(status_id = c("x1047841705729306624",
>>>>> "x1046966595610927105",
>>>>>
>>>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>>>>
>>>>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>>>
>>>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>>>>
>>>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>>>>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt
>>>>> ",
>>>>>
>>>>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>>> let me
>>>>> tell you man. You are the fucking messiah",
>>>>>
>>>>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>>>>> not
>>>>> being hung over tomorrow vs. not fucking up your life ten years
>>>>> later.",
>>>>>
>>>>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>>> @SenatorCollins So,  if your client was in her 20s, attending parties
>>>>> with
>>>>> teenagers, doesn't that make her at the least immature as hell, or at
>>>>> the
>>>>> worst, a pedophile and a person contributing to the delinquency of
>>>>> minors?",
>>>>>
>>>>>
>>>>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>>>>
>>>>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>>>
>>>>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>>>
>>>>> ), county_name = c("Cumberland County", "Delaware County", "San
>>>>> Francisco
>>>>> County",
>>>>>
>>>>> "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>>> c(23005L,
>>>>>
>>>>>
>>>>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>>>
>>>>> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>>>
>>>>>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>>> c("Medium Metro",
>>>>>
>>>>>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>>>
>>>>>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>>>
>>>>>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>>>
>>>>>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names =
>>>>> c(NA,
>>>>>
>>>>> -6L))
>>>>>
>>>>>
>>>>> st <- structure(list(terms = c("me abused depressed", "me hurt
>>>>> depressed",
>>>>>
>>>>> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>>>
>>>>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>>>
>>>>> "tbl", "data.frame"))
>>>>>
>>>>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <
>>>>> nathan.f.parsons using gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Thanks all for your patience. Here’s a second go that is perhaps more
>>>>> > explicative of what it is I am trying to accomplish (and hopefully
>>>>> in plain
>>>>> > text form)...
>>>>> >
>>>>> >
>>>>> > I’m using the following packages: tidyverse, purrr, tidytext
>>>>> >
>>>>> >
>>>>> > I have a number of tweets in the following form:
>>>>> >
>>>>> >
>>>>> > th <- structure(list(status_id = c("x1047841705729306624",
>>>>> > "x1046966595610927105",
>>>>> >
>>>>> > "x1047094786610552832", "x1046988542818308097",
>>>>> "x1046934493553221632",
>>>>> >
>>>>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>>> >
>>>>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
>>>>> "2018-10-02T05:01:35Z",
>>>>> >
>>>>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique
>>>>> is
>>>>> > everything with olympic lifts ! @ Body By John
>>>>> https://t.co/UsfR6DafZt",
>>>>> >
>>>>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>>> let
>>>>> > me tell you man. You are the fucking messiah",
>>>>> >
>>>>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now
>>>>> vs. not
>>>>> > being hung over tomorrow vs. not fucking up your life ten years
>>>>> later.",
>>>>> >
>>>>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>>> > @SenatorCollins So, if your client was in her 20s, attending parties
>>>>> with
>>>>> > teenagers, doesn't that make her at the least immature as hell, or
>>>>> at the
>>>>> > worst, a pedophile and a person contributing to the delinquency of
>>>>> minors?",
>>>>> >
>>>>> > "i wish i could take credit for this"), lat = c(43.6835853,
>>>>> 40.284123,
>>>>> >
>>>>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>>> >
>>>>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>>> >
>>>>> > ), county_name = c("Cumberland County", "Delaware County", "San
>>>>> Francisco
>>>>> > County",
>>>>> >
>>>>> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>>> > c(23005L,
>>>>> >
>>>>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>>> >
>>>>> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>>> >
>>>>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>>> c("Medium
>>>>> > Metro",
>>>>> >
>>>>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>>> >
>>>>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>>> >
>>>>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>>> >
>>>>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>>>>> >
>>>>> > ), row.names = c(NA, -6L), .internal.selfref = )
>>>>> >
>>>>> >
>>>>> > I also have a number of search terms in the following form:
>>>>> >
>>>>> >
>>>>> > st <- structure(list(terms = c("me abused depressed", "me hurt
>>>>> depressed",
>>>>> >
>>>>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>>> >
>>>>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>>> >
>>>>> > "tbl", "data.frame”))
>>>>> >
>>>>> >
>>>>> > I am trying to isolate the tweets that contain all of the words in
>>>>> each of
>>>>> > the search terms, i.e “me” “abused” and “depressed” from the first
>>>>> example
>>>>> > search term, but they do not have to be in order or even next to one
>>>>> > another.
>>>>> >
>>>>> >
>>>>> > I am familiar with the dplyr suite of tools and have been attempting
>>>>> to
>>>>> > generate some sort of ‘filter()’ to do this. I am not very familiar
>>>>> with
>>>>> > purrr, but there may be a solution using the map function? I have
>>>>> also
>>>>> > explored the tidytext ‘unnest_tokens’ function which transforms the
>>>>> ’th’
>>>>> > data in the following way:
>>>>> >
>>>>> >
>>>>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>>>>> >
>>>>> > > head(tt)
>>>>> >
>>>>> > status_id created_at lat lng
>>>>> >
>>>>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > county_name fips state_name state_abb urban_level urban_code
>>>>> >
>>>>> > 1: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 2: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 3: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 4: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 5: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 6: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > population word
>>>>> >
>>>>> > 1: 277308 technique
>>>>> >
>>>>> > 2: 277308 is
>>>>> >
>>>>> > 3: 277308 everything
>>>>> >
>>>>> > 4: 277308 with
>>>>> >
>>>>> > 5: 277308 olympic
>>>>> >
>>>>> > 6: 277308 lifts
>>>>> >
>>>>> >
>>>>> > but once I have unnested the tokens, I am unable to recombine them
>>>>> back
>>>>> > into tweets.
>>>>> >
>>>>> >
>>>>> > Ideally the end result would append a new column to the ‘th’ data
>>>>> that
>>>>> > would flag a tweet that contained all of the search words for any of
>>>>> the
>>>>> > search terms; so the work flow would look like
>>>>> >
>>>>> > 1) look for all search words for one search term in a tweet
>>>>> >
>>>>> > 2) if all of the search words in the search term are found, create a
>>>>> flag
>>>>> > (mutate(flag = 1) or some such)
>>>>> >
>>>>> > 3) do this for all of the tweets
>>>>> >
>>>>> > 4) move on the next search term and repeat
>>>>> >
>>>>> >
>>>>> > Again, my thanks for your patience.
>>>>> >
>>>>> >
>>>>> > --
>>>>> >
>>>>> >
>>>>> > Nate Parsons
>>>>> >
>>>>> > Pronouns: He, Him, His
>>>>> >
>>>>> > Graduate Teaching Assistant
>>>>> >
>>>>> > Department of Sociology
>>>>> >
>>>>> > Portland State University
>>>>> >
>>>>> > Portland, Oregon
>>>>> >
>>>>> >
>>>>> > 503-725-9025
>>>>> >
>>>>> > 503-725-3957 FAX
>>>>> >
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>

	[[alternative HTML version deleted]]




More information about the R-help mailing list