[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)

Wed Oct 17 00:03:13 CEST 2018

The problem wasn't the data tibbles. You posted in html -- which you were
explictly warned against -- and that corrupted your text (e.g. some quotes
became "smart quotes", which cannot be properly cut and pasted into R).

Bert

On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <nathan.f.parsons using gmail.com>
wrote:

> Argh! Here are those two example datasets as data frames (not tibbles).
> Sorry again. This apparently is just not my day.
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>
> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
> tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> being hung over tomorrow vs. not fucking up your life ten years later.",
>
> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> @SenatorCollins So,  if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at the
> worst, a pedophile and a person contributing to the delinquency of
> minors?",
>
>
> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles County"), fips =
> c(23005L,
>
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>
>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> c("Medium Metro",
>
>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>
>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>
>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
>
> -6L))
>
>
> st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>
> "tbl", "data.frame"))
>
> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <nathan.f.parsons using gmail.com
> >
> wrote:
>
> > Thanks all for your patience. Here’s a second go that is perhaps more
> > explicative of what it is I am trying to accomplish (and hopefully in
> plain
> > text form)...
> >
> >
> > I’m using the following packages: tidyverse, purrr, tidytext
> >
> >
> > I have a number of tweets in the following form:
> >
> >
> > th <- structure(list(status_id = c("x1047841705729306624",
> > "x1046966595610927105",
> >
> > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
> >
> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> >
> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
> >
> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
> >
> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
> > me tell you man. You are the fucking messiah",
> >
> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> > being hung over tomorrow vs. not fucking up your life ten years later.",
> >
> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > @SenatorCollins So, if your client was in her 20s, attending parties with
> > teenagers, doesn't that make her at the least immature as hell, or at the
> > worst, a pedophile and a person contributing to the delinquency of
> minors?",
> >
> > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
> >
> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> >
> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> >
> > ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> > County",
> >
> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
> > c(23005L,
> >
> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> >
> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> >
> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> c("Medium
> > Metro",
> >
> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> >
> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> >
> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> >
> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
> >
> > ), row.names = c(NA, -6L), .internal.selfref = )
> >
> >
> > I also have a number of search terms in the following form:
> >
> >
> > st <- structure(list(terms = c("me abused depressed", "me hurt
> depressed",
> >
> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
> >
> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> >
> > "tbl", "data.frame”))
> >
> >
> > I am trying to isolate the tweets that contain all of the words in each
> of
> > the search terms, i.e “me” “abused” and “depressed” from the first
> example
> > search term, but they do not have to be in order or even next to one
> > another.
> >
> >
> > I am familiar with the dplyr suite of tools and have been attempting to
> > generate some sort of ‘filter()’ to do this. I am not very familiar with
> > purrr, but there may be a solution using the map function? I have also
> > explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
> > data in the following way:
> >
> >
> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
> >
> > > head(tt)
> >
> > status_id created_at lat lng
> >
> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > county_name fips state_name state_abb urban_level urban_code
> >
> > 1: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 2: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 3: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 4: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 5: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 6: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > population word
> >
> > 1: 277308 technique
> >
> > 2: 277308 is
> >
> > 3: 277308 everything
> >
> > 4: 277308 with
> >
> > 5: 277308 olympic
> >
> > 6: 277308 lifts
> >
> >
> > but once I have unnested the tokens, I am unable to recombine them back
> > into tweets.
> >
> >
> > Ideally the end result would append a new column to the ‘th’ data that
> > would flag a tweet that contained all of the search words for any of the
> > search terms; so the work flow would look like
> >
> > 1) look for all search words for one search term in a tweet
> >
> > 2) if all of the search words in the search term are found, create a flag
> > (mutate(flag = 1) or some such)
> >
> > 3) do this for all of the tweets
> >
> > 4) move on the next search term and repeat
> >
> >
> > Again, my thanks for your patience.
> >
> >
> > --
> >
> >
> > Nate Parsons
> >
> > Pronouns: He, Him, His
> >
> > Graduate Teaching Assistant
> >
> > Department of Sociology
> >
> > Portland State University
> >
> > Portland, Oregon
> >
> >
> > 503-725-9025
> >
> > 503-725-3957 FAX
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]