[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

Fri Jun 11 21:07:52 CEST 2021

Yes, wonderful! This seems to work beautifully.  Thank you so much!

________________________________
From: Rui Barradas <ruipbarradas using sapo.pt>
Sent: Friday, June 11, 2021 2:03 PM
To: Debbie Hahs-Vaughn <debbie using ucf.edu>; r-help using R-project.org <r-help using R-project.org>
Subject: Re: [R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

Hello,

For what I understood of the problem, this might be what you want.

library(dplyr)
library(stringr)

coreWordsPat <- paste0("\\b", coreWords, "\\b")
coreWordsPat <- paste(coreWordsPat, collapse = "|")

left_join(
   df %>%
     mutate(Core = +str_detect(Utterance, coreWordsPat)) %>%
     select(ID, Utterance, Core),
   df %>%
     mutate(Fringe = str_remove_all(Utterance, coreWordsPat),
            Fringe = +(nchar(trimws(Fringe)) > 0)) %>%
     select(ID, Fringe),
   by = "ID"
)

Hope this helps,

Rui Barradas

Às 18:02 de 11/06/21, Debbie Hahs-Vaughn escreveu:
> I am working with utterances, statements spoken by children.  From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').
>
> If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe').  I will not have a list of Fringe words.
>
> Basically, right now I have a child ID and only the utterances.  Here is a snippet of my data.
>
> ID      Utterance
> 1       a baby
> 2       small
> 3       yes
> 4       where's his bed
> 5       there's his bed
> 6       where's his pillow
> 7       what is that on his head
> 8       hey he has his arm stuck here
> 9       there there's it
> 10      now you're gonna go night-night
> 11      and that's the thing you can turn on
> 12      yeah where's the music box
> 13      what is this
> 14      small
> 15      there you go baby
>
>
> The following code runs but isn't doing exactly what I need--which is:  1) the ability to detect words from the list and define as core; 2) the ability to search the utterance and if there are any words in the utterance that are NOT core, to identify those as �1� as I will not have a list of fringe words.
>
> ```
>
> library(dplyr)
> library(stringr)
> library(tidyr)
>
> coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", "help", "all done", "finished")
>
> str_detect(df,)
>
> dfplus <- df %>%
>    mutate(id = row_number()) %>%
>    separate_rows(Utterance, sep = ' ') %>%
>    mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
>           Fringe = + !Core) %>%
>    group_by(id) %>%
>    mutate(Core = + (sum(Core) > 0),
>           Fringe = + (sum(Fringe) > 0)) %>%
>    slice(1) %>%
>    select(-Utterance) %>%
>    left_join(df) %>%
>    ungroup() %>%
>    select(Utterance, Core, Fringe, ID)
>
> ```
>
> The dput() code is:
>
> structure(list(Utterance = c("a baby", "small", "yes", "where's his bed",
> "there's his bed", "where's his pillow", "what is that on his head",
> "hey he has his arm stuck here", "there there's it", "now you're gonna go night-night",
> "and that's the thing you can turn on", "yeah where's the music box",
> "what is this", "small", "there you go baby ", "what is this for ",
> "a ", "and the go goodnight here ", "and what is this ", " what's that sound ",
> "what does she say ", "what she say", "should I turn the on so Laura doesn't cry ",
> "what is this ", "what is that ", "where's clothes ", " where's the baby's bedroom ",
> "that might be in dad's bed+room ", "yes ", "there you go baby ",
> "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
> 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
> -31L), class = c("tbl_df", "tbl", "data.frame"))
>
> ```
>
> The first 10 rows of output looks like this:
>
> Utterance       Core    Fringe  ID
> 1       a baby  1       0       1
> 2       small   1       0       2
> 3       yes     1       0       3
> 4       where's his bed 1       1       4
> 5       there's his bed 1       1       5
> 6       where's his pillow      1       1       6
> 7       what is that on his head        1       0       7
> 8       hey he has his arm stuck here   1       1       8
> 9       there there's it        1       0       9
> 10      now you're gonna go night-night 1       1       10
>
> For example, in line 1 of the output, �a� is a core word so �1� for core is correct.  However, �baby� should be picked up as fringe so there should be �1�, not �0�, for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not.
>
> Additionally, it seems like if the utterance has parts of a core word in it, it�s being counted. For example, �small� is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core.
>
> Any suggestions on what is happening and how to correct it are greatly appreciated.
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cdebbie%40ucf.edu%7Cfffbba02d03c4ec3314908d92d034863%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590314387984018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=HqicUtmxaVxSROWUgRSZMjxcyCBqBCq3OzKe1Iha4Jo%3D&reserved=0
> PLEASE do read the posting guide https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7Cdebbie%40ucf.edu%7Cfffbba02d03c4ec3314908d92d034863%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590314387984018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4MthQRICH68CnIgYsX08AAcrhyLKHloibl23VmIaCjY%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]