[R] Discovering patterns in textual strings

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Mon May 7 23:33:45 CEST 2018


You seem to be using semantics to make your choices, not merely rules-based
patterns.

But in any case, I cannot help. Perhaps someone else with more experience
at this sort of thing or who is smarter can.

-- Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, May 7, 2018 at 2:02 PM, Jeff Reichman <reichmanj using sbcglobal.net>
wrote:

> Bert
>
>
>
> Here are some examples of the type of text strings I’m dealing with:
>
>
>
> ??????.??.???
>
> ??????.??.??????????
>
> ?Torrent? Pro - Torrent App
>
> ?Torrent?-Torrent Downloader
>
> 1 Pic 8 Words - Syllables
>
> 1 Pic 8 Words - Syllables
>
> 27043_Spanish songs for children
>
> 28.android.com.alpha.horoscope
>
> 28.android.com.bravo.horoscope
>
> 28.Card Game - Offline
>
> 28.card Game Multiplayer
>
> 37045_Spanish songs for children
>
> 7 Minute Workout for Weight Loss: Daily Cardio App
>
> 7 Minute Workout Plus
>
> 7 Minute Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_
> Android_MEDIUMRECTANGLE_300x250_IAB7
>
> 7 Nights at Pizza House - 2
>
> 7 Nights at Pizza House 3D
>
> com.zombodroid
>
> com.zombodroid.battle
>
> com.zombodroid.memegenerator
>
> com.zone.talking.pet
>
> com.zone.yinshidaquan
>
> Disney Kingdom
>
> Disney Kingdom_Android
>
> Evite
>
> Evite Invitations
>
> Evite IOS_Evite_IOS_320x50
>
> Excavator Simulator 3D:Sand
>
> Excavator Snow Plow Loader Truck
>
> Flippy Knife
>
> Flippy Knife - 654567
>
> fliptech.iowafmworld
>
> fliptech.serbiafmworld
>
> Floor is lava!
>
> Floor is lava: Escape
>
> Go_Launcher
>
> Go_Launcher_Lite
>
> myyearbook Android
>
> myyearbook.com-MeetMe_Android_300x250_UK
>
>
>
> hoping to obtain something like ….
>
>
>
> ??????.??
>
> Torrent
>
> 1 Pic 8 Words
>
> 7 Minute Workout
>
> 7 Nights at Pizza House
>
> com.zombodroid
>
> com.zone
>
> Disney Kingdom
>
> Flippy Knife
>
> fliptech
>
> Floor is lava
>
> Go_Launcher
>
> myyearbook
>
>
>
>
>
>
>
> *From:* Bert Gunter <bgunter.4567 using gmail.com>
> *Sent:* Saturday, May 5, 2018 2:14 AM
> *To:* reichmanj using sbcglobal.net
> *Cc:* R-help <r-help using r-project.org>
> *Subject:* Re: [R] Discovering patterns in textual strings
>
>
>
> I am still somewhat confused by your specifications, but others may not
> be. Part of my confusion stems from your failure to provide a reproducible
> example (see e.g. the posting guide linked below).  For example, I cannot
> tell from your text whether the Abc and Bce strings contain one or more
> spaces at the end. I shall assume they may but need not.
>
> Anyway, here is a reproducible example and solution that assumes that the
> substrings/patterns of interest to you occur at the beginning of the
> strings and may or may not be followed by one of "." "_" or " "(space) and
> then possibly further text which should be ignored. Assuming that you are
> familiar with regular expressions, maybe this will help to get you started
> even if I have misunderstood your specifications. If you aren't familiar
> with regex's, maybe the stringr package may provide a gentler interface
> than using R's raw regex functionality. Or maybe someone else can suggest a
> better approach (which is another reason why you should reply to the list,
> not just me).
>
> z <- c("abc",
>        "abc_def",
>        "abc.def",
>        "abc def",
>        "abcd_ef",
>        "abcd",
>        "e","f")
>
> pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
>
> ## gives:
> > pats
> [1] "abc"  "abcd" "e"    "f"
>
>
>
> This gives you the four separate patterns that you could then use to group
> your records, perhaps by:
>
> > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z))
> [[1]]
> [1] 1 2 3 4
>
> [[2]]
> [1] 5 6
>
> [[3]]
> [1] 7
>
> [[4]]
> [1] 8
>
>
>
> That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.
>
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>
> On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj using sbcglobal.net>
> wrote:
>
> Bert
>
> Thank you for the  link.  Figured there might be something
>
> Regarding your questions
>
> This is from a large 53 Billion records.  The column in question are
> AdNames (Real Time Bidding data)
>
> #1. Generally yes, but not always
>
> #2 Separators could be underscores  (_) or dots (.) as in 1.2.3_ABC .....
>
> #3 Yes. So there could be Abc 123 could be a matching string
>
> This would not be considered a match  ...
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> The sequence(s) are always are at the beginning (or so it appears).  Out
> of the 54 billion records  I am able to pull (SparkR sql) 948,679 unique
> strings.  It is from these unique strings that I (if possible)  want to
> identify the "key" strings.
>
> 1.  Abc_1232.niok7j9hd
> 2.  Abc
> 3.  Abc.2#348hfk2.njilo
> 4.  Abc.2
> 5.  Abc.7
> 6.  BAdfr_kajdhf98#kjsdh
> 7.  BAdrf_gofer
> 948679 ....
>
>
> So I may have a thousand individuals strings all of which have Abc as a
> common string, or Badrf.  So I am looking to pull "Abc," "BAdrf", etc.  So
> then I can go back and restructure the data to show that any record with
> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
>
> Does that help
>
> Jeff
>
> -----Original Message-----
> From: Bert Gunter <bgunter.4567 using gmail.com>
> Sent: Friday, May 4, 2018 5:41 PM
> To: reichmanj using sbcglobal.net
> Cc: R-help <R-help using r-project.org>
> Subject: Re: [R] Discovering patterns in textual strings
>
> The answer is, of course, using regular expressions and/or libraries
> therefor. However, I do not think you have defined your problem
> sufficiently. Some questions I have:
>
> 1. Do possible patterns to be matched always appear at the beginning of
> your strings?
>
> 2. Always together between specified separators ("_"  in your example); or
> one of several specified separators; or otherwise?
>
> 3. Do spaces or other nonprinting characters occur in your strings?
>
> e.g. would
>
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> be considered matching?
> There are undoubtedly other possibilities that I've missed.
>
>
>
> You may also find it useful to check this "task view" out for
> possibilities:
> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj using sbcglobal.net>
> wrote:
> > R Help Forum
> >
> >
> >
> > Is there a R library (or a way) that I can extract unique character
> > strings, or repeating patterns in textual strings.  Say for example I
> > have the following records:
> >
> >
> >
> > Abc_1234_kjhksh_276
> >
> > Abc
> >
> > Abc_1234_lakdofyo_324
> >
> > Bce_876_skdhk_*&^%*&
> >
> > Bce
> >
> > Bce_454
> >
> >
> >
> > And I would like to see the following results
> >
> > Abc
> >
> > Abc_1234
> >
> > Bce
> >
> >
> >
> >
> >
> > Jeff Reichman
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>

	[[alternative HTML version deleted]]




More information about the R-help mailing list