[R] speed issue: gsub on large data frame

Tue Nov 5 10:06:26 CET 2013

what is missing is any idea of what the 'patterns' are that you are searching for.  Regular expressions are very sensitive to how you specify the pattern.  you indicated that you have up to 500 elements in the pattern, so what does it look like?  alternation and backtracking can be very expensive.  so a lot more specificity is required.  there are whole books written on how pattern matching works and what is hard and what is easy.  this is true for wherever regular expressions are used, not just in R.  also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert <simon.pickert at t-online.de> wrote:

> How’s that not reproducible?
> 
> 1. Data frame, one column with text strings
> 2. Size of data frame= 4million observations
> 3. A bunch of gsubs in a row (  gsub(patternvector, “[token]“,dataframe$text_column)  )
> 4. General question: How to speed up string operations on ‘large' data sets?
> 
> 
> Please let me know what more information you need in order to reproduce this example? 
> It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now.
> 
> 
> 
> 
> 
> 
> General question: 
> Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnewmil at dcn.davis.CA.us>:
> 
>> Example not reproducible. Communication fail. Please refer to Posting Guide.
>> ---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>                                     Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> --------------------------------------------------------------------------- 
>> Sent from my phone. Please excuse my brevity.
>> 
>> Simon Pickert <simon.pickert at t-online.de> wrote:
>>> Hi R’lers,
>>> 
>>> I’m running into speeding issues, performing a bunch of 
>>> 
>>> „gsub(patternvector, [token],dataframe$text_column)"
>>> 
>>> on a data frame containing >4millionentries.
>>> 
>>> (The “patternvectors“ contain up to 500 elements) 
>>> 
>>> Is there any better/faster way than performing like 20 gsub commands in
>>> a row?
>>> 
>>> 
>>> Thanks!
>>> Simon
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.