[R] regexpr: R takes very long with non-existent pattern

Leonard Mada |eo@m@d@ @end|ng |rom @yon|c@eu
Thu May 19 02:35:16 CEST 2022


Dear Andrew,


I screwed it a little bit up. The object was not a string vector, but an 
xml object (the original xml with the abstracts).

str(x)
List of 2
  $ node:<externalptr>
  $ doc :<externalptr>
  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"


i pasted the R code for a function but had an error, which stopped the 
parsing of the function. But the next lines were still executed:

npos = regexpr(patt, x, perl=TRUE);
# Error in regexpr(patt, x, perl = TRUE) : object 'patt' not found


Variable x was actually the xml object - my mistake. It still takes 1-2 
minutes to generate the final error.

Is regexpr trying to parse the xml with as.character first (I have not 
checked this)?

It makes more sense to first parse the regex expression.


Sincerely,


Leonard

On 5/19/2022 3:26 AM, Andrew Simmons wrote:
> Hello,
>
>
> I tried this myself, something like:
>
>
> dat <- utils::read.csv(
>      "https://raw.githubusercontent.com/discoleo/R/master/TextMining/Pubmed/Example_Abstracts_Title_Pubmed.csv",
>      check.names = FALSE
> )
>
>
> regexpr(patt, dat$Abstract, perl = TRUE)
> regexpr(patt, dat$Title, perl = TRUE)
>
>
> and I can't reproduce your issue. Mine seems to raise the error within
> a second or less that object 'patt' does not exist. I'm using R 4.2.0
> and Windows 11, though that shouldn't be making a difference: if you
> look at Sys.info(), it's still Windows 10 with a build version of
> 22000. Don't really know what else to say, have you tried it again
> since?
>
>
> Regards,
>      Andrew Simmons
>
> On Wed, May 18, 2022 at 5:09 PM Leonard Mada via R-help
> <r-help using r-project.org> wrote:
>> Dear R Users,
>>
>>
>> I have run the following command in R:
>>
>> # x = larger vector of strings (1200 Pubmed abstracts);
>> # patt = not defined;
>> npos = regexpr(patt, x, perl=TRUE);
>> # Error in regexpr(patt, x, perl = TRUE) : object 'patt' not found
>>
>>
>> The problem:
>>
>> R becomes unresponsive and it takes 1-2 minutes to return the error. The
>> operation completes almost instantaneously with a valid pattern.
>>
>> Is there a reason for this behavior?
>>
>> Tested with R 4.2.0 on MS Windows 10.
>>
>>
>> I have uploaded a set with 1200 Pubmed abstracts on Github, if anyone
>> wants to check:
>>
>> - see file: Example_Abstracts_Title_Pubmed.csv;
>>
>> https://github.com/discoleo/R/tree/master/TextMining/Pubmed
>>
>> The variable patt was not defined due to an error: but it took very long
>> to exit the operation and report the error.
>>
>>
>> Many thanks,
>>
>>
>> Leonard
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list