[R] Matching long strings ... was Re: Memory management in R

David Winsemius dwinsemius at comcast.net
Sun Oct 10 16:11:28 CEST 2010


On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:

>
>> I already offered the Biostrings package. It provides more robust
>> methods for string matching than does grepl. Is there a reason that  
>> you
>> choose not to?
>>
>
> Indeed that is the way I should go for and I have installed the  
> package after some struggling.

For me is was a matter of waiting. The only struggle was coming from  
my inner timer saying it was taking too long.

> Since biostring is a fairly complex package and I need only a way to  
> check if a certain string A is a subset of string B, do you know the  
> biostring functions to achieve this?
> I see a lot of methods for biological (DNA, RNA) sequences, and they  
> may not apply to my series (which are definitely not from biology).
> Cheers

It appeared to me that the function matchPattern should replace your  
grepl invocation that was failing. It returns a more complex  
structure, so you would need to determine what would be an exact  
replacement for grepl(...) != 1. Looks like a no-match event resutls  
in the start and end items being of length 0.

 > str(  matchPattern("A", BString("BBB")) )
Formal class 'XStringViews' [package "Biostrings"] with 7 slots
   ..@ subject        :Formal class 'BString' [package "Biostrings"]  
with 6 slots
   .. .. ..@ shared         :Formal class 'SharedRaw' [package  
"IRanges"] with 2 slots
   .. .. .. .. ..@ xp                    :<externalptr>
   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
   .. .. ..@ offset         : int 0
   .. .. ..@ length         : int 3
   .. .. ..@ elementMetadata: NULL
   .. .. ..@ elementType    : chr "ANY"
   .. .. ..@ metadata       : list()
   ..@ start          : int(0)
   ..@ width          : int(0)
   ..@ NAMES          : NULL
   ..@ elementMetadata: NULL
   ..@ elementType    : chr "integer"
   ..@ metadata       : list()

Perhaps:

length(matchPattern(fut_string, past_string)@start ) == 0

You do need to use BString() on at least the past_string argument and  
maybe the fut_string as well. The BioConductor Mailing List would have  
a larger audience with experience using this package, so they should  
probably be your next avenue for advice. I am just reading the help  
pages as you should be able to do. The help page help("lowlevel- 
matching") should probably be reviewed since there may be efficiency  
issues to consider as mentioned below.

When dropped into your function with the BString coercion, it  
replicated your small example results and did not crash after a long  
period with your larger example, so I then terminated it and insert a  
"reporter" line to monitor progress. With that reporter I got up into  
the 200's for count_len without error. My laptop CPU was warming up  
the case and I was getting sleepy so I terminated the process. (I had  
no way of checking for accuracy, even if I had let it proceed, since  
you did not offer a "correct" answer.)

By the way, the construct ... grepl(. , .) != 1 ... is perhaps  
inefficient. It could more compactly be expressed as ...   ! 
grepl(. , .)  which would not be doing coercion of logicals to integers.

-- 
David.

>
> Lorenzo



More information about the R-help mailing list