[R] Matching long strings ... was Re: Memory management in R

Martin Morgan mtmorgan at fhcrc.org
Sun Oct 10 20:22:43 CEST 2010


On 10/10/2010 11:00 AM, David Winsemius wrote:
> 
> On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote:
> 
>> On 10/10/2010 07:11 AM, David Winsemius wrote:
>>>
>>> On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:
>>>
>>>>
>>>>> I already offered the Biostrings package. It provides more robust
>>>>> methods for string matching than does grepl. Is there a reason that
>>>>> you
>>>>> choose not to?
>>>>>
>>>>
>>>> Indeed that is the way I should go for and I have installed the
>>>> package after some struggling.
>>>
>>> For me is was a matter of waiting. The only struggle was coming from my
>>> inner timer saying it was taking too long.
>>>
>>>> Since biostring is a fairly complex package and I need only a way to
>>>> check if a certain string A is a subset of string B, do you know the
>>>> biostring functions to achieve this?
>>>> I see a lot of methods for biological (DNA, RNA) sequences, and they
>>>> may not apply to my series (which are definitely not from biology).
>>>> Cheers
>>>
>>> It appeared to me that the function matchPattern should replace your
>>> grepl invocation that was failing. It returns a more complex structure,
>>> so you would need to determine what would be an exact replacement for
>>> grepl(...) != 1. Looks like a no-match event resutls in the start and
>>> end items being of length 0.
>>>
>>>> str(  matchPattern("A", BString("BBB")) )
>>
>> A couple of things from this thread.
>>
>> To install a Bioconductor package follow directions here
>>
>>  http://bioconductor.org/install/index.html#install-bioconductor-packages
>>
>> which leads to
>>
>>   source("http://bioconductor.org/biocLite.R")
>>   biocLite("Biostrings")
>>
>> biocLite is just a wrapper around install.packages with appropriate
>> repositories defined.
>>
>> Some Bioconductor packages are relatively mature and make relatively
>> advanced use of S4 classes, so looking at str() is not that helpful --
>> the way the user is meant to interact with the object is different from
>> the way the object is implemented. So the best bet is to look at the
>> relevant help pages
>>
>>  result = matchPattern("A", BString("BBB"))
>>  class(result)
>>  class?XStringViews
> 
> The above was the most surprising example for me (not being particularly
> S4-savvy). Looks like it parses as:
> `?`(class, XStringViews)

similarly ?"XStringViews-class"

> Is that an S4 sort of extension for accessing documentation or have I
> just missed a more general method? I tried looking at the help Index for
> the "methods" package.

?"?" documents type?topic. It is more general, in that package?stats
takes one to the 'stats' topic amongst the 'package' doc-type help
pages. It relies on package authors choosing appropriate docTypes for
their man pages.

One S4 paradigm that can be useful is the analog of methods(class="lm"),
which is showMethods(class="XStringViews", where="package:Biostrings").

Martin

> 
>>
>> and the help pages referenced there, or from which XStringViews inherits
>>
>>   class("XStringViews")
>>
>> and in particular
>>
>>   class?Ranges
>>
>> Rather than accessing the 'start' slot, use start(result). Vignettes are
>> used heavily in Bioconductor packages, and in particular
>>
>>   browseVignettes("Biostrings")
>>
>> pops up a page with several relevant vignettes, e.g., 'A short
>> presentation of the basic classes...' and perhaps 'Pairwise Sequence
>> Alignment'. These are also accessible on the Bioconductor web site,
>> e.g., on the pages linked from
>>
>>  http://bioconductor.org/help/bioc-views/release/bioc/
>>
>> The rule of thumb hinted at below -- that an operation seems to be
>> taking longer than it should -- probably indicates that the function is
>> being invoked in an inefficient way. If the documentation is opaque then
>> definitely the place to seek additional help is on the Bioconductor
>> mailing list
>>
>>  http://bioconductor.org/help/mailing-list/
>>
>> Hope this helps.
>>
>> Martin
>>
>>
>>> Formal class 'XStringViews' [package "Biostrings"] with 7 slots
>>>  ..@ subject        :Formal class 'BString' [package "Biostrings"] with
>>> 6 slots
>>>  .. .. ..@ shared         :Formal class 'SharedRaw' [package "IRanges"]
>>> with 2 slots
>>>  .. .. .. .. ..@ xp                    :<externalptr>
>>>  .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
>>>  .. .. ..@ offset         : int 0
>>>  .. .. ..@ length         : int 3
>>>  .. .. ..@ elementMetadata: NULL
>>>  .. .. ..@ elementType    : chr "ANY"
>>>  .. .. ..@ metadata       : list()
>>>  ..@ start          : int(0)
>>>  ..@ width          : int(0)
>>>  ..@ NAMES          : NULL
>>>  ..@ elementMetadata: NULL
>>>  ..@ elementType    : chr "integer"
>>>  ..@ metadata       : list()
>>>
>>> Perhaps:
>>>
>>> length(matchPattern(fut_string, past_string)@start ) == 0
>>>
>>> You do need to use BString() on at least the past_string argument and
>>> maybe the fut_string as well. The BioConductor Mailing List would have a
>>> larger audience with experience using this package, so they should
>>> probably be your next avenue for advice. I am just reading the help
>>> pages as you should be able to do. The help page
>>> help("lowlevel-matching") should probably be reviewed since there may be
>>> efficiency issues to consider as mentioned below.
>>>
>>> When dropped into your function with the BString coercion, it replicated
>>> your small example results and did not crash after a long period with
>>> your larger example, so I then terminated it and insert a "reporter"
>>> line to monitor progress. With that reporter I got up into the 200's for
>>> count_len without error. My laptop CPU was warming up the case and I was
>>> getting sleepy so I terminated the process. (I had no way of checking
>>> for accuracy, even if I had let it proceed, since you did not offer a
>>> "correct" answer.)
>>>
>>> By the way, the construct ... grepl(. , .) != 1 ... is perhaps
>>> inefficient. It could more compactly be expressed as ...   !grepl(. ,
>>> .)  which would not be doing coercion of logicals to integers.
>>>
>>
>>
>> -- 
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-help mailing list