[BioC] HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"

Hervé Pagès hpages at fhcrc.org
Thu Sep 22 20:47:42 CEST 2011


Hi Malcolm,

On 11-09-22 07:41 AM, Cook, Malcolm wrote:
> Herve,
>
> I agree that loading one chromosome at a time is preferable.  I am working in fly genome now, so the memory footprint of loading them all at once is bearable, but this will be burdensome in larger genomes.
>
> I tried using bsapply to this end, but found that
>    1) the applied FUN did not 'know' the name of the current chromosome to which it was being applied, therefor it could not known which of the GenomicRange's it should extract
>    2) the naive approach would not preserve the order of the input GRanges in the result, which is depended upon in my use case as it requires being able to recover the GRange coordinates from an exported fasta file.
>
> I like your analysis of the problem, however, if I understand your proposal, I fear the burden imposed by injecting the mask(s) onto the entire chromosome, prior to extracting the subseq, may itself prove onerous since this requires copying it in entirety (as opposed to hard masking just the individual subseqs which in typical(?) applications will be a small fraction of the genome).

It's not that bad: in each step of the loop, (a) we'll load
one chromosome, (b) generate a copy of it, (c) remove the
original chromosome, (d) extract stuff from the copy, (e) then
remove the copy itself.

So yes when we'll load Human chr1 and generate a copy of it,
we'll need 500MB of RAM, but that's still better than loading
all the chromosomes simultaneously.

As for the speed, it takes less than 1 sec. to inject the hard
masks in Human chr1, and less than 0.05 sec. to inject them
in Fly chr3R. And we'll call injectHardMask once per chromosome,
whereas with VinjectHardMask (the vectorized version of
injectHardMask, which uses lapply behind the scene), you
call injectHardMask N times, where N is the nb of sequences
to extract. And even if each of your injectHardMask call is
cheaper than calling injectHardMask on an entire chromosome,
the overhead of looping in R is such that when N is big
(e.g. > 100000), things just take too long regardless of what
you do inside the loop.

So overall, with this callback approach, I expect a significant
improvement in both memory usage and speed.

>
> Assuming the performance issue is workaroundable, whether or not to provide a convenience wrapper to getSeq is a matter of packaging.  Depending on how simple invoking getSeq with a callback to perform hardMasking, you might just document its usage in getSeq's own documentation and be done with it....

Yeah, I was planning to put this in getSeq's man page anyway, as a
good illustration of how to use a callback function.

Thanks for your feedback,
H.

>
> Cheers,
>
> ~Malcolm
>
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Tuesday, September 20, 2011 12:53 PM
> To: Cook, Malcolm
> Cc: 'bioconductor at r-project.org'
> Subject: Re: HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"
>
> Hi Malcom,
>
> This is indeed covering a typical/important use case. Thanks a lot for
> sharing the code!
>
> I think it's possible and we should avoid loading all the chromosomes
> simultaneously in memory. The top-level loop in getSeq() loads one
> chromosome at a time, extracts what needs to be extracted from it, and
> then "releases" it.
> I was thinking that it would be easy to extend the capabilities of
> getSeq() by adding an argument to it. This argument would be a hook
> function that takes at least 2 args: the BSgenome object and the name
> of a sequence. It would be called in the top-level loop right after the
> chromosome has been loaded and just before things are extracted from
> it, and would need to return a DNAString object of the same length as
> the chromosome.
>
> In this context the current behaviour of getSeq() corresponds
> to a hook that just drops the masks and your getSeqHardMasked()
> function corresponds to a hook that injects the specified masks.
>
> Because getSeqHardMasked() is probably the most typical use case
> for using a user-supplied hook, once the hook feature is implemented
> we should probably add the function anyway as a convenience wrapper
> to getSeq(), called with the appropriate hook. Hope that makes sense.
>
> I'll try to come up with something like this and will let you know.
>
> Cheers,
> H.
>
>
> On 11-09-16 01:36 PM, Cook, Malcolm wrote:
>> Hi, fellow bioconductors,
>>
>> I offer the following function as an implementation of a solution to the problem presented in
>>
>> 	[BioC] insert Ns for repeat masked regions - https://stat.ethz.ch/pipermail/bioconductor/2011-March/038134.html
>> 	[Bioc-sig-seq] Extract masked sequences - https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-January/000786.html
>>
>> I have cc:ed the original discussants on those two threads.
>>
>> I embrace any criticisms or suggestions for improvement, and I hope it helps any colleague faced with the same issue.
>>
>> Cheers,
>>
>> Malcolm Cook
>>
>>
>> getSeqHardMasked<-
>>     function(BSg,GR,maskList,letter) {
>> ### PURPOSE: return list of DNAString sequences extracted from the
>> ### BSgenome<BSg>   corresponding to each location in GenomicRange
>> ###<GR>, and masked with<letter>   according to the masks named in
>> ###<maskList>   (which are encoded following BSParams convention).
>> ###
>> ### USE CASE - write fasta file of hard masked regions, using
>> ###            RepeatMasker (RM) and Tandem Repeat Finder (TRF):
>> ###
>> ### GR<- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
>> ### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
>> ###            ,"myExtractForGR.fa"
>> ###            ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
>> ###            )
>> ###
>> ### NB: The implementation was coded 'pay(ing) attention to memory
>> ### management' following suggestions from Herve in:
>> ### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
>> ### In particular, the inidividual chromosomes and their
>> ### subseq(uences) should be garbage collectable after the function
>> ### exits and they go out of scope, (although the chromosomes _are_
>> ### all simultaneously loaded which I think is unavoidable if the
>> ### results are to preserve the arbitrary order of GR).
>> ###
>> ### NB: My initial implementation FAILed as it used bsapply&   BSParams
>> ### whose FUN can not 'know' the name of the sequence (which was
>> ### needed to know which subseqs to extract).
>>       ']]'<-
>>         ## utility to subscript b by a.
>>         function(a,b) b[[a]]
>>       Vsubseq<-
>>         ## vectorized version of subseq.
>>         Vectorize(subseq)
>>       VinjectHardMask<-
>>         ## vectorized version of injectHardMask.
>>         Vectorize(injectHardMask)
>>       activeMask<-
>>         ## A logical vector indicating whether each mask should be ON or
>>         ## OFF to be applied to each chromosome in BSg.
>>         masknames(BSg) %in% names(maskList[which(maskList)])
>>       BSg_seqList<-
>>         ## BSg as a list of named MaskedDNAString (one per chromosome)...
>>         sapply(seqnames(BSg),']]',BSg)
>>       BSg_seqList<-
>>         ## ... with the masks for each chromosome activated.
>>         sapply(BSg_seqList,function(x) {active(masks(x))<- activeMask;x})
>>       GR_seq<-
>>         ## the MaskedDNAString corresponding to GR
>>         sapply(as.character(seqnames(GR)),']]',BSg_seqList)
>>       VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
>> }
>>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list