[BioC] HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"

Cook, Malcolm MEC at stowers.org
Thu Sep 22 16:41:38 CEST 2011


Herve,

I agree that loading one chromosome at a time is preferable.  I am working in fly genome now, so the memory footprint of loading them all at once is bearable, but this will be burdensome in larger genomes.

I tried using bsapply to this end, but found that 
  1) the applied FUN did not 'know' the name of the current chromosome to which it was being applied, therefor it could not known which of the GenomicRange's it should extract
  2) the naive approach would not preserve the order of the input GRanges in the result, which is depended upon in my use case as it requires being able to recover the GRange coordinates from an exported fasta file.

I like your analysis of the problem, however, if I understand your proposal, I fear the burden imposed by injecting the mask(s) onto the entire chromosome, prior to extracting the subseq, may itself prove onerous since this requires copying it in entirety (as opposed to hard masking just the individual subseqs which in typical(?) applications will be a small fraction of the genome).

Assuming the performance issue is workaroundable, whether or not to provide a convenience wrapper to getSeq is a matter of packaging.  Depending on how simple invoking getSeq with a callback to perform hardMasking, you might just document its usage in getSeq's own documentation and be done with it....

Cheers,

~Malcolm


-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org] 
Sent: Tuesday, September 20, 2011 12:53 PM
To: Cook, Malcolm
Cc: 'bioconductor at r-project.org'
Subject: Re: HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"

Hi Malcom,

This is indeed covering a typical/important use case. Thanks a lot for
sharing the code!

I think it's possible and we should avoid loading all the chromosomes
simultaneously in memory. The top-level loop in getSeq() loads one
chromosome at a time, extracts what needs to be extracted from it, and
then "releases" it.
I was thinking that it would be easy to extend the capabilities of
getSeq() by adding an argument to it. This argument would be a hook
function that takes at least 2 args: the BSgenome object and the name
of a sequence. It would be called in the top-level loop right after the
chromosome has been loaded and just before things are extracted from
it, and would need to return a DNAString object of the same length as
the chromosome.

In this context the current behaviour of getSeq() corresponds
to a hook that just drops the masks and your getSeqHardMasked()
function corresponds to a hook that injects the specified masks.

Because getSeqHardMasked() is probably the most typical use case
for using a user-supplied hook, once the hook feature is implemented
we should probably add the function anyway as a convenience wrapper
to getSeq(), called with the appropriate hook. Hope that makes sense.

I'll try to come up with something like this and will let you know.

Cheers,
H.


On 11-09-16 01:36 PM, Cook, Malcolm wrote:
> Hi, fellow bioconductors,
>
> I offer the following function as an implementation of a solution to the problem presented in
>
> 	[BioC] insert Ns for repeat masked regions - https://stat.ethz.ch/pipermail/bioconductor/2011-March/038134.html
> 	[Bioc-sig-seq] Extract masked sequences - https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-January/000786.html
>
> I have cc:ed the original discussants on those two threads.
>
> I embrace any criticisms or suggestions for improvement, and I hope it helps any colleague faced with the same issue.
>
> Cheers,
>
> Malcolm Cook
>
>
> getSeqHardMasked<-
>    function(BSg,GR,maskList,letter) {
> ### PURPOSE: return list of DNAString sequences extracted from the
> ### BSgenome<BSg>  corresponding to each location in GenomicRange
> ###<GR>, and masked with<letter>  according to the masks named in
> ###<maskList>  (which are encoded following BSParams convention).
> ###
> ### USE CASE - write fasta file of hard masked regions, using
> ###            RepeatMasker (RM) and Tandem Repeat Finder (TRF):
> ###
> ### GR<- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
> ### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
> ###            ,"myExtractForGR.fa"
> ###            ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
> ###            )
> ###
> ### NB: The implementation was coded 'pay(ing) attention to memory
> ### management' following suggestions from Herve in:
> ### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
> ### In particular, the inidividual chromosomes and their
> ### subseq(uences) should be garbage collectable after the function
> ### exits and they go out of scope, (although the chromosomes _are_
> ### all simultaneously loaded which I think is unavoidable if the
> ### results are to preserve the arbitrary order of GR).
> ###
> ### NB: My initial implementation FAILed as it used bsapply&  BSParams
> ### whose FUN can not 'know' the name of the sequence (which was
> ### needed to know which subseqs to extract).
>      ']]'<-
>        ## utility to subscript b by a.
>        function(a,b) b[[a]]
>      Vsubseq<-
>        ## vectorized version of subseq.
>        Vectorize(subseq)
>      VinjectHardMask<-
>        ## vectorized version of injectHardMask.
>        Vectorize(injectHardMask)
>      activeMask<-
>        ## A logical vector indicating whether each mask should be ON or
>        ## OFF to be applied to each chromosome in BSg.
>        masknames(BSg) %in% names(maskList[which(maskList)])
>      BSg_seqList<-
>        ## BSg as a list of named MaskedDNAString (one per chromosome)...
>        sapply(seqnames(BSg),']]',BSg)
>      BSg_seqList<-
>        ## ... with the masks for each chromosome activated.
>        sapply(BSg_seqList,function(x) {active(masks(x))<- activeMask;x})
>      GR_seq<-
>        ## the MaskedDNAString corresponding to GR
>        sapply(as.character(seqnames(GR)),']]',BSg_seqList)
>      VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
> }
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list