[BioC] HowTo: "Extract masked sequences" / "insert Ns for repeat masked regions"

Hervé Pagès hpages at fhcrc.org
Tue Sep 20 19:53:11 CEST 2011


Hi Malcom,

This is indeed covering a typical/important use case. Thanks a lot for
sharing the code!

I think it's possible and we should avoid loading all the chromosomes
simultaneously in memory. The top-level loop in getSeq() loads one
chromosome at a time, extracts what needs to be extracted from it, and
then "releases" it.
I was thinking that it would be easy to extend the capabilities of
getSeq() by adding an argument to it. This argument would be a hook
function that takes at least 2 args: the BSgenome object and the name
of a sequence. It would be called in the top-level loop right after the
chromosome has been loaded and just before things are extracted from
it, and would need to return a DNAString object of the same length as
the chromosome.

In this context the current behaviour of getSeq() corresponds
to a hook that just drops the masks and your getSeqHardMasked()
function corresponds to a hook that injects the specified masks.

Because getSeqHardMasked() is probably the most typical use case
for using a user-supplied hook, once the hook feature is implemented
we should probably add the function anyway as a convenience wrapper
to getSeq(), called with the appropriate hook. Hope that makes sense.

I'll try to come up with something like this and will let you know.

Cheers,
H.


On 11-09-16 01:36 PM, Cook, Malcolm wrote:
> Hi, fellow bioconductors,
>
> I offer the following function as an implementation of a solution to the problem presented in
>
> 	[BioC] insert Ns for repeat masked regions - https://stat.ethz.ch/pipermail/bioconductor/2011-March/038134.html
> 	[Bioc-sig-seq] Extract masked sequences - https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-January/000786.html
>
> I have cc:ed the original discussants on those two threads.
>
> I embrace any criticisms or suggestions for improvement, and I hope it helps any colleague faced with the same issue.
>
> Cheers,
>
> Malcolm Cook
>
>
> getSeqHardMasked<-
>    function(BSg,GR,maskList,letter) {
> ### PURPOSE: return list of DNAString sequences extracted from the
> ### BSgenome<BSg>  corresponding to each location in GenomicRange
> ###<GR>, and masked with<letter>  according to the masks named in
> ###<maskList>  (which are encoded following BSParams convention).
> ###
> ### USE CASE - write fasta file of hard masked regions, using
> ###            RepeatMasker (RM) and Tandem Repeat Finder (TRF):
> ###
> ### GR<- GRanges('chr2L',IRanges(c(1,100),c(15,125)))
> ### writeFASTA(getSeqHardMasked(BSgenome, GR, c(RM=TRUE,TRF=TRUE), "N")
> ###            ,"myExtractForGR.fa"
> ###            ,paste(seqnames(GR),start(GR),end(GR),strand(GR),sep=':')
> ###            )
> ###
> ### NB: The implementation was coded 'pay(ing) attention to memory
> ### management' following suggestions from Herve in:
> ### https://stat.ethz.ch/pipermail/bioconductor/2011-March/038143.html.
> ### In particular, the inidividual chromosomes and their
> ### subseq(uences) should be garbage collectable after the function
> ### exits and they go out of scope, (although the chromosomes _are_
> ### all simultaneously loaded which I think is unavoidable if the
> ### results are to preserve the arbitrary order of GR).
> ###
> ### NB: My initial implementation FAILed as it used bsapply&  BSParams
> ### whose FUN can not 'know' the name of the sequence (which was
> ### needed to know which subseqs to extract).
>      ']]'<-
>        ## utility to subscript b by a.
>        function(a,b) b[[a]]
>      Vsubseq<-
>        ## vectorized version of subseq.
>        Vectorize(subseq)
>      VinjectHardMask<-
>        ## vectorized version of injectHardMask.
>        Vectorize(injectHardMask)
>      activeMask<-
>        ## A logical vector indicating whether each mask should be ON or
>        ## OFF to be applied to each chromosome in BSg.
>        masknames(BSg) %in% names(maskList[which(maskList)])
>      BSg_seqList<-
>        ## BSg as a list of named MaskedDNAString (one per chromosome)...
>        sapply(seqnames(BSg),']]',BSg)
>      BSg_seqList<-
>        ## ... with the masks for each chromosome activated.
>        sapply(BSg_seqList,function(x) {active(masks(x))<- activeMask;x})
>      GR_seq<-
>        ## the MaskedDNAString corresponding to GR
>        sapply(as.character(seqnames(GR)),']]',BSg_seqList)
>      VinjectHardMask(Vsubseq(GR_seq,start(GR),end(GR)),letter=letter)
> }
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list