[BioC] Subsampling of BAM/SAM alignments

Martin Morgan mtmorgan at fhcrc.org
Wed Mar 13 03:45:01 CET 2013


On 03/12/2013 01:32 PM, Julian Gehring wrote:
> Hi,
>
> Is there an efficient function to randomly subsample alignments from a BAM/SAM
> within R?  In the best case, an equivalent to 'FastqSampler' in 'ShortRead'.  If
> not, what would be a clever way of doing this with the bioc framework, without
> having to read the whole file first?

Hi Julian -- there isn't anything at the moment; below is a function that will 
have at most yieldSize * 2 elements in memory at one time. All the elements in 
the bam file (satisfying ScanBamParam) will eventually pass through memory, so 
not super efficient.

sampleBam <-
     function(file, index=file, ..., yieldSize, verbose=FALSE,
              param=ScanBamParam(what=scanBamWhat()))
{
     qunlist <- Rsamtools:::.quickUnlist
     tot <- sampleSize <- yieldSize
     bf <- open(BamFile(file, index, yieldSize=yieldSize))
     smpl <- qunlist(scanBam(bf, param=param))

     repeat {
         yld <- qunlist(scanBam(bf, param=param))
         yld_n <- length(yld[[1]])
         if (length(yld[[1]]) == 0L)
             break
         tot <- tot + yld_n
         keep <- rbinom(1L, yld_n, yld_n/ tot)
         if (verbose)
             message(sprintf("sampling %d of %d", keep, tot))
         if (keep == 0L)
             next

         i <- sample(sampleSize, keep)
         j <- sample(yld_n, keep)
         smpl <- Map(function(x, y, i, j) {
             x[i] <- y[j]
             x
         }, smpl, yld, MoreArgs=list(i=i, j=j))
     }

     close(bf)
     smpl
}


with

   fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
   param <- ScanBamParam(
       flag=scanBamFlag(isUnmappedQuery=FALSE),
       what=c("rname", "pos", "cigar", "strand"))

   lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)

The end result could be fed to a more friendly structure

   names(lst)[names(lst) == "rname"] = "seqname"
   do.call(GappedAlignments, lst)

Lightly tested, especially when ScanBamParam() is non-trivial.

Martin

>
> Best wishes
> Julian
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list