[BioC] Filtering BAM files by start position for VariantTools

Valerie Obenchain vobencha at fhcrc.org
Thu Jul 11 22:24:45 CEST 2013


On 07/11/2013 01:11 PM, Valerie Obenchain wrote:
> As an fyi, Martin just checked in a change to Rsamtools (devel) that
> allows filtering by range. This should speed up the filtering
> substantially. I'll use the new syntax in the filter example below.
>
> I'm not sure of the best way to tallyVariants for all groups of reads
> for each unique start position. Here's an approach I might take - maybe
> others will chime in with their ideas. Start by reading in the bam and
> identify the unique starts.
>
>    library(Rsamtools)
>    fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>    gal <- readGAlignments(fl)
>    starts <- unique(start(gal))
>
> mclapply() over the seqlevels containing unique starts.
> (code not tested)

Nope. This won't work. Trying to be too clever with the mclapply().

You could create two vectors, one of seqlevels and one of starts (unique 
combinations). Pass these to Map() and use the same logic as below. This 
would allow you to create the correct param, filter and get the tally 
results for each seqlevel-start combo.

Valerie
>
>    lst <- split(start(gal), as.factor(seqnames(gal)))
>    starts <- lapply(lst, unique)
>    vtparam <- VariantTallyParam(...)
>    mclapply(starts,
>             function(start, rname, fl, vtparam) {
>               ## 'what' imports only the fields used in the filter
>               ## 'which' is the position(s) to overlap
>               param <- ScanBamParam(
>                                 what=c("rname", "pos"),
>                                 which=GRanges(rname,
>                                   IRanges(start, width=1)))
>               filter <- FilterRules(list(atPos = function(x) {
>                           (x$rname %in% rname) &
>                           (x$pos %in% start)}))
>               dest0 <- filterBam(fl, tempfile(),
>                                  filter=filter, param=param)
>               tallyVariants(dest0, vtparam)},
>               rname=names(starts), fl=fl, vtparam=vtparam)
>
>
> Rsamtools in devel is broken today but should be ok tomorrow.
>
> Valerie
>
> On 07/11/2013 10:27 AM, Taylor, Sean D wrote:
>> Thanks Valerie, I'll give that a try. And I see that you just
>> clarified my question about the input file for tallyVariants.
>>
>> I had thought about doing this before but hadn't been able to figure
>> out the proper filter syntax. Thanks!  The only downside to this
>> approach is that I ultimately want to do this for all unique start
>> sites that align to my reference. That could entail generating
>> thousands of tempfiles that all have to be read back in. It could be
>> done with lapply, but it sounds rather memory and time intensive.
>> Another approach that I tried was reading the bam file as a
>> GappedAlignments object and then splitting it by start position into a
>> GAlignmentsList. That was really easy, but so far as I can tell
>> tallyVariants will not accept GappedAlignments objects as an input. I
>> wonder if there is another way to vectorize this approach?
>>
>> Thanks,
>> Sean
>>
>>
>> -----Original Message-----
>> From: Valerie Obenchain [mailto:vobencha at fhcrc.org]
>> Sent: Thursday, July 11, 2013 10:02 AM
>> To: Taylor, Sean D
>> Cc: bioconductor at r-project.org; Michael Lawrence
>> Subject: Re: [BioC] Filtering BAM files by start position for
>> VariantTools
>>
>> Hi Sean,
>>
>> As you've discovered, the 'which' in the 'param' (for reading bam
>> files) specifies positions to overlap, not start or end. One approach
>> to isolating reads with a specific starting position would be to
>> filter the bam by 'pos'.
>>
>>       library(VariantTools)
>>       fl <- LungCancerLines::LungCancerBamFiles()$H1993
>>
>>       mystart <- 1110426
>>       filt <- list(setStart=function(x) x$pos %in% mystart)
>>       dest <- tempfile()
>>       filterBam(fl, dest, filter=FilterRules(filt))
>>       scn <- scanBam(dest)
>>
>> Confirm all reads start with 'mystart':
>>
>>   > table(scn[[1]]$pos)
>>
>> 1110426
>>      2388
>>
>> If you want a tally of all nucleotides for all sequences starting with
>> 'mystart' then no need to supply a 'which':
>>       param <- VariantTallyParam(gmapR::TP53Genome(),
>>                                  readlen=100L,
>>                                  high_base_quality=23L)
>>       tally <- tallyVariants(fl, param)
>>
>>
>> Valerie
>>
>>
>> On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
>>> I am trying to read a specific set of records from a bam file for use
>>> in the VariantTools package. I'm trying to construct a which argument
>>> (a GRanges object) that will pull in a set of records from all reads
>>> that only start at a specified position. (i.e. all reads that start
>>> at position 100). So far I have only been able to specify reads that
>>> overlap position 100, but have not been able to find a way to define
>>> the start site.
>>>
>>> #Example code:
>>>> bams <- LungCancerLines::LungCancerBamFiles()
>>>> bam <- bams$H1993
>>>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+')
>>>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
>>> + readlen = 100L,
>>> + high_base_quality = 23L,
>>> + which = which)
>>>> raw.variants <- tallyVariants(bam, tally.param)
>>>
>>> This code shows all the variants at position 1110426, but not all the
>>> variants from the reads that start at position 1110426.
>>>
>>> Ultimately, I am trying to do this for all start positions in my data
>>> set, so I would want something that looks like this pseudocode:
>>>> raw.variants<-lapply (start(bam), function (x){
>>>     which<-GRanges(seqnames=c('chrM'), '+', start=x)
>>>     tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
>>>     tallyVariants(bamfile, tally.param)
>>> })
>>>
>>> Thanks,
>>> Sean
>>>
>>>> sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> LC_TIME=en_US.UTF-8
>>>    [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
>>> LC_MESSAGES=en_US.UTF-8
>>>    [7] LC_PAPER=C                 LC_NAME=C
>>> LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] grid      parallel  stats     graphics  grDevices utils
>>> datasets  methods   base
>>>
>>> other attached packages:
>>> [1] RSQLite_0.11.4          DBI_0.2-7               BiocInstaller_1.10.2
>>>    [4] LungCancerLines_0.0.8   GenomicFeatures_1.12.1
>>> AnnotationDbi_1.22.5
>>>    [7] Biobase_2.20.0          gmapR_1.2.0
>>> latticeExtra_0.6-24
>>> [10] lattice_0.20-15         RColorBrewer_1.0-5      genoPlotR_0.8
>>> [13] ade4_1.5-2              VariantTools_1.2.2
>>> VariantAnnotation_1.6.6
>>> [16] Rsamtools_1.12.3        Biostrings_2.28.0
>>> GenomicRanges_1.12.4
>>> [19] IRanges_1.18.1          BiocGenerics_0.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] biomaRt_2.16.0                          bitops_1.0-5
>>>    [3] BSgenome_1.28.0
>>> BSgenome.Hsapiens.UCSC.hg19_1.3.19
>>>    [5] graph_1.38.2                            Matrix_1.0-12
>>>    [7] org.Hs.eg.db_2.9.0                      RBGL_1.36.2
>>>    [9] RCurl_1.95-4.1                          rtracklayer_1.20.2
>>> [11] stats4_3.0.1                            tools_3.0.1
>>> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1
>>> [15] zlibbioc_1.6.0
>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list