[BioC] Filtering BAM files by start position for VariantTools

Valerie Obenchain vobencha at fhcrc.org
Thu Jul 11 19:02:18 CEST 2013


Hi Sean,

As you've discovered, the 'which' in the 'param' (for reading bam files) 
specifies positions to overlap, not start or end. One approach to 
isolating reads with a specific starting position would be to filter the 
bam by 'pos'.

     library(VariantTools)
     fl <- LungCancerLines::LungCancerBamFiles()$H1993

     mystart <- 1110426
     filt <- list(setStart=function(x) x$pos %in% mystart)
     dest <- tempfile()
     filterBam(fl, dest, filter=FilterRules(filt))
     scn <- scanBam(dest)

Confirm all reads start with 'mystart':

 > table(scn[[1]]$pos)

1110426
    2388

If you want a tally of all nucleotides for all sequences starting with 
'mystart' then no need to supply a 'which':
     param <- VariantTallyParam(gmapR::TP53Genome(),
                                readlen=100L,
                                high_base_quality=23L)
     tally <- tallyVariants(fl, param)


Valerie


On 07/09/2013 02:06 PM, Taylor, Sean D wrote:
> I am trying to read a specific set of records from a bam file for use in the VariantTools package. I'm trying to construct a which argument (a GRanges object) that will pull in a set of records from all reads that only start at a specified position. (i.e. all reads that start at position 100). So far I have only been able to specify reads that overlap position 100, but have not been able to find a way to define the start site.
>
> #Example code:
>> bams <- LungCancerLines::LungCancerBamFiles()
>> bam <- bams$H1993
>> which<-GRanges(seqnames=c('TP53'), IRanges(1110426, width=1), '+')
>> tally.param <- VariantTallyParam(gmapR::TP53Genome(),
> + readlen = 100L,
> + high_base_quality = 23L,
> + which = which)
>> raw.variants <- tallyVariants(bam, tally.param)
>
> This code shows all the variants at position 1110426, but not all the variants from the reads that start at position 1110426.
>
> Ultimately, I am trying to do this for all start positions in my data set, so I would want something that looks like this pseudocode:
>> raw.variants<-lapply (start(bam), function (x){
>    which<-GRanges(seqnames=c('chrM'), '+', start=x)
>    tally.param<-VariantTallyParam(gmap, readlen=100L, which=which)
>    tallyVariants(bamfile, tally.param)
> })
>
> Thanks,
> Sean
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] RSQLite_0.11.4          DBI_0.2-7               BiocInstaller_1.10.2
>   [4] LungCancerLines_0.0.8   GenomicFeatures_1.12.1  AnnotationDbi_1.22.5
>   [7] Biobase_2.20.0          gmapR_1.2.0             latticeExtra_0.6-24
> [10] lattice_0.20-15         RColorBrewer_1.0-5      genoPlotR_0.8
> [13] ade4_1.5-2              VariantTools_1.2.2      VariantAnnotation_1.6.6
> [16] Rsamtools_1.12.3        Biostrings_2.28.0       GenomicRanges_1.12.4
> [19] IRanges_1.18.1          BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.16.0                          bitops_1.0-5
>   [3] BSgenome_1.28.0                         BSgenome.Hsapiens.UCSC.hg19_1.3.19
>   [5] graph_1.38.2                            Matrix_1.0-12
>   [7] org.Hs.eg.db_2.9.0                      RBGL_1.36.2
>   [9] RCurl_1.95-4.1                          rtracklayer_1.20.2
> [11] stats4_3.0.1                            tools_3.0.1
> [13] TxDb.Hsapiens.UCSC.hg19.knownGene_2.9.2 XML_3.96-1.1
> [15] zlibbioc_1.6.0
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list