[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Thomas Girke thomas.girke at ucr.edu
Mon Feb 13 01:12:47 CET 2012


Thanks Martin for your suggestion!

Thomas

On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
> On 02/10/2012 07:44 PM, Thomas Girke wrote:
> > With some Illumina libraries I have been running into problems importing
> > the read mappings from Rsubread into Rsamtools. After some testing I
> > found out that some reads reported in Rsubread's SAM output have their
> > end positions outside of the chromosome ranges. If those reads are
> > removed from the SAM file then the import into Rsamtools works just
> > fine. Below is a reproducible example of this problem.
> >
> > I could think of several solutions to fix this, e.g. not reporting reads
> > outside of chromosome ranges or updating their length in the CIAGR string.
> > An additional option could be to remove invalid mappings during the import
> > into Rsamtools.
> 
> Hi Thomas --
> 
> for the latter and for the case where those reads are intrinsically not 
> interesting, it might be reasonable to specify a range on 
> readGappedAlignments that precludes the possibility of extension beyond 
> sequence ends.
> 
>  > si <- seqinfo(BamFile(fl))
>  > gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34))
>  > bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr))
> 
> One would like a better solution, either doing this automatically with a 
> warning, or warning rather than stopping on reads overlapping ends.
> 
> Martin
> 
> >
> > Thanks,
> >
> > Thomas
> >
> > ################################
> > ## Read mapping with Rsubread ##
> > ################################
> > library(Rsubread)
> > ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> > ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> > buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
> > align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)
> >
> > #####################################
> > ## Process SAM file with Rsamtools ##
> > #####################################
> > library(Rsamtools)
> > asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
> > [1] "./data/SRR390302_subread.bam"
> > reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
> > Error in validObject(.Object) :
> >    invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds
> >
> > ## The error disappears if the following two reads are removed from the SAM file:
> > ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside
> > ## the range of Chr5 as illustrated here:
> >
> > ## Relevant chunks of SAM file generated by Rsubread
> > @SQ     SN:Chr1 LN:30427671
> > @SQ     SN:Chr2 LN:19698289
> > @SQ     SN:Chr3 LN:23459830
> > @SQ     SN:Chr4 LN:18585056
> > @SQ     SN:Chr5 LN:26975502
> > @SQ     SN:ChrC LN:154478
> > @SQ     SN:ChrM LN:366924
> > SRR390302.1     0       Chr1    27018257        2.0     35M     *       0       0       TGG...ACC     III...GD)0
> > SRR390302.2     4       *       0       0       *       *       0       0       TCC...AAA     III...+I@
> > ...
> > SRR390302.434558        0       Chr5    26975485        5.0     35M     *       0       0       ATG...TGG     III...&4(
> > SRR390302.2716043       0       Chr5    26975486        3.0     35M     *       0       0       CAT...GGT     III...+2&
> >
> >
> >> sessionInfo()
> > R version 2.14.0 (2011-10-31)
> > Platform: x86_64-unknown-linux-gnu (64-bit)
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > other attached packages:
> > [1] Rsamtools_1.6.3     Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5      Rsubread_1.4.2
> >
> > loaded via a namespace (and not attached):
> > [1] BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2          bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0       zlibbioc_1.0.0
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793



More information about the Bioconductor mailing list