[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Wei Shi shi at wehi.EDU.AU
Tue Feb 14 23:42:47 CET 2012


That's great! Thanks Martin.

Cheers,
Wei


On Feb 15, 2012, at 9:40 AM, Martin Morgan wrote:

> On 02/14/2012 02:36 PM, Wei Shi wrote:
>> Hi Thomas,
>> 
>> After re-checking the changes we have made before for addressing the
>> issue of reads mapped out of the chromosome ranges, I found we
>> actually made the decision to allow reads to be mapped out of the
>> chromosome ranges, due to the fact that the reference sequences might
>> not be complete and reads could originate from outside of the
>> reference sequences used for mapping.
>> 
>> So you may take Martin's suggestion to add extra parameters when you
> 
> Actually, Herve has patched GenomicRanges (1.7.25) in the development branch so that ranges outside sequence bounds are now accepted (with a warning).
> 
> Martin
> 
>> call 'readGappedAlignments' function to make it work for you. Or
>> alternatively you may try the 'featureCounts' function in Rsubread
>> package to count the number of reads mapped to each gene, if the
>> purpose of your analysis is perform RNA-seq analysis. This function
>> works on SAM file, so you do not need to convert it to a BAM file.
>> But you will have to provide an Arabidopsis anntation file it as it
>> has built-in annotation for mouse and human only at present.
>> 
>> I think it might be better if the 'readGappedAlignments'  function
>> could allow reads being mapped outside of chromosomal ranges. We have
>> been successfully running through a SNP discovery pipeline using
>> subread/subjunc, samtools and vcftools on unix command lines, which
>> does not complain about the reads mapped outside of the chromosomal
>> ranges. I guess many other downstream analysis tools do not require
>> mapping locations of reads to be within the chromosomal ranges as
>> well.
>> 
>> Hope this helps.
>> 
>> Cheers, Wei
>> 
>> 
>> 
>> On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote:
>> 
>>> That's great, thanks a lot!
>>> 
>>> BTW: are there any published/shareable performance test results
>>> available comparing subread against other RNA-read to genome
>>> aligners such as Tophat with respect to the accuracy of aligning
>>> reads across exon/intron junctions?
>>> 
>>> Thanks,
>>> 
>>> Thomas
>>> 
>>> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote:
>>>> Hi Thomas,
>>>> 
>>>> I have just committed changes to both the release version and the
>>>> devel version of Rsubread package to fix the bug of mapping reads
>>>> out of chromosomal boundaries. They should be available in 24-36
>>>> hours. Please rebuild your index when you rerun your alignment.
>>>> 
>>>> Let me know if the problem persists or you encounter further
>>>> problems.
>>>> 
>>>> Cheers, Wei
>>>> 
>>>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
>>>> 
>>>>> Thanks Martin for your suggestion!
>>>>> 
>>>>> Thomas
>>>>> 
>>>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
>>>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote:
>>>>>>> With some Illumina libraries I have been running into
>>>>>>> problems importing the read mappings from Rsubread into
>>>>>>> Rsamtools. After some testing I found out that some reads
>>>>>>> reported in Rsubread's SAM output have their end positions
>>>>>>> outside of the chromosome ranges. If those reads are
>>>>>>> removed from the SAM file then the import into Rsamtools
>>>>>>> works just fine. Below is a reproducible example of this
>>>>>>> problem.
>>>>>>> 
>>>>>>> I could think of several solutions to fix this, e.g. not
>>>>>>> reporting reads outside of chromosome ranges or updating
>>>>>>> their length in the CIAGR string. An additional option
>>>>>>> could be to remove invalid mappings during the import into
>>>>>>> Rsamtools.
>>>>>> 
>>>>>> Hi Thomas --
>>>>>> 
>>>>>> for the latter and for the case where those reads are
>>>>>> intrinsically not interesting, it might be reasonable to
>>>>>> specify a range on readGappedAlignments that precludes the
>>>>>> possibility of extension beyond sequence ends.
>>>>>> 
>>>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si),
>>>>>>> IRanges(34, seqlengths(si)-34)) bam<-
>>>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr))
>>>>>> 
>>>>>> One would like a better solution, either doing this
>>>>>> automatically with a warning, or warning rather than stopping
>>>>>> on reads overlapping ends.
>>>>>> 
>>>>>> Martin
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Thomas
>>>>>>> 
>>>>>>> ################################ ## Read mapping with
>>>>>>> Rsubread ## ################################
>>>>>>> library(Rsubread) ## Reference source:
>>>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
>>>>>>> 
>>>>>>> 
> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
>>>>>>> buildindex(basename="tair10chr.fasta",
>>>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta",
>>>>>>> readfile1="SRR390302.fastq",
>>>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1,
>>>>>>> TH1=2)
>>>>>>> 
>>>>>>> ##################################### ## Process SAM file
>>>>>>> with Rsamtools ## #####################################
>>>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam",
>>>>>>> destination="SRR390302_subread") [1]
>>>>>>> "./data/SRR390302_subread.bam" reads<-
>>>>>>> readBamGappedAlignments("SRR390302_subread.bam",
>>>>>>> use.names=FALSE) Error in validObject(.Object) : invalid
>>>>>>> class "GappedAlignments" object: 'ranges' contains values
>>>>>>> outside of sequence bounds
>>>>>>> 
>>>>>>> ## The error disappears if the following two reads are
>>>>>>> removed from the SAM file: ## SRR390302.434558 and
>>>>>>> SRR390302.2716043. Both reads have their end positions
>>>>>>> outside ## the range of Chr5 as illustrated here:
>>>>>>> 
>>>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ
>>>>>>> SN:Chr1 LN:30427671 @SQ     SN:Chr2 LN:19698289 @SQ
>>>>>>> SN:Chr3 LN:23459830 @SQ     SN:Chr4 LN:18585056 @SQ
>>>>>>> SN:Chr5 LN:26975502 @SQ     SN:ChrC LN:154478 @SQ
>>>>>>> SN:ChrM LN:366924 SRR390302.1     0       Chr1    27018257
>>>>>>> 2.0     35M     *       0       0       TGG...ACC
>>>>>>> III...GD)0 SRR390302.2     4       *       0       0
>>>>>>> *       *       0       0       TCC...AAA     III...+I@
>>>>>>> ... SRR390302.434558        0       Chr5    26975485
>>>>>>> 5.0     35M     *       0       0       ATG...TGG
>>>>>>> III...&4( SRR390302.2716043       0       Chr5    26975486
>>>>>>> 3.0     35M     *       0       0       CAT...GGT
>>>>>>> III...+2&
>>>>>>> 
>>>>>>> 
>>>>>>>> sessionInfo()
>>>>>>> R version 2.14.0 (2011-10-31) Platform:
>>>>>>> x86_64-unknown-linux-gnu (64-bit)
>>>>>>> 
>>>>>>> locale: [1] C
>>>>>>> 
>>>>>>> attached base packages: [1] stats     graphics  grDevices
>>>>>>> utils     datasets  methods   base
>>>>>>> 
>>>>>>> other attached packages: [1] Rsamtools_1.6.3
>>>>>>> Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5
>>>>>>> Rsubread_1.4.2
>>>>>>> 
>>>>>>> loaded via a namespace (and not attached): [1]
>>>>>>> BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2
>>>>>>> bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0
>>>>>>> zlibbioc_1.0.0
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search
>>>>>>> the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
> -- 
>>>>>> Computational Biology Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>>> 
>>>>>> Location: M1-B861 Telephone: 206 667-2793
>>>>> 
>>>>> _______________________________________________ Bioconductor
>>>>> mailing list Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>>> archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>>> 
>>>> 
>>>>> 
> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely
>>>> for the addressee. You must not disclose, forward, print or use
>>>> it without the permission of the sender.
>>>> ______________________________________________________________________
>> 
>>>> 
>> 
>> ______________________________________________________________________
>> 
>> 
> The information in this email is confidential and inte...{{dropped:24}}



More information about the Bioconductor mailing list