[BioC] Rsamtools: sortBam() and 'samtools sort' does not give the same results

Martin Morgan mtmorgan at fhcrc.org
Fri Aug 23 01:13:35 CEST 2013


On 08/22/2013 02:13 PM, Henrik Bengtsson wrote:
> Hi,
>
> I've got a BAM file 'foo.bam' [425,850,792 bytes] that I'm trying to
> sort.  It was generated using BWA.  I fail to sort it with
> Rsamtools::sortBam() [Rsamtools v1.12.4] whereas with 'samtools sort'
> [samtools v0.1.18 (r982:295)] it works.
>
> BAM FILE:
> % samtools view -H foo0.bam | head
> @HD     VN:1.3  SO:coordinate
> @SQ     SN:1    LN:249250621
> @SQ     SN:2    LN:243199373
> @SQ     SN:3    LN:198022430
> @SQ     SN:4    LN:191154276
> @SQ     SN:5    LN:180915260
> @SQ     SN:6    LN:171115067
> @SQ     SN:7    LN:159138663
> @SQ     SN:8    LN:146364022
> @SQ     SN:9    LN:141213431
>
> The following:
>> Rsamtools::sortBam("foo.bam", "foo.sort.byR")
>
> outputs file 'foo.sort.byR.bam' [425,293,387 bytes], but when trying
> to generate an index, both indexBam() and 'samtools index' throws an
> error:
>
>> Rsamtools::indexBam("foo.sort.byR.bam")
> Error in FUN("foo.sort.byR.bam"[[1L]], ...) : failed to build index
>    file: foo.sort.byR.bam
> In addition: Warning messages:
> 1: In FUN("foo.sort.byR.bam"[[1L]], ...) :
>    [bam_index_core] the alignment is not sorted: reads without
> coordinates prior to reads with coordinates.
> 2: In FUN("foo.sort.byR.bam"[[1L]], ...) :
>    [bam_index_build2] fail to index the BAM file.
>
> and
>
> % samtools index foo.sort.byR.bam
> [bam_index_core] the alignment is not sorted: reads without
> coordinates prior to reads with coordinates.
> [bam_index_build2] fail to index the BAM file.
>
> Trying to call sortBam() multiple times changes nothing.  However, if
> I use 'samtools sort' to sort the BAM file, it all works:
>
> % samtools sort foo.bam foo.sort
>    [bam_sort_core] merging from 3 files...
>
> outputs file 'foo.sort.bam' [425,254,135 bytes]. With this file both
>
> % samtools index foo.sort.bam
>
> and
>
>> Rsamtools::indexBam("foo.sort.bam")
>
> work.  They both generate identical file 'foo.sort.bam.bai' [8,274,272 bytes].
>
> I've tried also with a toy sample BAM file 'longreads.bam' [1,923,379
> bytes] and there I don't get any errors (although Rsamtools and
> 'samtools sort' do output differently sized *.sort.bam files).
>
> I've never had this issue before. Comments?

I think samtools-0.1.18 is from Sept 2, 2011 (from git)

Author: Heng Li <lh3 at live.co.uk>
Date:   Fri Sep 2 16:57:34 2011 +0000

      * Updated samtools to the latest
      * Added seqtk.c
      * Updated wgsim.c
      * release samtools-0.1.18

versus (not sure why this didn't get into the NEWS file) in Rsamtools in release

r74547 | mtmorgan at fhcrc.org | 2013-03-18 22:45:53 -0700 (Mon, 18 Mar 2013) | 4 l
ines

update samtools to 0.1.19 beta

so one possibility is a regression in samtools. Happy to investigate further if 
you've got a reproducible example.

Martin

>
> /Henrik
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.12.4     Biostrings_2.28.0    GenomicRanges_1.12.4
> [4] IRanges_1.18.2       BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6   stats4_3.0.1   tools_3.0.1    zlibbioc_1.6.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list