[BioC] Rsamtools: sortBam() and 'samtools sort' does not give the same results

Henrik Bengtsson hb at biostat.ucsf.edu
Fri Aug 23 02:23:18 CEST 2013


On Thu, Aug 22, 2013 at 4:13 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 08/22/2013 02:13 PM, Henrik Bengtsson wrote:
>>
>> Hi,
>>
>> I've got a BAM file 'foo.bam' [425,850,792 bytes] that I'm trying to
>> sort.  It was generated using BWA.  I fail to sort it with
>> Rsamtools::sortBam() [Rsamtools v1.12.4] whereas with 'samtools sort'
>> [samtools v0.1.18 (r982:295)] it works.
>>
>> BAM FILE:
>> % samtools view -H foo0.bam | head
>> @HD     VN:1.3  SO:coordinate
>> @SQ     SN:1    LN:249250621
>> @SQ     SN:2    LN:243199373
>> @SQ     SN:3    LN:198022430
>> @SQ     SN:4    LN:191154276
>> @SQ     SN:5    LN:180915260
>> @SQ     SN:6    LN:171115067
>> @SQ     SN:7    LN:159138663
>> @SQ     SN:8    LN:146364022
>> @SQ     SN:9    LN:141213431
>>
>> The following:
>>>
>>> Rsamtools::sortBam("foo.bam", "foo.sort.byR")
>>
>>
>> outputs file 'foo.sort.byR.bam' [425,293,387 bytes], but when trying
>> to generate an index, both indexBam() and 'samtools index' throws an
>> error:
>>
>>> Rsamtools::indexBam("foo.sort.byR.bam")
>>
>> Error in FUN("foo.sort.byR.bam"[[1L]], ...) : failed to build index
>>    file: foo.sort.byR.bam
>> In addition: Warning messages:
>> 1: In FUN("foo.sort.byR.bam"[[1L]], ...) :
>>    [bam_index_core] the alignment is not sorted: reads without
>> coordinates prior to reads with coordinates.
>> 2: In FUN("foo.sort.byR.bam"[[1L]], ...) :
>>    [bam_index_build2] fail to index the BAM file.
>>
>> and
>>
>> % samtools index foo.sort.byR.bam
>> [bam_index_core] the alignment is not sorted: reads without
>> coordinates prior to reads with coordinates.
>> [bam_index_build2] fail to index the BAM file.
>>
>> Trying to call sortBam() multiple times changes nothing.  However, if
>> I use 'samtools sort' to sort the BAM file, it all works:
>>
>> % samtools sort foo.bam foo.sort
>>    [bam_sort_core] merging from 3 files...
>>
>> outputs file 'foo.sort.bam' [425,254,135 bytes]. With this file both
>>
>> % samtools index foo.sort.bam
>>
>> and
>>
>>> Rsamtools::indexBam("foo.sort.bam")
>>
>>
>> work.  They both generate identical file 'foo.sort.bam.bai' [8,274,272
>> bytes].
>>
>> I've tried also with a toy sample BAM file 'longreads.bam' [1,923,379
>> bytes] and there I don't get any errors (although Rsamtools and
>> 'samtools sort' do output differently sized *.sort.bam files).
>>
>> I've never had this issue before. Comments?
>
>
> I think samtools-0.1.18 is from Sept 2, 2011 (from git)
>
> Author: Heng Li <lh3 at live.co.uk>
> Date:   Fri Sep 2 16:57:34 2011 +0000
>
>      * Updated samtools to the latest
>      * Added seqtk.c
>      * Updated wgsim.c
>      * release samtools-0.1.18
>
> versus (not sure why this didn't get into the NEWS file) in Rsamtools in
> release
>
> r74547 | mtmorgan at fhcrc.org | 2013-03-18 22:45:53 -0700 (Mon, 18 Mar 2013) |
> 4 l
> ines
>
> update samtools to 0.1.19 beta
>
> so one possibility is a regression in samtools.

Thanks Martin, I should have known better.  After using samtools
0.1.19 (stable), sortBam() of Rsamtools 1.12.4 and 'samtools sort' of
samtools 0.1.19 gives identical BAM files;

425293387 Aug 22 13:42 foo.sort.Rsamtools1_12_4.bam
425293387 Aug 22 17:14 foo.sort.samtools0_1_19.bam

% diff foo.sort.Rsamtools1_12_4.bam foo.sort.samtools0_1_19.bam
% md5sum foo.sort.Rsamtools1_12_4.bam foo.sort.samtools0_1_19.bam
0dcd0a5b867d3e049d1edd35fe09445a  foo.sort.Rsamtools1_12_4.bam
0dcd0a5b867d3e049d1edd35fe09445a  foo.sort.samtools0_1_19.bam

So no longer a case of BioC but samtools developers.

/Henrik

> Happy to investigate further if you've got a reproducible example.
>
> Martin
>
>>
>> /Henrik
>>
>>> sessionInfo()
>>
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.12.4     Biostrings_2.28.0    GenomicRanges_1.12.4
>> [4] IRanges_1.18.2       BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6   stats4_3.0.1   tools_3.0.1    zlibbioc_1.6.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793



More information about the Bioconductor mailing list