[BioC] Sorting a GAlignments object by QNAME

Hervé Pagès hpages at fhcrc.org
Mon Sep 30 21:17:09 CEST 2013


Hi Rubi,

On 09/30/2013 07:33 AM, nimrod rubinstein wrote:
> Hi Hervé and thanks for the response.
>
> I assume you meant 'gal[order(names(gal))]' rather than
> 'gal[order(seqnames(gal))]'
> in order to sort by QNAME (i.e. read ID)?

My bad. You're right: seqnames is mapped to the RNAME field, not to the
QNAME field.

>
> Using this, however, doesn't seem to sort gal by the same sorting rules
> that sortBam or samtools sort -n does. Am I missing something?

2 reasons for that:

   (1) The result of 'order(names(gal))' depends on the
       collating sequence of the locale in use (see ?order
       and ?Comparison). So different users can get different
       results when doing 'gal[order(names(gal))]'. I don't
       know what collating sequence 'samtools sort -n' is using
       but even users using the same collating sequence might
       get different results, because of (2).

   (2) We don't know whether order() uses a "stable" sorting algorithm
       or not (AFAIK ?order doesn't tell). When a tie occurs (and when
       sorting by QNAME you can bet many ties will occur), a non-stable
       algo is free to break the tie in any way, whereas a stable algo
       is guaranteed to order the elements involved in the tie the
       same way they show up in the input.

To summarize, 'gal[order(names(gal))]' and 'samtools sort -n' would
be guaranteed to give the same result if both were using the same
collating sequence and a stable sorting algo. I guess it's not the case.

HTH,
H.

>
> Thanks,
> rubi
>
>
>
> On Mon, Sep 30, 2013 at 12:51 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi Rubi,
>>
>> 'gal[order(seqnames(gal))]' will sort GAlignments object 'gal' by QNAME.
>>
>> H.
>>
>>
>>
>> On 09/29/2013 12:32 PM, rubi [guest] wrote:
>>
>>>
>>> Is there a way to sort the records in a GAlignments object by the QNAME,
>>> as this object is created with the readGAlignmentsFromBam function where
>>> the bam file and its corresponding index file must be sorted by RNAME and
>>> POS.
>>>
>>> Unless I'm missing something the only way I see how can this be done is
>>> read the bam into a data.table and sort that.
>>>
>>>    -- output of sessionInfo():
>>>
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>>> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>>    [1] doParallel_1.0.3      iterators_1.0.6       foreach_1.4.1
>>> data.table_1.8.10     Rsamtools_1.13.44     Biostrings_2.29.19
>>>   GenomicRanges_1.13.45 XVector_0.1.4
>>>    [9] IRanges_1.19.38       BiocGenerics_0.7.5
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6    codetools_0.2-8 stats4_3.0.2    tools_3.0.2
>>> zlibbioc_1.7.0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>>
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list