[BioC] readGAlignmentPairs error

Valerie Obenchain vobencha at fhcrc.org
Tue Sep 17 23:46:21 CEST 2013


The new pairing algorithm used by readGAlignmentsList() does find mates 
outside the specified range. When only one of the pairs overlaps the 
'which', it looks for a potential mate in the rest of the file.

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="flag",
                       which=GRanges("seq1", IRanges(1, width=40)))

readGAlignmentPairs() finds no mates in this region.

gapairs <- readGAlignmentPairs(BamFile(fl), param=param)
 > gapairs
GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
  seqnames strand :    ranges --    ranges
     <Rle>  <Rle> : <IRanges> -- <IRanges>

readGAlignmentsList() finds two mate pairs and several non-mates in this 
region. The 'mates' metadata column indicates which are legitimate mates 
as per the algorithem described at ?readGAlignmentsListFromBam. More 
control over what records are returned can be done by specifying flags 
in ScanBamParam.

galist <- readGAlignmentsList(BamFile(fl, asMates=TRUE), param=param)
 > galist
GAlignmentsList of length 17:
[[1]]
GAlignments with 2 alignments and 2 metadata columns:
       seqnames strand cigar qwidth start end width ngap | mates flag
   [1]     seq1      -   35M     35   229 263    35    0 |     1   83
   [2]     seq1      +   35M     35    37  71    35    0 |     1  163

[[2]]
GAlignments with 2 alignments and 2 metadata columns:
       seqnames strand cigar qwidth start end width ngap | mates flag
   [1]     seq1      -   35M     35   185 219    35    0 |     1  147
   [2]     seq1      +   35M     35    36  70    35    0 |     1   99

[[3]]
GAlignments with 1 alignment and 2 metadata columns:
       seqnames strand cigar qwidth start end width ngap | mates flag
   [1]     seq1      +   36M     36     6  41    36    0 |     0  137


 > elementLengths(galist)
  [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The first two list elements contain mates.
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE
  [13] FALSE FALSE FALSE FALSE FALSE


Here are descriptions of the first few flags as per this site:
http://picard.sourceforge.net/explain-flags.html
Flag 137 indicates an unmapped mate which is why this record was not mated.


83:
Summary:
     read paired
     read mapped in proper pair
     read reverse strand
     first in pair

163:
Summary:
     read paired
     read mapped in proper pair
     mate reverse strand
     second in pair

147:
Summary:
     read paired
     read mapped in proper pair
     read reverse strand
     second in pair

99:
Summary:
     read paired
     read mapped in proper pair
     mate reverse strand
     first in pair

137:
Summary:
     read paired
     mate unmapped
     second in pair


Valerie


On 09/17/2013 01:55 PM, Michael Lawrence wrote:
> Yes, I knew there weren't any easy solutions. I just think that somehow we
> need to point these out to users, because it's not exactly obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>
>>> There seem to be at least two "gotchas" with the 'which' argument and
>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle
>>> cases where one record overlaps multiple which arguments
>>>
>>
>> This is not just with readGAlignmentPairs, it's also with
>> readGAlignments and, at a lower level, with scanBam(), which
>> the 2 formers are based on. If a record overlaps 2 ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to address this unless we change the
>> semantic of the 'which' argument. For example instead of loading
>> records that overlap we could make the choice to only load records
>> that have their POS (i.e. start) within the ranges specified thru
>> 'which'. It introduces a dissymmetry (because it looks at the start
>> of the record and not at its end) but it has the advantage of making
>> sure records are only loaded once (unless the user passes a 'which'
>> argument that contains overlapping ranges but then s/he's really
>> looking for it ;-) ).
>>
>>
>>   and (2) any read
>>> pairs with one end inside which and the other out will be discarded.
>>>
>>
>> No easy work around for that one with the current implementation of
>> readGAlignmentPairs which just passes its 'which' argument down to
>> scanBam().
>>
>>
>>   Excluding per-chromosome iteration, 'which' is a bit dangerous when
>>> dealing
>>> with read pairs.
>>>
>>
>> It's also dangerous when dealing with single end reads, and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>>   Somehow we should make this obvious to the user.
>>>
>>>
>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>>> goldstein.leonard at gene.com> wrote:
>>>
>>>   Dear Herve and list,
>>>>
>>>>
>>>> I realized the error might have been caused due to the same bam record
>>>> being read in multiple times when passing multiple ranges in the which
>>>> argument.
>>>>
>>>>
>>>> I don't think it is necessarily obvious to users that
>>>> readGAlignments/Pairs
>>>> returns a concatenation of records that were read in for each range in
>>>> the
>>>> which argument. Instead one might expect to obtain the union of records
>>>> that overlap any of the ranges included in which. Do you think it would
>>>> be
>>>> helpful to issue a warning in case records are read in multiple times?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>> Leonard
>>>>
>>>>
>>>>
>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at gene.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>   Hi Herve,
>>>>>
>>>>>
>>>>> I ran into some issues when using readGAlignmentPairs with the
>>>>> ScanBamParam 'which' argument. I included an example below. I can see
>>>>> how
>>>>> changing 'which' can result in different pairings but was wondering
>>>>>
>>>> whether
>>>>
>>>>> the error can be avoided and problematic pairings dropped instead.
>>>>>
>>>>>
>>>>> It looks like the issue is introduced in revision 77808 (no error in
>>>>> 77791). I was hoping you have an idea what causes the problem, otherwise
>>>>>
>>>> I
>>>>
>>>>> can try to forward a test case.
>>>>>
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>>   gr
>>>>>>
>>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>>         seqnames               ranges strand
>>>>>            <Rle>            <IRanges>  <Rle>
>>>>>     [1]       10 [75758072, 75758133]      *
>>>>>     [2]       10 [75802841, 75802911]      *
>>>>>     ---
>>>>>     seqlengths:
>>>>>      10
>>>>>      NA
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>>>>>
>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>>    seqnames strand :    ranges --    ranges
>>>>>       <Rle>  <Rle> : <IRanges> -- <IRanges>
>>>>> ---
>>>>> seqlengths:
>>>>>             1          2          3 ... GL000247.1 GL000248.1 GL000249.1
>>>>>     249250621  243199373  198022430 ...      36422      39786      38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>>>>>
>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>>>>       seqnames strand :               ranges --               ranges
>>>>>          <Rle>  <Rle> :            <IRanges> --            <IRanges>
>>>>> [1]       10      + : [75758115, 75802896] -- [75802892, 75830482]
>>>>> [2]       10      - : [75802908, 75830498] -- [75758111, 75802892]
>>>>> [3]       10      - : [75802908, 75830498] -- [75802878, 75830468]
>>>>> ---
>>>>> seqlengths:
>>>>>             1          2          3 ... GL000247.1 GL000248.1 GL000249.1
>>>>>     249250621  243199373  198022430 ...      36422      39786      38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>>
>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols =
>>>>> use.mcols) :
>>>>>     findMateAlignment() returned an invalid 'mate' vector
>>>>> In addition: Warning message:
>>>>> In findMateAlignment(x) :
>>>>>       4 alignments with ambiguous pairing were dumped.
>>>>>       Use 'getDumpedAlignments()' to retrieve them from the dump
>>>>>
>>>> environment.
>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>>
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.13.39     Biostrings_2.29.16    GenomicRanges_1.13.41
>>>>> [4] XVector_0.1.1         IRanges_1.19.31       BiocGenerics_0.7.4
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6   stats4_3.0.0   zlibbioc_1.7.0
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list