[BioC] A question about the function readGAlignmentPairs in GenomicRnages package

Valerie Obenchain vobencha at fhcrc.org
Wed Apr 2 17:22:33 CEST 2014


Hi,

On 04/01/14 22:27, Hervé Pagès wrote:
> Hi Liang,
>
> On 04/01/2014 12:12 PM, Niu, Liang (NIH/NIEHS) [E] wrote:
>> Dear Herve,
>>
>> I found the reason for the mate_status error: it is because that  the
>> read 1 in each pair were reverse complemented before the alignment.
>>
>> Now I have a question: when I use readGAlignmentsList() to read in the
>> pairs, how can I preserve the order of the two reads, i.e., keep read
>> 1 as the first alignment and  read 2 as the second alignment in a
>> element (with two alignments) of the alignments list? is it the
>> default order of each element of the list?
>
> Yes. Even though I cannot find this information in the man page
> for readGAlignmentsList(), my understanding is that the 1st mate
> will always be 1st in the GAlignmentsList list elements that have
> mate_status set to "mated", and that the 2nd mate will always be
> 2nd.
>
> Val please correct me if I'm wrong.

You are correct about the ordering of the pairs. (fyi, this occurs in 
add_to_complete() in Template.h)

I've updated the man page in GenomicAlignments 0.99.37.

Val

>
>>
>> Also, if I use grglist() to convert the above GAlignmentsList into a
>> GRangesList, do I need to set the order.as.in.query=TRUE for the above
>> purpose?
>
> I didn't know the "grglist" method for GAlignmentsList objects supported
> this argument (and it doesn't seem to be documented). But yes, if you
> care about the order of the ranges within each list element of the
> returned GRangesList, you should use 'order.as.in.query=TRUE'. At least
> that's what you should do on a GAlignmentPairs object. I'm just assuming
> it's the same for a GAlignmentsList object but I don't know how this
> translates for list elements with mate_status other than "mated" (you
> will need to check if that matters for your case).
> However, note that if you're going to find or count overlaps (with
> findOverlaps(), countOverlaps(), or summarizeOverlaps()), how you set
> 'order.as.in.query' is irrelevant. The only situation I know where this
> matters is when you want to compute the "overlap encodings" (with
> encodeOverlaps()), which is actually what motivated the addition of the
> 'order.as.in.query' arg.
>
> HTH,
> H.
>
>>
>> Best,
>> Liang
>>
>>
>> ________________________________________
>> From: Hervé Pagès [hpages at fhcrc.org]
>> Sent: Sunday, March 30, 2014 4:45 PM
>> To: Niu, Liang (NIH/NIEHS) [E]; bioconductor at r-project.org
>> Subject: Re: A question about the function readGAlignmentPairs in
>> GenomicRnages package
>>
>> On 03/30/2014 01:31 PM, Niu, Liang (NIH/NIEHS) [E] wrote:
>>> Dear Herve,
>>>
>>> Here is the code and output:
>>>
>>>> bam.path<-"test.bam"
>>>> bam.file<-BamFile(bam.path,asMates=TRUE)
>>>>
>>>> param<-ScanBamParam(flag=scanBamFlag(isPaired=TRUE,hasUnmappedMate=FALSE,isUnmappedQuery=FALSE,isNotPassingQualityControls=FALSE,isDuplicate=FALSE,isNotPrimaryRead=FALSE))
>>>>
>>>> alignment<-readGAlignmentsList(bam.file,param=param)
>>>> length(alignment)
>>> [1] 18041910
>>>> table(mcols(unlist(alignment,
>>>> use.names=FALSE))$mates[end(PartitioningByEnd(alignment))])
>>>
>>>      FALSE
>>> 18041910
>>>
>>> What is the problem?
>>
>> Whaoo. No idea :-/ I'm on week-end right now and have to turn off
>> the computer due to other obligations, sorry. I'll try to get to this
>> later. In the mean time it would be great if you could install R-3.1.0
>> + BioC 2.14 (you will need the GenomicAlignments package) and try
>> this again.
>>
>> Thanks,
>> H.
>>
>>>
>>> Thanks!
>>>
>>> Liang
>>>
>>> ________________________________________
>>> From: Hervé Pagès [hpages at fhcrc.org]
>>> Sent: Sunday, March 30, 2014 4:13 PM
>>> To: Niu, Liang (NIH/NIEHS) [E]; bioconductor at r-project.org
>>> Subject: Re: A question about the function readGAlignmentPairs in
>>> GenomicRnages package
>>>
>>> Dear Liang,
>>>
>>> Please keep the communication on the mailing list (by using
>>> the "Reply All" button).
>>>
>>> On 03/30/2014 12:13 PM, Niu, Liang (NIH/NIEHS) [E] wrote:
>>>> Dear Herve,
>>>>
>>>> I use the following command to read in the alignments:
>>>>
>>>> bam.path<-"test.bam"
>>>> bam.file<-BamFile(bam.path, yieldSize=1E6, asMates=TRUE)
>>>>
>>>> param<-ScanBamParam(flag=scanBamFlag(isPaired=TRUE,hasUnmappedMate=FALSE,isUnmappedQuery=FALSE,isNotPassingQualityControls=FALSE,isDuplicate=FALSE,isNotPrimaryRead=FALSE))
>>>>
>>>> alignment<-readGAlignmentsList(bam.file,param=param)
>>>>
>>>> However, each element ((a pair of alignments)) of the
>>>> GAlignmentsList "alignment"  has meta field Mates as "FALSE" for
>>>> both reads. Is it a problem?
>>>
>>> I'm surprised that *each* list element in your GAlignmentsList object
>>> 'alignment' looks like this. How do you know? Why not show us the
>>> object?
>>>
>>> My understanding is that even though your BAM file contains paired-end
>>> reads, when you read it with readGAlignmentsList(), you end up with a
>>> GAlignmentsList object where some fraction of the list elements have
>>> the "mates" field set to FALSE. The reads in such list element have the
>>> same QNAME but are not mates. Note that all list elements with the
>>> "mates" field set to TRUE are guaranteed to contain exactly 2 reads.
>>> But there is no such expectation for list elements with the "mates"
>>> field set to FALSE: they can contain 1, 2, 3, or more reads.
>>>
>>> So if *each* list element in your GAlignmentsList object
>>> 'alignment' really has the "mates" field set to FALSE, that
>>> means none of the reads in your file could be mated. That could
>>> actually happen if your data was not paired-end and if you didn't
>>> use isPaired=TRUE but that doesn't seem to be the case here.
>>>
>>> Here is how you can summarize how many list elements have the
>>> "mates" field set to FALSE and how many have it set to TRUE:
>>>
>>>      table(mcols(unlist(alignment,
>>> use.names=FALSE))$mates[end(PartitioningByEnd(alignment))])
>>>
>>> Finally note that in BioC devel, the "mates" field has been replaced
>>> by the "mate_status" field and that this field is now an attribute of
>>> the list elements rather than of the individual reads. This means that
>>> it's a top-level metadata column (i.e. directly accessible with
>>> mcols(alignment)$mate_status) instead of an inner metadata column
>>> (which are more complicated to access). So it's much easier now to get
>>> the above count:
>>>
>>>      table(mcols(alignment)$mate_status)
>>>
>>> This will give you something like:
>>>
>>>      > table(mcols(galist1)$mate_status)
>>>
>>>          mated ambiguous   unmated
>>>          75346         0     21286
>>>
>>> As you can see, another change is that this field is now a 3-level
>>> factor (levels are explained in ?readGAlignmentsList).
>>>
>>> Hope this helps,
>>> H.
>>>
>>>
>>>>
>>>> Liang
>>>>
>>>>
>>>> ________________________________________
>>>> From: Hervé Pagès [hpages at fhcrc.org]
>>>> Sent: Sunday, March 30, 2014 2:54 PM
>>>> To: Niu, Liang (NIH/NIEHS) [E]; bioconductor at r-project.org
>>>> Subject: Re: A question about the function readGAlignmentPairs in
>>>> GenomicRnages package
>>>>
>>>> On 03/30/2014 11:52 AM, Niu, Liang (NIH/NIEHS) [E] wrote:
>>>>> Sorry for the typo in the previous email. What I mean is "I do NEED
>>>>> (not NOT) the pairing information".
>>>>
>>>> I see. So yes, readGAlignmentsList() should do it.
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>>
>>>>> Liang
>>>>> ________________________________________
>>>>> From: Hervé Pagès [hpages at fhcrc.org]
>>>>> Sent: Sunday, March 30, 2014 2:51 PM
>>>>> To: Niu, Liang (NIH/NIEHS) [E]; bioconductor at r-project.org
>>>>> Subject: Re: A question about the function readGAlignmentPairs in
>>>>> GenomicRnages package
>>>>>
>>>>> On 03/30/2014 11:31 AM, Niu, Liang (NIH/NIEHS) [E] wrote:
>>>>>> Dear Herve,
>>>>>>
>>>>>> Thanks for your suggestion. I do not the paring information in the
>>>>>> downstream analysis, therefore, readGAlignmentsList() is good.
>>>>>
>>>>> Maybe I was not clear enough but if you don't need the pairing, then
>>>>> you can just use readGAlignments(). readGAlignmentsList() will pair
>>>>> everything that can be paired and this has a cost (in terms of
>>>>> performance and memory usage). By using readGAlignments() you avoid
>>>>> that cost and you also end up with an object that is a little bit
>>>>> simpler to manipulate (i.e. a GAlignments instead of GAlignmentsList
>>>>> object).
>>>>>
>>>>> Hope that makes sense,
>>>>> H.
>>>>>
>>>>>>
>>>>>> Liang
>>>>>> ________________________________________
>>>>>> From: Hervé Pagès [hpages at fhcrc.org]
>>>>>> Sent: Sunday, March 30, 2014 2:12 PM
>>>>>> To: Niu, Liang (NIH/NIEHS) [E]
>>>>>> Cc: bioconductor at r-project.org
>>>>>> Subject: Re: A question about the function readGAlignmentPairs in
>>>>>> GenomicRnages package
>>>>>>
>>>>>> Hi Liang,
>>>>>>
>>>>>> Hope you don't mind that I'm cc'ing the Bioconductor mailing list, so
>>>>>> other can give suggestions.
>>>>>>
>>>>>> On 03/30/2014 09:32 AM, Niu, Liang (NIH/NIEHS) [E] wrote:
>>>>>>> Dear Mr. Pages,
>>>>>>>
>>>>>>> This is Liang Niu, a research fellow at the National Institute of
>>>>>>> Environmental Health Sciences.
>>>>>>>
>>>>>>> I am using R to read .bam files for chromatin interaction data
>>>>>>> sets. Such data sets contains alignments for  paired-end reads
>>>>>>> from ChIA-PET experiment, thus it has pairs in which two reads on
>>>>>>> different chromosomes and/or on same strand, and those pairs are
>>>>>>> valid pairs. I want to use your function readGAlignmentPairs in
>>>>>>> GenomicRnages package to read  the pairs, but the manual says
>>>>>>> that it will discard those pairs. Do you have any suggestion?
>>>>>>
>>>>>> You could use readGAlignmentsList() instead of readGAlignmentPairs().
>>>>>> readGAlignmentsList() will keep these "discordant pairs". It will
>>>>>> also
>>>>>> keep the reads that cannot be paired.
>>>>>> Note that, depending on what you will do downstream, you don't
>>>>>> necessarily need to pair the reads (e.g. if you're going to compute
>>>>>> the read coverage). In that case you can just load the reads with
>>>>>> readGAlignments().
>>>>>>
>>>>>> Cheers,
>>>>>> H.
>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Liang
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fhcrc.org
>>>>>> Phone:  (206) 667-5791
>>>>>> Fax:    (206) 667-1319
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list