[BioC] GenomicAlignments and QNAME collision

Tue Jun 3 17:47:58 CEST 2014

I can't reproduce this problem.

When a prefix/suffix is not found in the qname no error is thrown - it 
just doesn't do any trimming. I've clarified this in the documentation 
in 1.17.22.

With Rsamtools 1.17.21:

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")

## original qnames:
bfm <- BamFile(fl, asMates=TRUE)
scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"

## bogus prefix:
bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*")
scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"

## bogus suffix:
bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*")
scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"

Can you provide a reproducible example? Maybe you were calling scanBam() 
differently?

Valerie

On 06/03/2014 03:43 AM, Stefano Calza wrote:
> Valerie,
>
> just a little note. In case the character in qnameSuffixEnd does not
> exist in QNAME (i.e. you specify the wrong character...) scanBam (I
> tested that) hangs forever.
>
> I guess there should be a test on the very first read: if the character
> is not in the string then report error...
>
>
> Stefano
>
> On 06/02/2014 05:34 PM, Valerie Obenchain wrote:
>> You're welcome. Let us know if you run into other problems.
>>
>> Valerie
>>
>>
>> On 06/01/2014 10:01 AM, Stefano Calza wrote:
>> > Great!
>> >
>> > I would like to thank Valerie and all Bioconductor developers.
>> >
>> > Stefano
>> >
>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto:
>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be
>>> available via biocLite() ~ noon tomorrow.
>>>
>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the
>>> BamFile class. Currently the trimming is implemented for mate pairing
>>> only, not just reading records from a bam.
>>>
>>> Unique qnames aren't a problem in this sample file - just using it to
>>> demonstrate the output.
>>>
>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>> param <- ScanBamParam(what="qname")
>>>
>>> ## no trimming
>>> bf <- BamFile(fl, asMates=TRUE)
>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>> [1] "EAS54_61:4:143:69:578"         "EAS54_61:4:143:69:578"
>>>> [3] "EAS219_FC30151:7:51:1429:1043"
>>>
>>> ## trim prefix
>>> qnamePrefixEnd(bf) <- "_"
>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>> [1] "61:4:143:69:578"        "61:4:143:69:578"
>>>> "FC30151:7:51:1429:1043"
>>>
>>> ## trim prefix and suffix
>>> qnameSuffixStart(bf) <- ":"
>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>> [1] "61:4:143:69"       "61:4:143:69" "FC30151:7:51:1429"
>>>
>>>
>>> Valerie
>>>
>>>
>>> On 05/08/14 12:17, Stefano Calza wrote:
>>>> Yes, I say that would be easier to use than regexp
>>>>
>>>> Stefano
>>>>
>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
>>>>> Thanks for the SRA tips.
>>>>>
>>>>> It's starting to look like the modifications are an additional prefix
>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe
>>>>> simply adding an option to trim the QNAME by the pre/post term
>>>>> separated by a given character would be sufficient. This allows
>>>>> flexibilty but prevents unwarranted QNAME mangling.
>>>>>
>>>>> Valerie
>>>>>
>>>>>
>>>>> On 05/08/14 10:56, Stefano Calza wrote:
>>>>>> Right this is how I got some other example. I think it would add the
>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are
>>>>>> named
>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by
>>>>>> actual
>>>>>> read QNAME.
>>>>>>
>>>>>> Stefano
>>>>>>
>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>>>>>> Hi Valerie,
>>>>>>>
>>>>>>> You get something similar from the .sra files that you can download
>>>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to
>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and
>>>>>>> the QNAME in each of the fastq files will be appended with a .1 for
>>>>>>> the first pairs and a .2 for the second pairs.
>>>>>>>
>>>>>>> As an example:
>>>>>>>
>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1
>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1
>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>>>>>> Thanks Valerie
>>>>>>>>
>>>>>>>> I have got this BAM files from different sources but they cannot be
>>>>>>>> distributed.
>>>>>>>>
>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One
>>>>>>>> is the
>>>>>>>> trailing value as we said (/1, /2). Another one is a leading
>>>>>>>> string.
>>>>>>>> Eg. (made up QNAME)
>>>>>>>>
>>>>>>>> SRR1122.12345HTR
>>>>>>>> SRR1123.12345HTR
>>>>>>>>
>>>>>>>> So there must be removed SRR1122 and SRR1123)
>>>>>>>>
>>>>>>>> My little program actually uses a regex substitution, so the
>>>>>>>> user can
>>>>>>>> decide what pattern to edit. This second one though it seems quit
>>>>>>>> unusual.
>>>>>>>>
>>>>>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>>>>>> correctly the use a pipeline called MapSplice)
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>>>>>> Hi Stefano,
>>>>>>>>>
>>>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I
>>>>>>>>> will implement this and post back when it's done.
>>>>>>>>>
>>>>>>>>> For reference, where did you download your bam files or what
>>>>>>>>> application outputs QNAMEs in this format? I'd like to have
>>>>>>>>> some for
>>>>>>>>> test data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>> Valerie
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>>>>>> Hi everybody
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end
>>>>>>>>>> data. The
>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting
>>>>>>>>>> asMates=TRUE
>>>>>>>>>> in BamFile, returns 0 mates.
>>>>>>>>>>
>>>>>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>>>>>
>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>>>>>
>>>>>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>>>>>
>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes
>>>>>>>>>> still
>>>>>>>>>> quite a substantial amount of time on big files.
>>>>>>>>>>
>>>>>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>

-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158