[BioC] GenomicAlignments and QNAME collision

Valerie Obenchain vobencha at fhcrc.org
Fri Jun 6 04:10:32 CEST 2014


When I use double escape the code doesn't hang it just doesn't trim.

 > fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
 > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\\")
 > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
[1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"


Using escape with any (?) other character gives an error:

 > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\_")
Error: '\_' is an unrecognized escape in character string starting ""\_"


I'm not sure why the code is hanging for you with the double escape. If 
you have a reproducible example you can share I can take a look.

Valerie


On 06/04/2014 12:11 AM, Stefano Calza wrote:
> Dear Valerie
>
> it might be the string I used: qnameSuffixStart="\\" (while it should
> have been "/")
>
> Being the escape character I guess it has been messing up the code.
>
> Stefano
>
> On 06/03/2014 05:47 PM, Valerie Obenchain wrote:
>> I can't reproduce this problem.
>>
>> When a prefix/suffix is not found in the qname no error is thrown - it
>> just doesn't do any trimming. I've clarified this in the documentation
>> in 1.17.22.
>>
>> With Rsamtools 1.17.21:
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>
>> ## original qnames:
>> bfm <- BamFile(fl, asMates=TRUE)
>> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>>
>> ## bogus prefix:
>> bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*")
>> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>>
>> ## bogus suffix:
>> bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*")
>> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>>
>> Can you provide a reproducible example? Maybe you were calling
>> scanBam() differently?
>>
>> Valerie
>>
>> On 06/03/2014 03:43 AM, Stefano Calza wrote:
>>> Valerie,
>>>
>>> just a little note. In case the character in qnameSuffixEnd does not
>>> exist in QNAME (i.e. you specify the wrong character...) scanBam (I
>>> tested that) hangs forever.
>>>
>>> I guess there should be a test on the very first read: if the character
>>> is not in the string then report error...
>>>
>>>
>>> Stefano
>>>
>>> On 06/02/2014 05:34 PM, Valerie Obenchain wrote:
>>>> You're welcome. Let us know if you run into other problems.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 06/01/2014 10:01 AM, Stefano Calza wrote:
>>>> > Great!
>>>> >
>>>> > I would like to thank Valerie and all Bioconductor developers.
>>>> >
>>>> > Stefano
>>>> >
>>>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto:
>>>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be
>>>>> available via biocLite() ~ noon tomorrow.
>>>>>
>>>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the
>>>>> BamFile class. Currently the trimming is implemented for mate pairing
>>>>> only, not just reading records from a bam.
>>>>>
>>>>> Unique qnames aren't a problem in this sample file - just using it to
>>>>> demonstrate the output.
>>>>>
>>>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>>>> param <- ScanBamParam(what="qname")
>>>>>
>>>>> ## no trimming
>>>>> bf <- BamFile(fl, asMates=TRUE)
>>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>>>>>> [3] "EAS219_FC30151:7:51:1429:1043"
>>>>>
>>>>> ## trim prefix
>>>>> qnamePrefixEnd(bf) <- "_"
>>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>>> [1] "61:4:143:69:578"        "61:4:143:69:578"
>>>>>> "FC30151:7:51:1429:1043"
>>>>>
>>>>> ## trim prefix and suffix
>>>>> qnameSuffixStart(bf) <- ":"
>>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>>> [1] "61:4:143:69"       "61:4:143:69" "FC30151:7:51:1429"
>>>>>
>>>>>
>>>>> Valerie
>>>>>
>>>>>
>>>>> On 05/08/14 12:17, Stefano Calza wrote:
>>>>>> Yes, I say that would be easier to use than regexp
>>>>>>
>>>>>> Stefano
>>>>>>
>>>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
>>>>>>> Thanks for the SRA tips.
>>>>>>>
>>>>>>> It's starting to look like the modifications are an additional
>>>>>>> prefix
>>>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe
>>>>>>> simply adding an option to trim the QNAME by the pre/post term
>>>>>>> separated by a given character would be sufficient. This allows
>>>>>>> flexibilty but prevents unwarranted QNAME mangling.
>>>>>>>
>>>>>>> Valerie
>>>>>>>
>>>>>>>
>>>>>>> On 05/08/14 10:56, Stefano Calza wrote:
>>>>>>>> Right this is how I got some other example. I think it would add
>>>>>>>> the
>>>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are
>>>>>>>> named
>>>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by
>>>>>>>> actual
>>>>>>>> read QNAME.
>>>>>>>>
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>>>>>>>> Hi Valerie,
>>>>>>>>>
>>>>>>>>> You get something similar from the .sra files that you can
>>>>>>>>> download
>>>>>>>>> from the SRA, if they are paired data. If you use the SRA
>>>>>>>>> toolkit to
>>>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq
>>>>>>>>> files, and
>>>>>>>>> the QNAME in each of the fastq files will be appended with a .1
>>>>>>>>> for
>>>>>>>>> the first pairs and a .2 for the second pairs.
>>>>>>>>>
>>>>>>>>> As an example:
>>>>>>>>>
>>>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1
>>>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1
>>>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jim
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>>>>>>>> Thanks Valerie
>>>>>>>>>>
>>>>>>>>>> I have got this BAM files from different sources but they
>>>>>>>>>> cannot be
>>>>>>>>>> distributed.
>>>>>>>>>>
>>>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One
>>>>>>>>>> is the
>>>>>>>>>> trailing value as we said (/1, /2). Another one is a leading
>>>>>>>>>> string.
>>>>>>>>>> Eg. (made up QNAME)
>>>>>>>>>>
>>>>>>>>>> SRR1122.12345HTR
>>>>>>>>>> SRR1123.12345HTR
>>>>>>>>>>
>>>>>>>>>> So there must be removed SRR1122 and SRR1123)
>>>>>>>>>>
>>>>>>>>>> My little program actually uses a regex substitution, so the
>>>>>>>>>> user can
>>>>>>>>>> decide what pattern to edit. This second one though it seems quit
>>>>>>>>>> unusual.
>>>>>>>>>>
>>>>>>>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>>>>>>>> correctly the use a pipeline called MapSplice)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>
>>>>>>>>>>> No, the current mate-pairing doesn't handle the trailing
>>>>>>>>>>> values. I
>>>>>>>>>>> will implement this and post back when it's done.
>>>>>>>>>>>
>>>>>>>>>>> For reference, where did you download your bam files or what
>>>>>>>>>>> application outputs QNAMEs in this format? I'd like to have
>>>>>>>>>>> some for
>>>>>>>>>>> test data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Valerie
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>>>>>>>> Hi everybody
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end
>>>>>>>>>>>> data. The
>>>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting
>>>>>>>>>>>> asMates=TRUE
>>>>>>>>>>>> in BamFile, returns 0 mates.
>>>>>>>>>>>>
>>>>>>>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>>>>>>>
>>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>>>>>>>
>>>>>>>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>>>>>>>
>>>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes
>>>>>>>>>>>> still
>>>>>>>>>>>> quite a substantial amount of time on big files.
>>>>>>>>>>>>
>>>>>>>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>> Stefano
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>>> Search the archives:
>>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>> Biostatistician
>>>>>>>>> University of Washington
>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>>> Seattle WA 98105-6099
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>
>>
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list