[BioC] GenomicAlignments and QNAME collision

Sat May 31 21:39:23 CEST 2014

Trimming the qname is implemented in Rsamtools 1.17.18 and should be 
available via biocLite() ~ noon tomorrow.

'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the 
BamFile class. Currently the trimming is implemented for mate pairing 
only, not just reading records from a bam.

Unique qnames aren't a problem in this sample file - just using it to 
demonstrate the output.

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="qname")

## no trimming
bf <- BamFile(fl, asMates=TRUE)
>> scanBam(bf, param=param)[[1]]$qname[1:3]
> [1] "EAS54_61:4:143:69:578"         "EAS54_61:4:143:69:578"
> [3] "EAS219_FC30151:7:51:1429:1043"

## trim prefix
qnamePrefixEnd(bf) <- "_"
>> scanBam(bf, param=param)[[1]]$qname[1:3]
> [1] "61:4:143:69:578"        "61:4:143:69:578"        "FC30151:7:51:1429:1043"

## trim prefix and suffix
qnameSuffixStart(bf) <- ":"
>> scanBam(bf, param=param)[[1]]$qname[1:3]
> [1] "61:4:143:69"       "61:4:143:69"       "FC30151:7:51:1429"

Valerie

On 05/08/14 12:17, Stefano Calza wrote:
> Yes, I say that would be easier to use than regexp
>
> Stefano
>
> On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
>> Thanks for the SRA tips.
>>
>> It's starting to look like the modifications are an additional prefix
>> or suffix separated by dot or slash (possibly underscore?). Maybe
>> simply adding an option to trim the QNAME by the pre/post term
>> separated by a given character would be sufficient. This allows
>> flexibilty but prevents unwarranted QNAME mangling.
>>
>> Valerie
>>
>>
>> On 05/08/14 10:56, Stefano Calza wrote:
>>> Right this is how I got some other example. I think it would add the
>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named
>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual
>>> read QNAME.
>>>
>>> Stefano
>>>
>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>>> Hi Valerie,
>>>>
>>>> You get something similar from the .sra files that you can download
>>>> from the SRA, if they are paired data. If you use the SRA toolkit to
>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and
>>>> the QNAME in each of the fastq files will be appended with a .1 for
>>>> the first pairs and a .2 for the second pairs.
>>>>
>>>> As an example:
>>>>
>>>> zcat SRR833731_1.fastq.gz | head -n 1
>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>> zcat SRR833731_2.fastq.gz | head -n 1
>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>>> Thanks Valerie
>>>>>
>>>>> I have got this BAM files from different sources but they cannot be
>>>>> distributed.
>>>>>
>>>>> Up to now I experienced twp different 'patterns' in QNAME. One is the
>>>>> trailing value as we said (/1, /2). Another one is a leading string.
>>>>> Eg. (made up QNAME)
>>>>>
>>>>> SRR1122.12345HTR
>>>>> SRR1123.12345HTR
>>>>>
>>>>> So there must be removed SRR1122 and SRR1123)
>>>>>
>>>>> My little program actually uses a regex substitution, so the user can
>>>>> decide what pattern to edit. This second one though it seems quit
>>>>> unusual.
>>>>>
>>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>>> correctly the use a pipeline called MapSplice)
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Stefano
>>>>>
>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>>> Hi Stefano,
>>>>>>
>>>>>> No, the current mate-pairing doesn't handle the trailing values. I
>>>>>> will implement this and post back when it's done.
>>>>>>
>>>>>> For reference, where did you download your bam files or what
>>>>>> application outputs QNAMEs in this format? I'd like to have some for
>>>>>> test data.
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> Valerie
>>>>>>
>>>>>>
>>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>>> Hi everybody
>>>>>>>
>>>>>>>
>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end
>>>>>>> data. The
>>>>>>> problem is that readGAlignmentPairsFromBam, after setting
>>>>>>> asMates=TRUE
>>>>>>> in BamFile, returns 0 mates.
>>>>>>>
>>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>>
>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>>
>>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>>
>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes
>>>>>>> still
>>>>>>> quite a substantial amount of time on big files.
>>>>>>>
>>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Stefano
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>