[BioC] GenomicAlignments and QNAME collision

Valerie Obenchain vobencha at fhcrc.org
Thu May 8 20:59:49 CEST 2014


Thanks for the SRA tips.

It's starting to look like the modifications are an additional prefix or 
suffix separated by dot or slash (possibly underscore?). Maybe simply 
adding an option to trim the QNAME by the pre/post term separated by a 
given character would be sufficient. This allows flexibilty but prevents 
unwarranted QNAME mangling.

Valerie


On 05/08/14 10:56, Stefano Calza wrote:
> Right this is how I got some other example. I think it would add the
> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named
> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual
> read QNAME.
>
> Stefano
>
> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>> Hi Valerie,
>>
>> You get something similar from the .sra files that you can download
>> from the SRA, if they are paired data. If you use the SRA toolkit to
>> convert to fastq (fastq-dump), it will spit out two fastq files, and
>> the QNAME in each of the fastq files will be appended with a .1 for
>> the first pairs and a .2 for the second pairs.
>>
>> As an example:
>>
>> zcat SRR833731_1.fastq.gz | head -n 1
>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>> zcat SRR833731_2.fastq.gz | head -n 1
>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>> Thanks Valerie
>>>
>>> I have got this BAM files from different sources but they cannot be
>>> distributed.
>>>
>>> Up to now I experienced twp different 'patterns' in QNAME. One is the
>>> trailing value as we said (/1, /2). Another one is a leading string.
>>> Eg. (made up QNAME)
>>>
>>> SRR1122.12345HTR
>>> SRR1123.12345HTR
>>>
>>> So there must be removed SRR1122 and SRR1123)
>>>
>>> My little program actually uses a regex substitution, so the user can
>>> decide what pattern to edit. This second one though it seems quit
>>> unusual.
>>>
>>> Those with  trailing values were downloaded by TCGA (if I recall
>>> correctly the use a pipeline called MapSplice)
>>>
>>>
>>> Regards
>>>
>>> Stefano
>>>
>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>> Hi Stefano,
>>>>
>>>> No, the current mate-pairing doesn't handle the trailing values. I
>>>> will implement this and post back when it's done.
>>>>
>>>> For reference, where did you download your bam files or what
>>>> application outputs QNAMEs in this format? I'd like to have some for
>>>> test data.
>>>>
>>>>
>>>> Thanks.
>>>> Valerie
>>>>
>>>>
>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>> Hi everybody
>>>>>
>>>>>
>>>>> I am using GenomicAlignments package to read RNAseq pair-end data. The
>>>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE
>>>>> in BamFile, returns 0 mates.
>>>>>
>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>
>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>
>>>>> that is the two mates have /1 or /2 at the end.
>>>>>
>>>>> I wrote a Python (and a cpp) program to fix it...but this takes still
>>>>> quite a substantial amount of time on big files.
>>>>>
>>>>> Does the mating algorithm allow for this? If so how?
>>>>>
>>>>> Regards
>>>>>
>>>>> Stefano
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list