[BioC] GenomicAlignments and QNAME collision

Thu May 8 21:17:14 CEST 2014

Yes, I say that would be easier to use than regexp

Stefano

On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
> Thanks for the SRA tips.
>
> It's starting to look like the modifications are an additional prefix 
> or suffix separated by dot or slash (possibly underscore?). Maybe 
> simply adding an option to trim the QNAME by the pre/post term 
> separated by a given character would be sufficient. This allows 
> flexibilty but prevents unwarranted QNAME mangling.
>
> Valerie
>
>
> On 05/08/14 10:56, Stefano Calza wrote:
>> Right this is how I got some other example. I think it would add the
>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named
>> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual
>> read QNAME.
>>
>> Stefano
>>
>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>> Hi Valerie,
>>>
>>> You get something similar from the .sra files that you can download
>>> from the SRA, if they are paired data. If you use the SRA toolkit to
>>> convert to fastq (fastq-dump), it will spit out two fastq files, and
>>> the QNAME in each of the fastq files will be appended with a .1 for
>>> the first pairs and a .2 for the second pairs.
>>>
>>> As an example:
>>>
>>> zcat SRR833731_1.fastq.gz | head -n 1
>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>> zcat SRR833731_2.fastq.gz | head -n 1
>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>> Thanks Valerie
>>>>
>>>> I have got this BAM files from different sources but they cannot be
>>>> distributed.
>>>>
>>>> Up to now I experienced twp different 'patterns' in QNAME. One is the
>>>> trailing value as we said (/1, /2). Another one is a leading string.
>>>> Eg. (made up QNAME)
>>>>
>>>> SRR1122.12345HTR
>>>> SRR1123.12345HTR
>>>>
>>>> So there must be removed SRR1122 and SRR1123)
>>>>
>>>> My little program actually uses a regex substitution, so the user can
>>>> decide what pattern to edit. This second one though it seems quit
>>>> unusual.
>>>>
>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>> correctly the use a pipeline called MapSplice)
>>>>
>>>>
>>>> Regards
>>>>
>>>> Stefano
>>>>
>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>> Hi Stefano,
>>>>>
>>>>> No, the current mate-pairing doesn't handle the trailing values. I
>>>>> will implement this and post back when it's done.
>>>>>
>>>>> For reference, where did you download your bam files or what
>>>>> application outputs QNAMEs in this format? I'd like to have some for
>>>>> test data.
>>>>>
>>>>>
>>>>> Thanks.
>>>>> Valerie
>>>>>
>>>>>
>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>> Hi everybody
>>>>>>
>>>>>>
>>>>>> I am using GenomicAlignments package to read RNAseq pair-end 
>>>>>> data. The
>>>>>> problem is that readGAlignmentPairsFromBam, after setting 
>>>>>> asMates=TRUE
>>>>>> in BamFile, returns 0 mates.
>>>>>>
>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>
>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>
>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>
>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes 
>>>>>> still
>>>>>> quite a substantial amount of time on big files.
>>>>>>
>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Stefano
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>