[BioC] GenomicAlignments and QNAME collision

Stefano Calza stecalza at gmail.com
Thu May 8 19:56:21 CEST 2014


Right this is how I got some other example. I think it would add the 
files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named 
STARTING with SRR1234 or SRR1235 for the two mates, followed by actual 
read QNAME.

Stefano

On 05/08/2014 07:05 PM, James W. MacDonald wrote:
> Hi Valerie,
>
> You get something similar from the .sra files that you can download 
> from the SRA, if they are paired data. If you use the SRA toolkit to 
> convert to fastq (fastq-dump), it will spit out two fastq files, and 
> the QNAME in each of the fastq files will be appended with a .1 for 
> the first pairs and a .2 for the second pairs.
>
> As an example:
>
> zcat SRR833731_1.fastq.gz | head -n 1
> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
> zcat SRR833731_2.fastq.gz | head -n 1
> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>
>
> Best,
>
> Jim
>
>
>
> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>> Thanks Valerie
>>
>> I have got this BAM files from different sources but they cannot be
>> distributed.
>>
>> Up to now I experienced twp different 'patterns' in QNAME. One is the
>> trailing value as we said (/1, /2). Another one is a leading string.
>> Eg. (made up QNAME)
>>
>> SRR1122.12345HTR
>> SRR1123.12345HTR
>>
>> So there must be removed SRR1122 and SRR1123)
>>
>> My little program actually uses a regex substitution, so the user can
>> decide what pattern to edit. This second one though it seems quit
>> unusual.
>>
>> Those with  trailing values were downloaded by TCGA (if I recall
>> correctly the use a pipeline called MapSplice)
>>
>>
>> Regards
>>
>> Stefano
>>
>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>> Hi Stefano,
>>>
>>> No, the current mate-pairing doesn't handle the trailing values. I
>>> will implement this and post back when it's done.
>>>
>>> For reference, where did you download your bam files or what
>>> application outputs QNAMEs in this format? I'd like to have some for
>>> test data.
>>>
>>>
>>> Thanks.
>>> Valerie
>>>
>>>
>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>> Hi everybody
>>>>
>>>>
>>>> I am using GenomicAlignments package to read RNAseq pair-end data. The
>>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE
>>>> in BamFile, returns 0 mates.
>>>>
>>>> The reason is that mates have different QNAMEs. Eg:
>>>>
>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>
>>>> that is the two mates have /1 or /2 at the end.
>>>>
>>>> I wrote a Python (and a cpp) program to fix it...but this takes still
>>>> quite a substantial amount of time on big files.
>>>>
>>>> Does the mating algorithm allow for this? If so how?
>>>>
>>>> Regards
>>>>
>>>> Stefano
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list