[BioC] GenomicAlignments and QNAME collision

Tue Jun 3 12:43:55 CEST 2014

Valerie,

just a little note. In case the character in qnameSuffixEnd does not 
exist in QNAME (i.e. you specify the wrong character...) scanBam (I 
tested that) hangs forever.

I guess there should be a test on the very first read: if the character 
is not in the string then report error...

Stefano

On 06/02/2014 05:34 PM, Valerie Obenchain wrote:
> You're welcome. Let us know if you run into other problems.
>
> Valerie
>
>
> On 06/01/2014 10:01 AM, Stefano Calza wrote:
> > Great!
> >
> > I would like to thank Valerie and all Bioconductor developers.
> >
> > Stefano
> >
> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto:
>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be
>> available via biocLite() ~ noon tomorrow.
>>
>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the
>> BamFile class. Currently the trimming is implemented for mate pairing
>> only, not just reading records from a bam.
>>
>> Unique qnames aren't a problem in this sample file - just using it to
>> demonstrate the output.
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>> param <- ScanBamParam(what="qname")
>>
>> ## no trimming
>> bf <- BamFile(fl, asMates=TRUE)
>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>> [1] "EAS54_61:4:143:69:578"         "EAS54_61:4:143:69:578"
>>> [3] "EAS219_FC30151:7:51:1429:1043"
>>
>> ## trim prefix
>> qnamePrefixEnd(bf) <- "_"
>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>> [1] "61:4:143:69:578"        "61:4:143:69:578"
>>> "FC30151:7:51:1429:1043"
>>
>> ## trim prefix and suffix
>> qnameSuffixStart(bf) <- ":"
>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>> [1] "61:4:143:69"       "61:4:143:69" "FC30151:7:51:1429"
>>
>>
>> Valerie
>>
>>
>> On 05/08/14 12:17, Stefano Calza wrote:
>>> Yes, I say that would be easier to use than regexp
>>>
>>> Stefano
>>>
>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
>>>> Thanks for the SRA tips.
>>>>
>>>> It's starting to look like the modifications are an additional prefix
>>>> or suffix separated by dot or slash (possibly underscore?). Maybe
>>>> simply adding an option to trim the QNAME by the pre/post term
>>>> separated by a given character would be sufficient. This allows
>>>> flexibilty but prevents unwarranted QNAME mangling.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 05/08/14 10:56, Stefano Calza wrote:
>>>>> Right this is how I got some other example. I think it would add the
>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are 
>>>>> named
>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by 
>>>>> actual
>>>>> read QNAME.
>>>>>
>>>>> Stefano
>>>>>
>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>>>>> Hi Valerie,
>>>>>>
>>>>>> You get something similar from the .sra files that you can download
>>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to
>>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and
>>>>>> the QNAME in each of the fastq files will be appended with a .1 for
>>>>>> the first pairs and a .2 for the second pairs.
>>>>>>
>>>>>> As an example:
>>>>>>
>>>>>> zcat SRR833731_1.fastq.gz | head -n 1
>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>> zcat SRR833731_2.fastq.gz | head -n 1
>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>>>>> Thanks Valerie
>>>>>>>
>>>>>>> I have got this BAM files from different sources but they cannot be
>>>>>>> distributed.
>>>>>>>
>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One 
>>>>>>> is the
>>>>>>> trailing value as we said (/1, /2). Another one is a leading 
>>>>>>> string.
>>>>>>> Eg. (made up QNAME)
>>>>>>>
>>>>>>> SRR1122.12345HTR
>>>>>>> SRR1123.12345HTR
>>>>>>>
>>>>>>> So there must be removed SRR1122 and SRR1123)
>>>>>>>
>>>>>>> My little program actually uses a regex substitution, so the 
>>>>>>> user can
>>>>>>> decide what pattern to edit. This second one though it seems quit
>>>>>>> unusual.
>>>>>>>
>>>>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>>>>> correctly the use a pipeline called MapSplice)
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Stefano
>>>>>>>
>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>>>>> Hi Stefano,
>>>>>>>>
>>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I
>>>>>>>> will implement this and post back when it's done.
>>>>>>>>
>>>>>>>> For reference, where did you download your bam files or what
>>>>>>>> application outputs QNAMEs in this format? I'd like to have 
>>>>>>>> some for
>>>>>>>> test data.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Valerie
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>>>>> Hi everybody
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end
>>>>>>>>> data. The
>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting
>>>>>>>>> asMates=TRUE
>>>>>>>>> in BamFile, returns 0 mates.
>>>>>>>>>
>>>>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>>>>
>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>>>>
>>>>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>>>>
>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes
>>>>>>>>> still
>>>>>>>>> quite a substantial amount of time on big files.
>>>>>>>>>
>>>>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> -- 
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>