[BioC] writeFastq writing dashes instead of dots

Martin Morgan mtmorgan at fhcrc.org
Wed Feb 27 19:44:17 CET 2013


On 02/27/2013 03:11 AM, Thomas Rensch wrote:
> Hi Martin,
>
> I am not sure what is the best way of solving this, however it seems that dashes
> are not part of the FASTQ format (http://maq.sourceforge.net/fastq.shtml) which
> implies the writeFastq function is "wrong". If such as method was written
> ( writeFastq(..., dashesASdots=FALSE) ) I think it the default value should be
> TRUE.
>
> Again in my opinion it is a standard assumption to assume that a writeFastq
> function should always write a valid FASTQ file no?

Yes, it should read and write valid fastq, and the Bioc representation has 
obviously diverged from (this) standard.

It seems like the class for representing fastq sequences needs to be 
generalized, probably to BStringSet from DNAStringSet, probably both in 
ShortRead and Rsamtools::scanBam. This would mean loss of some functionality 
(e.g., complement(), translate()) which don't make sense for a more general 
alphabet, and some unnecessary complexity (e.g., alphabetFrequency() would 
return frequencies of all 256 letters). It seems a shame to lose this, when it 
is clear that many fastq and bam files contain IUPAC-consistent sequences.

My temptation is to make the input type user-selectable, c("DNAStringSet", 
"BStringSet") with the former the default. Your '.' would then be handled 
correctly in a round-trip.

What does the broader community use non-IUPAC symbols for, including '.' and '~' 
but (paradoxically?) not '-'?

Martin

>
> Best,
> Thomas
>
> --
> Thomas Rensch
> PhD Student - Paul Flicek Group
> EMBL - European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge
> CB10 1SD
> United Kingdom
>
> On 26 Feb 2013, at 18:38, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>> Hi Thomas --
>>
>> On 02/25/2013 09:08 AM, Thomas Rensch wrote:
>>> Hi everyone,
>>>
>>> I am reading and writing fastq files and writeFastq just swaps dots ('.') for
>>> dashes ('-').
>>>
>>> Is this the desired behaviour of writeFastq and if so why? Otherwise could
>>> someone better at R development than I modify this?
>>>
>>> Example fastq:
>>>
>>>
>>> @HWI-EAS149_3:1:1:0:1956:0:1:0
>>> .A..................................
>>> +
>>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>> @HWI-EAS149_3:1:1:0:173:0:1:0
>>> .T..................................
>>> +
>>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>> @HWI-EAS149_3:1:1:0:47:0:1:0
>>> .T..................................
>>> +
>>> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>>
>> It's actually on input
>>
>> > sread(readFastq("tmp.fastq"))
>>  A DNAStringSet instance of length 3
>>    width seq
>> [1]    36 -A----------------------------------
>> [2]    36 -T----------------------------------
>> [3]    36 -T----------------------------------
>>
>> because '.' is not an a valid letter for DNAStringSet (or from the
>> international standard, if I understand correctly...)
>>
>> > DNAStringSet(".")
>> Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  :
>>  key 46 (char '.') not in lookup table
>> > alphabet(DNAStringSet())
>> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>> > DNA_ALPHABET
>> [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
>>
>> and from ?DNA_ALPHABET
>>
>>     This alphabet contains all letters from the IUPAC Extended Genetic
>>     Alphabet (see '?IUPAC_CODE_MAP') + the gap ('"-"') and the hard
>>     masking ('"+"') letters.
>>
>> One possibility would be to add an option writeFastq(..., dashesASdots=FALSE),
>> is that really a good idea?
>>
>> Martin
>>
>>>
>>>
>>> Thanks a lot,
>>> Thomas
>>>
>>> --
>>> Thomas Rensch
>>> PhD Student - Paul Flicek Group
>>> EMBL-EBI
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list