[BioC] writeFastq to gzip compressed files

Martin Morgan mtmorgan at fhcrc.org
Wed Sep 25 23:18:06 CEST 2013


On 09/25/2013 01:23 PM, Harris A. Jaffee wrote:
> I believe gzfile only applies when reading.
>
> With gzcon, you may have to flush or close the connection (not sure which
> one - I think your fout2), or your whole session, to get the file written.

writeFastq() won't currently work for anything other than a character string 
pointing to a file.

The code included earlier in this email thread writeFastq_con() works with any 
connection that supports writing.

Both gzfile() and gzcon() connections can be written to; usually we'd use 
gzfile() to write to a file. I don't know of a real use case for gzcon.

In general connection output is buffered, so if the connection is opened outside 
the function, then it needs to be flush()'ed or close()'ed. The code and 
examples I provided earlier open and then close the connection inside the 
function, and output is of course written to the file.

Martin


>
>
> On Sep 25, 2013, at 4:07 PM, Marcus Davy wrote:
>
>> Thanks for the responses, Martin I will give your work-around a try.
>>
>>
>> I did also try out the following but in both cases they failed, and only
>> produced a zero byte sized file (equivalent to bash 'touch' utility on the
>> command line) using OS X.
>>
>> fout <- gzfile("test.fq.gz", "wb")
>> print(fout)
>> writeFastq(rfq, fout)
>>
>> fout2 <- gzcon(file("test.fq.gz", "wb"))
>> print(fout2)
>> writeFastq(rfq, fout)
>>
>>
>> cheers,
>>
>>
>> Marcus
>>
>>
>> On Thu, Sep 26, 2013 at 6:01 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>
>>> On 09/25/2013 09:09 AM, Michael Lawrence wrote:
>>>
>>>> Ideally, it would be:
>>>> writeFastq(rfq, gzfile(fout))
>>>>
>>>> But I don't think that works. In theory, it would be feasible to support
>>>> this now that R has exposed its connection interface...
>>>>
>>>
>>> That'll be an interesting learning experience, and I'll come up with a
>>> proper implementation. Here's a work-around, With
>>>
>>>   writeFastq_con(rfq, stdout())
>>>   yy = writeFastq_con(rfq, gzfile(tempfile()))
>>>   readFastq(yy)
>>>
>>> Martin
>>>
>>> .write1fastq <-
>>>     function(x, con, ..., full=FALSE)
>>> {
>>>     i <- paste0("@", as.character(id(x)), "\n")
>>>     fill <- if (full)
>>>         paste0("\n+", as.character(id(x)), "\n")
>>>     else "\n+\n"
>>>     s <- as.character(sread(x))
>>>     q <- as.character(quality(quality(**x)))
>>>     cat(paste(i, s, fill, q, sep=""), file=con, sep="\n")
>>>     length(i)
>>> }
>>>
>>> writeFastq_con <-
>>>     function(object, file, ..., chunkSize=1000000L)
>>>     ## object: ShortReadQ instance
>>>     ## file: connection,  e.g., stdout(), gzfile(<...>, "wb")
>>>     ## ...: additional arguments, in particular
>>>     ##   full=TRUE: logical(1), TRUE: replicate id tag on '+' line
>>>     ## chunkSize: size of each chunk processed for output, smaller
>>>     ##     values use less memory but are slower
>>>
>>> {
>>>     if (!isOpen(file)) {
>>>         open(file, "wb")
>>>         on.exit(close(file))
>>>     }
>>>
>>>     len <- length(object)
>>>     start <- 1L
>>>     tot <- 0L
>>>     while (start < len) {
>>>         end <- min(start + chunkSize - 1L, len)
>>>         tot <- tot + .write1fastq(object[seq(start, end)], file, ...)
>>>         start <- end + 1L
>>>     }
>>>     if (tot != len)
>>>         warning("output length does not equal input length")
>>>
>>>     summary(file)$description
>>>
>>> }
>>>
>>>
>>>> Michael
>>>>
>>>>
>>>> On Wed, Sep 25, 2013 at 7:39 AM, Marcus Davy <mdavy86 at gmail.com> wrote:
>>>>
>>>> Hi,
>>>>> as fastq files are often compressed now, I would like to be able to write
>>>>> out to compressed 'fq.gz' files, rather than raw text.
>>>>>
>>>>> I have searched the BioC threads and found a similar topic at;
>>>>>
>>>>>
>>>>> http://article.gmane.org/**gmane.science.biology.**
>>>>> informatics.conductor/45306/**match=writeFastq<http://article.gmane.org/gmane.science.biology.informatics.conductor/45306/match=writeFastq>
>>>>>
>>>>> A poor example of what I would like to be able to achieve follows;
>>>>>
>>>>> example(readFastq)
>>>>> print(rfq)
>>>>> fout <- "test.fq"
>>>>> writeFastq(rfq, fout)
>>>>> system(paste("gzip", fout))
>>>>>
>>>>> but this does not compress the data as it is streaming out to file. I
>>>>> would
>>>>> like to be able to do something like below, so this is a potential
>>>>> feature
>>>>> request;
>>>>>
>>>>> fout <- "test.fq.gz"
>>>>> writeFastq(rfq, fout, compress=TRUE)
>>>>>
>>>>> Is there an obvious way to compress using writeFastq that I am missing?
>>>>>
>>>>>
>>>>> cheers,
>>>>>
>>>>> Marcus
>>>>>
>>>>>
>>>>>   sessionInfo()
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel  splines   stats     graphics  grDevices utils     datasets
>>>>> [8] methods   base
>>>>>
>>>>> other attached packages:
>>>>>   [1] ShortRead_1.18.0     latticeExtra_0.6-24  RColorBrewer_1.0-5
>>>>>   [4] Rsamtools_1.12.3     lattice_0.20-15      Biostrings_2.28.0
>>>>>   [7] GenomicRanges_1.12.2 IRanges_1.18.0       BiocGenerics_0.6.0
>>>>> [10] BiocInstaller_1.10.3 limma_3.16.7         Hmisc_3.12-2
>>>>> [13] Formula_1.1-1        survival_2.37-4
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list