[BioC] writeFastq to gzip compressed files

Martin Morgan mtmorgan at fhcrc.org
Wed Sep 25 20:01:04 CEST 2013


On 09/25/2013 09:09 AM, Michael Lawrence wrote:
> Ideally, it would be:
> writeFastq(rfq, gzfile(fout))
>
> But I don't think that works. In theory, it would be feasible to support
> this now that R has exposed its connection interface...

That'll be an interesting learning experience, and I'll come up with a proper 
implementation. Here's a work-around, With

   writeFastq_con(rfq, stdout())
   yy = writeFastq_con(rfq, gzfile(tempfile()))
   readFastq(yy)

Martin

.write1fastq <-
     function(x, con, ..., full=FALSE)
{
     i <- paste0("@", as.character(id(x)), "\n")
     fill <- if (full)
         paste0("\n+", as.character(id(x)), "\n")
     else "\n+\n"
     s <- as.character(sread(x))
     q <- as.character(quality(quality(x)))
     cat(paste(i, s, fill, q, sep=""), file=con, sep="\n")
     length(i)
}

writeFastq_con <-
     function(object, file, ..., chunkSize=1000000L)
     ## object: ShortReadQ instance
     ## file: connection,  e.g., stdout(), gzfile(<...>, "wb")
     ## ...: additional arguments, in particular
     ##   full=TRUE: logical(1), TRUE: replicate id tag on '+' line
     ## chunkSize: size of each chunk processed for output, smaller
     ##     values use less memory but are slower

{
     if (!isOpen(file)) {
         open(file, "wb")
         on.exit(close(file))
     }

     len <- length(object)
     start <- 1L
     tot <- 0L
     while (start < len) {
         end <- min(start + chunkSize - 1L, len)
         tot <- tot + .write1fastq(object[seq(start, end)], file, ...)
         start <- end + 1L
     }
     if (tot != len)
         warning("output length does not equal input length")

     summary(file)$description
}

>
> Michael
>
>
> On Wed, Sep 25, 2013 at 7:39 AM, Marcus Davy <mdavy86 at gmail.com> wrote:
>
>> Hi,
>> as fastq files are often compressed now, I would like to be able to write
>> out to compressed 'fq.gz' files, rather than raw text.
>>
>> I have searched the BioC threads and found a similar topic at;
>>
>>
>> http://article.gmane.org/gmane.science.biology.informatics.conductor/45306/match=writeFastq
>>
>> A poor example of what I would like to be able to achieve follows;
>>
>> example(readFastq)
>> print(rfq)
>> fout <- "test.fq"
>> writeFastq(rfq, fout)
>> system(paste("gzip", fout))
>>
>> but this does not compress the data as it is streaming out to file. I would
>> like to be able to do something like below, so this is a potential feature
>> request;
>>
>> fout <- "test.fq.gz"
>> writeFastq(rfq, fout, compress=TRUE)
>>
>> Is there an obvious way to compress using writeFastq that I am missing?
>>
>>
>> cheers,
>>
>> Marcus
>>
>>
>>   sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel  splines   stats     graphics  grDevices utils     datasets
>> [8] methods   base
>>
>> other attached packages:
>>   [1] ShortRead_1.18.0     latticeExtra_0.6-24  RColorBrewer_1.0-5
>>   [4] Rsamtools_1.12.3     lattice_0.20-15      Biostrings_2.28.0
>>   [7] GenomicRanges_1.12.2 IRanges_1.18.0       BiocGenerics_0.6.0
>> [10] BiocInstaller_1.10.3 limma_3.16.7         Hmisc_3.12-2
>> [13] Formula_1.1-1        survival_2.37-4
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list