[BioC] readFastq, writeFastq, compressed files and unix pipes.

Martin Morgan mtmorgan at fhcrc.org
Sat Dec 15 07:03:26 CET 2012


Sorry, I sent a previous reply a little too quickly, meant to add...

On 12/13/2012 06:40 AM, Martin Morgan wrote:
> On 12/13/2012 01:22 AM, Ivan Gregoretti wrote:
>> Hello ShortRead developers,
>>
>> I recently tried to create an R script that could run as an ordinary
>> programme at the command line. It was something like
>>
>> $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz
>> --output=output.fastq.gz
>>
>> I tried and failed but I learned something in the process: readFastq()
>> and writeFastq() are not symmetrical.
>>
>> Specifically:
>>
>> 1) readFastq() can tell the difference between a plain text FASTQ file
>> and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always
>> outputs in plain text regardless of the suffix passed.
>
> FastqStreamer is intended to work on R connections, and from the note in ?stdin
> + a little googling + tolerating a warning about an already open connection one
> can read a compressed file from stdin using
>
>    fin <- gzcon(file("stdin", "rb"))
>
> (this seems to be the R paradigm, nothing special to fastq files) followed by
>
>    strm <- FastqStreamer(fin)
>    object <- yield(strm)
>
> see ?FastqStreamer for how many records are read at a time.
>
>> 2) writeFastq() understands unix pipes because it conveniently accepts
>> the file argument "/dev/stdout". Unlike that, readFastq() does not
>> accept the file argument "/dev/stdin", so, no pipes are possible.
>
> For me, writeFastq(object, "/dev/stdout") fails with
>
>  > writeFastq(object, "/dev/stdout")
> Error: UserArgumentMismatch
>    file '/dev/stdout' exists, but mode is not 'a'
>
> an easy way to get the full support of R's connections is
>
> setMethod(writeFastq, c("ShortReadQ", "connection"),
>      function(object, file, mode="w", full = FALSE, ...)
> {
>      outp <- character(length(object) * 4L)
>      outp[c(TRUE, FALSE, full, FALSE)] <- as.character(id(object))
>      outp[c(FALSE, TRUE, FALSE, FALSE)] <- as.character(sread(object))
>      outp[c(FALSE, FALSE, TRUE, FALSE)] <- "+"
>      outp[c(FALSE, FALSE, FALSE, TRUE)] <- as.character(quality(quality(object)))
>      writeLines(outp, file)
> })
>
> which has obvious limitations. Writing a compressed stream to /dev/stdout could
> be arranged with
>
>    fout <- gzcon(file("/dev/stdout", "wb"))
>
> (which seems like a hack; maybe there's a better R way?) and then
>
>    writeFastq(object, fout)
>    close(fout)
>
> and processing a file would be
>
>    fin <- gzcon(file("stdin", "rb"))
>    fout <- gzcon(file("/dev/stdout", "wb"))
>
>    strm <- FastqStreamer(fin)
>    while(length(object <- yield(strm))) {
>        writeFastq(object, fout)
>    }
>    close(fin);  close(fout)
>
> piping to stdout seems like it will be problematic, e.g., if some R command
> writes to stdout then it will be inserted in the output stream. Maybe better to
> use a named pipe (fifo)

Picking up here, in R I did

     library(ShortRead)
     fin <- gzcon(file("stdin", "rb"))
     fout = commandArgs(trailingOnly=TRUE)[[1]]

     strm <- FastqStreamer(fin)
     while(length(object <- yield(strm))) {
         writeFastq(object, fout, "a")
     }
     close(fin)

and used this from the shell as

   $ mkfifo pipe
   $ cat fastq.gz | Rscript --slave myfifo.R pipe & cat pipe | gzip -f | ...

and yes, I'll try to make read/writeFastq more symmetrical in how they behave.

Martin

>
>> It would be great if readFastq() and writeFastq() were equally smart.
>> Please consider adding these functionalities.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>>
>>
>> Ivan Gregoretti, PhD
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list