[BioC] readFasta and gzipped FASTA files

Fri Feb 15 03:24:28 CET 2013

On 2/14/2013 12:32 PM, Ivan Gregoretti wrote:
> Hello everybody,
>
> The library ShortRead includes two very useful functions: readFastq()
> and readFasta()
>
> While readFastq() can open FASTQ files as either plain text or gzipped
> files, readFasta() can only open files in plain text.
>
> For example:
>
> # FASTQ: success
>> readFastq("t01213R0QU.fq.gz")
> class: ShortReadQ
> length: 43608 reads; width: 178..486 cycles
>
> # FASTA: failure
>> readFasta("t01213R0QU.P.fa.gz")
> Error in .normargInputFilepath(filepath) :
>    file "t01213R0QU.P.fa.gz" has unsupported type: gzfile
>
>
> Is this the current status or it's time for me to update my BioC?
>
> Can someone offer a work around that does not involve decompressing
> the FASTA file to disc? I tried, yet unsuccessfully:

readFasta is just an alias for readDNAStringSet, which does not support 
compressed files.

Not sure exactly what is in your fasta file, but Rsamtools::FaFile might be what 
you are looking for (typically, relatively few long sequences; you need to 
create an index (using indexFa) if one does not already exist then something 
along the lines of

   fa = FaFile("some.fa.gz")  ## index is some.fa.gz.
   scanFa(fa, param=scanFaIndex(fa))

would read the whole file, or a more selective GRanges to go after particular 
sub-sequences.

A workaround might also involve

http://comments.gmane.org/gmane.comp.lang.r.sequencing/1981

   all <- readLines("s_1.fa.gz")
   sread <- DNAStringSet(all[c(FALSE, TRUE)])
   id <- BStringSet(all[c(TRUE, FALSE)])
   fas <- ShortRead(sread=sread, id=id)

>
> readFasta(gzfile("t01213R0QU.P.fa.gz","r"))
> Error in function (classes, fdef, mtable)  :
>    unable to find an inherited method for function ‘readFasta’ for
> signature ‘"gzfile"’
>
>
> Thank you,
>
> Ivan
>
>
>> sessionInfo()
> R Under development (unstable) (2012-11-30 r61184)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.11.16     Biostrings_2.27.11    GenomicRanges_1.11.29
> [4] IRanges_1.17.32       BiocGenerics_0.5.6
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5   stats4_2.16.0  tools_2.16.0   zlibbioc_1.5.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109