[BioC] finding end of file in FASTA file

Martin Morgan mtmorgan at fhcrc.org
Thu Sep 13 14:58:35 CEST 2012


On 09/13/2012 01:42 AM, Jack [guest] wrote:
>
> library(ShortRead)
> fastadata <- readFasta("fastafolder", "fa$")
> file <- tempfile()
> writeFasta(fastadata, file)
> var1 <- readLines(file)
> while(countlength(tmp <- readLines(file, n = -1)) > 0)  {
> #do something
> }
>
> I want the while loop to run till the end of file is reached, but the while statement dosent work. Thanks for help.

Hi Jack -- if the goal is to read the fasta file in chunks, use a 
'connection' that can remember the current location. After running the 
following to get a reproducible example fasta file

   library(ShortRead)
   example(readFasta)
   fl = dir(analysisPath(sp), "s_1_sequence.txt", full=TRUE)

we can create a connection and open it, and the do our loop reading 500 
lines at a time

   con <- file(fl); open(con)
   while(length(res <- readLines(con, n=500)))
       cat(length(res), "\n")
   close(con)

which prints out

500
500
24

Unfortunately, readFasta doesn't work on connections (that would be a 
worthwhile feature request). There is also FaFile in Rsamtools, try

   example(FaFile)

FaFile is most useful when the fasta file would benefit from being 
indexed, e.g., hundreds of contigs, but might also be useful for your 
purposes.

Martin

> Regards
> Jack
>
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.14.4     latticeExtra_0.6-24  RColorBrewer_1.0-5   Rsamtools_1.8.6      lattice_0.20-10      Biostrings_2.24.1    GenomicRanges_1.8.13
> [8] IRanges_1.14.4       BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.1    hwriter_1.3    stats4_2.15.1  tools_2.15.1   zlibbioc_1.2.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list