[BioC] read sequences from fasta file starting with > sign and untill next > sign

Fri Sep 14 15:52:14 CEST 2012

On 09/14/2012 06:11 AM, Jack [guest] wrote:
>
> Hi:
>
> I am trying to read sequences from a fasta file starting with > till the next > sign:
>
> library(ShortRead)
> setwd("fastafolder");
> con <- file("somefastafile.fa");
> open(con)
> pattern <- as.character("TACC")
> while(length(res <- readLines(con, n=1)))
> {
> #do something
> }
> close(con)
>
> With this while statement I am able to read a single line from the fasta file each time. But I want to read a chunk of links each time from the fasta file starting with  > sign and till the next  >  sign.
>
> Example
>> AAATTT
> TAGGCT
> ATTTGC
>> CGATTT
>
>   And I want to read the following in the first run of while loop
>> AAATTT
> TAGGCT
> ATTTGC
>
> Thanks for your help.

Hi Jack -- in R it's not generally useful to operate one record at a 
time, so think about reading many records (a million?) in and then 
processing as a vector. You could use FaFile in Rsamtools

fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa")
## create an index, if neceessary
indexFa(fl)
fa <- FaFile(fl)
hdr <- scanFaIndex(fa)

## record-at-a-time
open(fa)
for (i in seq_along(hdr))
     scanFa(fa, hdr[i])
close(fa)

## groups, e.g., 10
idx <- seq_along(hdr)
grps <- split(idx, cut(idx, 10, labels=FALSE))
open(fa)
for (grp in grps)
     scanFa(fa, hdr[grp])
close(fa)

or role your own

fl <- system.file(package="BSgenome", "extdata", "ce2dict0.fa")
con <- file(fl)
open(con)
buf <- character()
while (length(inpt <- readLines(con, 10))) {
     lastidx <- tail(grep("^>", inpt), 1)
     complete <- c(buf, inpt[seq(1, lastidx - 1L)])
     buf <- inpt[seq(lastidx, length(inpt))]
     if (length(complete)) {
         ## 'complete' contains complete record(s)
     }
}
## 'buf' might contain some final, complete, records
close(con)

Martin

>
> Regards:
> Jack
>
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] CHNOSZ_0.9-7         ShortRead_1.14.4     latticeExtra_0.6-24  RColorBrewer_1.0-5   Rsamtools_1.8.6      lattice_0.20-10      Biostrings_2.24.1
>   [8] GenomicRanges_1.8.13 IRanges_1.14.4       BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.1    hwriter_1.3    stats4_2.15.1  tools_2.15.1   zlibbioc_1.2.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793