[BioC] how to deal with a 30G fastq file

Martin Morgan mtmorgan at fhcrc.org
Thu Oct 6 21:29:01 CEST 2011


Hi Steve --

On 10/06/2011 06:48 AM, Steve Lianoglou wrote:
> Hi Martin,
>
> Just wanted to say:
>
> On Wed, Oct 5, 2011 at 11:39 PM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>
>>   fq = FastqStreamer(<...>)
>>   while (length(res<- yield(fq)))
>>       # work, e.g., filter
>
> That's really cool!

Anita Lerch suggested and helped to implement this.

> Then some navel gazing:
>
> Have you thought about "inverting" this flow? Like, run the while loop
> in "C-land" but pass an R expression/block/something in and have it be
> evaluated within each iteration of the C/while loop?
>
> I'm guessing calling an R function from within C code is costly, but
> "while" loops in R are also slow (compared to while loops in C), so I
> wonder which would win in the long run.

Rsamtools::applyPileups does this. In some ways it's like lapply(<obj>, 
FUN), where the user provides FUN and applyPileups does work at the C 
level to prepare data for FUN.

FUN is like # work -- they are expecting to do stuff on R objects using 
R code. For this reason they're both going to be efficient if they 
operate on vectors, hence chunks (e.g., millions of records) of the 
fastq or bam file. So yield() and applyPileups() have a similar task -- 
efficiently create a chunk of data to be processed, then pass that to 
the user. Since they're both function calls, they are both free to 
create those objects in R or C as appropriate.

The big difference is really in how the results of the iteration or the 
apply are aggregated. yield() relies on the user to do something 
('aggregate by writing to a file', or 'pre-allocate a result vector and 
fill in with each iteration') whereas applyPileups returns a list, with 
each element the result of FUN. If there were clear aggregation 
strategies then the apply-style approach might have additional advantages.

This is still a bit of work in progress, so ideas welcome; one might 
easily image that lapply(FastqStreamer(<...>), FUN, ...) could be 
implemented in a straight-forward way, for instance.

Martin

> Just curious -- sorry if I missed some previous discussion on this topic.
>
> Anyway, like I said -- this is really cool already.
>
> Thanks,
>
> -steve
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list