[BioC] Using summarizeOverlaps with multiple samples/readgroups in a single bam file?

Thu Jan 24 01:00:23 CET 2013

I've been thinking about this some more, and I don't think there's any 
inherent reason that one cannot parallelize access to multiple read 
groups in a single bam file, because I have previously successfully 
sped up bam file reading by parallelizing across chromosomes. I think 
it would be convenient to have all the data for all the samples in an 
experiment in a single file. If Rsamtools supported filtering by read 
groups using some kind of option to scanBamParam (does it?), I think it 
would be sufficient to take a vectorized param argument to 
summarizeOverlaps. Then one could pass a list with one scanBamParam for 
each read group and get parallel counting of multiple read groups from 
a single bam file.

What do you think?

On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote:
> On 1/12/2013 12:29 PM, Ryan C. Thompson wrote:
>> Hi all,
>>
>> I'm looking at simplifying my differential expression pipeline a
>> little bit by
>> merging all my input bam files into one bam file with multiple
>> samples/read
>> groups and then using that bam file as input to summarizeOverlaps. Is
>> this
>> supported in any way? I've never worked with sam read groups before
>> (I always
>> just did one sample per file), so I don't really know anything about
>> them.
>>
>> So is it supported to take a single bam file and use
>> summarizeOverlaps or some
>> other mechanism to get a SummarizedExperiment object with one column
>> for each
>> sample in the bam file, rather than one column per file?
>
> Rsamtools doesn't do anything special with read groups (e.g., no
> pre-filtering) and summarizeOverlaps doesn't do per-read-group
> counting (one can provide one's own counting function to
> summarizedOverlaps, though...) Also, parallelizing over bam files is a
> simple way to get better throughput (providing a BamFileList as the
> second argument to summarizeOverlaps, and with 'parallel' on the
> search path, currently uses mclapply and memory-efficient iteration to
> populate the SummarizedExperiment), so in some ways one large bam file
> is a step in a counter-productive direction.
>
> Martin
>
>>
>> -Ryan Thompson
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>