[BioC] Advice on reading big BED/BAM and ChIP-seq quality control

Steve Lianoglou lianoglou.steve at gene.com
Tue May 14 09:16:56 CEST 2013


Hi Michael,

That's the first time I saw QuasR -- looks like a very nicely done package.

Good stuff!
-steve


On Tue, May 14, 2013 at 12:06 AM, Michael Stadler
<michael.stadler at fmi.ch> wrote:
> Dear Hari,
>
> Reading all alignments of a sample (either from a bed or a bam file)
> into memory may not be sustainable.
>
> One way to get around that would be to work on streams, i.e. only
> loading a chunk of the alignments at a time. The Rsamtools package
> provides such functionality for bam files (see ?BamFile and "yield"
> therein).
>
> Alternatively, I would like to point out the QuasR package to you
> (apologies for the advertisement). We have avoided the memory issue by
> traversing the bam files at the C level, and only extracting the
> information that is required. The function qQCrepoprt() may fulfill your
> requirement 3), and the functions qCount() and qProfile() may be what
> you can use to do 4).
>
> QuasR was designed to begin the analysis with reads and also create the
> bam files for you; you can however also start with pre-existing bam
> files and still use a good part of its functionality (see vignette
> section 5.1, "BAM" files).
>
> I hope this helps,
> Michael
>
>
>
> Reading alignments
>
> On 13.05.2013 15:43, Hari Easwaran wrote:
>> Dear Bioc gurus,
>>
>> I am a newbie with using R tools for ChIP-seq analyses and seek advice on
>> the best way to go about a data set I have.  Following are the file formats
>> I have and what I would like to do with them:
>>
>> 1) Using Samtools, I created BED files (about 5 Gb) from the BAM files (3-4
>> Gb)
>>
>> 2) Want to read the BED files (or BAM files) into R.
>>
>> 3) Perform quality control plots (like the number of duplicated reads
>> across the samples because the nature of ChIP-seq processing is different
>> in some of the samples, and so I want to know what bias it introduces).
>>
>> 4) Be able to retrieve specific genomic regions for exploration and
>> visualization of reads/peaks in the context of genomic annotations (I guess
>> to have in a format so that I can play with GenomicRanges).
>>
>>
>> I am doing all this in a cluster with fairly good memory capacities (about
>> 18G; Or perhaps I think it is 'good' memory). I went through the mailing
>> list and found some very useful discussion on reading BED/BAM files:
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-March/001900.html
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-September/002242.html
>>
>> I thought BED files will be easy to work with because it already has data
>> in a format that I understand (chromosome, start, end, tags). I tried the
>> 'import' function from rtracklayer, as suggested in the above link, to read
>> the BED file. However, it didn't work as I run out of memory.
>>
>>>From the discussions, it seems an alternative is Rsamtools to read BAM
>> files.  Before I go about with trying Rsamtools, I would be happy to get
>> some advice on whether I am on the right track by using Rsamtools, and if
>> any other packages/tools might have in-built functions to achieve what I
>> want with the data.
>>
>> Thanks for your time.
>>
>>
>> Sincerely
>>
>> Hari Easwaran
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> --------------------------------------------
> Michael Stadler, PhD
> Head of Computational Biology
> Friedrich Miescher Institute
> Basel (Switzerland)
> Phone : +41 61 697 6492
> Fax   : +41 61 697 3976
> Mail  : michael.stadler at fmi.ch
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



-- 
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech



More information about the Bioconductor mailing list