[BioC] Advice on reading big BED/BAM and ChIP-seq quality control

Ugo Borello ugo.borello at inserm.fr
Tue May 14 21:23:32 CEST 2013


I agree with Steve; QuasaR is a very well done package.
Thank you Michael!
Ugo


Quoting Steve Lianoglou <lianoglou.steve at gene.com>:

> Hi Michael,
>
> That's the first time I saw QuasR -- looks like a very nicely done package.
>
> Good stuff!
> -steve
>
>
> On Tue, May 14, 2013 at 12:06 AM, Michael Stadler
> <michael.stadler at fmi.ch> wrote:
>> Dear Hari,
>>
>> Reading all alignments of a sample (either from a bed or a bam file)
>> into memory may not be sustainable.
>>
>> One way to get around that would be to work on streams, i.e. only
>> loading a chunk of the alignments at a time. The Rsamtools package
>> provides such functionality for bam files (see ?BamFile and "yield"
>> therein).
>>
>> Alternatively, I would like to point out the QuasR package to you
>> (apologies for the advertisement). We have avoided the memory issue by
>> traversing the bam files at the C level, and only extracting the
>> information that is required. The function qQCrepoprt() may fulfill your
>> requirement 3), and the functions qCount() and qProfile() may be what
>> you can use to do 4).
>>
>> QuasR was designed to begin the analysis with reads and also create the
>> bam files for you; you can however also start with pre-existing bam
>> files and still use a good part of its functionality (see vignette
>> section 5.1, "BAM" files).
>>
>> I hope this helps,
>> Michael
>>
>>
>>
>> Reading alignments
>>
>> On 13.05.2013 15:43, Hari Easwaran wrote:
>>> Dear Bioc gurus,
>>>
>>> I am a newbie with using R tools for ChIP-seq analyses and seek advice on
>>> the best way to go about a data set I have.  Following are the file formats
>>> I have and what I would like to do with them:
>>>
>>> 1) Using Samtools, I created BED files (about 5 Gb) from the BAM files (3-4
>>> Gb)
>>>
>>> 2) Want to read the BED files (or BAM files) into R.
>>>
>>> 3) Perform quality control plots (like the number of duplicated reads
>>> across the samples because the nature of ChIP-seq processing is different
>>> in some of the samples, and so I want to know what bias it introduces).
>>>
>>> 4) Be able to retrieve specific genomic regions for exploration and
>>> visualization of reads/peaks in the context of genomic annotations (I guess
>>> to have in a format so that I can play with GenomicRanges).
>>>
>>>
>>> I am doing all this in a cluster with fairly good memory capacities (about
>>> 18G; Or perhaps I think it is 'good' memory). I went through the mailing
>>> list and found some very useful discussion on reading BED/BAM files:
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-March/001900.html
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-September/002242.html
>>>
>>> I thought BED files will be easy to work with because it already has data
>>> in a format that I understand (chromosome, start, end, tags). I tried the
>>> 'import' function from rtracklayer, as suggested in the above link, to read
>>> the BED file. However, it didn't work as I run out of memory.
>>>
>>>> From the discussions, it seems an alternative is Rsamtools to read BAM
>>> files.  Before I go about with trying Rsamtools, I would be happy to get
>>> some advice on whether I am on the right track by using Rsamtools, and if
>>> any other packages/tools might have in-built functions to achieve what I
>>> want with the data.
>>>
>>> Thanks for your time.
>>>
>>>
>>> Sincerely
>>>
>>> Hari Easwaran
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:   
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> --------------------------------------------
>> Michael Stadler, PhD
>> Head of Computational Biology
>> Friedrich Miescher Institute
>> Basel (Switzerland)
>> Phone : +41 61 697 6492
>> Fax   : +41 61 697 3976
>> Mail  : michael.stadler at fmi.ch
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:   
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Department of Bioinformatics and Computational Biology
> Genentech
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:   
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list