[BioC] Advice on reading big BED/BAM and ChIP-seq quality control

Mon May 13 16:01:00 CEST 2013

On 05/13/2013 06:43 AM, Hari Easwaran wrote:
> Dear Bioc gurus,
>
> I am a newbie with using R tools for ChIP-seq analyses and seek advice on
> the best way to go about a data set I have.  Following are the file formats
> I have and what I would like to do with them:
>
> 1) Using Samtools, I created BED files (about 5 Gb) from the BAM files (3-4
> Gb)
>
> 2) Want to read the BED files (or BAM files) into R.
>
> 3) Perform quality control plots (like the number of duplicated reads
> across the samples because the nature of ChIP-seq processing is different
> in some of the samples, and so I want to know what bias it introduces).
>
> 4) Be able to retrieve specific genomic regions for exploration and
> visualization of reads/peaks in the context of genomic annotations (I guess
> to have in a format so that I can play with GenomicRanges).
>
>
> I am doing all this in a cluster with fairly good memory capacities (about
> 18G; Or perhaps I think it is 'good' memory). I went through the mailing
> list and found some very useful discussion on reading BED/BAM files:
>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-March/001900.html
>
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-September/002242.html
>
> I thought BED files will be easy to work with because it already has data
> in a format that I understand (chromosome, start, end, tags). I tried the
> 'import' function from rtracklayer, as suggested in the above link, to read
> the BED file. However, it didn't work as I run out of memory.
>
>>From the discussions, it seems an alternative is Rsamtools to read BAM
> files.  Before I go about with trying Rsamtools, I would be happy to get
> some advice on whether I am on the right track by using Rsamtools, and if
> any other packages/tools might have in-built functions to achieve what I
> want with the data.

As the author of Rsamtools, I'd say yes, you are on the right track using this 
package and (indexed) bam files.

A good place to start is the readBamGappedAlignments function (this is renamed 
to readGAlignmentsFromBam in the 'devel' version) which inputs essential 
alignment data (seqname, start, cigar) of aligned reads; additional information 
can also be input. You'll want to understand how to use ScanBamParam.

Strategies for dealing with large data are to iterate through the bam file 
(using the 'BamFile' function with yieldSize=1000000 or similarly large but 
manageable number; see ?BamFile for examples) or indexing into the bam file to 
retrieve specific regions of interest (using the 'which' argument to 
ScanBamParam(), and using ScanBamParam as the 'param' argument  to most 
functions that read BAM files).

There are packages that provide QA, including the 'qrqc', 'QuasR', and 
'ShortRead' packages among others. Be sure to read package vignettes from 
bioconductor.org or via, e.g., vignette(package="Rsamtools").

Don't get too bogged down on a cluster right out of the gate; instead get small 
data working, move to parallel::mclapply (typically processing separate bam 
files in separate processes) and then if needs be jobs submitted to clusters.

Hope that helps, feel free to come back to the list with more specific questions 
as you encounter problems.

Martin

>
> Thanks for your time.
>
>
> Sincerely
>
> Hari Easwaran
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793