[BioC] Coverage vs GC bias - how to?

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue Sep 7 23:22:58 CEST 2010


Hi,

Just a few comments:

On Tue, Sep 7, 2010 at 4:59 AM, Marc Noguera <mnoguera at imppc.org> wrote:
> Hi all,
>
> I am trying to get some insight knowledge on some NGS data. I have an
> alignment file and an AlignedRead object from it, from which I can
> compute coverage.
> What I would like to know is if there is any bias on the coverage vs GC
> content.
> My idea is to run a sliding window of width N through the reference
> sequence I have the reads on and to compute, somehow, a kind of coverage
> coefficient from the coverage i get from IRanges and the AlignedRead object.

For some ideas on performing sliding-window stuff over Rle's, read
through this thread:
http://thread.gmane.org/gmane.science.biology.informatics.conductor/28491

> As I know the coordinate of the sliding window the next step is to
> obtain, somehow, the sequence of the window and calculate the GC content.
>
> How could I extract the sequence from BSgenome within this sliding
> window and how can I create the sliding window. I see that I can use
> getSeq to obtain the sequence and the use alphabetByFrequency on it.

While getSeq might work (I really never use that function), I think
I'd rather use "Views", like so:

R> library(BSgenome.Hsapiens.UCSC.hg18)
R> chr <- unmasked(Hsapiens$chr10)
R> ranges <- IRanges(start=c(100000, 101000, 102000), width=100)
R> seqs <- Views(chr, ranges)
R> alphabetFrequency(seqs)
      A  C  G  T M R W S Y K V H D B N - +
[1,] 18 26 23 33 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 30 26 24 20 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 38 23 18 21 0 0 0 0 0 0 0 0 0 0 0 0 0

> However, I don't know how to create this sliding window.
> Which is the best way to proceed? am I reinventing the wheel?

Maybe, but it's good practice to learn how to do anyway ;-)

There have been several papers published about biases found in
sequencing data, which you should probably read through. I think it's
good to know how to do these things so that you can actually see which
of the reported phenomena are happening in the dataset you're working
on ..

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list