[BioC] Coverage vs GC bias - how to?

Sun Sep 12 20:32:50 CEST 2010

I wonder if you are blurring a couple of concepts, and I'm not sure if
bias is the right word, rather than an actual facet of the situation.

You could get the GC content of your suite of reads fairly directly:

	a = alphabetFrequency(reads, baseOnly=TRUE)
	reads_GC_content = a[,"C"] + a[,"G"]

But for the GC content of the alignments, you might do this (also from
Biostrings):

  U: reference (must be unmasked, if masked)
  N: read/alignment width
  L: alignment starting locations

	lf = letterFrequencyInSlidingView(U, view.width=N, letters="CG")
	alignments_GC_content = lf[L,]

Note that 'lf' holds the GC content of all subsequences of length N  
along
your reference sequence.  This may seem wasteful, but it can be  
faster to
do all that, then subset it for the relevant subsequences, than to  
extract
those subsequences, get their alphabetFrequency, etc.

Let us know if we aren't getting to what you need.  -Harris

On Sep 7, 2010, at 4:59 AM, Marc Noguera wrote:

> Hi all,
>
> I am trying to get some insight knowledge on some NGS data. I have an
> alignment file and an AlignedRead object from it, from which I can
> compute coverage.
> What I would like to know is if there is any bias on the coverage  
> vs GC
> content.
> My idea is to run a sliding window of width N through the reference
> sequence I have the reads on and to compute, somehow, a kind of  
> coverage
> coefficient from the coverage i get from IRanges and the  
> AlignedRead object.
>
> As I know the coordinate of the sliding window the next step is to
> obtain, somehow, the sequence of the window and calculate the GC  
> content.
>
> How could I extract the sequence from BSgenome within this sliding
> window and how can I create the sliding window. I see that I can use
> getSeq to obtain the sequence and the use alphabetByFrequency on it.
> However, I don't know how to create this sliding window.
> Which is the best way to proceed? am I reinventing the wheel?
>
> Thanks in advance
>
> Marc
>
> -- 
> -----------------------------------------------------
> Marc Noguera i Julian, PhD
> Genomics unit / Bioinformatics
> Institut de Medicina Predictiva i Personalitzada
> del Càncer (IMPPC)
> B-10 Office
> Carretera de Can Ruti
> Camí de les Escoles s/n
> 08916 Badalona, Barcelona
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor