[BioC] Normalization of polyA RNA-seq?

Sat Aug 14 11:31:44 CEST 2010

Hi Xiaohui

> I have two libraries of RNA-seq only with polyA of same tissue (leaf and
> leaf), and have mapped them to the genome. Most of these reads are in
3'UTR
> but not spread over the whole gene body. And the size of these two
> libraries are in great difference, like 25,000 reads versus 1250,000
reads.
> About 40% and 60% of genes only have 1 read in small lib and bigger one,
> respectively. Most of the tags are dominated by only a few genes. I want
to
> combine these two libs for larger one, but I think I should normalize
the
> read count before pooling them together.
> 
> If use TPM normalization, read count in smaller library will be
multiplied
> by 50 times, that means the 1-tag gene will become 50-tag gene, in the
> small lib, while maybe that gene is also 1-tag gene in bigger lib, I
feel
> not comfortable that TPM may make skew the read distribution. Do you
have
> any idea on normalizing the data instead of TPM? 

So, you want to ensure that both libraries get the same weight in your
downstream analysis. but why would you want that? The smaller library
contains less information, so it should not get the same weight.

Actually, your description is not to clear. You want to combine the two
libraries to a single one, i.e., give up the information which sample each
read came from. This would make sense only if these are replicates. If so,
it seems very suspicious that a gene that has one count in the small
library only gets one count in the bigger one. This might occur
occasionally, but should not happen for many genes. You should really
double-check whether you did the counting correctly. (Try, for example, my
htseq-count script
[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see whether
its results are similar to yours.)

Apart from this issue: If you really just want to combine the reads to one
large sample, just add up the number, without normalization. If, however,
you want to compare the samples against each other, and normalize to make
them comparable, you may want to look at the normalization functions of
DESeq (function 'estimateSizeFactors') or edgeR (function
'calcNormFactors').

  Simon