[BioC] Normalization of polyA RNA-seq?

Ann Loraine aloraine at gmail.com
Sat Aug 14 12:23:05 CEST 2010


I just now looked at the "counting reads with HTSEQ" page

Is it possible to use a 'bed' file (instead of GFF) to provide the
gene models for counting?

Probably you already have access to plenty of 'bed' format files to
try, but just in case not, here is the list of all gene models from
the Arabidopsis thaliana genome:



On Sat, Aug 14, 2010 at 5:31 AM, Simon Anders <anders at embl.de> wrote:
> Hi Xiaohui
>> I have two libraries of RNA-seq only with polyA of same tissue (leaf and
>> leaf), and have mapped them to the genome. Most of these reads are in
> 3'UTR
>> but not spread over the whole gene body. And the size of these two
>> libraries are in great difference, like 25,000 reads versus 1250,000
> reads.
>> About 40% and 60% of genes only have 1 read in small lib and bigger one,
>> respectively. Most of the tags are dominated by only a few genes. I want
> to
>> combine these two libs for larger one, but I think I should normalize
> the
>> read count before pooling them together.
>> If use TPM normalization, read count in smaller library will be
> multiplied
>> by 50 times, that means the 1-tag gene will become 50-tag gene, in the
>> small lib, while maybe that gene is also 1-tag gene in bigger lib, I
> feel
>> not comfortable that TPM may make skew the read distribution. Do you
> have
>> any idea on normalizing the data instead of TPM?
> So, you want to ensure that both libraries get the same weight in your
> downstream analysis. but why would you want that? The smaller library
> contains less information, so it should not get the same weight.
> Actually, your description is not to clear. You want to combine the two
> libraries to a single one, i.e., give up the information which sample each
> read came from. This would make sense only if these are replicates. If so,
> it seems very suspicious that a gene that has one count in the small
> library only gets one count in the bigger one. This might occur
> occasionally, but should not happen for many genes. You should really
> double-check whether you did the counting correctly. (Try, for example, my
> htseq-count script
> [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see whether
> its results are similar to yours.)
> Apart from this issue: If you really just want to combine the reads to one
> large sample, just add up the number, without normalization. If, however,
> you want to compare the samples against each other, and normalize to make
> them comparable, you may want to look at the normalization functions of
> DESeq (function 'estimateSizeFactors') or edgeR (function
> 'calcNormFactors').
>  Simon
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list