[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Mon Sep 27 18:31:48 CEST 2010

Hi

I was never too happy about what we wrote there in Lab 8. Paul has 
pointed out one major issue. The other is overlapping genes: Standard 
RNA-Seq does not recover information about which strand a transcript is 
from, and in more crowded genomes, it does happen not that rarely that 
exons of two different genes on opposite strands overlap.

The code in the lab does not address this. If genes A and B overlap, 
then every read that maps onto this overlap will be counted for both 
genes. If now gene A is differentially expressed and gene B is not, then 
the extra counts from gene A that get counted for gene B as well might 
cause gene B to be called differentially expressed, too.

All this is not likely to have large effects on results, but it would be 
nicer to do it properly. As you have already noticed, it is not exactly 
trivial to code something like this in a correct and efficient manner in 
R. At least I think so. I'm sure the IRanges gurus on the list will now 
jump on me with  examples on how easy it would have been, but I found it 
much easier to code this in Python. The script I made for this purpose 
is available at
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

It is actually part of a larger framework to make coding such stuff in 
Python easy. Have a look: http://www-huber.embl.de/users/anders/HTSeq

   Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de