[BioC] What populates makeTranscriptDbFromBiomart?

Mon Apr 16 19:50:01 CEST 2012

Hi Ravi,

I think part of your question is about whether or not we can trust 
ensembl to be internally consistent between what they put into their gtf 
files and what they expose via biomaRt.  That's not really a 
bioconductor question since we really only present what is available at 
the resource in question, but we can still use bioconductor to ask 
questions about it.  You can for example use the import() method from 
rtracklayer to bring the information in from the gtf file and compare 
that to the information that makeTranscriptDbFromBiomart() assembles 
from biomaRt.  I would encourage you to make comparisons if you feel 
motivated, (but bear in mind that some kinds of data may not be present 
in the GTF file).  And if you should find any legitimate discrepancies, 
the people at ensembl are usually quite responsive at explaining or 
correcting them (depending on what is appropriate).  But usually, there 
are no real problems with this resource.  The folks at ensembl are 
highly reliable.

But as Steve was pointing out, even if everything is the same you will 
have reads that are just not part of the known transcriptome.  So some 
proportion of your reads are not going to match up to anything that is 
well characterized.  Of the reads that don't match up, some of them are 
likely to be from unknown transcripts, and some will be noise, but both 
are to be expected.

   Marc

On 04/16/2012 06:41 AM, Steve Lianoglou wrote:
> Hi,
>
> On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra<ravi.karra at gmail.com>  wrote:
>> Hi,
>>
>> Just starting to learn how to look at RNA Seq data, so apologies in advance.  I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2.  I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices.   I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data.
>>
>> As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts.  It seems that I am losing reads in summarizing the count data and I can't really figure out why.   Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl?
> Assume for the moment that it is identical -- you will (for sure)
> still have reads to regions where no transcripts are annotated. This
> still happens in organisms with "better" annotations than zebrafish,
> such as fruit fly, mouse, and human.
>
> The limit of our knowledge about what, where, and why regions of the
> genome are transcribed can be equally exciting as it is frustrating
> depending on which side of the fence you happen to be standing on a
> particular day.
>
> -steve
>