[BioC] Packages for GO and KEGG analysis on RNAseq data

Fri Sep 12 21:40:24 CEST 2014

Hi Nicolas,

On Fri, Sep 12, 2014 at 10:54 AM, Merienne Nicolas
<Nicolas.Merienne at chuv.ch> wrote:
> Dear all,
>
> We are two biologists (so very new in bioinformatic field...) working with RNAseq data and having little "troubles" with pathways analysis. We performed mRNA sequencing on 4 distinct cell populations to compare their transcriptional profile (platform Illumina HiSeq 2000). Row reads were mapped using TopHat and differential analysis was performed with edgeR+voom+limma packages. Our final output is a table (.txt file) for each contrast containing our 16058 expressed genes with respective log fold change, expression values (normalized) and adjusted p-values. We wish to perform pathway enrichment analysis (first GO for a global level and KEGG for a more precise analysis) to determine which pathways are enriched/depleted in specific cell population compared to the others in order to infer cell-type specific functional signatures. However, we have difficulties to find an optimal method to do this. We tried several packages (e.g gage, goseq) and web-based softwares (e.g GeneGO, AmiGO) and found different outputs (sometimes opposite results). What could be the more "validated" method/package for these analysis? In addition, we found differences considering the input data (raw reads, log FC, a list of differentially expressed genes...) and finally we don't understand what should be the input data for the analysis (we think that it is dependent of the package/method used...). So, does anyone of you experiences with GO/KEGG for RNAseq and maybe help us to use a good quantification method please?

Instead of using a "straight up" GO analysis, I'd try using doing GSEA
using camera or roast (from edgeR, or limma (if you are using
limma::voom)), and use the GO and KEGG categories as gene sets.

You can get the GO gene sets (among others) in an R consumable format from here:

http://bioinf.wehi.edu.au/software/MSigDB/index.html

You presumable already have the KEGG annotations in a consumable
format (perhaps you are using the KEGG.db bioconductor package). 

The nice thing about using camera and (m)roast is that you already
have you can simply reuse your "setup", ie. design matrix +
count/expression data to test for enrichment the same way you are
already testing for differential expression. There will be some work
to parse the gene sets into index vectors (you'll know what I mean
when you start reading through the documentation and trying the
example) -- it's not conceptually difficult to do, but you just need
to be careful (in the bookkeeping sense) -- but it'll be worth it.

One thing to keep in mind, though, is that if there's no "signal" in
the expression data, sometimes it means that there really is no signal
from simply looking at expression data alone, so at some point you
might have to resign to stop looking for the next method in hopes of
finding something that pops.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech