[BioC] Packages for GO and KEGG analysis on RNAseq data

Gordon K Smyth smyth at wehi.EDU.AU
Sun Sep 14 06:54:39 CEST 2014

Dear Nicolas,

I think you are suffering from the "paradox of choice"


which says that too much choice leads to more unhappiness and less action 
even when all the choices are potentially good ones.

Pathway analysis is a very large and very active research area, and there 
are a large number of alternative methods.  This is because there is no 
unique definition of what a pathway is, and because there are large number 
of different ways to rank pathways, and because there is no consensus as 
to which ranking will best match biological significance.  Perhaps no 
consensus is even possible, because different methods may give better 
results in different circumstances.

Some methods simply count the number of genes associated with each GO term 
in your DE list (overlap methods).  Other methods work with the test 
statistics or fold changes for each gene (gene set tests).  Some methods 
adjust for inter-correlations between genes and some don't.  Some adjust 
RNA-seq specific biases like gene length.  Some methods test each GO term 
in isolation whereas others test them relative to each other.  It is not 
surprising that methods that use different information will potentially 
give different results.

You are asking which is the most validated method, but in truth there are 
many validated methods. I would however avoid methods which work purely on 
logFC, because low expression genes may have large fold changes from 
RNA-seq just by chance.  Steve suggested roast and camera, which are 
implemented in both limma and edgeR.  Roast and camera can be viewed as 
improvements on logFC methods in that they use logFC but also take into 
account the mean-variance trend as well as correlations between genes.

In the end though, it might be best for you to stick to a simple 
statistical approach that you can understand.  This is better than trying 
out lots of different methods without fully understanding how they differ. 
Do you simply want to test for enrichment of publicly available GO and 
KEGG terms in your list of top DE genes?  This is the oldest and simplest 
type of pathway analysis you can do, and it is still very popular.  In 
that case, goseq is an obvious choice -- it does exactly this, but also 
takes into account some of the special features of RNA-seq data.  Any 
other GO overlap method (like AmiGO, DAVID, GOstat or GOStats) will give 
similar results, but without adjusting for gene length bias.  Beware 
however that commercial products like geneGO and IPA use their own 
proprietry databases, rather than the usual public GO databases, and hence 
will give different results.

Best wishes

> Date: Fri, 12 Sep 2014 17:54:07 +0000
> From: Merienne Nicolas <Nicolas.Merienne at chuv.ch>
> To: Bioconductor <bioconductor at r-project.org>
> Subject: [BioC] Packages for GO and KEGG analysis on RNAseq data
> Dear all,
> We are two biologists (so very new in bioinformatic field...) working 
> with RNAseq data and having little "troubles" with pathways analysis. We 
> performed mRNA sequencing on 4 distinct cell populations to compare 
> their transcriptional profile (platform Illumina HiSeq 2000). Row reads 
> were mapped using TopHat and differential analysis was performed with 
> edgeR+voom+limma packages. Our final output is a table (.txt file) for 
> each contrast containing our 16058 expressed genes with respective log 
> fold change, expression values (normalized) and adjusted p-values. We 
> wish to perform pathway enrichment analysis (first GO for a global level 
> and KEGG for a more precise analysis) to determine which pathways are 
> enriched/depleted in specific cell population compared to the others in 
> order to infer cell-type specific functional signatures. However, we 
> have difficulties to find an optimal method to do this. We tried several 
> packages (e.g gage, goseq) and web-based softwares (e.g GeneGO, AmiGO) 
> and found different outputs (sometimes opposite results). What could be 
> the more "validated" method/package for these analysis? In addition, we 
> found differences considering the input data (raw reads, log FC, a list 
> of differentially expressed genes?) and finally we don't understand what 
> should be the input data for the analysis (we think that it is dependent 
> of the package/method used?). So, does anyone of you experiences with 
> GO/KEGG for RNAseq and maybe help us to use a good quantification method 
> please?
> Thank you very much for your help.
> Best,
> Nicolas

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list