[BioC] Packages for GO and KEGG analysis on RNAseq data
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Sep 14 06:54:39 CEST 2014
Dear Nicolas,
I think you are suffering from the "paradox of choice"
http://en.wikipedia.org/wiki/The_Paradox_of_Choice
which says that too much choice leads to more unhappiness and less action
even when all the choices are potentially good ones.
Pathway analysis is a very large and very active research area, and there
are a large number of alternative methods. This is because there is no
unique definition of what a pathway is, and because there are large number
of different ways to rank pathways, and because there is no consensus as
to which ranking will best match biological significance. Perhaps no
consensus is even possible, because different methods may give better
results in different circumstances.
Some methods simply count the number of genes associated with each GO term
in your DE list (overlap methods). Other methods work with the test
statistics or fold changes for each gene (gene set tests). Some methods
adjust for inter-correlations between genes and some don't. Some adjust
RNA-seq specific biases like gene length. Some methods test each GO term
in isolation whereas others test them relative to each other. It is not
surprising that methods that use different information will potentially
give different results.
You are asking which is the most validated method, but in truth there are
many validated methods. I would however avoid methods which work purely on
logFC, because low expression genes may have large fold changes from
RNA-seq just by chance. Steve suggested roast and camera, which are
implemented in both limma and edgeR. Roast and camera can be viewed as
improvements on logFC methods in that they use logFC but also take into
account the mean-variance trend as well as correlations between genes.
In the end though, it might be best for you to stick to a simple
statistical approach that you can understand. This is better than trying
out lots of different methods without fully understanding how they differ.
Do you simply want to test for enrichment of publicly available GO and
KEGG terms in your list of top DE genes? This is the oldest and simplest
type of pathway analysis you can do, and it is still very popular. In
that case, goseq is an obvious choice -- it does exactly this, but also
takes into account some of the special features of RNA-seq data. Any
other GO overlap method (like AmiGO, DAVID, GOstat or GOStats) will give
similar results, but without adjusting for gene length bias. Beware
however that commercial products like geneGO and IPA use their own
proprietry databases, rather than the usual public GO databases, and hence
will give different results.
Best wishes
Gordon
> Date: Fri, 12 Sep 2014 17:54:07 +0000
> From: Merienne Nicolas <Nicolas.Merienne at chuv.ch>
> To: Bioconductor <bioconductor at r-project.org>
> Subject: [BioC] Packages for GO and KEGG analysis on RNAseq data
>
> Dear all,
>
> We are two biologists (so very new in bioinformatic field...) working
> with RNAseq data and having little "troubles" with pathways analysis. We
> performed mRNA sequencing on 4 distinct cell populations to compare
> their transcriptional profile (platform Illumina HiSeq 2000). Row reads
> were mapped using TopHat and differential analysis was performed with
> edgeR+voom+limma packages. Our final output is a table (.txt file) for
> each contrast containing our 16058 expressed genes with respective log
> fold change, expression values (normalized) and adjusted p-values. We
> wish to perform pathway enrichment analysis (first GO for a global level
> and KEGG for a more precise analysis) to determine which pathways are
> enriched/depleted in specific cell population compared to the others in
> order to infer cell-type specific functional signatures. However, we
> have difficulties to find an optimal method to do this. We tried several
> packages (e.g gage, goseq) and web-based softwares (e.g GeneGO, AmiGO)
> and found different outputs (sometimes opposite results). What could be
> the more "validated" method/package for these analysis? In addition, we
> found differences considering the input data (raw reads, log FC, a list
> of differentially expressed genes?) and finally we don't understand what
> should be the input data for the analysis (we think that it is dependent
> of the package/method used?). So, does anyone of you experiences with
> GO/KEGG for RNAseq and maybe help us to use a good quantification method
> please?
>
> Thank you very much for your help.
>
> Best,
>
> Nicolas
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list