[BioC] topGO analysis returns different gene counts for category than given for non-model organism

Fri Aug 30 18:18:01 CEST 2013

Hi Diana,

I'm not an expert at topGO but maybe I can help track this down.

It sounds like for the same input, GenTable is returning many more genes 
than annFUN.gene2GO.

Looking at the annFUN.gene2GO function it looks like several 'filters' 
are applied. Genes with no annotation are dropped and only one-to-one 
mappings are kept (subsetting on 'goodGO' below). I've taken the 
annFUN.gene2GO code from the package so we can see the comments.

annFUN.gene2GO <- function(whichOnto, feasibleGenes = NULL, gene2GO) {

   ## GO terms annotated to the specified ontology
   ontoGO <- get(paste("GO", whichOnto, "Term", sep = ""))

   ## Restrict the mappings to the feasibleGenes set
   if(!is.null(feasibleGenes))
     gene2GO <- gene2GO[intersect(names(gene2GO), feasibleGenes)]

   ## Throw-up the genes which have no annotation
   if(any(is.na(gene2GO)))
     gene2GO <- gene2GO[!is.na(gene2GO)]

   gene2GO <- gene2GO[sapply(gene2GO, length) > 0]

   ## Get all the GO and keep a one-to-one mapping with the genes
   allGO <- unlist(gene2GO, use.names = FALSE)
   geneID <- rep(names(gene2GO), sapply(gene2GO, length))

   goodGO <- allGO %in% ls(ontoGO)
   return(split(geneID[goodGO], allGO[goodGO]))
}

The GeneTable code has no such filtering. The function is long so I 
won't post the full code here. You can see it with,

 > selectMethod("GenTable", "topGOdata")
Method Definition:

function (object, ...)
{
     .local <- function (object, ..., orderBy = 1, ranksOf = 2,
         topNodes = 10, numChar = 40, format.FUN = format.pval,
         decreasing = FALSE, useLevels = FALSE)
     {
...
...
}

If this is not helpful, please provide a reproducible example and I will 
look into it further.

Valerie

On 08/29/2013 12:27 PM, Diana Toups Dugas wrote:
> Hi-
> I'm trying to analyze some Sorghum NGS transcriptome data using topGO.  The
> analysis runs using my custom geneID2GO object and the annFUN.gene2GO
> function.  When I run GenTable, however, the results are strange.  For
> example, for the GO:0006970 category, the geneID2GO mappings I used only
> lists 48 genes annotated to this category, but the GenTable function claims
> that 335 genes are annotated and 270 of them are significant.  If I dig
> deeper and see what those 335 genes are they do indeed appear to be sorghum
> gene IDs; they are all unique and contain my original 48.
>
> I assume that the topGO package is pulling GO-to-gene mappings from
> somewhere to find these additional genes, but I cannot find documentation
> for it anywhere.  Does anyone know why/how topGO is locating more genes
> than I give per GO category and from where they are getting pulled from?
>   If my GO annotations are not complete I would like to know, as I thought
> they were.
>
> Thanks for the help!
> Sincerely,
> -Diana
> Former USDA researcher
> Current USDA consultant
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>