[BioC] GOStats: can't find probesets, which are in overrepresented GO term

Fri Oct 1 16:59:00 CEST 2010

Hi Jim,

Thanks for your answers. This makes everything clear now. And best of all, my GO enrichment is not wrong, which was my first fear.

Kind regards,

Mike

-----Ursprüngliche Nachricht-----
Von: "James W. MacDonald" <jmacdon at med.umich.edu>
Gesendet: 01.10.2010 15:43:14
An: "Mike Walter" <michael_walter at email.de>
Betreff: Re: [BioC] GOStats: can't find probesets,	which are in overrepresented GO term

>Hi Mike,
>
>On 10/1/2010 8:07 AM, Mike Walter wrote:
>> Dear BioC List,
>>
>> I have difficulties in understanding and retracing the results from GO overrepresentation analysis.
>> I used GOTerms to see if there are any GO Terms overrepresented in a list of DE genes with following code:
>>
>>
>> #define gene universe, select only probes with GO annotation and
>> #transform probesetIDs in unique Entrez Ids
>>>
>>> input<- featureNames(d.norm.filt)
>>> univ<- mget(input, mouse4302GO, ifnotfound=NA)
>>> univ<- input[!is.na(univ)]
>>> univ<- mget(univ, mouse4302ENTREZID, ifnotfound=NA)
>>> univ<- unique(as.character(univ[!is.na(univ)]))
>>
>> #do the same for my list of genes (DE)
>>>
>>> ID<- mget(DE, mouse4302GO, ifnotfound=NA)
>>> ID<- DE[!is.na(ID)]
>>> ID<- mget(as.character(ID), mouse4302ENTREZID, ifnotfound=NA)
>>> ID<- unique(as.character(ID[!is.na(ID)]))
>>
>> #run HyperGTest
>>>
>>> BP = new("GOHyperGParams",
>> +    geneIds=ID, universeGeneIds=univ,
>> +    annotation="mouse4302", ontology="BP",
>> +    pvalueCutoff=0.01, conditional=TRUE,
>> +    testDirection="over")
>>> hGt.BP = hyperGTest(BP)
>>> head(summary(hGt.BP))
>>
>>   GOBPID       Pvalue OddsRatio     ExpCount Count Size
>> 1 GO:0006944 0.0001406440  146.7895 0.0180309101     2   21
>> 2 GO:0014049 0.0008586148       Inf 0.0008586148     1    1
>> 3 GO:0032303 0.0008586148       Inf 0.0008586148     1    1
>> 4 GO:0032308 0.0008586148       Inf 0.0008586148     1    1
>> 5 GO:0034405 0.0008586148       Inf 0.0008586148     1    1
>> 6 GO:0043132 0.0008586148       Inf 0.0008586148     1    1
>>   Term
>> 1                       cellular membrane fusion
>> 2     positive regulation of glutamate secretion
>> 3              regulation of icosanoid secretion
>> 4 positive regulation of prostaglandin secretion
>> 5                 response to fluid shear stress
>> 6                                  NAD transport
>>
>> When I now want to know, which probesets of my DE list are in the first GO, I tried following chunk.
>>
>>
>>> go = as.matrix(unlist(mget("GO:0006944", mouse4302GO2PROBE, ifnotfound=NA)))
>>> DEinGO = DE[is.element(DE,go[,1])]
>>> DEinGO
>> character(0)
>
>The problem here is you are using the GO2PROBE mapping, whereas you 
>should be using the GO2ALLPROBES mapping. The difference being that you 
>are looking only at probesets that map to a particular node in the GO 
>DAG, whereas the test you are using considers all probesets that map to 
>that node and all descendant nodes (child, grandchild, etc).
>
>This is because a child node by definition is also a member of its 
>parent node. An example using a fake vocabulary:
>
>Say you have a probeset that maps directly to 'MAP kinases'. This 
>probeset also maps indirectly to an ancestor term, 'Protein 
>Phosphorylation', as that is what a MAP kinase does. So you might have 5 
>probesets that map to 'MAP kinases', and 3 probesets that map to 'MAP 
>kinase kinases', but neither GO term is significant. However, 'Protein 
>phosphorylation' might be (due to these 8 probesets), but there are no 
>probesets in your significant set that map directly to protein 
>phosphorylation.
>
>Anyway, you are doing a lot of work here that isn't really necessary 
>unless you really want a detailed understanding of what is going on. 
>This isn't a bad idea IMO, and I applaud anybody who wants to know more 
>about what they are doing than how to press the buttons, so if that's 
>the case, then bravo!
>
>If you just want to press buttons, however, the correct button to press 
>would be probeSetSummary(), which will output all the probesets that are 
>contributing to each significant GO term. Perusing the function will 
>also help if you are trying to get a better understanding.
>
>Best,
>
>Jim
>
>
>>
>>
>> However, this results in an empty vector. I then picked all probesets, which fall into this category,
>> which results in only 5 probesets/2 genes.
>>
>>> revgo = mouse4302GO2PROBE
>>> revgo = as.list(revgo[mappedkeys(revgo)])
>>> revgo = lapply(revgo,as.vector)
>>> probesInGO = as.vector(unlist(revgo["GO:0006944"))
>>> mget(probesInGO, mouse4302ENTREZID)
>> $`1417349_at`
>> [1] "18457"
>>
>> $`1417350_at`
>> [1] "18457"
>>
>> $`1457088_at`
>> [1] "18457"
>>
>> $`1420833_at`
>> [1] "22318"
>>
>> $`1420834_at`
>> [1] "22318"
>>
>> None of the 5 probesets are in my DE list. So I have basically 2 questions:
>>
>> 1) Why is this GO overrepresented in my list? Or, where is the large error in my script, that results in this mistake?
>>
>> 2) Why is the size of the GO category in the summary 21, when there are only 5 probesets with this GO?
>>
>> Thank you very much for any ideas,
>>
>> Mike
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>-- 
>James W. MacDonald, M.S.
>Biostatistician
>Douglas Lab
>University of Michigan
>Department of Human Genetics
>5912 Buhl
>1241 E. Catherine St.
>Ann Arbor MI 48109-5618
>734-615-7826
>**********************************************************
>Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 
>