[BioC] Retrieve Entrez IDs for enriched GO terms

Fri Sep 25 01:15:28 CEST 2009

Hi,

Thank you all very much for your reply on this topic. I got the  
following findings after a further investigation on my data.

I downloaded the gene2go.gz from NCBI, from which the result agrees  
with the one got by using biomaRt (i.e. query ensembl). However, most  
of the retrieved entrez gene IDs are not available in my differential  
expressed gene list.

In contrast, I found all the retrieved entrez gene IDs, which were  
retrieved by using probeSetSummary() or geneIdsByCategory(),  in my DE  
gene list.

So I conclude that probeSetSummary()/geneIdsByCategory() takes  
consideration of the selected genes for the hypergeometric test when  
retrieving entrez genes, but biomaRt on the other hand will not take  
this information and simply queries the database. In the context of  
hypergeometric test here, I would say the former methods might be more  
proper.

Kind regards,
Yuan

On 24 Sep 2009, at 19:25, Marc Carlson wrote:

> Hi everyone,
>
> Even just downloading the file from NCBI might result in some
> disagreements.  GO changes all the time, and the file from NCBI is
> downloaded and parsed into the GO annotations for GO.db and the  
> organism
> packages once every 6 months.  So particularly right now when the
> packages are just about to be updated again, you might find some
> disagreements with the sources that are presently at NCBI or biomaRt.
> With the GO.db and organism packages we are attempting to balance
> keeping things "current" with keeping things "reproducible".  So we
> update things everything every 6 months, but then we "freeze" those
> annotation packages with a release number so that if in the future you
> need to reproduce a result, you can come back and grab that particular
> release again along with the software that goes with it.
>
> In a few weeks there will be an entirely new set of annotation  
> packages
> when we make the new release.   Hope this helps.
>
>
>  Marc
>
>
>
> michael watson (IAH-C) wrote:
>> Hi Sean
>>
>> I realise the GO packages are built using the NCBI data, but when  
>> there is a disagreement between derived data packages, it is good  
>> practice to go to the source database.
>>
>> Cheers
>> Mick
>> ________________________________________
>> From: Sean Davis [seandavi at gmail.com]
>> Sent: 24 September 2009 12:08
>> To: michael watson (IAH-C)
>> Cc: Yuan Hao; Heidi Dvinge; bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] Retrieve Entrez IDs for enriched GO terms
>>
>> On Thu, Sep 24, 2009 at 6:51 AM, michael watson (IAH-C)
>> <michael.watson at bbsrc.ac.uk> wrote:
>>
>>> these databases are built using different methods so you get  
>>> different results. A common problem in bioinformatics! Solution.  
>>> Go to the ncbi ftp site, for the entrez gene database and download  
>>> the gene2go.gz file. Unzip and query.
>>> ________________________________________
>>>
>>
>> As usual, there are many ways to solve the problem and downloading
>> tab-delimited text files is one.  However, the GO package in
>> bioconductor is built using the NCBI data (actually, the file noted
>> above), so there really isn't a need to download files.  The data are
>> already present in the GO.db package and in all chip annotation
>> packages built by Bioconductor.
>>
>> Sean
>>
>>
>>
>>> From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch 
>>> ] On Behalf Of Yuan Hao [yuan.hao at ucd.ie]
>>> Sent: 24 September 2009 11:46
>>> To: Heidi Dvinge
>>> Cc: bioconductor at stat.math.ethz.ch
>>> Subject: Re: [BioC] Retrieve Entrez IDs for enriched GO terms
>>>
>>> Hi Heidi,
>>>
>>> Thank you very much for your reply. The method you provided gives  
>>> the
>>> same result as using probeSetSummary, but I don't understand why  
>>> this
>>> result is different from the one got from biomaRt? Do you have some
>>> insight about it?
>>>
>>> Kind regards,
>>> Yuan
>>>
>>> On 24 Sep 2009, at 11:38, Heidi Dvinge wrote:
>>>
>>>
>>>> Hello Yuan,
>>>>
>>>> have you tried using the accessor functions for your test object
>>>> directly? For example:
>>>>
>>>>
>>>>> geneIdsByCategory(hyp, catids="GO:0007498")
>>>>>
>>>> Does this give you what you want?
>>>>
>>>> Cheers
>>>> \Heidi
>>>>
>>>>
>>>> On 24 Sep 2009, at 11:31, Yuan Hao wrote:
>>>>
>>>>
>>>>> Dear list,
>>>>>
>>>>> I spent a long time trying to figure out this problem, but without
>>>>> progress. I would appreciate it very much if you could give me  
>>>>> some
>>>>> help.
>>>>>
>>>>> I got a list of differential expressed genes from microarray  
>>>>> analysis
>>>>> by using limma. Then I did GO enrichment analysis on these genes  
>>>>> by
>>>>> hypeGTest() method available in GOstats package. Now I want to
>>>>> retrieve entrez gene IDs in my gene list that correspond to each
>>>>> enriched GO terms. I found there are two ways to get the entrez  
>>>>> gene
>>>>> IDs: using probeSetSummary() from GOstats, or using getBM() from
>>>>> biomaRt. I tried both method, and they all worked, but I got two
>>>>> different lists (lengths 13 vs 24) of entrez gene IDs  
>>>>> corresponding
>>>>> to
>>>>> a single GO term, and most of them are not overlapped. I am not  
>>>>> very
>>>>> familiar with the annotation and/or genome assembly, so I am not  
>>>>> sure
>>>>> whether it is because the two methods using different annotation/
>>>>> assembly that caused this problem.
>>>>>
>>>>> # get geneIds for hyperGTest
>>>>>
>>>>>
>>>>>> topA<-topTable(fit2,coef=1,p.value=0.01,n=nrow(fit2))
>>>>>>
>>>>>> prbs<-topA[,1]
>>>>>>
>>>>>> hasGO<-sapply(mget(prbs,hgu133plus2GO),function(ids)
>>>>>>
>>>>> + if(!is.na(ids) && length(ids) > 1) TRUE else FALSE)
>>>>>
>>>>>
>>>>>> prbs<-prbs[hasGO]
>>>>>>
>>>>>> prbs<-getEG(prbs,"hgu133plus2")
>>>>>>
>>>>>> prbs<-prbs[!duplicated(prbs)]
>>>>>>
>>>>> # get universeGeneIds for hyperGTest
>>>>>
>>>>>
>>>>>> univ<-featureNames(eset)
>>>>>>
>>>>>> hasUnivGO<-sapply(mget(univ,hgu133plus2GO),function(ids)
>>>>>>
>>>>> + if (!is.na(ids) && length(ids) > 1) TRUE else FALSE)
>>>>>
>>>>>
>>>>>> univ<-univ[hasUnivGO]
>>>>>>
>>>>>> univ<-unique(getEG(univ,"hgu133plus2"))
>>>>>>
>>>>> # compose params and carry out hyperGTest
>>>>>
>>>>>
>>>>>> p<-new("GOHyperGParams", geneIds=prbs, universeGeneIds=univ,
>>>>>>
>>>>> ontology="BP", annotation="hgu133plus2", conditional=TRUE)
>>>>>
>>>>>
>>>>>> if(interactive()){
>>>>>>
>>>>> + hyp<-hyperGTest(p)
>>>>>
>>>>> + ps<-probeSetSummary(hyp)
>>>>>
>>>>> }
>>>>>
>>>>> # retrieve entrez IDs for one enriched GO term GO:0007498
>>>>>
>>>>>
>>>>>> unique(ps$"GO:0007498"$EntrezID)
>>>>>>
>>>>>  [1] "2131"  "2139"  "2296"  "3717"  "4088"  "4771"  "6398"  "655"
>>>>> "695"
>>>>>
>>>>> [10] "8013"  "8320"  "83439" "9314"
>>>>>
>>>>>
>>>>>
>>>>> # using biomaRt package
>>>>>
>>>>>
>>>>>> ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>>>>
>>>>>> summary <- summary(hyp)
>>>>>>
>>>>>> goID<-summary$GOBPID
>>>>>>
>>>>>> E <- getBM(attributes=c("go_biological_process_id",  
>>>>>> "entrezgene"),
>>>>>>
>>>>> filters="go", values=goID, mart=ensembl)
>>>>>
>>>>>
>>>>>> oneGO<-sapply(E$"go_biological_process_id",function(i)
>>>>>>
>>>>> + if (i=="GO:0007498") TRUE else FALSE)
>>>>>
>>>>>
>>>>>> EE<-E[oneGO,]
>>>>>>
>>>>> # retrieve entrez IDs for the same GO term, GO:0007498
>>>>>
>>>>>
>>>>>> unique(EE$entrezgene)
>>>>>>
>>>>>  [1]  5515    NA    90  6398  2131  3717   660  4145 84667  3055
>>>>> 6911 10320
>>>>>
>>>>> [13] 10220 22806   695  5017 23184  9355  2303  7075  4232    92
>>>>> 6943  6862
>>>>>
>>>>>
>>>>> Thank you very much in advance!
>>>>>
>>>>> Kind regards,
>>>>> Yuan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>

--------------------------------
Yuan Hao
PhD student
Conway Institute
University College Dublin
Belfield, Dublin 4, Ireland
E-mail: yuan.hao at ucd.ie