[BioC] Retrieve Entrez IDs for enriched GO terms

Thu Sep 24 20:25:40 CEST 2009

Hi everyone,

Even just downloading the file from NCBI might result in some
disagreements.  GO changes all the time, and the file from NCBI is
downloaded and parsed into the GO annotations for GO.db and the organism
packages once every 6 months.  So particularly right now when the
packages are just about to be updated again, you might find some
disagreements with the sources that are presently at NCBI or biomaRt. 
With the GO.db and organism packages we are attempting to balance
keeping things "current" with keeping things "reproducible".  So we
update things everything every 6 months, but then we "freeze" those
annotation packages with a release number so that if in the future you
need to reproduce a result, you can come back and grab that particular
release again along with the software that goes with it. 

In a few weeks there will be an entirely new set of annotation packages
when we make the new release.   Hope this helps.

  Marc

michael watson (IAH-C) wrote:
> Hi Sean
>
> I realise the GO packages are built using the NCBI data, but when there is a disagreement between derived data packages, it is good practice to go to the source database.
>
> Cheers
> Mick
> ________________________________________
> From: Sean Davis [seandavi at gmail.com]
> Sent: 24 September 2009 12:08
> To: michael watson (IAH-C)
> Cc: Yuan Hao; Heidi Dvinge; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Retrieve Entrez IDs for enriched GO terms
>
> On Thu, Sep 24, 2009 at 6:51 AM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
>   
>> these databases are built using different methods so you get different results. A common problem in bioinformatics! Solution. Go to the ncbi ftp site, for the entrez gene database and download the gene2go.gz file. Unzip and query.
>> ________________________________________
>>     
>
> As usual, there are many ways to solve the problem and downloading
> tab-delimited text files is one.  However, the GO package in
> bioconductor is built using the NCBI data (actually, the file noted
> above), so there really isn't a need to download files.  The data are
> already present in the GO.db package and in all chip annotation
> packages built by Bioconductor.
>
> Sean
>
>
>   
>> From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Yuan Hao [yuan.hao at ucd.ie]
>> Sent: 24 September 2009 11:46
>> To: Heidi Dvinge
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] Retrieve Entrez IDs for enriched GO terms
>>
>> Hi Heidi,
>>
>> Thank you very much for your reply. The method you provided gives the
>> same result as using probeSetSummary, but I don't understand why this
>> result is different from the one got from biomaRt? Do you have some
>> insight about it?
>>
>> Kind regards,
>> Yuan
>>
>> On 24 Sep 2009, at 11:38, Heidi Dvinge wrote:
>>
>>     
>>> Hello Yuan,
>>>
>>> have you tried using the accessor functions for your test object
>>> directly? For example:
>>>
>>>       
>>>> geneIdsByCategory(hyp, catids="GO:0007498")
>>>>         
>>> Does this give you what you want?
>>>
>>> Cheers
>>> \Heidi
>>>
>>>
>>> On 24 Sep 2009, at 11:31, Yuan Hao wrote:
>>>
>>>       
>>>> Dear list,
>>>>
>>>> I spent a long time trying to figure out this problem, but without
>>>> progress. I would appreciate it very much if you could give me some
>>>> help.
>>>>
>>>> I got a list of differential expressed genes from microarray analysis
>>>> by using limma. Then I did GO enrichment analysis on these genes by
>>>> hypeGTest() method available in GOstats package. Now I want to
>>>> retrieve entrez gene IDs in my gene list that correspond to each
>>>> enriched GO terms. I found there are two ways to get the entrez gene
>>>> IDs: using probeSetSummary() from GOstats, or using getBM() from
>>>> biomaRt. I tried both method, and they all worked, but I got two
>>>> different lists (lengths 13 vs 24) of entrez gene IDs corresponding
>>>> to
>>>> a single GO term, and most of them are not overlapped. I am not very
>>>> familiar with the annotation and/or genome assembly, so I am not sure
>>>> whether it is because the two methods using different annotation/
>>>> assembly that caused this problem.
>>>>
>>>> # get geneIds for hyperGTest
>>>>
>>>>         
>>>>> topA<-topTable(fit2,coef=1,p.value=0.01,n=nrow(fit2))
>>>>>           
>>>>> prbs<-topA[,1]
>>>>>           
>>>>> hasGO<-sapply(mget(prbs,hgu133plus2GO),function(ids)
>>>>>           
>>>> + if(!is.na(ids) && length(ids) > 1) TRUE else FALSE)
>>>>
>>>>         
>>>>> prbs<-prbs[hasGO]
>>>>>           
>>>>> prbs<-getEG(prbs,"hgu133plus2")
>>>>>           
>>>>> prbs<-prbs[!duplicated(prbs)]
>>>>>           
>>>> # get universeGeneIds for hyperGTest
>>>>
>>>>         
>>>>> univ<-featureNames(eset)
>>>>>           
>>>>> hasUnivGO<-sapply(mget(univ,hgu133plus2GO),function(ids)
>>>>>           
>>>> + if (!is.na(ids) && length(ids) > 1) TRUE else FALSE)
>>>>
>>>>         
>>>>> univ<-univ[hasUnivGO]
>>>>>           
>>>>> univ<-unique(getEG(univ,"hgu133plus2"))
>>>>>           
>>>> # compose params and carry out hyperGTest
>>>>
>>>>         
>>>>> p<-new("GOHyperGParams", geneIds=prbs, universeGeneIds=univ,
>>>>>           
>>>> ontology="BP", annotation="hgu133plus2", conditional=TRUE)
>>>>
>>>>         
>>>>> if(interactive()){
>>>>>           
>>>> + hyp<-hyperGTest(p)
>>>>
>>>> + ps<-probeSetSummary(hyp)
>>>>
>>>> }
>>>>
>>>> # retrieve entrez IDs for one enriched GO term GO:0007498
>>>>
>>>>         
>>>>> unique(ps$"GO:0007498"$EntrezID)
>>>>>           
>>>>   [1] "2131"  "2139"  "2296"  "3717"  "4088"  "4771"  "6398"  "655"
>>>> "695"
>>>>
>>>> [10] "8013"  "8320"  "83439" "9314"
>>>>
>>>>
>>>>
>>>> # using biomaRt package
>>>>
>>>>         
>>>>> ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>>>           
>>>>> summary <- summary(hyp)
>>>>>           
>>>>> goID<-summary$GOBPID
>>>>>           
>>>>> E <- getBM(attributes=c("go_biological_process_id", "entrezgene"),
>>>>>           
>>>> filters="go", values=goID, mart=ensembl)
>>>>
>>>>         
>>>>> oneGO<-sapply(E$"go_biological_process_id",function(i)
>>>>>           
>>>> + if (i=="GO:0007498") TRUE else FALSE)
>>>>
>>>>         
>>>>> EE<-E[oneGO,]
>>>>>           
>>>> # retrieve entrez IDs for the same GO term, GO:0007498
>>>>
>>>>         
>>>>> unique(EE$entrezgene)
>>>>>           
>>>>   [1]  5515    NA    90  6398  2131  3717   660  4145 84667  3055
>>>> 6911 10320
>>>>
>>>> [13] 10220 22806   695  5017 23184  9355  2303  7075  4232    92
>>>> 6943  6862
>>>>
>>>>
>>>> Thank you very much in advance!
>>>>
>>>> Kind regards,
>>>> Yuan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>         
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>     
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>