[BioC] GOstats question

Wed Mar 30 15:58:51 CEST 2005

Hi Sean,

What is indicated in the hgu133aACCNUM html for the hgu133a meta-data package is: "For all the Affymetrix chips, the manufacturer/user provided ids are GenBank accession numbers." So the starting material for the pipeline here is GenBank acc #. It seems possible that with this starting material one could potentially reduce the level of ambiguity.  

As an example -- take the affy ids 207039_at and 211156_at (NM_000077 and AF115544, respective GeneBank# ids).  They correspond to locuslink number 1029.  This number corresponds to 3 transcripts encoding 3 proteins (p12, p14 and p16).  GOA attributes same GO_ID 0016301 (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301).  Entrez Gene associates AF115544 as the source sequence for NM_058197 (NP_478104).  NM_00077 corresponds to the variant NP_000068.  The mapping by Dr. Gentleman et al yields the same 2 GO terms for both probe sets (see example below).  The locuslink (GeneID) # 1029 should yield 

Of course using the actual target sequence (which is given by affy) as the starting material would help better to resolve variants as well as permit a proper flagging of problem probe sets (see Mecham et al. Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe sets to GOA.  But as you indicated, maybe Dr. Gentleman (or maybe Chenwei Lin) could shed some light to why it is better to pass from probe set/accession number provided by affy to locuslink to GO id to study the potential enrichment of GO ids in an affy  microarray experiment.

###### EXAMPLE QUERY ####################
> affyGO = eapply(hgu133aGO, getOntology)
> affyGO$"211156_at"
[1] "GO:0004861" "GO:0016301"
> affyGO$"207039_at"
[1] "GO:0004861" "GO:0016301"
>

Here we see that for both probe sets we have 
Kinase activity (GO:0016301) & cyclin-dependent protein kinase inhibitor activity (GO:0004861). And not, for example, cell cycle arrest (GO:0007050) nor cell cycle checkpoint (GO:0000075), 2 TAS GO ids out of the 8 GO ids attributed by GOA for NP_000068.

A sampling from EBI_GOA_assoc_xrefs for LL 1029:

Supp RefSeq NP	locus link_ Gene Symbol	GOid	DB:reference	evidence
		1029_CDKN2A;	GO:0007049	PMID:7606716	NAS
		1029_CDKN2A;	GO:0008372	UniProt:Q16360	ND
NP_478102;	1029_CDKN2A;	GO:0016301	GOA:spkw	IEA
NP_478104;	1029_CDKN2A;	GO:0016301	GOA:spkw	IEA
NP_000068;	1029_CDKN2A;	GO:0007049	GOA:spkw	IEA
NP_000068;	1029_CDKN2A;	GO:0000075	PMID:7972006	TAS
NP_000068;	1029_CDKN2A;	GO:0045786	GOA:spkw	IEA
NP_000068;	1029_CDKN2A;	GO:0004861	PMID:7972006	TAS
NP_000068;	1029_CDKN2A;	GO:0007050	PMID:7972006	TAS
NP_000068;	1029_CDKN2A;	GO:0005634	UniProt:P42771	NR
NP_000068;	1029_CDKN2A;	GO:0000079	PMID:7972006	TAS
NP_000068;	1029_CDKN2A;	GO:0008285	PMID:7972006	TAS

David

################################

-----Message d'origine-----
De : Sean Davis [mailto:sdavis2 at mail.nih.gov] 
Envoyé : Wednesday, March 30, 2005 1:19 PM
À : Rickman David
Cc : bioconductor at stat.math.ethz.ch
Objet : Re: [BioC] GOstats question

On Mar 30, 2005, at 4:03 AM, Rickman David wrote:

>
>
>  Hello,
>
> A naive question (I am by no means an ace R user) concerning GOstats 
> and splice variants:
>
> why do you rely on locuslink to map GO terms when GOA that take into 
> account splice variants as well via, for example, RefSeq?  Using the 
> GOstats tool to study Affymetrix u133a data, I noticed that thhe 
> hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing 
> that you are limited to the genbank accession number attribution for a 
> probe set offered by Affymetrix.
>
> Thanks for any help/comments
>

David,

I'm perhaps not the best person to answer this (Robert Gentleman and 
his team are), but I think the annotation pipeline that is used for the 
bioconductor packages goes through LocusLink (Entrez Gene) in all 
cases.  Since the mapping is through LocusLink, there isn't a way to 
get back to "trancript-level" detail.

Sean