[BioC] GOstats question
RickmanD at ligue-cancer.net
Wed Mar 30 15:58:51 CEST 2005
What is indicated in the hgu133aACCNUM html for the hgu133a meta-data package is: "For all the Affymetrix chips, the manufacturer/user provided ids are GenBank accession numbers." So the starting material for the pipeline here is GenBank acc #. It seems possible that with this starting material one could potentially reduce the level of ambiguity.
As an example -- take the affy ids 207039_at and 211156_at (NM_000077 and AF115544, respective GeneBank# ids). They correspond to locuslink number 1029. This number corresponds to 3 transcripts encoding 3 proteins (p12, p14 and p16). GOA attributes same GO_ID 0016301 (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301). Entrez Gene associates AF115544 as the source sequence for NM_058197 (NP_478104). NM_00077 corresponds to the variant NP_000068. The mapping by Dr. Gentleman et al yields the same 2 GO terms for both probe sets (see example below). The locuslink (GeneID) # 1029 should yield
Of course using the actual target sequence (which is given by affy) as the starting material would help better to resolve variants as well as permit a proper flagging of problem probe sets (see Mecham et al. Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe sets to GOA. But as you indicated, maybe Dr. Gentleman (or maybe Chenwei Lin) could shed some light to why it is better to pass from probe set/accession number provided by affy to locuslink to GO id to study the potential enrichment of GO ids in an affy microarray experiment.
###### EXAMPLE QUERY ####################
> affyGO = eapply(hgu133aGO, getOntology)
 "GO:0004861" "GO:0016301"
 "GO:0004861" "GO:0016301"
Here we see that for both probe sets we have
Kinase activity (GO:0016301) & cyclin-dependent protein kinase inhibitor activity (GO:0004861). And not, for example, cell cycle arrest (GO:0007050) nor cell cycle checkpoint (GO:0000075), 2 TAS GO ids out of the 8 GO ids attributed by GOA for NP_000068.
A sampling from EBI_GOA_assoc_xrefs for LL 1029:
Supp RefSeq NP locus link_ Gene Symbol GOid DB:reference evidence
1029_CDKN2A; GO:0007049 PMID:7606716 NAS
1029_CDKN2A; GO:0008372 UniProt:Q16360 ND
NP_478102; 1029_CDKN2A; GO:0016301 GOA:spkw IEA
NP_478104; 1029_CDKN2A; GO:0016301 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0007049 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0000075 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0045786 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0004861 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0007050 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0005634 UniProt:P42771 NR
NP_000068; 1029_CDKN2A; GO:0000079 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0008285 PMID:7972006 TAS
De : Sean Davis [mailto:sdavis2 at mail.nih.gov]
Envoyé : Wednesday, March 30, 2005 1:19 PM
À : Rickman David
Cc : bioconductor at stat.math.ethz.ch
Objet : Re: [BioC] GOstats question
On Mar 30, 2005, at 4:03 AM, Rickman David wrote:
> A naive question (I am by no means an ace R user) concerning GOstats
> and splice variants:
> why do you rely on locuslink to map GO terms when GOA that take into
> account splice variants as well via, for example, RefSeq? Using the
> GOstats tool to study Affymetrix u133a data, I noticed that thhe
> hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing
> that you are limited to the genbank accession number attribution for a
> probe set offered by Affymetrix.
> Thanks for any help/comments
I'm perhaps not the best person to answer this (Robert Gentleman and
his team are), but I think the annotation pipeline that is used for the
bioconductor packages goes through LocusLink (Entrez Gene) in all
cases. Since the mapping is through LocusLink, there isn't a way to
get back to "trancript-level" detail.
More information about the Bioconductor