[BioC] Changes in annotations?

James W. MacDonald jmacdon at med.umich.edu
Mon Jul 6 15:58:05 CEST 2009


Hi Alex,

This is a question that comes up on the Bioc list fairly regularly, and 
the answer is in two parts:

First, the annotations supplied in the various metadata packages 
supplied by BioC are *not* our annotations, but are simply a 
re-packaging of data we collect from various sources. As an example, we 
use the mappings of Affymetrix Probe ID to Entrez Gene ID from the 
annotation csv files you can download from the Affy website. We then map 
the Entrez Gene IDs to other annotation using primarily NCBI data. So if 
you go to Affy's netaffx site (free registration required) and query on 
say, 238900_at, you get this:

https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_2%3A238900_AT

And you will note that the first Entrez Gene ID listed there is 
100133484, which happens to be a defunct ID. However, this is the first 
of many listed there (and we need a one-to-one mapping), so we chose 
that one. A more likely Entrez Gene ID can be found further down the 
list, but we simply don't have the resources to figure out if there is a 
better choice in that list (for every reporter on every Affy chip we 
annotate). Nor do we have the resources to ensure that any of the 
mappings that Affy make are reasonable to begin with. We have to trust 
that they (with *way* more resources that us) are doing a reasonable job.

The second part of the answer has to do with the 'moving target' aspect 
of Biological annotations. These data change all the time, and there is 
the recurring question of whether one should do an analysis and 'freeze' 
it to that point in time, or should the annotations be updated on a 
regular basis, with the realization that things can and will change?

Without looking at each reporter ID you list, I can't say if the changes 
are due to Affy changing their annotation csv files, or to changing 
knowledge of the genome, but I suspect it is a combination of the two.

Best,

Jim






Alex Sanchez wrote:
> Hello
> 
> I have had to review recently an analysis I did some time ago. This was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
> I have been surprised by the changes in the annotations: Many probesets that had had an annotation have become NA's whereas some have changed their symbol and their Entrez gene.
> 
> To be specific I summarize my question with the top genes of my list
> 
> The list I obtained 2 years ago is:
> 
> probeset    locuslink    symbol
> 238900_at 3123 HLA-DRB1
> 232583_at 8440 NCK2
> 236307_at 60468 BACH2
> 223620_at 2857 GPR34
> 219759_at 64167 LRAP
> 201702_s_at 5514 PPP1R10
> 232882_at 2308 FOXO1A
> 213446_s_at 8826 IQGAP1
> 234033_at 9693 RAPGEF2
> 243006_at 2534 FYN
> 244648_at 54520 CCDC93
> 243691_at 23142 DCUN1D4
> 239264_at 60412 EXOC4
> 243546_at 143686 SESN3
> 205239_at 374 AREG
> 1565703_at 55520 ELAC1
> 244061_at 55843 ARHGAP15
> 230505_at 26037 SIPA1L1
> 242688_at 9320 TRIP12
> 1556474_a_at 285097 FLJ38379
> 232614_at 596 BCL2
> 1565689_at 3839 KPNA3
> 236685_at NA NA
> 225173_at 93663 ARHGAP18
> 241893_at 4249 MGAT5
> 
> I used the following code to reproduce the issue with the annotations:
> 
> 
> #####################################################################
> ## Verification using R 2.9 & BioC 2.4
> #####################################################################
> 
>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , "219759_at" ,
> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",    "243006_at" ,  
> +  "244648_at" ,   "243691_at" ,   "239264_at" ,   "243546_at" ,   "205239_at" ,
> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,   "1556474_a_at",
> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,   "241893_at")
>> library(hgu133plus2.db)
>> library(annotate)
>>
>> entrezs<- getEG(probes, "hgu133plus2")
>> symbols<- getSYMBOL(probes, "hgu133plus2")
>> sel2<- cbind(probes, entrezs, symbols)
>> sel2
>              probes         entrezs     symbols       
> 238900_at    "238900_at"    "100133484" "LOC100133484"
> 232583_at    "232583_at"    NA          NA           
> 236307_at    "236307_at"    NA          NA           
> 223620_at    "223620_at"    "2857"      "GPR34"       
> 219759_at    "219759_at"    "64167"     "ERAP2"       
> 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"     
> 232882_at    "232882_at"    NA          NA           
> 213446_s_at  "213446_s_at"  "8826"      "IQGAP1"      
> 234033_at    "234033_at"    NA          NA           
> 243006_at    "243006_at"    NA          NA           
> 244648_at    "244648_at"    NA          NA           
> 243691_at    "243691_at"    NA          NA           
> 239264_at    "239264_at"    NA          NA           
> 243546_at    "243546_at"    NA          NA           
> 205239_at    "205239_at"    "374"       "AREG"        
> 1565703_at   "1565703_at"   "4089"      "SMAD4"       
> 244061_at    "244061_at"    NA          NA           
> 230505_at    "230505_at"    "145474"    "LOC145474"   
> 242688_at    "242688_at"    NA          NA           
> 1556474_a_at "1556474_a_at" "285097"    "FLJ38379"    
> 232614_at    "232614_at"    NA          NA           
> 1565689_at   "1565689_at"   NA          NA           
> 236685_at    "236685_at"    NA          NA           
> 225173_at    "225173_at"    "93663"     "ARHGAP18"    
> 241893_at    "241893_at"    NA          NA           
>> sessionInfo()
> R version 2.9.0 (2009-04-17) 
> i386-pc-mingw32 
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1         DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1       
> 
> loaded via a namespace (and not attached):
> [1] xtable_1.5-5
> #############################################
> 
> Many probesets seem to have changed.
> Can someone explain to me what is happening (or what may I be doing wrong)?
> 
> The same code does not work with R 2.4 but if I change hgu133plus2.db by hgu133plus2 and getEG by getLL I obtain the original results:
> 
> ###############################################
> ### Review of annotatons with R 2.4 and BioC 1.9
> ###############################################
> 
> ### This code is executed on a clean new session with R 2. and BioC 1.9
> 
>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , "219759_at" ,
> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",    "243006_at" ,  
> +  "244648_at" ,   "243691_at" ,   "239264_at" ,   "243546_at" ,   "205239_at" ,
> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,   "1556474_a_at",
> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,   "241893_at")
>> LLs<- getLL(rownames(sel), "hgu133plus2")
>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>> sel1<- cbind(probes, LLs, symbols)
>> sel1
>              probes         LLs      symbols    
> 238900_at    "238900_at"    "3123"   "HLA-DRB1" 
> 232583_at    "232583_at"    "8440"   "NCK2"     
> 236307_at    "236307_at"    "60468"  "BACH2"    
> 223620_at    "223620_at"    "2857"   "GPR34"    
> 219759_at    "219759_at"    "64167"  "ERAP2"    
> 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  
> 232882_at    "232882_at"    "2308"   "FOXO1"    
> 213446_s_at  "213446_s_at"  "8826"   "IQGAP1"   
> 234033_at    "234033_at"    "9693"   "RAPGEF2"  
> 243006_at    "243006_at"    "2534"   "FYN"      
> 244648_at    "244648_at"    "54520"  "CCDC93"   
> 243691_at    "243691_at"    "23142"  "DCUN1D4"  
> 239264_at    "239264_at"    "60412"  "EXOC4"    
> 243546_at    "243546_at"    "143686" "SESN3"    
> 205239_at    "205239_at"    "374"    "AREG"     
> 1565703_at   "1565703_at"   "4089"   "SMAD4"    
> 244061_at    "244061_at"    "55843"  "ARHGAP15" 
> 230505_at    "230505_at"    "145474" "LOC145474"
> 242688_at    "242688_at"    "9320"   "TRIP12"   
> 1556474_a_at "1556474_a_at" "285097" "FLJ38379" 
> 232614_at    "232614_at"    "596"    "BCL2"     
> 1565689_at   "1565689_at"   "3839"   "KPNA3"    
> 236685_at    "236685_at"    NA       NA         
> 225173_at    "225173_at"    "93663"  "ARHGAP18" 
> 241893_at    "241893_at"    "4249"   "MGAT5"    
> 
>> sessionInfo()
> R version 2.4.1 (2006-12-18) 
> i386-pc-mingw32 
> 
> locale:
> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
> 
> attached base packages:
> [1] "tools"     "stats"     "graphics"  "grDevices"
> [5] "utils"     "datasets"  "methods"   "base"     
> 
> other attached packages:
>    annotate     Biobase hgu133plus2 
>    "1.12.1"    "1.12.2"    "1.14.0" 
> 
> ########################################################
> 
> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4 the results change dramatically.
> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and BioC 2.2 the same as 2.4,
> 
> Any help to understand what's happening would be appreciated
> 
> Alex Sanchez
> 
> -----------------------------------------------------------------------------------------------------
> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
> asanchez_at_ub.edu
> Statistics and Bioinformatics Unit
> Institut de Recerca. Hospital Universitari Vall 'Hebron
> Passeig Vall d'Hebron 112-119. 08034 Barcelona
> asanchez_at_ir.vhebron.net
> ----------------------------------------------------------------------------------------------------
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826



More information about the Bioconductor mailing list