[BioC] Changes in annotations?

Alex Sanchez asanchez at ub.edu
Thu Jul 9 14:16:58 CEST 2009


Hi James and thanks for the help.

I know, of course, that what Bioconductor does is to take the annotations 
from public data sources.
I will now turn to affymetrix to see if someone there, or in their forums, 
can explain why so many annotations have been turned into "NA's"

There seem to be two problems here
- The first one, pointed by Loren in his message today is synonims. For 
instance the first probeset in my list was 238900_at,
   In the first version I used (BioC 1.9) the Gene Symbol provide by 
getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
   However the original symbol is still in the Affymetrix annotation file 
"HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
   to show the relation with the new Entrez ID (100133484).
In any case it is ennoying but it can be assumed.
- What seems worse is what happens to some annotations: If, for instance,  I 
take the second probeset in my list, 232583_at, and I look for it in the 
affy annotations file I find that apart some links to databases as genebank 
it does not have a gene symbol or an entrez (or most annotations anymore).
In the previous version of the annotations this probeset was one of two 
associated with gene NCK2 (232583_at;203315_at) .
The problem is that the second probeset (203315_at) was not selected by the 
analysis. That is the only evidence suggesting that gene NCK2 was 
differentially expressed (232583_at) does not suggest it anymore.

Last, but in any case least, this is not happening to a few probesets. My 
original list had 417 probesets. 210 have changed/synonimized their gene 
symbols, but from these 210, 160 have become NA's

Seems too many to feel comfortable :-(

Thanks for the help

Alex

...............................................................................................................
Dr. Alex  Sánchez.
Associate Professor. Statistics Department. University of Barcelona.
Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
asanchez_at_ub.edu
Statistics and Bioinformatics Unit
Institut de Recerca. Hospital Universitari Vall 'Hebron
Passeig Vall d'Hebron 112-119. 08034 Barcelona
asanchez_at_ir.vhebron.net
...............................................................................................................




----- Original Message ----- 
From: "James W. MacDonald" <jmacdon at med.umich.edu>
To: "Alex Sanchez" <asanchez at ub.edu>
Cc: <bioconductor at stat.math.ethz.ch>
Sent: Monday, July 06, 2009 3:58 PM
Subject: Re: [BioC] Changes in annotations?


> Hi Alex,
>
> This is a question that comes up on the Bioc list fairly regularly, and 
> the answer is in two parts:
>
> First, the annotations supplied in the various metadata packages supplied 
> by BioC are *not* our annotations, but are simply a re-packaging of data 
> we collect from various sources. As an example, we use the mappings of 
> Affymetrix Probe ID to Entrez Gene ID from the annotation csv files you 
> can download from the Affy website. We then map the Entrez Gene IDs to 
> other annotation using primarily NCBI data. So if you go to Affy's netaffx 
> site (free registration required) and query on say, 238900_at, you get 
> this:
>
> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_2%3A238900_AT
>
> And you will note that the first Entrez Gene ID listed there is 100133484, 
> which happens to be a defunct ID. However, this is the first of many 
> listed there (and we need a one-to-one mapping), so we chose that one. A 
> more likely Entrez Gene ID can be found further down the list, but we 
> simply don't have the resources to figure out if there is a better choice 
> in that list (for every reporter on every Affy chip we annotate). Nor do 
> we have the resources to ensure that any of the mappings that Affy make 
> are reasonable to begin with. We have to trust that they (with *way* more 
> resources that us) are doing a reasonable job.
>
> The second part of the answer has to do with the 'moving target' aspect of 
> Biological annotations. These data change all the time, and there is the 
> recurring question of whether one should do an analysis and 'freeze' it to 
> that point in time, or should the annotations be updated on a regular 
> basis, with the realization that things can and will change?
>
> Without looking at each reporter ID you list, I can't say if the changes 
> are due to Affy changing their annotation csv files, or to changing 
> knowledge of the genome, but I suspect it is a combination of the two.
>
> Best,
>
> Jim
>
>
>
>
>
>
> Alex Sanchez wrote:
>> Hello
>>
>> I have had to review recently an analysis I did some time ago. This was 
>> done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I have 
>> re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>> I have been surprised by the changes in the annotations: Many probesets 
>> that had had an annotation have become NA's whereas some have changed 
>> their symbol and their Entrez gene.
>>
>> To be specific I summarize my question with the top genes of my list
>>
>> The list I obtained 2 years ago is:
>>
>> probeset    locuslink    symbol
>> 238900_at 3123 HLA-DRB1
>> 232583_at 8440 NCK2
>> 236307_at 60468 BACH2
>> 223620_at 2857 GPR34
>> 219759_at 64167 LRAP
>> 201702_s_at 5514 PPP1R10
>> 232882_at 2308 FOXO1A
>> 213446_s_at 8826 IQGAP1
>> 234033_at 9693 RAPGEF2
>> 243006_at 2534 FYN
>> 244648_at 54520 CCDC93
>> 243691_at 23142 DCUN1D4
>> 239264_at 60412 EXOC4
>> 243546_at 143686 SESN3
>> 205239_at 374 AREG
>> 1565703_at 55520 ELAC1
>> 244061_at 55843 ARHGAP15
>> 230505_at 26037 SIPA1L1
>> 242688_at 9320 TRIP12
>> 1556474_a_at 285097 FLJ38379
>> 232614_at 596 BCL2
>> 1565689_at 3839 KPNA3
>> 236685_at NA NA
>> 225173_at 93663 ARHGAP18
>> 241893_at 4249 MGAT5
>>
>> I used the following code to reproduce the issue with the annotations:
>>
>>
>> #####################################################################
>> ## Verification using R 2.9 & BioC 2.4
>> #####################################################################
>>
>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , 
>>> "219759_at" ,
>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at", 
>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" , 
>> "243546_at" ,   "205239_at" ,
>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" , 
>> "1556474_a_at",
>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" , 
>> "241893_at")
>>> library(hgu133plus2.db)
>>> library(annotate)
>>>
>>> entrezs<- getEG(probes, "hgu133plus2")
>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>> sel2<- cbind(probes, entrezs, symbols)
>>> sel2
>>              probes         entrezs     symbols       238900_at 
>> "238900_at"    "100133484" "LOC100133484"
>> 232583_at    "232583_at"    NA          NA           236307_at 
>> "236307_at"    NA          NA           223620_at    "223620_at" 
>> "2857"      "GPR34"       219759_at    "219759_at"    "64167"     "ERAP2" 
>> 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"     232882_at 
>> "232882_at"    NA          NA           213446_s_at  "213446_s_at" 
>> "8826"      "IQGAP1"      234033_at    "234033_at"    NA          NA 
>> 243006_at    "243006_at"    NA          NA           244648_at 
>> "244648_at"    NA          NA           243691_at    "243691_at"    NA 
>> NA           239264_at    "239264_at"    NA          NA 
>> 243546_at    "243546_at"    NA          NA           205239_at 
>> "205239_at"    "374"       "AREG"        1565703_at   "1565703_at" 
>> "4089"      "SMAD4"       244061_at    "244061_at"    NA          NA 
>> 230505_at    "230505_at"    "145474"    "LOC145474"   242688_at 
>> "242688_at"    NA          NA           1556474_a_at "1556474_a_at" 
>> "285097"    "FLJ38379"    232614_at    "232614_at"    NA          NA 
>> 1565689_at   "1565689_at"   NA          NA           236685_at 
>> "236685_at"    NA          NA           225173_at    "225173_at" 
>> "93663"     "ARHGAP18"    241893_at    "241893_at"    NA          NA
>>> sessionInfo()
>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
>> States.1252;LC_MONETARY=English_United 
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base 
>> other attached packages:
>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1 
>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1       loaded 
>> via a namespace (and not attached):
>> [1] xtable_1.5-5
>> #############################################
>>
>> Many probesets seem to have changed.
>> Can someone explain to me what is happening (or what may I be doing 
>> wrong)?
>>
>> The same code does not work with R 2.4 but if I change hgu133plus2.db by 
>> hgu133plus2 and getEG by getLL I obtain the original results:
>>
>> ###############################################
>> ### Review of annotatons with R 2.4 and BioC 1.9
>> ###############################################
>>
>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>
>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , 
>>> "219759_at" ,
>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at", 
>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" , 
>> "243546_at" ,   "205239_at" ,
>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" , 
>> "1556474_a_at",
>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" , 
>> "241893_at")
>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>> sel1<- cbind(probes, LLs, symbols)
>>> sel1
>>              probes         LLs      symbols    238900_at    "238900_at" 
>> "3123"   "HLA-DRB1" 232583_at    "232583_at"    "8440"   "NCK2" 
>> 236307_at    "236307_at"    "60468"  "BACH2"    223620_at    "223620_at" 
>> "2857"   "GPR34"    219759_at    "219759_at"    "64167"  "ERAP2" 
>> 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  232882_at    "232882_at" 
>> "2308"   "FOXO1"    213446_s_at  "213446_s_at"  "8826"   "IQGAP1" 
>> 234033_at    "234033_at"    "9693"   "RAPGEF2"  243006_at    "243006_at" 
>> "2534"   "FYN"      244648_at    "244648_at"    "54520"  "CCDC93" 
>> 243691_at    "243691_at"    "23142"  "DCUN1D4"  239264_at    "239264_at" 
>> "60412"  "EXOC4"    243546_at    "243546_at"    "143686" "SESN3" 
>> 205239_at    "205239_at"    "374"    "AREG"     1565703_at   "1565703_at" 
>> "4089"   "SMAD4"    244061_at    "244061_at"    "55843"  "ARHGAP15" 
>> 230505_at    "230505_at"    "145474" "LOC145474"
>> 242688_at    "242688_at"    "9320"   "TRIP12"   1556474_a_at 
>> "1556474_a_at" "285097" "FLJ38379" 232614_at    "232614_at"    "596" 
>> "BCL2"     1565689_at   "1565689_at"   "3839"   "KPNA3"    236685_at 
>> "236685_at"    NA       NA         225173_at    "225173_at"    "93663" 
>> "ARHGAP18" 241893_at    "241893_at"    "4249"   "MGAT5"
>>> sessionInfo()
>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>>
>> attached base packages:
>> [1] "tools"     "stats"     "graphics"  "grDevices"
>> [5] "utils"     "datasets"  "methods"   "base"     other attached 
>> packages:
>>    annotate     Biobase hgu133plus2 "1.12.1"    "1.12.2"    "1.14.0" 
>> ########################################################
>>
>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I ibtained 
>> 2 years ago, but If I do the same steps using R2.9/BioC2.4 the results 
>> change dramatically.
>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in R 
>> 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and BioC 2.2 
>> the same as 2.4,
>>
>> Any help to understand what's happening would be appreciated
>>
>> Alex Sanchez
>>
>> -----------------------------------------------------------------------------------------------------
>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>> asanchez_at_ub.edu
>> Statistics and Bioinformatics Unit
>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>> asanchez_at_ir.vhebron.net
>> ----------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
>



More information about the Bioconductor mailing list