[BioC] Changes in annotations?

Fri Jul 10 17:46:36 CEST 2009

Hi Loren,

I am working on this right now.  But it is really not as simple as it
sounds.  I hope to have more to say about this very soon.

  Marc

Loren Engrav wrote:
> Thank you
>
> Why is it not "best" to include everything Affy includes? All the synonyms
> and IDs
>
> Or is this not software possible?
>
>
>
>   
>> From: "James W. MacDonald" <jmacdon at med.umich.edu>
>> Date: Thu, 09 Jul 2009 09:36:46 -0400
>> To: Alex Sanchez <asanchez at ub.edu>
>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] Changes in annotations?
>>
>> Hi Alex,
>>
>> We have taken this up as an internal topic of discussion and hopefully
>> will be able to come up with a solution that is a bit better than what
>> we are doing right now.
>>
>> However, some of this is going to be unavoidable. The chips were based
>> on UniGene build 133 (and we are now on build 219), so things are going
>> to tend to move around. But I don't think this is the main reason for
>> what you are seeing, as the latest Affy annotation file is still based
>> on UCSC build hg18 (which is based on NCBI build 36.1), so these data
>> are over three years old.
>>
>> Instead, it appears that Affy have changed something about how they are
>> annotating things. Some of this is an improvement, and some not so much.
>>
>> For instance, 203315_at does interrogate NCK2:
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510
>> 693&hgt.out3=10x&position=chr2%3A105876577-105876826&hgtgroup_map_close=0&hgtg
>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
>> ar_close=1
>>
>> whereas 232583_at does not:
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510
>> 693&hgt.out3=10x&position=chr2%3A105752637-105752886&hgtgroup_map_close=0&hgtg
>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
>> ar_close=1
>>
>> so in this case they have improved their annotations.
>>
>> In the case of 238900_at, they seem to have added in a bunch of EG IDs
>> that are based on a model rather than being actual reviewed genes. In
>> addition, that probeset seems to hit an intron, so adding in all this
>> other annotation appears pointless:
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136511
>> 081&hgt.out3=10x&position=chr6%3A32653723-32653972&hgtgroup_map_close=0&hgtgro
>> up_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expres
>> sion_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_va
>> rRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hgtg
>> roup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndVar
>> _close=1
>>
>> What this really comes down to is the fact that you can't take anything
>> for granted. Not only do you have to validate a set of significant
>> results using some other technology, you also need to ensure that the
>> chip you are using is actually measuring what you think prior to doing
>> the validation.
>>
>> It should be relatively painless to set up a pipeline where you could
>> use the BSgenome.Hsapiens.UCSC.hg18 package along with functions from
>> BSgenome/Biostrings to map probesets to the genome and then use the
>> org.Hs.eg.db package to see if a given probeset is even interrogating
>> the gene it is supposed to. One could then use rtracklayer to visualize
>> things in the UCSC genome browser to make sure you weren't interrogating
>> an intron.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>> Alex Sanchez wrote:
>>     
>>> Hi James and thanks for the help.
>>>
>>> I know, of course, that what Bioconductor does is to take the
>>> annotations from public data sources.
>>> I will now turn to affymetrix to see if someone there, or in their
>>> forums, can explain why so many annotations have been turned into "NA's"
>>>
>>> There seem to be two problems here
>>> - The first one, pointed by Loren in his message today is synonims. For
>>> instance the first probeset in my list was 238900_at,
>>>   In the first version I used (BioC 1.9) the Gene Symbol provide by
>>> getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
>>>   However the original symbol is still in the Affymetrix annotation file
>>> "HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
>>>   to show the relation with the new Entrez ID (100133484).
>>> In any case it is ennoying but it can be assumed.
>>> - What seems worse is what happens to some annotations: If, for
>>> instance,  I take the second probeset in my list, 232583_at, and I look
>>> for it in the affy annotations file I find that apart some links to
>>> databases as genebank it does not have a gene symbol or an entrez (or
>>> most annotations anymore).
>>> In the previous version of the annotations this probeset was one of two
>>> associated with gene NCK2 (232583_at;203315_at) .
>>> The problem is that the second probeset (203315_at) was not selected by
>>> the analysis. That is the only evidence suggesting that gene NCK2 was
>>> differentially expressed (232583_at) does not suggest it anymore.
>>>
>>> Last, but in any case least, this is not happening to a few probesets.
>>> My original list had 417 probesets. 210 have changed/synonimized their
>>> gene symbols, but from these 210, 160 have become NA's
>>>
>>> Seems too many to feel comfortable :-(
>>>
>>> Thanks for the help
>>>
>>> Alex
>>>
>>> .............................................................................
>>> ..................................
>>>
>>> Dr. Alex  Sánchez.
>>> Associate Professor. Statistics Department. University of Barcelona.
>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>> asanchez_at_ub.edu
>>> Statistics and Bioinformatics Unit
>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>> asanchez_at_ir.vhebron.net
>>> .............................................................................
>>> ..................................
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message ----- From: "James W. MacDonald"
>>> <jmacdon at med.umich.edu>
>>> To: "Alex Sanchez" <asanchez at ub.edu>
>>> Cc: <bioconductor at stat.math.ethz.ch>
>>> Sent: Monday, July 06, 2009 3:58 PM
>>> Subject: Re: [BioC] Changes in annotations?
>>>
>>>
>>>       
>>>> Hi Alex,
>>>>
>>>> This is a question that comes up on the Bioc list fairly regularly,
>>>> and the answer is in two parts:
>>>>
>>>> First, the annotations supplied in the various metadata packages
>>>> supplied by BioC are *not* our annotations, but are simply a
>>>> re-packaging of data we collect from various sources. As an example,
>>>> we use the mappings of Affymetrix Probe ID to Entrez Gene ID from the
>>>> annotation csv files you can download from the Affy website. We then
>>>> map the Entrez Gene IDs to other annotation using primarily NCBI data.
>>>> So if you go to Affy's netaffx site (free registration required) and
>>>> query on say, 238900_at, you get this:
>>>>
>>>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_
>>>> 2%3A238900_AT 
>>>>
>>>>
>>>> And you will note that the first Entrez Gene ID listed there is
>>>> 100133484, which happens to be a defunct ID. However, this is the
>>>> first of many listed there (and we need a one-to-one mapping), so we
>>>> chose that one. A more likely Entrez Gene ID can be found further down
>>>> the list, but we simply don't have the resources to figure out if
>>>> there is a better choice in that list (for every reporter on every
>>>> Affy chip we annotate). Nor do we have the resources to ensure that
>>>> any of the mappings that Affy make are reasonable to begin with. We
>>>> have to trust that they (with *way* more resources that us) are doing
>>>> a reasonable job.
>>>>
>>>> The second part of the answer has to do with the 'moving target'
>>>> aspect of Biological annotations. These data change all the time, and
>>>> there is the recurring question of whether one should do an analysis
>>>> and 'freeze' it to that point in time, or should the annotations be
>>>> updated on a regular basis, with the realization that things can and
>>>> will change?
>>>>
>>>> Without looking at each reporter ID you list, I can't say if the
>>>> changes are due to Affy changing their annotation csv files, or to
>>>> changing knowledge of the genome, but I suspect it is a combination of
>>>> the two.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Alex Sanchez wrote:
>>>>         
>>>>> Hello
>>>>>
>>>>> I have had to review recently an analysis I did some time ago. This
>>>>> was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I
>>>>> have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>>>>> I have been surprised by the changes in the annotations: Many
>>>>> probesets that had had an annotation have become NA's whereas some
>>>>> have changed their symbol and their Entrez gene.
>>>>>
>>>>> To be specific I summarize my question with the top genes of my list
>>>>>
>>>>> The list I obtained 2 years ago is:
>>>>>
>>>>> probeset    locuslink    symbol
>>>>> 238900_at 3123 HLA-DRB1
>>>>> 232583_at 8440 NCK2
>>>>> 236307_at 60468 BACH2
>>>>> 223620_at 2857 GPR34
>>>>> 219759_at 64167 LRAP
>>>>> 201702_s_at 5514 PPP1R10
>>>>> 232882_at 2308 FOXO1A
>>>>> 213446_s_at 8826 IQGAP1
>>>>> 234033_at 9693 RAPGEF2
>>>>> 243006_at 2534 FYN
>>>>> 244648_at 54520 CCDC93
>>>>> 243691_at 23142 DCUN1D4
>>>>> 239264_at 60412 EXOC4
>>>>> 243546_at 143686 SESN3
>>>>> 205239_at 374 AREG
>>>>> 1565703_at 55520 ELAC1
>>>>> 244061_at 55843 ARHGAP15
>>>>> 230505_at 26037 SIPA1L1
>>>>> 242688_at 9320 TRIP12
>>>>> 1556474_a_at 285097 FLJ38379
>>>>> 232614_at 596 BCL2
>>>>> 1565689_at 3839 KPNA3
>>>>> 236685_at NA NA
>>>>> 225173_at 93663 ARHGAP18
>>>>> 241893_at 4249 MGAT5
>>>>>
>>>>> I used the following code to reproduce the issue with the annotations:
>>>>>
>>>>>
>>>>> #####################################################################
>>>>> ## Verification using R 2.9 & BioC 2.4
>>>>> #####################################################################
>>>>>
>>>>>           
>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>> "219759_at" ,
>>>>>>             
>>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>>> "243546_at" ,   "205239_at" ,
>>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>>> "1556474_a_at",
>>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>>> "241893_at")
>>>>>           
>>>>>> library(hgu133plus2.db)
>>>>>> library(annotate)
>>>>>>
>>>>>> entrezs<- getEG(probes, "hgu133plus2")
>>>>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>>>>> sel2<- cbind(probes, entrezs, symbols)
>>>>>> sel2
>>>>>>             
>>>>>              probes         entrezs     symbols       238900_at
>>>>> "238900_at"    "100133484" "LOC100133484"
>>>>> 232583_at    "232583_at"    NA          NA           236307_at
>>>>> "236307_at"    NA          NA           223620_at    "223620_at"
>>>>> "2857"      "GPR34"       219759_at    "219759_at"    "64167"
>>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"
>>>>> 232882_at "232882_at"    NA          NA           213446_s_at
>>>>> "213446_s_at" "8826"      "IQGAP1"      234033_at    "234033_at"
>>>>> NA          NA 243006_at    "243006_at"    NA          NA
>>>>> 244648_at "244648_at"    NA          NA           243691_at
>>>>> "243691_at"    NA NA           239264_at    "239264_at"
>>>>> NA          NA 243546_at    "243546_at"    NA          NA
>>>>> 205239_at "205239_at"    "374"       "AREG"        1565703_at
>>>>> "1565703_at" "4089"      "SMAD4"       244061_at    "244061_at"
>>>>> NA          NA 230505_at    "230505_at"    "145474"    "LOC145474"
>>>>> 242688_at "242688_at"    NA          NA           1556474_a_at
>>>>> "1556474_a_at" "285097"    "FLJ38379"    232614_at    "232614_at"
>>>>> NA          NA 1565689_at   "1565689_at"   NA          NA
>>>>> 236685_at "236685_at"    NA          NA           225173_at
>>>>> "225173_at" "93663"     "ARHGAP18"    241893_at    "241893_at"
>>>>> NA          NA
>>>>>           
>>>>>> sessionInfo()
>>>>>>             
>>>>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>>>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>>>> States.1252;LC_MONETARY=English_United
>>>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>> other attached packages:
>>>>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1
>>>>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1
>>>>> loaded via a namespace (and not attached):
>>>>> [1] xtable_1.5-5
>>>>> #############################################
>>>>>
>>>>> Many probesets seem to have changed.
>>>>> Can someone explain to me what is happening (or what may I be doing
>>>>> wrong)?
>>>>>
>>>>> The same code does not work with R 2.4 but if I change hgu133plus2.db
>>>>> by hgu133plus2 and getEG by getLL I obtain the original results:
>>>>>
>>>>> ###############################################
>>>>> ### Review of annotatons with R 2.4 and BioC 1.9
>>>>> ###############################################
>>>>>
>>>>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>>>>
>>>>>           
>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>> "219759_at" ,
>>>>>>             
>>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>>> "243546_at" ,   "205239_at" ,
>>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>>> "1556474_a_at",
>>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>>> "241893_at")
>>>>>           
>>>>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>>>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>>>>> sel1<- cbind(probes, LLs, symbols)
>>>>>> sel1
>>>>>>             
>>>>>              probes         LLs      symbols    238900_at
>>>>> "238900_at" "3123"   "HLA-DRB1" 232583_at    "232583_at"    "8440"
>>>>> "NCK2" 236307_at    "236307_at"    "60468"  "BACH2"    223620_at
>>>>> "223620_at" "2857"   "GPR34"    219759_at    "219759_at"    "64167"
>>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  232882_at
>>>>> "232882_at" "2308"   "FOXO1"    213446_s_at  "213446_s_at"  "8826"
>>>>> "IQGAP1" 234033_at    "234033_at"    "9693"   "RAPGEF2"  243006_at
>>>>> "243006_at" "2534"   "FYN"      244648_at    "244648_at"    "54520"
>>>>> "CCDC93" 243691_at    "243691_at"    "23142"  "DCUN1D4"  239264_at
>>>>> "239264_at" "60412"  "EXOC4"    243546_at    "243546_at"    "143686"
>>>>> "SESN3" 205239_at    "205239_at"    "374"    "AREG"     1565703_at
>>>>> "1565703_at" "4089"   "SMAD4"    244061_at    "244061_at"    "55843"
>>>>> "ARHGAP15" 230505_at    "230505_at"    "145474" "LOC145474"
>>>>> 242688_at    "242688_at"    "9320"   "TRIP12"   1556474_a_at
>>>>> "1556474_a_at" "285097" "FLJ38379" 232614_at    "232614_at"    "596"
>>>>> "BCL2"     1565689_at   "1565689_at"   "3839"   "KPNA3"    236685_at
>>>>> "236685_at"    NA       NA         225173_at    "225173_at"
>>>>> "93663" "ARHGAP18" 241893_at    "241893_at"    "4249"   "MGAT5"
>>>>>           
>>>>>> sessionInfo()
>>>>>>             
>>>>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>>>>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spani
>>>>> sh_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>>>>>
>>>>>
>>>>> attached base packages:
>>>>> [1] "tools"     "stats"     "graphics"  "grDevices"
>>>>> [5] "utils"     "datasets"  "methods"   "base"     other attached
>>>>> packages:
>>>>>    annotate     Biobase hgu133plus2 "1.12.1"    "1.12.2"    "1.14.0"
>>>>> ########################################################
>>>>>
>>>>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I
>>>>> ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4
>>>>> the results change dramatically.
>>>>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in
>>>>> R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and
>>>>> BioC 2.2 the same as 2.4,
>>>>>
>>>>> Any help to understand what's happening would be appreciated
>>>>>
>>>>> Alex Sanchez
>>>>>
>>>>> ---------------------------------------------------------------------------
>>>>> --------------------------
>>>>>
>>>>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>>> asanchez_at_ub.edu
>>>>> Statistics and Bioinformatics Unit
>>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>>> asanchez_at_ir.vhebron.net
>>>>> ---------------------------------------------------------------------------
>>>>> -------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>           
>>>> -- 
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> Douglas Lab
>>>> University of Michigan
>>>> Department of Human Genetics
>>>> 5912 Buhl
>>>> 1241 E. Catherine St.
>>>> Ann Arbor MI 48109-5618
>>>> 734-615-7826
>>>>
>>>>         
>>>       
>> -- 
>> James W. MacDonald, M.S.
>> Biostatistician
>> Douglas Lab
>> University of Michigan
>> Department of Human Genetics
>> 5912 Buhl
>> 1241 E. Catherine St.
>> Ann Arbor MI 48109-5618
>> 734-615-7826
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>