[BioC] Changes in annotations?

Sat Jul 11 05:54:10 CEST 2009

Thank you


> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Fri, 10 Jul 2009 08:46:36 -0700
> To: Loren Engrav <engrav at u.washington.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] Changes in annotations?
> 
> Hi Loren,
> 
> I am working on this right now.  But it is really not as simple as it
> sounds.  I hope to have more to say about this very soon.
> 
> 
>   Marc
> 
> 
> 
> Loren Engrav wrote:
>> Thank you
>> 
>> Why is it not "best" to include everything Affy includes? All the synonyms
>> and IDs
>> 
>> Or is this not software possible?
>> 
>> 
>> 
>>   
>>> From: "James W. MacDonald" <jmacdon at med.umich.edu>
>>> Date: Thu, 09 Jul 2009 09:36:46 -0400
>>> To: Alex Sanchez <asanchez at ub.edu>
>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Subject: Re: [BioC] Changes in annotations?
>>> 
>>> Hi Alex,
>>> 
>>> We have taken this up as an internal topic of discussion and hopefully
>>> will be able to come up with a solution that is a bit better than what
>>> we are doing right now.
>>> 
>>> However, some of this is going to be unavoidable. The chips were based
>>> on UniGene build 133 (and we are now on build 219), so things are going
>>> to tend to move around. But I don't think this is the main reason for
>>> what you are seeing, as the latest Affy annotation file is still based
>>> on UCSC build hg18 (which is based on NCBI build 36.1), so these data
>>> are over three years old.
>>> 
>>> Instead, it appears that Affy have changed something about how they are
>>> annotating things. Some of this is an improvement, and some not so much.
>>> 
>>> For instance, 203315_at does interrogate NCK2:
>>> 
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 10
>>> 693&hgt.out3=10x&position=chr2%3A105876577-105876826&hgtgroup_map_close=0&hg
>>> tg
>>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_ex
>>> pr
>>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgrou
>>> p_
>>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&
>>> hg
>>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAn
>>> dV
>>> ar_close=1
>>> 
>>> whereas 232583_at does not:
>>> 
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 10
>>> 693&hgt.out3=10x&position=chr2%3A105752637-105752886&hgtgroup_map_close=0&hg
>>> tg
>>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_ex
>>> pr
>>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgrou
>>> p_
>>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&
>>> hg
>>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAn
>>> dV
>>> ar_close=1
>>> 
>>> so in this case they have improved their annotations.
>>> 
>>> In the case of 238900_at, they seem to have added in a bunch of EG IDs
>>> that are based on a model rather than being actual reviewed genes. In
>>> addition, that probeset seems to hit an intron, so adding in all this
>>> other annotation appears pointless:
>>> 
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 11
>>> 081&hgt.out3=10x&position=chr6%3A32653723-32653972&hgtgroup_map_close=0&hgtg
>>> ro
>>> up_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
>>> es
>>> sion_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
>>> va
>>> rRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
>>> tg
>>> roup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
>>> ar
>>> _close=1
>>> 
>>> What this really comes down to is the fact that you can't take anything
>>> for granted. Not only do you have to validate a set of significant
>>> results using some other technology, you also need to ensure that the
>>> chip you are using is actually measuring what you think prior to doing
>>> the validation.
>>> 
>>> It should be relatively painless to set up a pipeline where you could
>>> use the BSgenome.Hsapiens.UCSC.hg18 package along with functions from
>>> BSgenome/Biostrings to map probesets to the genome and then use the
>>> org.Hs.eg.db package to see if a given probeset is even interrogating
>>> the gene it is supposed to. One could then use rtracklayer to visualize
>>> things in the UCSC genome browser to make sure you weren't interrogating
>>> an intron.
>>> 
>>> Best,
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> Alex Sanchez wrote:
>>>     
>>>> Hi James and thanks for the help.
>>>> 
>>>> I know, of course, that what Bioconductor does is to take the
>>>> annotations from public data sources.
>>>> I will now turn to affymetrix to see if someone there, or in their
>>>> forums, can explain why so many annotations have been turned into "NA's"
>>>> 
>>>> There seem to be two problems here
>>>> - The first one, pointed by Loren in his message today is synonims. For
>>>> instance the first probeset in my list was 238900_at,
>>>>   In the first version I used (BioC 1.9) the Gene Symbol provide by
>>>> getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
>>>>   However the original symbol is still in the Affymetrix annotation file
>>>> "HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
>>>>   to show the relation with the new Entrez ID (100133484).
>>>> In any case it is ennoying but it can be assumed.
>>>> - What seems worse is what happens to some annotations: If, for
>>>> instance,  I take the second probeset in my list, 232583_at, and I look
>>>> for it in the affy annotations file I find that apart some links to
>>>> databases as genebank it does not have a gene symbol or an entrez (or
>>>> most annotations anymore).
>>>> In the previous version of the annotations this probeset was one of two
>>>> associated with gene NCK2 (232583_at;203315_at) .
>>>> The problem is that the second probeset (203315_at) was not selected by
>>>> the analysis. That is the only evidence suggesting that gene NCK2 was
>>>> differentially expressed (232583_at) does not suggest it anymore.
>>>> 
>>>> Last, but in any case least, this is not happening to a few probesets.
>>>> My original list had 417 probesets. 210 have changed/synonimized their
>>>> gene symbols, but from these 210, 160 have become NA's
>>>> 
>>>> Seems too many to feel comfortable :-(
>>>> 
>>>> Thanks for the help
>>>> 
>>>> Alex
>>>> 
>>>> ...........................................................................
>>>> ..
>>>> ..................................
>>>> 
>>>> Dr. Alex  Sánchez.
>>>> Associate Professor. Statistics Department. University of Barcelona.
>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>> asanchez_at_ub.edu
>>>> Statistics and Bioinformatics Unit
>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>> asanchez_at_ir.vhebron.net
>>>> ...........................................................................
>>>> ..
>>>> ..................................
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message ----- From: "James W. MacDonald"
>>>> <jmacdon at med.umich.edu>
>>>> To: "Alex Sanchez" <asanchez at ub.edu>
>>>> Cc: <bioconductor at stat.math.ethz.ch>
>>>> Sent: Monday, July 06, 2009 3:58 PM
>>>> Subject: Re: [BioC] Changes in annotations?
>>>> 
>>>> 
>>>>       
>>>>> Hi Alex,
>>>>> 
>>>>> This is a question that comes up on the Bioc list fairly regularly,
>>>>> and the answer is in two parts:
>>>>> 
>>>>> First, the annotations supplied in the various metadata packages
>>>>> supplied by BioC are *not* our annotations, but are simply a
>>>>> re-packaging of data we collect from various sources. As an example,
>>>>> we use the mappings of Affymetrix Probe ID to Entrez Gene ID from the
>>>>> annotation csv files you can download from the Affy website. We then
>>>>> map the Entrez Gene IDs to other annotation using primarily NCBI data.
>>>>> So if you go to Affy's netaffx site (free registration required) and
>>>>> query on say, 238900_at, you get this:
>>>>> 
>>>>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLU
>>>>> S_
>>>>> 2%3A238900_AT
>>>>> 
>>>>> 
>>>>> And you will note that the first Entrez Gene ID listed there is
>>>>> 100133484, which happens to be a defunct ID. However, this is the
>>>>> first of many listed there (and we need a one-to-one mapping), so we
>>>>> chose that one. A more likely Entrez Gene ID can be found further down
>>>>> the list, but we simply don't have the resources to figure out if
>>>>> there is a better choice in that list (for every reporter on every
>>>>> Affy chip we annotate). Nor do we have the resources to ensure that
>>>>> any of the mappings that Affy make are reasonable to begin with. We
>>>>> have to trust that they (with *way* more resources that us) are doing
>>>>> a reasonable job.
>>>>> 
>>>>> The second part of the answer has to do with the 'moving target'
>>>>> aspect of Biological annotations. These data change all the time, and
>>>>> there is the recurring question of whether one should do an analysis
>>>>> and 'freeze' it to that point in time, or should the annotations be
>>>>> updated on a regular basis, with the realization that things can and
>>>>> will change?
>>>>> 
>>>>> Without looking at each reporter ID you list, I can't say if the
>>>>> changes are due to Affy changing their annotation csv files, or to
>>>>> changing knowledge of the genome, but I suspect it is a combination of
>>>>> the two.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Jim
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Alex Sanchez wrote:
>>>>>         
>>>>>> Hello
>>>>>> 
>>>>>> I have had to review recently an analysis I did some time ago. This
>>>>>> was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I
>>>>>> have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>>>>>> I have been surprised by the changes in the annotations: Many
>>>>>> probesets that had had an annotation have become NA's whereas some
>>>>>> have changed their symbol and their Entrez gene.
>>>>>> 
>>>>>> To be specific I summarize my question with the top genes of my list
>>>>>> 
>>>>>> The list I obtained 2 years ago is:
>>>>>> 
>>>>>> probeset    locuslink    symbol
>>>>>> 238900_at 3123 HLA-DRB1
>>>>>> 232583_at 8440 NCK2
>>>>>> 236307_at 60468 BACH2
>>>>>> 223620_at 2857 GPR34
>>>>>> 219759_at 64167 LRAP
>>>>>> 201702_s_at 5514 PPP1R10
>>>>>> 232882_at 2308 FOXO1A
>>>>>> 213446_s_at 8826 IQGAP1
>>>>>> 234033_at 9693 RAPGEF2
>>>>>> 243006_at 2534 FYN
>>>>>> 244648_at 54520 CCDC93
>>>>>> 243691_at 23142 DCUN1D4
>>>>>> 239264_at 60412 EXOC4
>>>>>> 243546_at 143686 SESN3
>>>>>> 205239_at 374 AREG
>>>>>> 1565703_at 55520 ELAC1
>>>>>> 244061_at 55843 ARHGAP15
>>>>>> 230505_at 26037 SIPA1L1
>>>>>> 242688_at 9320 TRIP12
>>>>>> 1556474_a_at 285097 FLJ38379
>>>>>> 232614_at 596 BCL2
>>>>>> 1565689_at 3839 KPNA3
>>>>>> 236685_at NA NA
>>>>>> 225173_at 93663 ARHGAP18
>>>>>> 241893_at 4249 MGAT5
>>>>>> 
>>>>>> I used the following code to reproduce the issue with the annotations:
>>>>>> 
>>>>>> 
>>>>>> #####################################################################
>>>>>> ## Verification using R 2.9 & BioC 2.4
>>>>>> #####################################################################
>>>>>> 
>>>>>>           
>>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>>> "219759_at" ,
>>>>>>>            
>>>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>>>> "243546_at" ,   "205239_at" ,
>>>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>>>> "1556474_a_at",
>>>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>>>> "241893_at")
>>>>>>           
>>>>>>> library(hgu133plus2.db)
>>>>>>> library(annotate)
>>>>>>> 
>>>>>>> entrezs<- getEG(probes, "hgu133plus2")
>>>>>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>>>>>> sel2<- cbind(probes, entrezs, symbols)
>>>>>>> sel2
>>>>>>>            
>>>>>>              probes         entrezs     symbols       238900_at
>>>>>> "238900_at"    "100133484" "LOC100133484"
>>>>>> 232583_at    "232583_at"    NA          NA           236307_at
>>>>>> "236307_at"    NA          NA           223620_at    "223620_at"
>>>>>> "2857"      "GPR34"       219759_at    "219759_at"    "64167"
>>>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"
>>>>>> 232882_at "232882_at"    NA          NA           213446_s_at
>>>>>> "213446_s_at" "8826"      "IQGAP1"      234033_at    "234033_at"
>>>>>> NA          NA 243006_at    "243006_at"    NA          NA
>>>>>> 244648_at "244648_at"    NA          NA           243691_at
>>>>>> "243691_at"    NA NA           239264_at    "239264_at"
>>>>>> NA          NA 243546_at    "243546_at"    NA          NA
>>>>>> 205239_at "205239_at"    "374"       "AREG"        1565703_at
>>>>>> "1565703_at" "4089"      "SMAD4"       244061_at    "244061_at"
>>>>>> NA          NA 230505_at    "230505_at"    "145474"    "LOC145474"
>>>>>> 242688_at "242688_at"    NA          NA           1556474_a_at
>>>>>> "1556474_a_at" "285097"    "FLJ38379"    232614_at    "232614_at"
>>>>>> NA          NA 1565689_at   "1565689_at"   NA          NA
>>>>>> 236685_at "236685_at"    NA          NA           225173_at
>>>>>> "225173_at" "93663"     "ARHGAP18"    241893_at    "241893_at"
>>>>>> NA          NA
>>>>>>           
>>>>>>> sessionInfo()
>>>>>>>            
>>>>>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>>>>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>>>>> States.1252;LC_MONETARY=English_United
>>>>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>>>> 
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>> other attached packages:
>>>>>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1
>>>>>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] xtable_1.5-5
>>>>>> #############################################
>>>>>> 
>>>>>> Many probesets seem to have changed.
>>>>>> Can someone explain to me what is happening (or what may I be doing
>>>>>> wrong)?
>>>>>> 
>>>>>> The same code does not work with R 2.4 but if I change hgu133plus2.db
>>>>>> by hgu133plus2 and getEG by getLL I obtain the original results:
>>>>>> 
>>>>>> ###############################################
>>>>>> ### Review of annotatons with R 2.4 and BioC 1.9
>>>>>> ###############################################
>>>>>> 
>>>>>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>>>>> 
>>>>>>           
>>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>>> "219759_at" ,
>>>>>>>            
>>>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>>>> "243546_at" ,   "205239_at" ,
>>>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>>>> "1556474_a_at",
>>>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>>>> "241893_at")
>>>>>>           
>>>>>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>>>>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>>>>>> sel1<- cbind(probes, LLs, symbols)
>>>>>>> sel1
>>>>>>>            
>>>>>>              probes         LLs      symbols    238900_at
>>>>>> "238900_at" "3123"   "HLA-DRB1" 232583_at    "232583_at"    "8440"
>>>>>> "NCK2" 236307_at    "236307_at"    "60468"  "BACH2"    223620_at
>>>>>> "223620_at" "2857"   "GPR34"    219759_at    "219759_at"    "64167"
>>>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  232882_at
>>>>>> "232882_at" "2308"   "FOXO1"    213446_s_at  "213446_s_at"  "8826"
>>>>>> "IQGAP1" 234033_at    "234033_at"    "9693"   "RAPGEF2"  243006_at
>>>>>> "243006_at" "2534"   "FYN"      244648_at    "244648_at"    "54520"
>>>>>> "CCDC93" 243691_at    "243691_at"    "23142"  "DCUN1D4"  239264_at
>>>>>> "239264_at" "60412"  "EXOC4"    243546_at    "243546_at"    "143686"
>>>>>> "SESN3" 205239_at    "205239_at"    "374"    "AREG"     1565703_at
>>>>>> "1565703_at" "4089"   "SMAD4"    244061_at    "244061_at"    "55843"
>>>>>> "ARHGAP15" 230505_at    "230505_at"    "145474" "LOC145474"
>>>>>> 242688_at    "242688_at"    "9320"   "TRIP12"   1556474_a_at
>>>>>> "1556474_a_at" "285097" "FLJ38379" 232614_at    "232614_at"    "596"
>>>>>> "BCL2"     1565689_at   "1565689_at"   "3839"   "KPNA3"    236685_at
>>>>>> "236685_at"    NA       NA         225173_at    "225173_at"
>>>>>> "93663" "ARHGAP18" 241893_at    "241893_at"    "4249"   "MGAT5"
>>>>>>           
>>>>>>> sessionInfo()
>>>>>>>            
>>>>>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>>>>>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spa
>>>>>> ni
>>>>>> sh_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>>>>>> 
>>>>>> 
>>>>>> attached base packages:
>>>>>> [1] "tools"     "stats"     "graphics"  "grDevices"
>>>>>> [5] "utils"     "datasets"  "methods"   "base"     other attached
>>>>>> packages:
>>>>>>    annotate     Biobase hgu133plus2 "1.12.1"    "1.12.2"    "1.14.0"
>>>>>> ########################################################
>>>>>> 
>>>>>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I
>>>>>> ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4
>>>>>> the results change dramatically.
>>>>>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in
>>>>>> R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and
>>>>>> BioC 2.2 the same as 2.4,
>>>>>> 
>>>>>> Any help to understand what's happening would be appreciated
>>>>>> 
>>>>>> Alex Sanchez
>>>>>> 
>>>>>> -------------------------------------------------------------------------
>>>>>> --
>>>>>> --------------------------
>>>>>> 
>>>>>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>>>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>>>> asanchez_at_ub.edu
>>>>>> Statistics and Bioinformatics Unit
>>>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>>>> asanchez_at_ir.vhebron.net
>>>>>> -------------------------------------------------------------------------
>>>>>> --
>>>>>> -------------------------
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> [[alternative HTML version deleted]]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>           
>>>>> -- 
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Douglas Lab
>>>>> University of Michigan
>>>>> Department of Human Genetics
>>>>> 5912 Buhl
>>>>> 1241 E. Catherine St.
>>>>> Ann Arbor MI 48109-5618
>>>>> 734-615-7826
>>>>> 
>>>>>         
>>>>       
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Douglas Lab
>>> University of Michigan
>>> Department of Human Genetics
>>> 5912 Buhl
>>> 1241 E. Catherine St.
>>> Ann Arbor MI 48109-5618
>>> 734-615-7826
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>     
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>>   
>