[BioC] confusion about nomenclature of illumina annotationpackages

Tue Jul 22 01:28:41 CEST 2008

Hi Paul,
I used BLAST results for the 2006 Mouse assembly (mm8) from a group at 
Cambridge. 
http://www.compbio.group.cam.ac.uk/Resources/Annotation/
The ACCNUM part of the annotation package is what I used to search 
NCBI.  I only searched probes which had 100% similarity with RefSeq 
nucleotides.  For probes with 100% similarity to other nucleotides, the 
ACCNUM gives that information but none of the other NCBI info.  

If you look at ACCNUM instead of ENTREZID, you will see the following 
for the probe IDs you list below:
 > accnum <- as.list(illuminaMousev1p1ProbeIDENTREZID)
 > accnum[c("106220736","130309","4730088")]
$`106220736`
[1] "XM_130523"

$`130309`
[1] "XM_139262"

$`4730088`
[1] "XM_132882"

All of these IDs have either been removed from NCBI or changed to 
NM_xxxxx.  I don't know when the change occurred but certainly if you 
BLAST them on NCBI today, you will see the more recent annotation.  This 
is a weakness of my strategy.

The good news is that Marc Carlson has made AnnotationDbi much easier to 
use so you can make your own annotation package with accession IDs that 
you think are the most current and useful.

Hope this helpful,
Lynn

Paul Leo wrote:
> Hi Lynn,
> This was just a quick off-line check, I have noticed that the R3
> illumina derived annotations appear to assign many more probes that the
> probe ID package. Of the ones I have checked they have been in the 3'UTR
> , they are perfect, unique matches so am not sure why they were not
> assigned....for example
>
> LL<-mget(rownames(all.num), env=illuminaMousev1p1ProbeIDENTREZID,
> ifnotfound=NA)
>
> 106220736 101660139    130309 103710092   4730088 
>        NA  "110380"        NA   "66425"        NA
>
>
> Using illumina get:: 
>
>
> 106220736 101660139    130309 103710092   4730088
>   "52837"  "110380"  "239273"   "66425"   "12515"
>
> Analysis of first one 106220736:
>
>   
>> ref|NT_039207.7|Mm2_39247_37 Download subject sequence spanning the
>>     
> HSP Mus musculus chromosome 2 genomic contig, strain C57BL/6J
> Length=116366104
>
>  Features flanking this part of subject sequence:
>    40283 bp at 5' side: hydroxyacid oxidase 1, liver
>    3861 bp at 3' side: thioredoxin domain containing 13
>
>
>  Score = 99.6 bits (50),  Expect = 9e-20
>  Identities = 50/50 (100%), Gaps = 0/50 (0%)
>  Strand=Plus/Minus
>
> Query  1         CCCTGAGCCTATGTACTGCCATTTGATTGTCAGGTTCCATTTGTGTCAAA  50
>                  ||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct  75461702  CCCTGAGCCTATGTACTGCCATTTGATTGTCAGGTTCCATTTGTGTCAAA
> 75461653
>
>
>
> Just a heads up, no need to respond. Also I have noticed from
> HT-sequencing analysis they getting many more hits to the genome by
> using multiple reference genomes, especially for human (where fragments
> about 35-50bp). (reference, celera and the other 3-4 now available). My
> apologies if this is something you have already considered.
>
> Cheers
> Paul
>
>   
>> sessionInfo()
>>     
> R version 2.7.1 RC (2008-06-16 r45949) 
> i386-pc-mingw32 
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods
>
> [8] base     
>
> other attached packages:
> [1] illuminaMousev1p1ProbeID.db_1.1.1 AnnotationDbi_1.2.2              
> [3] RSQLite_0.6-9                     DBI_0.2-4                        
> [5] Biobase_2.0.1                     limma_2.14.5                     
>   
>
>
>
>
> Dr Paul Leo
> Bioinformatician
> Diamantina Institute for Cancer, Immunology and Metabolic Medicine
> Tel: +61 7 3240 7740  Mob: 041 303 8691 Fax: +61 7 3240 5946
>
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Lynn Amon
> Sent: Saturday, July 19, 2008 3:45 AM
> To: Christian.Kohler at klinik.uni-regensburg.de
> Cc: BioC
> Subject: Re: [BioC] confusion about nomenclature of illumina
> annotationpackages
>
> Christian,
>
> The v in the illumina annotation package names refer to the Illumina 
> version.  I apologize for not describing this better on the Bioconductor
>
> site.  All versions include the probes for the WG6 formats but the 
> HumanRefSeq8 probes are a subset of WG6 so you can use these annotation 
> packages for either format.  Rat does not have a WG6 counterpart.
> For human, there is v1, v2, and v3. 
> For mouse, there is v1, v1.1, and v2.  Because I can't use an extra "." 
> in the name, v1.1 is called illuminaMousev1p1ProbeID.db
> For rat, there is just v1 so far.
> You must match the version of your illumina array with version on the 
> annotation package because many if not all the probe IDs will be 
> different.  If you have an Illumina Human v2 array, you must use 
> illuminaHumanv2ProbeID.db. 
> I have not seen any Illumina Human chip with a V1.x type of name so I'm 
> not sure what are referring to.  Maybe, you are confused with the mouse 
> v1.1 chip?
> The illuminaHumanv1.db and illuminaHumanv2.db packages were made 
> independently by the Biocore Data team for Illumina Human v1 and v2 
> chips.  They will be a little different than illuminaHumanv1ProbeID.db 
> and illuminaHumanv2ProbeID.db because they used illumina annotation 
> whereas I used results from BLASTing the probe sequences against the 
> current NCBI assembly.  They might have also use different probe 
> identifiers than I did.
>
> I hope this helps clear up the confusion,
> Lynn
>
>
>
> Christian Kohler wrote:
>   
>> Dear list,
>>
>> I am a bit confused about the nomenclature of the BioC annotation
>> packages of Illumina BeadChips;
>> I'm not sure if I am in the know.
>>
>> Let me take the 'illuminaHumanv2.db' package as an example (mainly the
>> 'v2'-part of the nomenclature), which annotates the Illumina
>>     
> HumanRef-8
>   
>> BeadChip. Does 'v2' in general indicate the "chip type" (here RefSeq
>> Bead Chip [if so, then does "v1" stand for WG6 chips?]) or the "chip
>> version" (HumanRef-8 V2)?
>> I mean, if it indicates the version, then the packages
>> 'illuminaHumanv1.db' annotates e.g. HumanRef-8 V1.x Bead Chips, right?
>> But why is the description in BioCViews of 'illuminaHumanv1.db'
>> "...Illumina HumanRef 6..."?
>> Another point of confusion is that if I have a dataset where some
>> samples originate from a Ref8 V1.1 platform and some samples originate
>> from a Ref8 V2 platform, would it be advisable to annotate the first
>> samples with the *.v1.db and the latter with the *.v2.db package or
>>     
> can
>   
>> I use *.v2.db for all samples?
>>  
>> I am lost upon these questions.
>>
>> I am thankful for any help.
>>
>> Christian
>>
>>
>>
>>     
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>