[BioC] unmapped keys in hugene10stprobeset.db
pshannon at systemsbiology.org
Mon Aug 16 23:17:54 CEST 2010
Here's an annotation question someone might be able to help me out with. I'll be grateful.
Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content.
This sounds to me like affy started with sequence from exons of ~29k genes and created probes.
But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
bm = hugene10stprobesetENTREZID
length (keys (bm)) # 257022
count.mappedkeys (bm) # 238141
# unmapped: 18881
cm = hugene10sttranscriptclusterENTREZID
length (keys (cm)); # 33257
count.mappedkeys (cm) # 21787
The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion?
More information about the Bioconductor