[BioC] unmapped keys in hugene10stprobeset.db

Mon Aug 16 23:17:54 CEST 2010

Here's an annotation question someone might be able to help me out with.  I'll be grateful.

Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':

Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. 

This sounds to me like affy started with sequence from exons of ~29k genes and created probes.  
But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs.  The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.

    library (hugene10stprobeset.db)
    library (hugene10sttranscriptcluster.db)
    bm = hugene10stprobesetENTREZID
    length (keys (bm))    #  257022
    count.mappedkeys (bm) #  238141
                 # unmapped:  18881
     cm = hugene10sttranscriptclusterENTREZID
     length (keys (cm));   #  33257                                                                                                                         
     count.mappedkeys (cm) #  21787

The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.

Can anyone suggest where I can get entrez geneID annotations for these unmapped probes?   Or otherwise clear up my confusion? 

Thanks!

  - Paul