[BioC] unmapped keys in hugene10stprobeset.db

Mark Cowley m.cowley at garvan.org.au
Tue Aug 17 02:05:06 CEST 2010

hi Paul & Marc,
in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol.
I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to?

On 17/08/2010, at 9:26 AM, Marc Carlson wrote:

> Hi Paul,
> I looked into this for you.  Often there will be discrepancies like this
> for purely historical reasons.  For example, Affy may have made the
> probes based on one idea about what the transcriptome looked like and
> then this could have changed by the time they shipped their product. 
> That kind of discrepancy happens all the time and especially with older
> chips.  But in your case, you also seem to have a lot of control probes
> on this platform. 
> You can extract the unmatched probes like this:
> library (hugene10stprobeset.db)
> a = hugene10stprobesetENTREZID
> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
> I actually pulled down the .csv mapping from Affymetrix that Arthur Li
> would have used to generate this database.  And I noticed that all the
> oddProbes I was looking at were control probes.  In fact, more than 4
> thousand of these probes are control probes.  Looking more closely at
> this file, you will see that many, many other probes have no gene
> mapping to them even though they are not listed as control probes.  What
> is going on with some of those probesets?  Why has Affy refused to
> assign an identity those ones?  That is really more of a question for
> Affymetrix than for us.
> When we map these IDs to make annotation packages, we look for known
> gene IDs from the manufacturer (unigene, refseq etc.), and we then map
> those onto entrez gene IDs from NCBI and from there onto other
> annotations.  But if the people who make the array are not willing to
> tell us what these things map to then we could really only speculate
> about what they are.
> But, if you have some external information that helps you to decide what
> these probes really map to, (maybe you have mapped the probesets onto
> the genome yourself or else maybe you feel that you can extract a little
> more data out of Affys .csv file than this author did), then in that
> case you can always feed that "improved" mapping into the SQLForge code
> in the AnnotationDbi package and generate your very own version of this
> annotation package.  It is pretty straightforward to do so and is
> described in the SQLForge vignette here:
> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
> I hope this helps explain things,
>  Marc
> On 08/16/2010 02:17 PM, Paul Shannon wrote:
>> Here's an annotation question someone might be able to help me out with.  I'll be grateful.
>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. 
>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.  
>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs.  The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
>>    library (hugene10stprobeset.db)
>>    library (hugene10sttranscriptcluster.db)
>>    bm = hugene10stprobesetENTREZID
>>    length (keys (bm))    #  257022
>>    count.mappedkeys (bm) #  238141
>>                 # unmapped:  18881
>>     cm = hugene10sttranscriptclusterENTREZID
>>     length (keys (cm));   #  33257                                                                                                                         
>>     count.mappedkeys (cm) #  21787
>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes?   Or otherwise clear up my confusion? 
>> Thanks!
>>  - Paul
