[BioC] unmapped keys in hugene10stprobeset.db
mcarlson at fhcrc.org
Tue Aug 17 01:26:20 CEST 2010
I looked into this for you. Often there will be discrepancies like this
for purely historical reasons. For example, Affy may have made the
probes based on one idea about what the transcriptome looked like and
then this could have changed by the time they shipped their product.
That kind of discrepancy happens all the time and especially with older
chips. But in your case, you also seem to have a lot of control probes
on this platform.
You can extract the unmatched probes like this:
a = hugene10stprobesetENTREZID
oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
I actually pulled down the .csv mapping from Affymetrix that Arthur Li
would have used to generate this database. And I noticed that all the
oddProbes I was looking at were control probes. In fact, more than 4
thousand of these probes are control probes. Looking more closely at
this file, you will see that many, many other probes have no gene
mapping to them even though they are not listed as control probes. What
is going on with some of those probesets? Why has Affy refused to
assign an identity those ones? That is really more of a question for
Affymetrix than for us.
When we map these IDs to make annotation packages, we look for known
gene IDs from the manufacturer (unigene, refseq etc.), and we then map
those onto entrez gene IDs from NCBI and from there onto other
annotations. But if the people who make the array are not willing to
tell us what these things map to then we could really only speculate
about what they are.
But, if you have some external information that helps you to decide what
these probes really map to, (maybe you have mapped the probesets onto
the genome yourself or else maybe you feel that you can extract a little
more data out of Affys .csv file than this author did), then in that
case you can always feed that "improved" mapping into the SQLForge code
in the AnnotationDbi package and generate your very own version of this
annotation package. It is pretty straightforward to do so and is
described in the SQLForge vignette here:
I hope this helps explain things,
On 08/16/2010 02:17 PM, Paul Shannon wrote:
> Here's an annotation question someone might be able to help me out with. I'll be grateful.
> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content.
> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.
> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
> library (hugene10stprobeset.db)
> library (hugene10sttranscriptcluster.db)
> bm = hugene10stprobesetENTREZID
> length (keys (bm)) # 257022
> count.mappedkeys (bm) # 238141
> # unmapped: 18881
> cm = hugene10sttranscriptclusterENTREZID
> length (keys (cm)); # 33257
> count.mappedkeys (cm) # 21787
> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion?
> - Paul
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor