[BioC] Unambiguously mapping of affy IDs to gene symbols using hgu133plus2.db

Fri Oct 1 18:07:07 CEST 2010

Hi Christian,

What appears to be a simple mapping from probesets to gene symbols is
actually slightly more complex.  Behind the scenes, the annotation
package has data to map the relationship from probesets to gene IDs, and
also the relationship from gene IDs to gene symbols.  This is important
because there can be many probesets that map to a single gene, there can
be many genes that map to a single probeset and there can be many gene
symbols that map to a single gene.  Therefore there are two
relationships here, the 1st is potentially many (probes) to many
(genes), and the second is many (symbols) to one (gene). 

Why then does it look simpler than that?

In the annotation packages, (by default), we hide probesets that map to
more than one gene.  This is because most of the time, you probably
don't want anything to do with probes that are not specific.  But on the
off chance that you really want to see those, you can expose them using
the toggleProbes() method.  So usually the 1st relationship is actually
many (probesets) to one (gene).

And in the SYMBOL mappings, the only gene symbol we expose is the most
standard one.  If you want the other gene symbols that are associated
with a particular entrez gene ID, then you would have to use the
ALIAS2PROBE mapping.  So this second relationship is also normally
simplified somewhat for you, from a "many to one" down to a "one to one".

Because gene symbols are not guaranteed to be unique, (sometimes the
same symbol is used as an alias for multiple different genes), I would
strongly urge you to NEVER use them as actual IDs.  Instead if you have
to use them, they should always be the last piece of data attached to a
work flow.  So whether you decide to use the annotation packages or
biomaRt, you will require a different strategy for matching up IDs than
using gene symbols. 

In short, any sort of "joining" operation that uses gene symbols as keys
is unsafe and should never be done.

  Marc

On 10/01/2010 03:10 AM, Christian Ruckert wrote:
> Hi,
> I am doing some mapping of affymetrix probeset IDs to gene symbols
> using package hgu133plus2.db.
>
> As the following example illustrates, each of the 40686 mapped
> probesets maps to exactly one gene symbol.
>
> > library("hgu133plus2.db")
> > x <- hgu133plus2SYMBOL
> > Llength(x)
> [1] 54675
> > count.mappedkeys(x)
> [1] 40686
>
> > head(nhit(x))
> 1007_s_at   1053_at    117_at    121_at 1255_g_at   1294_at
>         1         1         1         1         1         1
>
> > table(nhit(x))
>
>     0     1
> 13989 40686
>
>
> Am I correct, that annotation with gene symbol is only included in the
> package if it is unambiguously?
>
> For example
> > x[["203074_at"]]
> [1] NA
>
> But netaffx and biomart return:
> ANXA8, ANXA8L1, ANXA8L2
>
> If doing a mapping between protein and gene expression arrays based on
> gene symbols, can results be improved using biomart instead of the
> annotation packages?
>
> Christian
>
>
> > sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hgu133plus2.db_2.4.1 org.Hs.eg.db_2.4.1   RSQLite_0.9-1
> [4] DBI_0.2-5            AnnotationDbi_1.10.1 Biobase_2.8.0
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor