[BioC] Questions about gene identifiers and probesets regulation

Thu Aug 16 17:02:47 CEST 2007

Hi Chunyan,

Chunyan Liu wrote:
> Dear all,
> 
> I'm doing gene expression comparisons between two groups of subjects
> using affymetrix single-channel hgu133plus2 microarray chips and I have
> two questions. 
> 
> 1) Relationship among manufacturer ID, EntrezID, GenBank ID and gene
> SYMBOL: Is there any one-to-one mapping?
> 
> I noticed that the hgu133plus2 environment gives annotations through
> Entrez ID. Is this always the case? It seems to me that one EntrezID
> corresponds to multiple manufacturer IDs (probe name), but is this the
> case between manufacturer ID and GenBank ID? Is it true that one
> EntrezID maps to one gene symbol? 

I'm not sure if there is a one-to-one mapping from probeset ID to 
GenBank ID, but there certainly isn't a one-to-one mapping of GenBank ID 
to gene symbol (as GenBank IDs map things at the transcript level), so I 
am not sure that would help.

I think there is a one-to-one mapping from Entrez Gene to symbol, but I 
am not 100% sure about that.

> 
> 2) Probesets: Another question is after using limma, I get a list of
> up- and down-regulated probeset when comparing two groups (1,000 up and
> 2,000 down regulated probesets). When I translate these into unique gene
> symbols, I find 200 gene symbols that appear in both lists. Is this
> plausible? Interpretable?

Ah, now that is the problem, isn't it? Another problem is the case where 
10 probesets are supposed to interrogate a particular gene and one is 
significant, but the other nine are not. In that case is the gene 
differentialy expressed or not?

What you have to understand is that Affy designed the probesets for this 
chip based on the UniGene build 133, which was the best information at 
the time, but which is really outdated now (we are on build 203 currently).

Even when they designed the chip, there were three levels of probesets. 
Those with an _at suffix, which indicated that the probes all blast 
exclusively to the transcript in question, those with an _s_at (or 
_a_at, I forget what they used for the 133), that indicates that some of 
the probes bind to related transcripts (whatever 'related' means), and 
_x_at, which indicates that some probes bind to completely unrelated 
transcripts.

So even when the chip was designed, some of the probesets were not 
nearly as reliable as others. If you take the probe sequences and blast 
them today, you can find _at probesets with probes that bind to 
unrelated sequences, so time has not always been kind to the probe mappings.

What can you do about this problem? There are a couple of things you can 
do, but any 'fix' has its own problems.

First, you can use the remapped cdfs that are made available by the MBNI 
at the University of Michigan (via BioC). These remapped cdfs discard 
the original probesets and only use those probes that are known to map 
to unique sequences in the genome (based on the current UniGene build), 
and then map to transcripts or genes based on Entrez Gene, GenBank, 
UniGene, Ensembl, etc.

The upside to these cdfs is that you will have only one probeset per 
transcript/gene, so it will be impossible to have a gene symbol 
appearing in both the up and down regulated groups. In addition, the 
assumptions of say RMA or GCRMA (or any probe-level models in affyPLM) 
will again hold true; in other words, the intensity of a given probe 
will be due only to the level of the transcript it is supposed to 
measure plus the probe-specific binding.

The downside of these cdfs is that the number of probes per probeset 
will vary from something like 3 - 150, so the standard error of your 
estimate will also vary widely. If you simply take the expression values 
for these probesets and analyze using limma, you will be ignoring this 
extra level of error (which you can safely ignore using the 'stock' affy 
cdfs, since most of those probesets have 11 probes per).

Second, you can just use the 'stock' affy cdfs, and do some ad hoc 
method to decide which of the probesets to believe. You can simply 
choose to believe only the _at probesets. Or you can decide to blast (or 
blat, which is much faster and AFAICT nearly as accurate) each of the 
disagreeing probesets to see which one appears to actually measure the 
gene transcript in question. The upside here is you don't have the extra 
level of variability introduced by the MBNI cdfs, but the downside is 
the amount of extra work it will entail.

HTH,

Jim

> 
> Thank you very much for any input. 
> 
> Chunyan Liu
> Cincinnati Children's Hospital
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623