[BioC] Dealing with multiple probes per gene and multiple locations per probe.

Tue Jul 15 14:35:21 CEST 2008

On Tue, Jul 15, 2008 at 4:21 AM, Nathan Harmston
<iwanttobeabadger at googlemail.com> wrote:
> Hi everyone,
>
> Currently the aim of a project I'm working on is to discover pathway
> signatures (and I am thinking about using an approach like GSEA using KEGG
> or GO or something more modular). I have seen in some vignettes/tutorials
> that they recommend reducing the number of probes per gene to one by
> retaining the probe with the most variation since it will be the most
> informative. However, would it not be best to take the probe which is
> closest to the polyA tail of the gene, which according to some sources (in
> the lab I'm working at) is the most reliable probe in the gene? Is there a
> good reason for choosing variability over reliability, I have done a quick
> look through some papers and been unable to find any information which would
> point me towards one or another (apart from the bioC vignettes).

Just to clarify, are you talking about probes or probesets?

If you know the answer to which probesets are the most reliable, you
could certainly use those.  However, in the absence of such
information, variability across an experiment that has biological
variability is thought to be a surrogate for measuring something
important.

> Another problem I was wondering about is trying to deal with the multiple
> locations per probe problem? I was wondering if a BioConductor package was
> available for this, since it seems like a frequent issue with microarray
> analysis. How would you actually deal with this problem, my current approach
> is too remove probes which hit to multiple locations on the genome (I have a
> list from http://microarray.csc.mrc.ac.uk/scampa/section.html?id=5 and was
> going to use nsFilter (if I get it working correctly)). But again this seems
> like a lot of information is thrown away, is there a good way of dealing
> with these probes which doesnt result in a throwing away of information?

The concept of using probes that map only once to the genome is not
really entirely rational.  Instead, one actually wants to use probes
that map to only one gene.  Whether or not a probe hits anywhere else
in the genome (but not another gene) is irrelevant for mRNA
expression.  The limitation to mapping to transcripts and then to
genes is that the transcriptome of any given organism is not entirely
known.

> Out of interest, how reliable is the annotation provided? Is it completely
> derived from the affy annotations. The number of probes where affy entrez id
> and the ensemblid match is approx 30000, which isn't that great a statistic.
> How do people tend to deal with problems like this?

The annotations are derived from a remapping to current annotation
sources from the supplied accessions from affy.  However, there have
been several reannotations based on alignment ideas.  A search of the
archives might be helpful here.

Sean