[BioC] select one Affy probeset for one gene

Mon Mar 13 22:55:49 CET 2006

Hi,

Sean Davis wrote:
> 
> 
> On 3/13/06 3:38 PM, "Glazko, Galina" <Galina_Glazko at urmc.rochester.edu>
> wrote:
> 
> 
>>Dear list,
>>
>> 
>>
>>Is there a way to automatically select one probeset for one gene in Affy
>>arrays? 
>>
>>Say, if we have several probesets for a given gene, we select the one
>>with the highest level of expression, or based on any other reasonable
>>criteria...?
>>
>>I am sorry if this question was answered before, it seems to be very
>>basic question and I hope there is the solution...
> 
> 
> Galina,
> 
> You can contrive a solution, I suppose.  However, I'm not sure this is a
> good idea.  Whatever "reasonable criteria" you use are likely to lead to
> bias.  Filtering on unmeasured probesets or other quality measures applied
> equally to all probesets is probably reasonable, but not applying on a
> per-gene basis.  There have been related discussions in the past, often
> centering around "averaging" expression values.
> 
> The more accepted way of dealing with multiple probesets is to do your
> analysis based on the probeset; only after that is done do you then connect
> your gene labels back to the probesets.

  Unfortunately that approach does not always work and something needs 
to be done a bit earlier in the process if a user wants to make use of 
data such as GO, chromosomal location etc where the mapping is based on 
Entrez Gene ID (for example, but other identifiers have very similar 
issues). Not removing the duplicates leads to often quite different 
results (in essence there is over counting if all probes are accurate). 
As users of GOstats know, you have to choose one candidate for each 
Entrez gene id (and probably what I have been doing there is not ideal - 
the suggestion below, due to Seth Falcon is, I think, better). But I 
would be interested to hear other points of view.

  I also do not like averaging for several reasons. Now, I have two 
kinds of measurements (averages and ordinary old probes) and that is 
problematic for some uses. Second, if not all of the probes work (which 
might be why there are several variants) then I am averaging the good 
with the bad, which also seems like a less than ideal way to go.

   One suggestion is to do non-specific filtering (say on variation, or 
for expressed versus not, or something of that ilk) and to then select 
the probe set that has the highest value. Thus, you are selecting the 
probe with the most information (but do be careful not to use any 
phenotypic information as this could cause problems). Your (Galina's) 
suggestion was to use level of expression, but that is generally a bad 
idea because that would involve a between probe within array comparison 
and these are not ideal; just because one spot is brighter does not mean 
it works better, or that there is more mRNA than a less bright spot.

  HTH
   Robert

> 
> Sean
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org