[BioC] select one Affy probeset for one gene

Tue Mar 14 06:45:56 CET 2006

James W. MacDonald wrote:
> Robert Gentleman wrote:
> 
>>Hi,
>>
>>Sean Davis wrote:
>>
>>
>>>On 3/13/06 3:38 PM, "Glazko, Galina" <Galina_Glazko at urmc.rochester.edu>
>>>wrote:
>>>
>>>
>>>
>>>
>>>>Dear list,
>>>>
>>>>
>>>>
>>>>Is there a way to automatically select one probeset for one gene in Affy
>>>>arrays? 
>>>>
>>>>Say, if we have several probesets for a given gene, we select the one
>>>>with the highest level of expression, or based on any other reasonable
>>>>criteria...?
>>>>
>>>>I am sorry if this question was answered before, it seems to be very
>>>>basic question and I hope there is the solution...
>>>
>>>
>>>Galina,
>>>
>>>You can contrive a solution, I suppose.  However, I'm not sure this is a
>>>good idea.  Whatever "reasonable criteria" you use are likely to lead to
>>>bias.  Filtering on unmeasured probesets or other quality measures applied
>>>equally to all probesets is probably reasonable, but not applying on a
>>>per-gene basis.  There have been related discussions in the past, often
>>>centering around "averaging" expression values.
>>>
>>>The more accepted way of dealing with multiple probesets is to do your
>>>analysis based on the probeset; only after that is done do you then connect
>>>your gene labels back to the probesets.
>>
>>
>>
>>  Unfortunately that approach does not always work and something needs 
>>to be done a bit earlier in the process if a user wants to make use of 
>>data such as GO, chromosomal location etc where the mapping is based on 
>>Entrez Gene ID (for example, but other identifiers have very similar 
>>issues). Not removing the duplicates leads to often quite different 
>>results (in essence there is over counting if all probes are accurate). 
>>As users of GOstats know, you have to choose one candidate for each 
>>Entrez gene id (and probably what I have been doing there is not ideal - 
>>the suggestion below, due to Seth Falcon is, I think, better). But I 
>>would be interested to hear other points of view.
>>
>>  I also do not like averaging for several reasons. Now, I have two 
>>kinds of measurements (averages and ordinary old probes) and that is 
>>problematic for some uses. Second, if not all of the probes work (which 
>>might be why there are several variants) then I am averaging the good 
>>with the bad, which also seems like a less than ideal way to go.
> 
> 
> One inherent problem with using the Affy probesets is that there are 
> known issues with many of the probes; some measure related transcripts 
> and others measure unrelated transcripts, so what you are measuring is 
> not always clear. The MBNI cdfs which have been re-mapped may help with 
> at least two of these problems. First, all probes that no longer blast 
> to the transcript of interest are removed from consideration. Second, 
> all probes that do blast to the transcript of interest are piled 
> together into one probeset (I guess you could argue this is bad since 
> the expression measures are now based on variable numbers of probes, but 
> that is already true anyway...). Note that these cdfs are planned to be 
> part of the new release of BioC, but currently are only available from 
> the MBNI website
> 
> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
> 
> Since you now have only one probeset per gene (based on Entrez Gene, 
> UniGene, RefSeq, or Ensembl) you no longer have to decide which one to 
> use. The biggest downside to using these cdfs is the lack of 
> infrastructure in BioC that is tailored to their use, which requires a 
> higher level of understanding of R than one would need to use a 'stock' 
> cdf (which reminds me - I should be doing something about that ;-D).

  Hi,
   These are good points, but I think that they are complementary rather 
than a strict replacement. First, I might just have expression data, not 
CEL files, so this approach would not be an option. Second, I might 
decide to map to Unigene or RefSeq, and then would still have the same 
problem these do not necessarily have a 1-1 correspondence with Entrez 
gene. And finally, I might be working with cDNA arrays where there is no 
clear way to take this same approach. That is not to say that this is 
not a viable approach and it certainly does solve some problems,

  best wishes
   Robert

> 
> HTH,
> 
> Jim
> 
> 
> 
>>   One suggestion is to do non-specific filtering (say on variation, or 
>>for expressed versus not, or something of that ilk) and to then select 
>>the probe set that has the highest value. Thus, you are selecting the 
>>probe with the most information (but do be careful not to use any 
>>phenotypic information as this could cause problems). Your (Galina's) 
>>suggestion was to use level of expression, but that is generally a bad 
>>idea because that would involve a between probe within array comparison 
>>and these are not ideal; just because one spot is brighter does not mean 
>>it works better, or that there is more mRNA than a less bright spot.
>>
>>  HTH
>>   Robert
>>
>>
>>
>>
>>>Sean
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org