[BioC] Question about the appropriate uniververse for GOHyperG or hyperGtable

Wed Jun 14 19:41:24 CEST 2006

Hi,

James W. MacDonald wrote:
> Hi Scott,
> 
> Ochsner, Scott A wrote:
>> Hello BioC,
>>
>> I have a list of 1164 differentially expressed probe sets extracted
>> from an experiment done with mouse4302 chips. To arrive at the list I
>> first filtered log2 expression values by removing those below a log2
>> of 6.  This left me with 24654 probe sets from 45101.  Next I used
>> limma to model treatment effects and used an fdr adjusted
>> fit2$F.p.value to extract those genes displaying differential
>> expression in at least one of two contrasts (p<0.001).  This left me
>> with 2612 probe sets.  My list of 1164 probe sets are those probe
>> sets which are upregulated in both contrasts within the significant 
>> set of 2612.  I would like to use GOHyperG or hyperGtable to evaluate
>> overrepresented BP GO terms within the 1164 list.  Should my BP GO
>> universe come from the 45101, 24654, or 2612 probe set groups?
> 
> I think you could make a convincing argument for either the 45101 or 
> 24654 probe set groups, but personally I would go with the 45101. My 
> rationale would be that all 45101 probe sets were measured, so any of 
> them could theoretically have been significant. It doesn't matter IMO 
> that some were removed because of low expression rather than a 
> statistical test.
> 

  First, let me suggest that you not filter on value (log 2 of six, is
generally not optimal), but rather on variability. Genes that show
little variability across your conditions have no information (for any
phenotype). Second, you need to be careful about duplicates - that is
you must also at some point reduce all probes down to a single
representative for each Entrez Gene ID (see the GOstats vignette for
some more details), but basically you end up doing something pretty odd
if you do not. My preference here is to take the most extreme test
statistic (since I don't think all probes are reliable and I am looking
for evidence in favor of joint behavior; other approaches are also
valid, but you need to pick something). You also need to remove that
that have no mappings for the ontology you are going to use (MF, BP or 
CC). That should happen for either all genes on the chip, or for those 
that survive your non-specific filtering.

  Second given that, Jim is correct, and I know of no sound statistical
argument for preferring one over the other. My own approach is to take
the 24654 (of course I would have a different number if I used the
approach described above). Mostly because I think alot of the probes on
the chip are not measuring anything in my cells, and if I was richer I
could have designed a purpose built chip. But Jim's argument is also
valid. In most of these cases there is no right answer, but you need to
choose something that you agree with philosophically.

  best wishes
   Robert

> HTH,
> 
> Jim
> 
> 
>> I hope this is clear,
>>
>> Thanks for any help,
>>
>> Scott A. Ochsner, Ph.D. Baylor College of Medicine One Baylor Plaza,
>> N810 Houston, TX. 77030 lab phone:  713-798-1620 office phone:
>> 713-798-1585 fax:  713-798-4161
>>
>> _______________________________________________ Bioconductor mailing
>> list Bioconductor at stat.math.ethz.ch 
>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>> archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org