[BioC] Filtering gene list prior to statistical testing

Thomas Hampton Thomas.H.Hampton at Dartmouth.EDU
Thu Jun 26 03:33:17 CEST 2008


I think Siobhan's approach is pretty sensible.

One thing to bear in mind is that we may be agonizing over the
parallel testing issue with counterproductive results. Yes, it is common
to see p values adjusted for family wise error rates or false discovery
rates.  However, it's a well known effect that  paying too much  
attention to
statistical significance measures can result in reduced concordance  
between
replicates than you would see if you simply took the most highly  
regulated genes.

As a result, the FDA and it's MAQC group recommend ranking
expression by fold change  to some biologically sensible level (say  
about 2)
then removing from remaining group those that fail to meet some common
significance threshold (say .05).

They think altogether too much has been made of the p values in  
general, and I agree. I do not think the goal here
is to quibble over which genes are "most significant" in a single  
experiment statistically when
those same genes fail to reproduce in identical repeat experiments.

We actually did 3 sets (N = 5) of the same experiment over several  
years and found that
statistical measures (e.g.  Bioc RankProd) that were closer to  
straight fold change yielded far more concordance
with our data than a variety of other tests which were essentially  
penalized t tests.

I also did a bunch of simulations and found similar results with  
theoretical data, so I can
see how this works. Basically, it happens when you have non-standard  
distributions where
the effect size and variance are of similar magnitude....

Anyway, if you are feeling distraught over seemingly good statistical  
practices (like
p value correction) that lead to short, weird lists, you are  
certainly not alone.

Google MAQC and check out their Sept 2006 papers in Nature  
Biotechnology.

Tom

On Jun 25, 2008, at 1:29 PM, Siobhan A. Braybrook wrote:

> Hi Johan,
>
> I tend to filter out genes with low Stdev across all arrays in the
> experiment before doing any tests.  This decreases your possibility  
> for
> multiple testing errors, although i still apply an MTP.  Since your  
> total
> number of tests is less, you see better adj.Pvalues.  To get this  
> better,
> look at the calculation for whatever MTP you use - see how N  
> factors into
> the calculation.
> Also, low stdev usually equals no change- like you say.
> Since you will likely go on to validate anything with qPCR if it is
> important, I say cut them out before testing.  If for some reason  
> it gives
> you false positives (unlikely given the MTP and stringency) you can  
> weed
> them out later.
>
> I filter these things out using the shorth.  In my opinion, it gets  
> rid of
> most 'absent' things as well as genes with low stDev but true signal.
>
> Hope that helps.
>
> Siobhan
>
>
>
> At 12:36 AM 25/06/2008, you wrote:
>> Dear Jim,
>>
>> Thanks for the feedback. Your point about the implications for
>> Normalization is noted, but my question relates to filtering post
>> Normalization. I.e. the scenario is as follows: Data is normalized  
>> (using
>> all "good" data points - "good" being based on Genepix quality  
>> criteria)
>> and effeciency of Normalization is assessed (using visualizations and
>> stats) based on the assumptions that you stated. This data is then
>> pre-processed (missing values are imputed, replicates are averaged  
>> etc.
>> Once a "clean" data set is obtained (where we are fairly certain  
>> that most
>> technical noise has been removed), we filter out all genes that  
>> don't show
>> at least a 1.2fold change between classes (i.e. not very  
>> stringent, the
>> aim is to remove within class "flat" patterns), which reduces our  
>> sample
>> space by about 1/3 (highlighting again that most genes show no or  
>> very
>> little change). Furthur analysis is then done on this data set (i.e.
>> Functional enrichment of gene groups as well as
>>  "straight" differential testing).
>>
>> What I would like to know is if there is any good reason why we  
>> should
>> keep the genes that show "no" class-based behaviour (in terms of
>> fold-changes) - as these contain very little or no information.  
>> Including
>> them does not change the types of functional classes found to be  
>> enriched,
>> but it pushes all adjusted p-values (doing differential testing)  
>> into very
>> non-significant ranges.  Eliminating them improves the adjusted p- 
>> values
>> greatly.  The "top" x genes using the "filtered" data set is  
>> exactly the
>> same as the unfiltered one, except they have significant adjusted
>> p-values. I am aware that a significant p-value is not the
>> be-all-and-end-all of microarray research, but it is an unfortunate
>> reality that most people do attach great importance to this and
>> "significant" values are required to make any substatiated assertions
>> (prior to downstream validation of course!).
>>
>> Can any serious criticisms be raised against this type of
>> "post-normalization" sample space reduction?
>>
>> Thanks,
>> Johan van Heerden
>>
>>
>>
>>
>> --- On Tue, 6/24/08, James W. MacDonald <jmacdon at med.umich.edu>  
>> wrote:
>>
>>> From: James W. MacDonald <jmacdon at med.umich.edu>
>>> Subject: Re: [BioC] Filtering gene list prior to statistical testing
>>> To: jvhn1 at yahoo.com
>>> Cc: bioconductor at stat.math.ethz.ch
>>> Date: Tuesday, June 24, 2008, 5:30 PM
>>> Hi Johan,
>>>
>>> Johan van Heerden wrote:
>>>> Dear All,
>>>>
>>>> I have scoured the BioC mailing list in search of a
>>> clear answer
>>>> regarding the filtering of a data sets prior to
>>> differential testing,
>>>> in an attempt to circumvent the multiple testing
>>> problem. Although
>>>> several opinions have been expressed over the last
>>> couple of years I
>>>> have not yet found a convincing argument for or
>>> against this
>>>> practice.  I would like to make a comment and would
>>> appreciate any
>>>> constructive feedback, as I am not a Statistician but
>>> a Biologists.
>>>>
>>>> As far as I can see the problem has been divided into
>>> 2 categories:
>>>> (1) "Supervised" and (2)
>>> "Unsupervised" filtering, where (1) is based
>>>> on some knowledge regarding the functional classes
>>> present in the
>>>> data, as opposed to (2) which does not consider any
>>> such information.
>>>> Several criticism have been raised against the
>>> "Supervised" approach,
>>>> with many people calling it flawed logic. My first
>>> comments are
>>>> regarding the logic of "Supervised"
>>> filtering.
>>>>
>>>> As an example: A data set consisting of two classes
>>> (Treatment 1 and
>>>> Treatment 2) has been generated.  A fold-change is
>>> then used to
>>>> enrich the data set for genes that show within class
>>> activity (i.e.
>>>> select only genes that show a mean x-fold change
>>> between classes).
>>>> This filtered data set is then used for differential
>>> testing.
>>>>
>>>> My first question is: How is this different
>>> (especially when working
>>>> with "whole-genome" arrays) from having
>>> custom arrays constructed
>>>> from genes known show a response to some treatment.
>>> I.e. Arrays will
>>>> then be selectively printed with genes that are known
>>> to or expected
>>>> to show a response. This is a type of
>>> "filtering" step that will
>>>> yield arrays with highly reduced gene sets. This
>>> scenario can result
>>>> from known knowledge about pathways or can arrise from
>>> a discovery
>>>> based microarray experiment, where a researcher
>>> produces whole genome
>>>> arrays and from there select "responsive"
>>> genes for the creation of
>>>> targeted (or custom arrays). Surely this step-wise
>>> sample space
>>>> reduction should be subject to the same criticism?
>>>
>>> It is. When normalizing microarray data, the assumption
>>> being made is
>>> that many (most) of the genes being measured are not
>>> actually changing
>>> expression. What most normalization schemes do is line up
>>> the bulk of
>>> the data so on average the log fold change is zero. If we
>>> can't make
>>> this assumption (e.g., it is possible that _all_ genes are
>>> up-regulated
>>> in one sample), then without having some housekeeping genes
>>> to use for
>>> the normalization, there is no way to normalize the data
>>> without making
>>> some strong and possibly unwarranted assumptions.
>>>
>>> So the main argument as I see it against doing supervised
>>> sample space
>>> reduction is that you may be removing the main assumption
>>> of most
>>> normalization schemes. The normalization is really the
>>> important thing
>>> here, as you are trying to remove unwanted technical
>>> variation that will
>>> have a much larger effect on your statistics than the
>>> multiple testing
>>> issue.
>>>
>>>>
>>>> Secondly, the supervised fold-change filter should not
>>> affect the
>>>> statistic of each individual gene, but will have
>>> profound effects on
>>>> the adjusted p-values. I have checked this only for
>>> t-tests and am
>>>> not sure what the effect on more complex statistical
>>> differential
>>>> testing methods would be. If the only effect of the
>>> "supervised"
>>>> filtering step is the enrichment of class-specific
>>> responsive gene
>>>> and a reduction in the severity of the p-value
>>> ADJUSTMENT (without
>>>> affecting the actual statistic), this could surely be
>>> a very useful
>>>> way of filtering data?
>>>>
>>>> Wrt the "unsupervised" approaches: These
>>> approaches define some
>>>> overall variability threshold which can be used to
>>> filter out genes
>>>> that don't show a minimum degree of variability
>>> regardless of class.
>>>> As far as I can tell there are several issues wrt this
>>> approach. (1)
>>>> Some genes will be naturally "noisy", i.e.
>>> will show high levels of
>>>> fluctuation regardless of class. These genes are
>>> likely to be
>>>> included in a filter based on degree of varilablity.
>>> (2) Some genes
>>>> might show low levels of variability (with small
>>> changes between
>>>> classes) and could be important, but will be excluded
>>> if a filter is
>>>> based on degree of variability.
>>>
>>> This is all true, but again I think the normalization issue
>>> is much more
>>> important, and that is where we really want to make sure we
>>> are doing a
>>> good job.
>>>
>>> These days people are getting much less interested in a
>>> list of
>>> differentially expressed genes, as these are often too
>>> large to be
>>> useful anyway. The real underlying goal of most experiments
>>> IMO, is to
>>> find pathways that are perturbed by some
>>> treatment/condition/whatever.
>>> In this case one really doesn't care about multiple
>>> testing, and instead
>>> is just using the t-stats (or whatever) in a GSEA type
>>> statistic to
>>> measure the difference in sets of genes.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>>
>>>> I would greatly appreciate some feedback on these
>>> comments,
>>>> specifically some statistical substantiation as to why
>>> a "supervised"
>>>> approach is "flawed", given the similar
>>> experimental strategies
>>>> included in the paragraph on this approach.
>>>>
>>>> Many Thanks!! Johan van Heerden
>>>>
>>>> _______________________________________________
>>> Bioconductor mailing
>>>> list Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the
>>>> archives:
>>>>
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Affymetrix and cDNA Microarray Core
>>> University of Michigan Cancer Center
>>> 1500 E. Medical Center Drive
>>> 7410 CCGC
>>> Ann Arbor MI 48109
>>> 734-647-5623
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> S. A. Braybrook
> Graduate Student, Harada Lab
> Section of Plant Biology
> University of California,  Davis
> Davis, CA 95616
> Ph 530.752.6980
>
> The time is always right, to do what is right.
> - Martin Luther King, Jr.
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list