[BioC] Invalid fold-filter

Mon Feb 20 16:30:22 CET 2006

Yes removing genes like that is wrong. The second method you mention
splits the data and will probably also let bias slip in the back-door.

People often select coefficient of variance (sd/mean) -- as the sd of
data should increase linearly with the mean. It doesn't with array data
so I sometimes use the moderated sd (sigma if I remember rightly) from
limma eBayes(fit) and use that to calculate a cv filter.

Stephen Henderson
Wolfson Inst. for Biomedical Research
Cruciform Bldg., Gower Street
University College London
United Kingdom, WC1E 6BT
+44 (0)207 679 6827

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Bornman,
Daniel M
Sent: 20 February 2006 13:34
To: Robert Gentleman
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Invalid fold-filter

Robert,

After reading your response to my initial question, I do not believe you
addressed exactly what I attempted to describe.  Please pardon me for
not being clear.  I think your response assumed I was filtering on
unadjusted p-values then applying a correction such as Benjamini &
Hochberg to a reduced set.

My question was rather on the validity of first filtering each gene
based on fold-change between two sample groups (i.e. controls vs
treated) then calculating a test-statistic, raw p-value and corrected
p-value on each gene that passed the fold-change filter.  I am worried
that using the group phenotype description to filter followed by
applying a p-value correction is unfairly reducing my multiple
comparison penalty.

I propose that a less biased approach to fold-filtering would be to
filter probes based on the mean of the lower half versus the mean of the
upper half of expression values at each probe regardless of the
phenotype (non-specific).  This would surely (except in some instances
where a phenotype causes drastic expression changes) cause the
fold-filtered set to be larger and thus not unfairly decrease the
multiple comparison penalty when computing adjusted p-values.   

Thank You,
Daniel  

-----Original Message-----
From: Robert Gentleman [mailto:rgentlem at fhcrc.org] 
Sent: Saturday, February 18, 2006 12:57 PM
To: Bornman, Daniel M
Subject: Re: [BioC] Invalid fold-filter

Hi Daniel,
  I hope not, it is as you have noted a flawed approach.

best wishes
   Robert

Bornman, Daniel M wrote:
> I of course agree that filtering on a variable (phenotype) that will 
> be used later to calculate adjusted p-values is flawed and therefore 
> it is not a method I would implement; however, it seems that many that

> describe fold-filtering are doing just that.
> Thank you for your response.
> 
> -----Original Message-----
> From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
> Sent: Friday, February 17, 2006 2:15 PM
> To: Bornman, Daniel M
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Invalid fold-filter
> 
> 
> 
> Bornman, Daniel M wrote:
> 
>>Dear BioC Folks,
>>
>>As a bioinformatician within a Statistics department I often consult 
>>with real statisticians about the most appropriate test to apply to 
>>our microarray experiments.  One issue that is being debated among our
> 
> 
>>statisticians is whether some types of fold-filtering may be invalid 
>>or biased in nature.  The types of fold-filtering in question are 
>>those that tend to NOT be non-specific.
>>Some filtering of a 54K probe affy chip is useful prior to making 
>>decisions on differential expression and there are many examples in 
>>the Bioconductor documentation (particularly in the {genefilter}
>>package) on how to do so.  A popular method of non-specific filtering 
>>for reducing your probeset prior to applying statistics is to filter 
>>out low expressed probes followed by filtering out probes that do not 
>>show a minimum difference between quartiles.  These two steps are 
>>non-specific in that they do not take into consideration the actual
> 
> samples/arrays.
> 
>>On the other hand, if we had two groups of samples, say control versus
> 
> 
>>treated, and we filtered out those probes that do not have a mean 
>>difference in expression of 2-fold between the control and treated 
>>groups, this filtering was based on the actual samples.  This is NOT a
> 
> 
>>non-specific filter.  The problem then comes (or rather the debate 
>>here
>>arises) when a t-test is calculated for each probe that passed the 
>>sample-specific fold-filtering and the p-values are adjusted for 
>>multiple comparisons by, for example the Benjamini & Hochberg method.
>>Is it valid to fold-filter using the sample identity as a criteria 
>>followed by correcting for multiple comparisons using just those 
>>probes that made it through the fold-filter?  When correcting for 
>>multiple comparisons you take a penalty for the number of comparison 
>>you are correcting.  The larger the pool of comparisons, the larger 
>>the penalty, thus the larger the adjusted p-value.  Or more 
>>importantly, the smaller the set, the less your adjusted p-value is 
>>adjusted (increased) relative to your raw p-value.  The argument is 
>>that you used the actual samples themselves you are comparing to 
>>unfairly reduce the adjusted p-value penalty.
> 
> 
>   It is not valid to use phenotype to compute t-statistics for a 
> particular phenotype and filter based on those p-values and to then 
> use p-value correction methods on the result. I don't think we need 
> research, it seems pretty obvious that this is not a valid approach.
> 
>    You can do non-specific filtering, but all you are really doing 
> there is to remove genes that are inherently uninteresting no matter 
> what the phenotype of the corresponding sample (if there is no
variation in
> expression for a particular gene across samples then it has   no 
> information about the phenotype of the sample). Filtering on low 
> values is probably a bad idea although many do it (and I used to, and 
> still do sometimes depending on the task at hand).
> 
> 
>   Best wishes
>     Robert
> 
> 
>>Has anyone considered this issue or heard of problems of using a 
>>specific type of filtering rather than a non-specific one?
>>Thank You for any responses.
>>
>>Daniel Bornman
>>Research Scientist
>>Battelle Memorial Institute
>>505 King Ave
>>Columbus, OH 43201
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
> 
> 
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 
> PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 
> rgentlem at fhcrc.org
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> 

--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor

**********************************************************************
This email and any files transmitted with it are confidentia...{{dropped}}