[BioC] Limma : post statistical gene filtering

Thu Jun 16 17:23:34 CEST 2011

If you filter the genes after performing the t-test, then I will not 
believe the results.  Filtering based on any criteria that knows how 
much the genes differ between the two groups being contrasted (fold 
change, p-value, etc.) is statistically and scientifically invalid.

People have made (and continue to make) the argument that it is 
safe/sound/reasonable to filter on criteria that do not rely on the 
results of the statistical test. examples of these kinds of filters are 
ones that look at the mean (max or some percentile) of the gene 
expression across the entire data set, or at the variance or range 
across the entire data set.

If you have no differential expression, then filtering is not going to 
magically create it for you.  I would advise one of the following options
[1] Rank the genes by the p-value or t-statistic (possibly filtered by 
fold change) and perform PCR on the top ten to see if any of them can 
actualy be confirmed.
[2] Run more arrays so you have enough replicates to provide adequate 
power to discover smaller differences in expression than you can expect 
to find with only two replicates per group.

     Kevin

On 6/16/2011 9:17 AM, Stephanie PIERSON wrote:
> Dear bioconductor listers,
>
> I am analyzing agilent 2 color microarray data and i choose limma 
> library to make normalization and statistical analysis because i only 
> have 2 replicates per condition and i read in some paper that a 
> moderated t test perform better when there are few replicates.
>
> The problem is that when i performed the statistical test on the whole 
> data set ( 35000 probes ),i have no differential expression, ie, all 
> the adjusted p value are comprise between 0.5 and 0.9. So, i have seen 
> on the list that the question on prefiltering genes have already been 
> asked : some people on the list recommand to do the normalization, 
> model fitting, etc, and then filter out before doing the multiplicity 
> adjustment.
> So, after the statistical analysis, i remove gene with log2FC<2 
> (ebayes$coefficients), and i perform the FDR. But once again, i have 
> no adj pvalue < 0.05.
>
> So, i was wondering on wich criteria i could filter out genes before 
> the multiple testing correction : pvalue ? log2FC ? other criteria ?
>
> I have a lot of variabily between replicates, ie, for many genes, i 
> have a fold change <0 in one replicate (for example, -5) and >0 on the 
> other one replicate (for example, 3) ... do you think i should remove 
> those gene before the statistical analysis or i can keep them ?
>
>
> Thank you,
> Best wishes
> Stéphanie
>
>
>