[BioC] Limma : post statistical gene filtering

Sean Davis sdavis2 at mail.nih.gov
Thu Jun 16 16:41:22 CEST 2011

Hi, Stephanie.

On Thu, Jun 16, 2011 at 10:17 AM, Stephanie PIERSON
<stephanie.pierson at etumel.univmed.fr> wrote:
> Dear bioconductor listers,
> I am analyzing agilent 2 color microarray data and i choose limma library to
> make normalization and statistical analysis because i only have 2 replicates
> per condition and i read in some paper that a moderated t test perform
> better when there are few replicates.
> The problem is that when i performed the statistical test on the whole data
> set ( 35000 probes ),i have no differential expression, ie, all the adjusted
> p value are comprise between 0.5 and 0.9. So, i have seen on the list that
> the question on prefiltering genes have already been asked : some people on
> the list recommand to do the normalization, model fitting, etc, and then
> filter out before doing the multiplicity adjustment.
> So, after the statistical analysis, i remove gene with log2FC<2
> (ebayes$coefficients), and i perform the FDR. But once again, i have no adj
> pvalue < 0.05.

Just a note on filtering.  You should not filter on any measure that
is derived from knowledge of the groupings.  In this case, filtering
based on ebayes$coefficients is not valid and will result in p-values
being incorrect (and falsely significant).

As for your data, assuming that your limma analysis is correct, it
sounds as if you have no evidence of differential expression.  Perhaps
your study would benefit from a larger number of samples to improve

> So, i was wondering on wich criteria i could filter out genes before the
> multiple testing correction : pvalue ? log2FC ? other criteria ?

There are papers on the subject and several email exchanges on this
list (which is searchable), but filtering based on variance across ALL
samples (not within groups) is a common technique.  The goal is not to
pick the lowest variance genes but to pick the top X% of the genes
with the highest variance (where X could be about 40-60%, roughly).

> I have a lot of variabily between replicates, ie, for many genes, i have a
> fold change <0 in one replicate (for example, -5) and >0 on the other one
> replicate (for example, 3) ... do you think i should remove those gene
> before the statistical analysis or i can keep them ?

Again, removing genes that have low variance within groups is not
valid and will result in p-values that will be biased (and, therefore,
not to be trusted).  If you have high biologic variation, you could
certainly benefit from a larger sample size.


More information about the Bioconductor mailing list