[BioC] Limma : post statistical gene filtering

Thu Jun 16 17:31:52 CEST 2011

On Thu, Jun 16, 2011 at 11:23 AM, Kevin R. Coombes
<kevin.r.coombes at gmail.com> wrote:
> If you filter the genes after performing the t-test, then I will not believe
> the results.  Filtering based on any criteria that knows how much the genes
> differ between the two groups being contrasted (fold change, p-value, etc.)
> is statistically and scientifically invalid.
>
> People have made (and continue to make) the argument that it is
> safe/sound/reasonable to filter on criteria that do not rely on the results
> of the statistical test. examples of these kinds of filters are ones that
> look at the mean (max or some percentile) of the gene expression across the
> entire data set, or at the variance or range across the entire data set.
>
> If you have no differential expression, then filtering is not going to
> magically create it for you.  I would advise one of the following options
> [1] Rank the genes by the p-value or t-statistic (possibly filtered by fold
> change) and perform PCR on the top ten to see if any of them can actualy be
> confirmed.
> [2] Run more arrays so you have enough replicates to provide adequate power
> to discover smaller differences in expression than you can expect to find
> with only two replicates per group.

Kevin and I agree on these points.  I would add a third option which
is to use GSEA-like ideas or gene set tests to look for a signal of
differential expression in larger gene sets.

Sean

>    Kevin
>
> On 6/16/2011 9:17 AM, Stephanie PIERSON wrote:
>>
>> Dear bioconductor listers,
>>
>> I am analyzing agilent 2 color microarray data and i choose limma library
>> to make normalization and statistical analysis because i only have 2
>> replicates per condition and i read in some paper that a moderated t test
>> perform better when there are few replicates.
>>
>> The problem is that when i performed the statistical test on the whole
>> data set ( 35000 probes ),i have no differential expression, ie, all the
>> adjusted p value are comprise between 0.5 and 0.9. So, i have seen on the
>> list that the question on prefiltering genes have already been asked : some
>> people on the list recommand to do the normalization, model fitting, etc,
>> and then filter out before doing the multiplicity adjustment.
>> So, after the statistical analysis, i remove gene with log2FC<2
>> (ebayes$coefficients), and i perform the FDR. But once again, i have no adj
>> pvalue < 0.05.
>>
>> So, i was wondering on wich criteria i could filter out genes before the
>> multiple testing correction : pvalue ? log2FC ? other criteria ?
>>
>> I have a lot of variabily between replicates, ie, for many genes, i have a
>> fold change <0 in one replicate (for example, -5) and >0 on the other one
>> replicate (for example, 3) ... do you think i should remove those gene
>> before the statistical analysis or i can keep them ?
>>
>>
>> Thank you,
>> Best wishes
>> Stéphanie
>>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>