[BioC] PreFiltering probe in microarray analysis

Fri Jun 3 16:10:57 CEST 2011

I have seen in this mail list many questions like "after I applied the
multiple test, I got 0 number of differentially expressed gene". The
suggested solution is always the "gene prefiltering".

I disagree with this old idea and proposed a new idea EDR which does not
need gene prefiltering. http://www.ncbi.nlm.nih.gov/pubmed/20846437
However, the old idea is hard to be shaken because it has been accepted by
people for a long time in microarray and now in RNAseq as well, and the
new idea needs time to be recognized.

Here is an intuitive scenario, we assume that the raw pvalues and the top
lowest-pvalue genes are the same before (35k genes) and after gene
filtering (5k genes), the gene x you selected from 35K versus the one
selected from 5K, which is more sound? In other words, the best student
selected from 1000 students versus the best student selected from 100,
which is more sound?

Wayne
--

> Hi Stephanie,
>
> You can have a look the 'genefilter' package in R/bioconductor.
> Basically, it's easy to set up a overall variance filter, for example
> if you have a data set normalized by gcrma and you require all
> probesets having an IQR bigger than 0.5, you can do:
>
>  > library(affy)
>  > library(genefilter)
>  > library(gcrma)
>  > eset <- gcrma(data)
>  > f <- function(x)(IQR(x)>0.5)
>  > selected <- genefilter (eset, f)
>  > eset.filtered <- eset[selected, ]
>
> You may have to be careful about the filtering on your data. It quiet
> depends on the characters of your data. There is a paper[1] having had
> a very good review about this, which doesn't really recommend an
> overall variance filtering combined with Limma.
>
> Cheers,
> Yuan
>
> [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551
>
> On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote:
>
>> Hello everybody,
>>
>> I am a french student in bioinformatic. I have to analyze microarray
>> data and I have some questions about prefiltering genes.
>> The dataset that I have to analyze consist in 8 microarray, i have 4
>> times points and 2 replicats for each time point. Agilent's two
>> color microarray  (Whole Mouse Genome (4x44K) Oligo Microarrays)
>> were used for the analysis. We are searching for genes that are
>> differentially expressed between two conditions (for example C1 and
>> C2) at the different time points and genes that are differentially
>> expressed in one condition (C1 or C2) over time .
>> I have chosen LIMMA to perform the statistical analysis because I
>> read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC
>> bioinformatic 2006,7/359  ) that it work better in experiment with
>> few replicate per conditions.
>> I perfom the statistical analysis on the whole data set ( more than
>> 37 000 genes ), but I have high corrected p value after multiple
>> testing correction (benjamini hochberg ). I would like to prefilter
>> genes before statistical analysis, but I don't know how to do this.
>> I read in Bourgon's paper that we can filter on the overall variance
>> or on the overall mean, but in my case, with few replicates, how can
>> I do ? In more, in this paper, it is not recommended to combine
>> limma with a filtering procedure ...
>> Someone can help me please ?
>>
>> Thank you,
>> Best wishes
>> Stéphanie
>>
>>
>>
>> --
>> Stéphanie PIERSON
>> Universite de la Mediterranee (Aix-Marseille II)
>> Master 2 Pro Bioinformatique et Génomique
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>