[BioC] PreFiltering probe in microarray analysis

Thu Jun 2 19:58:02 CEST 2011

Hi Stephanie,
There is another recent paper that you might consider which also
cautions about filtering
Van Iterson, M., Boer, J. M., & Menezes, R. X. (2010). Filtering, FDR
and power. BMC Bioinformatics, 11(1), 450.
They also recommend their own statistical test to see if one's filter
biases FDR.
currently I am trying variance filter and feature filter from
genefilter package: try ?nsFilter for help on these functions.
However, I dont use filtering routinely since choosing the right
filter , parameters and testing the effects of any bias are things I
have not worked out in addition to having read Bourgon et al and
Iterson et al and others that discuss this issue.
About your limma results, while conventional filtering may be expected
to increase the number of significant genes, as the papers suggest
likelihood of false positives also increases.In your current results,
do you have high fold changes above 2 (log2>1)?  You may want to
explore the biological relevance of those genes with high FC and
significant unadjusted p value.
Best,
Swapna
>
> Message: 2
> Date: Wed, 01 Jun 2011 14:58:47 +0200
> From: Stephanie PIERSON <stephanie.pierson at etumel.univmed.fr>
> To: bioconductor at r-project.org
> Subject: [BioC] PreFiltering probe in microarray analysis
> Message-ID: <20110601145847.72402aos7e2ne074 at wmeletud2.univmed.fr>
> Content-Type: text/plain; charset=ISO-8859-15; DelSp="Yes";
>         format="flowed"
>
> Hello everybody,
>
> I am a french student in bioinformatic. I have to analyze microarray
> data and I have some questions about prefiltering genes.
> The dataset that I have to analyze consist in 8 microarray, i have 4
> times points and 2 replicats for each time point. Agilent's two color
> microarray  (Whole Mouse Genome (4x44K) Oligo Microarrays) were used
> for the analysis. We are searching for genes that are differentially
> expressed between two conditions (for example C1 and C2) at the
> different time points and genes that are differentially expressed in
> one condition (C1 or C2) over time .
> I have chosen LIMMA to perform the statistical analysis because I read
> in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC
> bioinformatic 2006,7/359  ) that it work better in experiment with few
> replicate per conditions.
> I perfom the statistical analysis on the whole data set ( more than 37
> 000 genes ), but I have high corrected p value after multiple testing
> correction (benjamini hochberg ). I would like to prefilter genes
> before statistical analysis, but I don't know how to do this. I read
> in Bourgon's paper that we can filter on the overall variance or on
> the overall mean, but in my case, with few replicates, how can I do ?
> In more, in this paper, it is not recommended to combine limma with a
> filtering procedure ...
> Someone can help me please ?
>
> Thank you,
> Best wishes
> St?phanie
>
>
>
> --
> St?phanie PIERSON
> Universite de la Mediterranee (Aix-Marseille II)
> Master 2 Pro Bioinformatique et G?nomique
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 1 Jun 2011 14:31:48 +0100
> From: Yuan Hao <yuan.x.hao at gmail.com>
> To: Stephanie PIERSON <stephanie.pierson at etumel.univmed.fr>
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] PreFiltering probe in microarray analysis
> Message-ID: <E177DA51-396F-4BD3-A2C9-818C6FC2C76F at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
>
> Hi Stephanie,
>
> You can have a look the 'genefilter' package in R/bioconductor.
> Basically, it's easy to set up a overall variance filter, for example
> if you have a data set normalized by gcrma and you require all
> probesets having an IQR bigger than 0.5, you can do:
>
>  > library(affy)
>  > library(genefilter)
>  > library(gcrma)
>  > eset <- gcrma(data)
>  > f <- function(x)(IQR(x)>0.5)
>  > selected <- genefilter (eset, f)
>  > eset.filtered <- eset[selected, ]
>
> You may have to be careful about the filtering on your data. It quiet
> depends on the characters of your data. There is a paper[1] having had
> a very good review about this, which doesn't really recommend an
> overall variance filtering combined with Limma.
>
> Cheers,
> Yuan
>
> [1] R. Bourgon, R. Gentleman and W. Huber. PNAS 2010. p9546-9551
>
> On 1 Jun 2011, at 13:58, Stephanie PIERSON wrote:
>
>> Hello everybody,
>>
>> I am a french student in bioinformatic. I have to analyze microarray
>> data and I have some questions about prefiltering genes.
>> The dataset that I have to analyze consist in 8 microarray, i have 4
>> times points and 2 replicats for each time point. Agilent's two
>> color microarray  (Whole Mouse Genome (4x44K) Oligo Microarrays)
>> were used for the analysis. We are searching for genes that are
>> differentially expressed between two conditions (for example C1 and
>> C2) at the different time points and genes that are differentially
>> expressed in one condition (C1 or C2) over time .
>> I have chosen LIMMA to perform the statistical analysis because I
>> read in papers (Jeanmougin et al. PLoS ONE, Jefferey and al. BMC
>> bioinformatic 2006,7/359  ) that it work better in experiment with
>> few replicate per conditions.
>> I perfom the statistical analysis on the whole data set ( more than
>> 37 000 genes ), but I have high corrected p value after multiple
>> testing correction (benjamini hochberg ). I would like to prefilter
>> genes before statistical analysis, but I don't know how to do this.
>> I read in Bourgon's paper that we can filter on the overall variance
>> or on the overall mean, but in my case, with few replicates, how can
>> I do ? In more, in this paper, it is not recommended to combine
>> limma with a filtering procedure ...
>> Someone can help me please ?
>>
>> Thank you,
>> Best wishes
>> St?phanie
>>
>>
>>
>> --
>> St?phanie PIERSON
>> Universite de la Mediterranee (Aix-Marseille II)
>> Master 2 Pro Bioinformatique et G?nomique
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 27 May 2011 21:59:41 +0000
> From: Ilya Lipkovich <ilya.lipkovich at gmail.com>
> To: <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] R package modreg
> Message-ID: <loom.20110527T235533-145 at post.gmane.org>
> Content-Type: text/plain; charset="us-ascii"
>
> seems modreg has been merged into stats which is included in R
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 30 May 2011 10:17:54 -0700 (PDT)
> From: Les Dethlefsen <dethlefs at stanford.edu>
> To: bioconductor at r-project.org
> Subject: [BioC] question/issue with multtest mt.maxT
> Message-ID:
>         <570911820.845739.1306775874471.JavaMail.root at zm08.stanford.edu>
> Content-Type: text/plain; charset=utf-8
>
> Hi All,
>
> I'm a biologist, not a statistician, and a bit of a newbie with R, so please forgive me if this is a stupid question, and thanks for your help.
>
> My data are from shotgun metagenomic sequencing of multiple samples, but it's essentially similar to microarray data.  I have a matrix of 3312 functional gene categories by 24 samples, with real-valued entries indicating the proportion of interpretable gene sequence data for each sample that is assigned to each functional category.  There are a number of ways to bin the samples into categories, based on some other analysis I'm most interested in a breakdown into two subjects and 5 temporal categories within each subject, i.e. a total of 10 categories.  Two of the categories have 4 samples each, the remaining 8 categories have 2 samples each.
>
> When I ask mt.maxT to carry out a 2-sided F test on the 10 categories, I see something I don't understand: the adjusted p values are not a monotonic transformation of the raw p values.  They are a monotonic with respect to the test statistic...as the test statistic falls, the adjusted p values rise, as expected.  But the raw p values are bouncing around quite a bit, with the lowest raw p values of 0.0001 having a range of adjusted p values, all much larger than the minimum adjusted p value.  There are 244 raw p values lower than that which corresponds to the minimum adjusted p value.
>
> My (perhaps mistaken) understanding was that permutations were used to generate the raw p values (which I definitely want, since I don't think the distributions of values for the different functional gene categories will follow any regular distribution or be similar to each other), and then some sort of adjustment was made to the raw values to adjust for multiple testing considerations, but the adjustment itself was not based on permutations.  I would expect the permutations to be fairly noisy in my case with only 2 samples in most of my sample categories, but wouldn't that only affect the reliability of the raw p values?  How can the adjusted p values have a different rank order than the raw p values?
>
> I did run the same command with the samples binned only to subject, i.e. only 2 categories of 12 subject each. While the range of p values is much lower (not surprisingly) there is still the same pattern with a monotonic relationship between the test statistic and the adjusted p values, but not between the raw p values and adjusted p values.  Hence, it doesn't seem that this behavior is due to the messiness of 10 unequally sized categories for only 24 samples.
>
> I'm including some snips of my R session below; if anyone is interested in playing with the original data there's a compressed file at
>
> ftp://asiago.stanford.edu/multtest_issue.tgz
>
> that has the data matrix and the class labels.
>
> I'm running R 2.13.0 (on Mac OSX with the R.app GUI) with newly-installed BioConductor, so everything should be up to date.
>
> Thanks for any help or insight!
> Les
>
>
> Les Dethlefsen
> Relman Lab
> Microbiology & Immunology
> Stanford University
>
> R version 2.13.0 (2011-04-13)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> [R.app GUI 1.40 (5751) x86_64-apple-darwin9.8.0]
>
>> ko.data.df=as.data.frame(read.table("K57.fvn.ko.txt",header=TRUE,sep="",row.names=1))
>> class.label.df=as.data.frame(t(read.table("ClassLabel2.txt",header=FALSE,sep="",row.names=1)))
>> class.label.df
>     Sub CpNon Cp12Non CpPIP Cp12PIP Cp12PELP SubCp12PIP
> V2    0     0       0     0       0        0          0
> V3    0     0       0     0       0        0          0
> V4    0     1       1     1       1        1          1
> V5    0     1       1     1       1        1          1
> V6    0     0       0     2       2        2          2
> V7    0     0       0     2       2        2          2
> V8    0     0       0     2       2        3          2
> V9    0     0       0     2       2        3          2
> V10   0     1       2     1       3        4          3
> V11   0     1       2     1       3        4          3
> V12   0     0       0     3       4        5          4
> V13