[BioC] PreFiltering probe in microarray analysis

Tue Jun 14 15:57:24 CEST 2011

  Dear Matt,

I read your email again. Since you have lots of thoughts about this 
issue, I guess you probably have also thought a lot about the solutions. 
Hope my continuing followup is not boring. Please point out if I am 
wrong in my words.

There is no question (actually less questions) about the experiment 
result such as RT-PCR result of the differentially expressed gene 
detection.

However, when we test many genes in microarray or RNAseq, we do need 
something like FDR to control how many genes we are going to report. 
Eeven thought this FDR is not "absolutely true false discovery rate", it 
can work as a relative controller. The point is when different people 
use the same FDR method the FDR reports should be comparable.

Usually people will not do gene prefiltering first, and do it only when 
they find the FDR is too high. If you report a gene list with very high 
FDR, the reviewers will reject the paper. Therefore people try to make 
an amazing good FDR by gene prefiltering. The same gene list that had a 
high FDR before the gene prefiltering now has a lower FDR. Then the 
reviewers would be happy with the good FDR.

It seems, in some cases," with this FDR method, we have to do gene 
prefiltering in order to get a good FDR". We can see here that there are 
two problems. One is the FDR method itself, and the other is the gene 
prefiltering approach.

Having thought a lot about these problems, I came out a solution called 
EDR in which I have addressed these problems:
http://www.ncbi.nlm.nih.gov/pubmed/20846437

Have you read this paper? Do you think that could be one of the 
standardized solutions?  or any comments would be appreciated,

Best wishes,

Wayne

-- 
-----------------------------------------------------------------------
Wayne Xu, Ph.D
Computational Genomics Specialist

Supercomputing Institute for Advanced Computational Research
550 Walter Library
117 Pleasant Street SE
University of Minnesota
Minneapolis, Minnesota 55455
email: wxu at msi.umn.edu        help email:  help at msi.umn.edu
phone: 612-624-1447           help phone:  612-626-0802
fax:   612-624-8861
-----------------------------------------------------------------------

--On 6/13/2011 9:01 AM, Arno, Matthew wrote:
> Wayne - I *definitely* mean cheating! It depends on whether the FDR is reported I suppose. Let's say you do a microarray screen and the 'most changed' gene that comes up (either by largest fold change or smallest t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on to validate the change (on the same samples and further test sets) using qPCR and or western blots etc., if you go as far as protein analysis. Therefore you can analyse the importance of that single gene in a real biological context. No one could argue that the gene is not changed in the study and other samples, because of the low-throughput validation, and it makes a nice biological story for a paper. This is regardless of the arrays used, the test used, the FDR or actual p-value even. You could have picked the gene by sticking a pin in a list; you just used an array to make that pin stick more likely to give a real change.
>
> However, the statistical factors do definitely matter when you are trying to report an overall analysis with lots of genes/patterns/pathways/functions etc, with a wide range of conclusions, perhaps in the absence of being able to perform a high-throughput validation of every gene (or a proportion of) in the final 'significant' list. I can see it from both sides...however, sometimes it's easy to lose sight that an array hybridisation is just a hypothesis generator, not a hypothesis solver. That said any attempt to standardise this sort of reporting must have parity and (importantly) transparency with all these factors to have any success.
>
> I don't actually think there is a single valid answer to this issue, as there are so many interpretations/angles; it's just interesting to see how the debate changes over time. And essential to keep having it too!
>
> Thanks for reading - I have lots of thoughts about this!
> Matt
> ----------------------
> Matthew Arno, Ph.D.
> Genomics Centre Manager
> King's College London
>   
> The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals.
> This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender.
>
>
>
>> -----Original Message-----
>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu]
>> Sent: 13 June 2011 14:14
>> To: Arno, Matthew
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>
>> Thanks, Matt, for joining this discussion,
>>
>> It is true from Biologist point of view. You always get the top 10 genes
>> no matter filtering or not. But this shifts to another question, the
>> 'amazingly good FDR'. For the same top ten gene, people can report
>> different FDRs by filtering or no filtering, or by filtering a different
>> number of genes. These FDRs in different reports are not comparable at
>> all. Does this FDR make sense? People can try to make it amazing good.
>> Does that sound a little 'cheating'? Sorry I do not mean a real cheating
>> here.
>>
>> Do you have any thought about this ?
>>
>> Best wishes,
>>
>> Wayne
>> --
>>
>>
>>
>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as
>>> long you know the pitfalls, in terms of the potential bias and affect
>> on
>>> FDRs. I am personally aware of people pre-filtering not only to
>> enhance
>>> the FDR, but to use the results of a t-test as a starting point for a
>>> second sequential t-test because the FDRs from this test are
>> 'amazingly
>>> good'.
>>>
>>> However statistically sacrilegious this is, the top 10 genes are
>> always
>>> going to be the same top 10 genes, so if you are just looking for the
>> top
>>> 10 genes, this is essentially OK.
>>>
>>> How does that hang with you guys?
>>>
>>> Matt
>>>
>>> ----------------------
>>> Matthew Arno, Ph.D.
>>> Genomics Centre Manager
>>> King's College London
>>>
>>> The contents of this email are strictly confidential. It may not be
>>> transmitted in part or in whole to any other individual or groups of
>>> individuals.
>>> This email is intended solely for the use of the individual(s) to whom
>>> they are addressed and should not be released to any third party
>> without
>>> the consent of the sender.
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>> bounces at r-
>>>> project.org] On Behalf Of wxu at msi.umn.edu
>>>> Sent: 12 June 2011 16:41
>>>> To: Wolfgang Huber
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>>>
>>>> Hi, Dear Wolfgang,
>>>>
>>>> I think it would nice to bring up a discussion here about the gene
>>>> prefiltering issue. Please point me out if this suggestion is
>>>> inappropriate.
>>>>
>>>> There are two questions in the gene filtering which I could not find
>>>> answers:
>>>> 1). In the traditional multiple tests to correct the p-values of many
>>>> test
>>>> groups for example, in a new drug effect experiment, is it appropriate
>>>> to
>>>> remove some group tests from the whole experiment? If not, why can we
>>>> prefilter the genes?
>>>> 2). As I stated in the previous email, we assume that the raw pvalues
>>>> and
>>>> the top lowest-pvalue genes are the same before (35k genes) and after
>>>> gene
>>>> filtering (5k genes), the gene x you selected from 35K versus the one
>>>> selected from 5K, which is more sound? In other words, the best
>> student
>>>> selected from 1000 students versus the best student selected from 100,
>>>> which is more sound?
>>>>
>>>> So this is a question of the whole point of gene prefiltering
>> approach.
>>>> Best wishes,
>>>>
>>>> Wayne
>>>> --
>>>>> Hi Swapna
>>>>>
>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto:
>>>>>> Hi Stephanie,
>>>>>> There is another recent paper that you might consider which also
>>>>>> cautions about filtering
>>>>>> Van Iterson, M., Boer, J. M.,&   Menezes, R. X. (2010). Filtering,
>> FDR
>>>>>> and power. BMC Bioinformatics, 11(1), 450.
>>>>>> They also recommend their own statistical test to see if one's
>> filter
>>>>>> biases FDR.
>>>>>> currently I am trying variance filter and feature filter from
>>>>>> genefilter package: try ?nsFilter for help on these functions.
>>>>>> However, I dont use filtering routinely since choosing the right
>>>>>> filter , parameters and testing the effects of any bias are things
>> I
>>>>>> have not worked out in addition to having read Bourgon et al and
>>>>>> Iterson et al and others that discuss this issue.
>>>>>> About your limma results, while conventional filtering may be
>>>> expected
>>>>>> to increase the number of significant genes, as the papers suggest
>>>>>> likelihood of false positives also increases.
>>>>> No. Properly applied filtering does not affect the false positive
>>>> rates
>>>>> (FWER or FDR). That's the whole point of it. [1]
>>>>>
>>>>> If one is willing to put up with higher rate or probability of false
>>>>> discoveries, then don't do filtering - just increase the p-value
>>>> cutoff.
>>>>> [1] Bourgon et al., PNAS 2010.
>>>>>
>>>>>> In your current results,
>>>>>> do you have high fold changes above 2 (log2>1)?  You may want to
>>>>>> explore the biological relevance of those genes with high FC and
>>>>>> significant unadjusted p value.
>>>>>> Best,
>>>>>> Swapna
>>>>> Best wishes
>>>>> Wolfgang Huber
>>>>> EMBL
>>>>> http://www.embl.de/research/units/genome_biology/huber
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>