[BioC] PreFiltering probe in microarray analysis

Fri Jun 17 05:18:49 CEST 2011

Hi Matt,

Let me note that PCR (or even protein analysis) performed on SAME samples
does not solve the FDR problem. It will only confirm that microarrays
reported correct expression levels (or fold change). So now we are sure
that in 3 samples under condition A the level of some gene is indeed
higher than in 3 samples under condition B, but we still do not know
whether this is a true phenomenon distinguishing conditions A and B or
this just happened by chance since we have thousands (or tens of
thousands) of genes.
You will need additional (independent) samples to confirm that this is a
true phenomenon.

Moshe.

>   Dear Matt,
>
> I read your email again. Since you have lots of thoughts about this
> issue, I guess you probably have also thought a lot about the solutions.
> Hope my continuing followup is not boring. Please point out if I am
> wrong in my words.
>
> There is no question (actually less questions) about the experiment
> result such as RT-PCR result of the differentially expressed gene
> detection.
>
> However, when we test many genes in microarray or RNAseq, we do need
> something like FDR to control how many genes we are going to report.
> Eeven thought this FDR is not "absolutely true false discovery rate", it
> can work as a relative controller. The point is when different people
> use the same FDR method the FDR reports should be comparable.
>
> Usually people will not do gene prefiltering first, and do it only when
> they find the FDR is too high. If you report a gene list with very high
> FDR, the reviewers will reject the paper. Therefore people try to make
> an amazing good FDR by gene prefiltering. The same gene list that had a
> high FDR before the gene prefiltering now has a lower FDR. Then the
> reviewers would be happy with the good FDR.
>
> It seems, in some cases," with this FDR method, we have to do gene
> prefiltering in order to get a good FDR". We can see here that there are
> two problems. One is the FDR method itself, and the other is the gene
> prefiltering approach.
>
> Having thought a lot about these problems, I came out a solution called
> EDR in which I have addressed these problems:
> http://www.ncbi.nlm.nih.gov/pubmed/20846437
>
> Have you read this paper? Do you think that could be one of the
> standardized solutions?  or any comments would be appreciated,
>
> Best wishes,
>
> Wayne
>
> --
> -----------------------------------------------------------------------
> Wayne Xu, Ph.D
> Computational Genomics Specialist
>
> Supercomputing Institute for Advanced Computational Research
> 550 Walter Library
> 117 Pleasant Street SE
> University of Minnesota
> Minneapolis, Minnesota 55455
> email: wxu at msi.umn.edu        help email:  help at msi.umn.edu
> phone: 612-624-1447           help phone:  612-626-0802
> fax:   612-624-8861
> -----------------------------------------------------------------------
>
>
>
> --On 6/13/2011 9:01 AM, Arno, Matthew wrote:
>> Wayne - I *definitely* mean cheating! It depends on whether the FDR is
>> reported I suppose. Let's say you do a microarray screen and the 'most
>> changed' gene that comes up (either by largest fold change or smallest
>> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go on
>> to validate the change (on the same samples and further test sets) using
>> qPCR and or western blots etc., if you go as far as protein analysis.
>> Therefore you can analyse the importance of that single gene in a real
>> biological context. No one could argue that the gene is not changed in
>> the study and other samples, because of the low-throughput validation,
>> and it makes a nice biological story for a paper. This is regardless of
>> the arrays used, the test used, the FDR or actual p-value even. You
>> could have picked the gene by sticking a pin in a list; you just used an
>> array to make that pin stick more likely to give a real change.
>>
>> However, the statistical factors do definitely matter when you are
>> trying to report an overall analysis with lots of
>> genes/patterns/pathways/functions etc, with a wide range of conclusions,
>> perhaps in the absence of being able to perform a high-throughput
>> validation of every gene (or a proportion of) in the final 'significant'
>> list. I can see it from both sides...however, sometimes it's easy to
>> lose sight that an array hybridisation is just a hypothesis generator,
>> not a hypothesis solver. That said any attempt to standardise this sort
>> of reporting must have parity and (importantly) transparency with all
>> these factors to have any success.
>>
>> I don't actually think there is a single valid answer to this issue, as
>> there are so many interpretations/angles; it's just interesting to see
>> how the debate changes over time. And essential to keep having it too!
>>
>> Thanks for reading - I have lots of thoughts about this!
>> Matt
>> ----------------------
>> Matthew Arno, Ph.D.
>> Genomics Centre Manager
>> King's College London
>>
>> The contents of this email are strictly confidential. It may not be
>> transmitted in part or in whole to any other individual or groups of
>> individuals.
>> This email is intended solely for the use of the individual(s) to whom
>> they are addressed and should not be released to any third party without
>> the consent of the sender.
>>
>>
>>
>>> -----Original Message-----
>>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu]
>>> Sent: 13 June 2011 14:14
>>> To: Arno, Matthew
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>>
>>> Thanks, Matt, for joining this discussion,
>>>
>>> It is true from Biologist point of view. You always get the top 10
>>> genes
>>> no matter filtering or not. But this shifts to another question, the
>>> 'amazingly good FDR'. For the same top ten gene, people can report
>>> different FDRs by filtering or no filtering, or by filtering a
>>> different
>>> number of genes. These FDRs in different reports are not comparable at
>>> all. Does this FDR make sense? People can try to make it amazing good.
>>> Does that sound a little 'cheating'? Sorry I do not mean a real
>>> cheating
>>> here.
>>>
>>> Do you have any thought about this ?
>>>
>>> Best wishes,
>>>
>>> Wayne
>>> --
>>>
>>>
>>>
>>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes as
>>>> long you know the pitfalls, in terms of the potential bias and affect
>>> on
>>>> FDRs. I am personally aware of people pre-filtering not only to
>>> enhance
>>>> the FDR, but to use the results of a t-test as a starting point for a
>>>> second sequential t-test because the FDRs from this test are
>>> 'amazingly
>>>> good'.
>>>>
>>>> However statistically sacrilegious this is, the top 10 genes are
>>> always
>>>> going to be the same top 10 genes, so if you are just looking for the
>>> top
>>>> 10 genes, this is essentially OK.
>>>>
>>>> How does that hang with you guys?
>>>>
>>>> Matt
>>>>
>>>> ----------------------
>>>> Matthew Arno, Ph.D.
>>>> Genomics Centre Manager
>>>> King's College London
>>>>
>>>> The contents of this email are strictly confidential. It may not be
>>>> transmitted in part or in whole to any other individual or groups of
>>>> individuals.
>>>> This email is intended solely for the use of the individual(s) to whom
>>>> they are addressed and should not be released to any third party
>>> without
>>>> the consent of the sender.
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>>> bounces at r-
>>>>> project.org] On Behalf Of wxu at msi.umn.edu
>>>>> Sent: 12 June 2011 16:41
>>>>> To: Wolfgang Huber
>>>>> Cc: bioconductor at r-project.org
>>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>>>>
>>>>> Hi, Dear Wolfgang,
>>>>>
>>>>> I think it would nice to bring up a discussion here about the gene
>>>>> prefiltering issue. Please point me out if this suggestion is
>>>>> inappropriate.
>>>>>
>>>>> There are two questions in the gene filtering which I could not find
>>>>> answers:
>>>>> 1). In the traditional multiple tests to correct the p-values of many
>>>>> test
>>>>> groups for example, in a new drug effect experiment, is it
>>>>> appropriate
>>>>> to
>>>>> remove some group tests from the whole experiment? If not, why can we
>>>>> prefilter the genes?
>>>>> 2). As I stated in the previous email, we assume that the raw pvalues
>>>>> and
>>>>> the top lowest-pvalue genes are the same before (35k genes) and after
>>>>> gene
>>>>> filtering (5k genes), the gene x you selected from 35K versus the one
>>>>> selected from 5K, which is more sound? In other words, the best
>>> student
>>>>> selected from 1000 students versus the best student selected from
>>>>> 100,
>>>>> which is more sound?
>>>>>
>>>>> So this is a question of the whole point of gene prefiltering
>>> approach.
>>>>> Best wishes,
>>>>>
>>>>> Wayne
>>>>> --
>>>>>> Hi Swapna
>>>>>>
>>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto:
>>>>>>> Hi Stephanie,
>>>>>>> There is another recent paper that you might consider which also
>>>>>>> cautions about filtering
>>>>>>> Van Iterson, M., Boer, J. M.,&   Menezes, R. X. (2010). Filtering,
>>> FDR
>>>>>>> and power. BMC Bioinformatics, 11(1), 450.
>>>>>>> They also recommend their own statistical test to see if one's
>>> filter
>>>>>>> biases FDR.
>>>>>>> currently I am trying variance filter and feature filter from
>>>>>>> genefilter package: try ?nsFilter for help on these functions.
>>>>>>> However, I dont use filtering routinely since choosing the right
>>>>>>> filter , parameters and testing the effects of any bias are things
>>> I
>>>>>>> have not worked out in addition to having read Bourgon et al and
>>>>>>> Iterson et al and others that discuss this issue.
>>>>>>> About your limma results, while conventional filtering may be
>>>>> expected
>>>>>>> to increase the number of significant genes, as the papers suggest
>>>>>>> likelihood of false positives also increases.
>>>>>> No. Properly applied filtering does not affect the false positive
>>>>> rates
>>>>>> (FWER or FDR). That's the whole point of it. [1]
>>>>>>
>>>>>> If one is willing to put up with higher rate or probability of false
>>>>>> discoveries, then don't do filtering - just increase the p-value
>>>>> cutoff.
>>>>>> [1] Bourgon et al., PNAS 2010.
>>>>>>
>>>>>>> In your current results,
>>>>>>> do you have high fold changes above 2 (log2>1)?  You may want to
>>>>>>> explore the biological relevance of those genes with high FC and
>>>>>>> significant unadjusted p value.
>>>>>>> Best,
>>>>>>> Swapna
>>>>>> Best wishes
>>>>>> Wolfgang Huber
>>>>>> EMBL
>>>>>> http://www.embl.de/research/units/genome_biology/huber
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Moshe Olshansky
Division of Bioinformatics
The Walter & Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052
e-mail: olshansky at wehi.edu.au
tel: (03) 9345 2631

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}