[BioC] Advise on setting up a non-specific filter for differential expression

Tue Aug 17 11:31:28 CEST 2010

Hi Wolfgang,

just an experience. in some of my analyses applying variance filtering resulted in problems fitting N(0,1) to the limma t statistic. now that i had a quick look at your paper I get an idea that combining limma with the variance filter is anyway not a good idea. 

the performance of mas call-based filtering/limma t as compared to variance filter/standard t is however (slightly) better as estimated by ROC curve analysis on my prior-knowledge data (3 arrays/condition). this is probably not unexpected?

anyway thanks for pointing to the paper, apparently a must-read before applying the nsFilter function.

best regards
Tobias

On Aug 17, 2010, at 9:36 AM, Wolfgang Huber wrote:

> Hi Tobias,
> you said you were worried about "filtering based on variance or IQR - as it jeopardizes ... applying a threshold on the local false discovery rate." I am not sure I understand what you mean, but the effect (or, if properly applied, non-effect) of filtering on type-I error is also discussed in [1] in some detail.
> 
> 
> 
> [1] Richard Bourgon et al. Independent filtering increases detection power for high-throughput experiments. PNAS, 107(21):9546-9551, 2010.
> [2] Talloen et al. I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data.
> Bioinformatics, doi:10.1093/bioinformatics/btm478
> 
> 	Best wishes
> 	Wolfgang
> 
> 
> On 16/08/10 16:50, Lucia Peixoto wrote:
>> Thanks Tobias for your response
>> 
>> I am processing data obtained with Affymetrix mouse chips (430_2, previous
>> version)
>> The first filterning was done based on presence/absence calls, so only genes
>> present in 2/17 samples were used. It is a 2 condition set up, with 8 and 9
>> replicates for each condition. My definition of FDR in my previous question
>> was strictly limited to validation in 8+ independent qPCRs of 40+ randomly
>> selected genes obtained using a SAM cutoff of 5% FDR. So I am talking about
>> independently re-testing the reproducibility of gene expression, which is
>> the only way to really know your FDR. Using the Mas5 presence absence calls
>> filter leads to about 50% of the tested genes not being reproducible.
>> 
>> If I remove the filtering and redo the analysis at 5% FDR, I get all the the
>> previous "false positives" to become true positives. Which was not a
>> surprise to me since about 1/3 of MM probes are known to hybridize better
>> than PM probes, so I don't know what Mas5 presence/absence really means, but
>> definitely cannot reflect accurately the presence of a transcript if the MM
>> probe hybridizes better.
>> 
>> The problem is that I have a great loss of sensitivity (I have a lot of
>> positive controls so I know that), and I would like to increase that using a
>> filter that can come closer to really defining "present", because MM/PM does
>> not.
>> any ideas?
>> thanks
>> 
>> Lucia
>> 
>> 
>> On Mon, Aug 16, 2010 at 8:34 AM, Tobias Straub
>> <tstraub at med.uni-muenchen.de>wrote:
>> 
>>> Hi Lucia
>>> 
>>> I am not sure if I completely understand your problem, just want to mention
>>> that I routinely apply non-specific filtering based on MAS5 calls with a
>>> very good outcome (based on a prior-knowledge training set). I do not like
>>> so much the alternative approach - filtering based on variance or IQR -  as
>>> it jeopardizes my preferred way of defining responders by applying a
>>> threshold on the local false discovery rate.
>>> 
>>> Could you extend a bit on how you exactly filter based on MAS5 calls, how
>>> you define responders and non-responders in qPCR, how your "FDR disaster"
>>> exactly looks like.
>>> 
>>> What is your model system by the way, which arrays you use?
>>> 
>>> best regards
>>> T.
>>> 
>>> 
>>> On Aug 13, 2010, at 7:11 PM, Lucia Peixoto wrote:
>>> 
>>>> Dear All,
>>>> I want to set up a non-specific filter to eliminate genes that are juts
>>> not
>>>> expressed from further statistical analysis. I've previously tried a
>>> filter
>>>> based on Mas5 presence/absence calls which turned out to be a disaster
>>> for
>>>> the FDR (as measured by lots of qPCRs), probably because 1/3 of the MM
>>>> probes actually hybridize better than PM, who knows.
>>>> 
>>>> In any case, my plan is to set up a filter based both on raw fluorescent
>>>> intensity and IQR. I am trying to get as much sensitivity as possible
>>>> without increasing my FDR too much.
>>>> I was thinking that using the intensity distributions and box plots of
>>> the
>>>> raw data may be useful to determine what the best cutoffs to obtain the
>>> best
>>>> sensitivity will be.
>>>> Any advise on how to select appropriate cutoffs?
>>>> 
>>>> Thank you very much in advance
>>>> Lucia
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> ----------------------------------------------------------------------
>>> Dr. Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, München D
>>> 
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> -- 
> 
> 
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

----------------------------------------------------------------------
Dr. Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, München D