[BioC] Advise on setting up a non-specific filter for differential expression

Wolfgang Huber whuber at embl.de
Tue Aug 17 09:36:29 CEST 2010


Hi Lucia,
the diagnostic plots in Fig.1 in [1] might be useful for choosing filter 
criteria. We found that for Affymetrix GeneChips, overall variance 
(across all samples) is a decent correlate of "presence". Other people 
have also proposed more specialised criteria [2], which you could try.

Hi Tobias,
you said you were worried about "filtering based on variance or IQR - 
as it jeopardizes ... applying a threshold on the local false discovery 
rate." I am not sure I understand what you mean, but the effect (or, if 
properly applied, non-effect) of filtering on type-I error is also 
discussed in [1] in some detail.



[1] Richard Bourgon et al. Independent filtering increases detection 
power for high-throughput experiments. PNAS, 107(21):9546-9551, 2010.
[2] Talloen et al. I/NI-calls for the exclusion of non-informative 
genes: a highly effective filtering tool for microarray data.
Bioinformatics, doi:10.1093/bioinformatics/btm478

	Best wishes
	Wolfgang


On 16/08/10 16:50, Lucia Peixoto wrote:
> Thanks Tobias for your response
>
> I am processing data obtained with Affymetrix mouse chips (430_2, previous
> version)
> The first filterning was done based on presence/absence calls, so only genes
> present in 2/17 samples were used. It is a 2 condition set up, with 8 and 9
> replicates for each condition. My definition of FDR in my previous question
> was strictly limited to validation in 8+ independent qPCRs of 40+ randomly
> selected genes obtained using a SAM cutoff of 5% FDR. So I am talking about
> independently re-testing the reproducibility of gene expression, which is
> the only way to really know your FDR. Using the Mas5 presence absence calls
> filter leads to about 50% of the tested genes not being reproducible.
>
> If I remove the filtering and redo the analysis at 5% FDR, I get all the the
> previous "false positives" to become true positives. Which was not a
> surprise to me since about 1/3 of MM probes are known to hybridize better
> than PM probes, so I don't know what Mas5 presence/absence really means, but
> definitely cannot reflect accurately the presence of a transcript if the MM
> probe hybridizes better.
>
> The problem is that I have a great loss of sensitivity (I have a lot of
> positive controls so I know that), and I would like to increase that using a
> filter that can come closer to really defining "present", because MM/PM does
> not.
> any ideas?
> thanks
>
> Lucia
>
>
> On Mon, Aug 16, 2010 at 8:34 AM, Tobias Straub
> <tstraub at med.uni-muenchen.de>wrote:
>
>> Hi Lucia
>>
>> I am not sure if I completely understand your problem, just want to mention
>> that I routinely apply non-specific filtering based on MAS5 calls with a
>> very good outcome (based on a prior-knowledge training set). I do not like
>> so much the alternative approach - filtering based on variance or IQR -  as
>> it jeopardizes my preferred way of defining responders by applying a
>> threshold on the local false discovery rate.
>>
>> Could you extend a bit on how you exactly filter based on MAS5 calls, how
>> you define responders and non-responders in qPCR, how your "FDR disaster"
>> exactly looks like.
>>
>> What is your model system by the way, which arrays you use?
>>
>> best regards
>> T.
>>
>>
>> On Aug 13, 2010, at 7:11 PM, Lucia Peixoto wrote:
>>
>>> Dear All,
>>> I want to set up a non-specific filter to eliminate genes that are juts
>> not
>>> expressed from further statistical analysis. I've previously tried a
>> filter
>>> based on Mas5 presence/absence calls which turned out to be a disaster
>> for
>>> the FDR (as measured by lots of qPCRs), probably because 1/3 of the MM
>>> probes actually hybridize better than PM, who knows.
>>>
>>> In any case, my plan is to set up a filter based both on raw fluorescent
>>> intensity and IQR. I am trying to get as much sensitivity as possible
>>> without increasing my FDR too much.
>>> I was thinking that using the intensity distributions and box plots of
>> the
>>> raw data may be useful to determine what the best cutoffs to obtain the
>> best
>>> sensitivity will be.
>>> Any advise on how to select appropriate cutoffs?
>>>
>>> Thank you very much in advance
>>> Lucia
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> ----------------------------------------------------------------------
>> Dr. Tobias Straub ++4989218075439 Adolf-Butenandt-Institute, München D
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list