[BioC] Genefilter parameters for mouse 430 2

Wed Mar 19 21:52:37 CET 2008

Hi Richard,

Richard Friedman wrote:
> Dear Bioconductor Users,
> 
> 	I am using genefilter to filter an ExpressionSet of 4 Mouse 430 2 chips
> preprocessed with gcrma  prior to  analysis with limma.
> 
> Here is a description of the expressionset.
> 
>  > xen2dataeset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 45101 features, 4 samples
>    element names: exprs
> phenoData
>    sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel,  
> D_nodal_2_2
>    8.cel
>    varLabels and varMetadata description:
>      sample: arbitrary numbering
> featureData
>    featureNames: 1415670_at, 1415671_at, ..., AFFX-r2-P1-cre-5_at   
> (45101 total)
>    fvarLabels and fvarMetadata description: none
> experimentData: use 'experimentData(object)'
> Annotation: mouse4302
>  >
> 
> Here is my session information.
> 
>  > sessionInfo()
> R version 2.6.1 (2007-11-26)
> i386-apple-darwin8.10.1
> 
> locale:
> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
> 
> other attached packages:
>   [1] mouse4302probe_2.0.0 mouse4302cdf_2.0.0   mouse4302.db_2.0.2
>   [4] limma_2.12.0         geneplotter_1.16.0   lattice_0.17-2
>   [7] annotate_1.16.1      AnnotationDbi_1.0.6  RSQLite_0.6-3
> [10] DBI_0.2-3            RColorBrewer_1.0-1   affyPLM_1.14.0
> [13] xtable_1.5-2         simpleaffy_2.14.05   gcrma_2.10.0
> [16] matchprobes_1.10.0   genefilter_1.16.0    survival_2.34
> [19] annaffy_1.10.1       KEGG_2.0.1           GO_2.0.1
> [22] affy_1.16.0          preprocessCore_1.0.0 affyio_1.6.1
> [25] Biobase_1.16.3
> 
> loaded via a namespace (and not attached):
> [1] KernSmooth_2.22-21 grid_2.6.1         tools_2.6.1
>  >
> 
> 
> I have tried the filtering parameters in the article by Scholtens and  
> Heydebreck on
> p 233 of the book by Gentleman et al.:
> 
>   f1<-pOverA(0.25,log2(100))
>  > f2<-function(x)(IQR(x)>0.5)
>  > ff<-filterfun(f1,f2)
>  > selected <-genefilter(xen2dataeset,ff)
>  > sum(selected)
> [1] 289
> 
> This seemed a bit small so that I tried the effect of each of the  
> parameters individually:
> 
>   selectedp025A <-genefilter(xen2dataeset,f1)
>  > sum(selectedp025A)
> [1] 9681
>  > selectedIQRgtp5 <-genefilter(xen2dataeset,f2)
>  > sum(selectedIQRgtp5)
> [1] 731
> 
> My questions;
> 
> 1. Is the log2(100) intensity cutoff good for all chips?
> 	If not can someone recommend a good intensity cutoff for	mouse 4302.

That depends. If you are using rma(), then no ;-P

Seriously, this depends on the data in hand. If you have some really dim 
chips then maybe it is too high. The problem with filtering is that it 
can be pretty ad hoc, so it's difficult to come up with a hard and fast 
rule.

You might try something like

eset2 <- nsFilter(eset)$eset

and see how many probesets you end up with.

> 2, Is the only effect of filtering to reduce the multiplier in the  
> false discovery
>         analysis OR does it reduce false positives in other ways by
> 	A. In the case of intensity filters by reducing the number of large  
> fold changes resulting
> 	    from the ratios of small numbers.
> 	B. In the case of IQR filters eliminating large t-statistics  
> resulting for genes with small variation	
> 	     across samples but fortuitously low standard deviations,

Yes and yes, to a certain extent. If you are just doing fold changes, 
you might consider filtering on each fold change rather than overall. 
For instance you could create a filter

filt <- filterfun(kOverA(1, 100))

that you would then use for each fold change comparison to ensure that 
at least one of the samples had an expression > 100. Shameless plug - 
see foldFilt() in affycoretools.

If you are doing t-stats with a very small number of replicates (like 2 
vs 2), then you should be using limma, and in which case over-filtering 
the data can be detrimental as well. The reason for that is the prior 
will be estimated on all the probesets that remain, and if all you have 
are highly variable probesets then the prior will be larger than you 
might want. I have seen cases with very small numbers of replicates 
where using all the data on the chip resulted in many more significant 
probesets than if I did what I thought was a reasonable filter.

Of course the question remains; is more better? And if more is better, 
does that mean the ideal would be to find all probesets differentially 
expressed? Probably not, so we are back to the usual prescriptions; 
check your data carefully. Make sure your results are sensible. Do EDA 
to ensure that you don't have some wacky chip messing things up. Check 
your code to be sure that you haven't made the kind of errors that I 
like to make. Consult with the experimenter to see if very few genes 
should be changing (or be expressed at all).

Best,

Jim

> 
> 	Up until this time I have not filtered because the filtering  
> parameters looked arbitrary and I
> thought that it was cheating to reduce the # of tests used to compute  
> the FDR. From reading and
> further reflection I now believe otherwise. But whereas I now believe  
> I should filter I am
> not at all sure what parameters to use, and how much my final list of  
> differentially expressed genes
> will be sensitive to a choice of those parameters. In particular, i  
> wonder if the
> intensity filter cutoff should vary with chip-type and preprocessing  
> method (eg GCRMA).
> 
> 	Any thoughts and guidance would be appreciated.
> 
> Thanks as always,
> Rich
> ------------------------------------------------------------
> Richard A. Friedman, PhD
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Lecturer
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator
> Center for Computational Biology and Bioinformatics (C2B2)
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
> Box 95, Room 130BB or P&S 1-420C
> Columbia University Medical Center
> 630 W. 168th St.
> New York, NY 10032
> (212)305-6901 (5-6901) (voice)
> friedman at cancercenter.columbia.edu
> http://cancercenter.columbia.edu/~friedman/
> 
> "Sure I am willing to stop watching television
> to get a better education."
> -Rose Friedman, age 11
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623