[BioC] Genefilter parameters for mouse 430 2
James W. MacDonald
jmacdon at med.umich.edu
Wed Mar 19 21:52:37 CET 2008
Richard Friedman wrote:
> Dear Bioconductor Users,
> I am using genefilter to filter an ExpressionSet of 4 Mouse 430 2 chips
> preprocessed with gcrma prior to analysis with limma.
> Here is a description of the expressionset.
> > xen2dataeset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 45101 features, 4 samples
> element names: exprs
> sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel,
> varLabels and varMetadata description:
> sample: arbitrary numbering
> featureNames: 1415670_at, 1415671_at, ..., AFFX-r2-P1-cre-5_at
> (45101 total)
> fvarLabels and fvarMetadata description: none
> experimentData: use 'experimentData(object)'
> Annotation: mouse4302
> Here is my session information.
> > sessionInfo()
> R version 2.6.1 (2007-11-26)
> attached base packages:
>  splines stats graphics grDevices utils datasets methods
>  base
> other attached packages:
>  mouse4302probe_2.0.0 mouse4302cdf_2.0.0 mouse4302.db_2.0.2
>  limma_2.12.0 geneplotter_1.16.0 lattice_0.17-2
>  annotate_1.16.1 AnnotationDbi_1.0.6 RSQLite_0.6-3
>  DBI_0.2-3 RColorBrewer_1.0-1 affyPLM_1.14.0
>  xtable_1.5-2 simpleaffy_2.14.05 gcrma_2.10.0
>  matchprobes_1.10.0 genefilter_1.16.0 survival_2.34
>  annaffy_1.10.1 KEGG_2.0.1 GO_2.0.1
>  affy_1.16.0 preprocessCore_1.0.0 affyio_1.6.1
>  Biobase_1.16.3
> loaded via a namespace (and not attached):
>  KernSmooth_2.22-21 grid_2.6.1 tools_2.6.1
> I have tried the filtering parameters in the article by Scholtens and
> Heydebreck on
> p 233 of the book by Gentleman et al.:
> > f2<-function(x)(IQR(x)>0.5)
> > ff<-filterfun(f1,f2)
> > selected <-genefilter(xen2dataeset,ff)
> > sum(selected)
>  289
> This seemed a bit small so that I tried the effect of each of the
> parameters individually:
> selectedp025A <-genefilter(xen2dataeset,f1)
> > sum(selectedp025A)
>  9681
> > selectedIQRgtp5 <-genefilter(xen2dataeset,f2)
> > sum(selectedIQRgtp5)
>  731
> My questions;
> 1. Is the log2(100) intensity cutoff good for all chips?
> If not can someone recommend a good intensity cutoff for mouse 4302.
That depends. If you are using rma(), then no ;-P
Seriously, this depends on the data in hand. If you have some really dim
chips then maybe it is too high. The problem with filtering is that it
can be pretty ad hoc, so it's difficult to come up with a hard and fast
You might try something like
eset2 <- nsFilter(eset)$eset
and see how many probesets you end up with.
> 2, Is the only effect of filtering to reduce the multiplier in the
> false discovery
> analysis OR does it reduce false positives in other ways by
> A. In the case of intensity filters by reducing the number of large
> fold changes resulting
> from the ratios of small numbers.
> B. In the case of IQR filters eliminating large t-statistics
> resulting for genes with small variation
> across samples but fortuitously low standard deviations,
Yes and yes, to a certain extent. If you are just doing fold changes,
you might consider filtering on each fold change rather than overall.
For instance you could create a filter
filt <- filterfun(kOverA(1, 100))
that you would then use for each fold change comparison to ensure that
at least one of the samples had an expression > 100. Shameless plug -
see foldFilt() in affycoretools.
If you are doing t-stats with a very small number of replicates (like 2
vs 2), then you should be using limma, and in which case over-filtering
the data can be detrimental as well. The reason for that is the prior
will be estimated on all the probesets that remain, and if all you have
are highly variable probesets then the prior will be larger than you
might want. I have seen cases with very small numbers of replicates
where using all the data on the chip resulted in many more significant
probesets than if I did what I thought was a reasonable filter.
Of course the question remains; is more better? And if more is better,
does that mean the ideal would be to find all probesets differentially
expressed? Probably not, so we are back to the usual prescriptions;
check your data carefully. Make sure your results are sensible. Do EDA
to ensure that you don't have some wacky chip messing things up. Check
your code to be sure that you haven't made the kind of errors that I
like to make. Consult with the experimenter to see if very few genes
should be changing (or be expressed at all).
> Up until this time I have not filtered because the filtering
> parameters looked arbitrary and I
> thought that it was cheating to reduce the # of tests used to compute
> the FDR. From reading and
> further reflection I now believe otherwise. But whereas I now believe
> I should filter I am
> not at all sure what parameters to use, and how much my final list of
> differentially expressed genes
> will be sensitive to a choice of those parameters. In particular, i
> wonder if the
> intensity filter cutoff should vary with chip-type and preprocessing
> method (eg GCRMA).
> Any thoughts and guidance would be appreciated.
> Thanks as always,
> Richard A. Friedman, PhD
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator
> Center for Computational Biology and Bioinformatics (C2B2)
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
> Box 95, Room 130BB or P&S 1-420C
> Columbia University Medical Center
> 630 W. 168th St.
> New York, NY 10032
> (212)305-6901 (5-6901) (voice)
> friedman at cancercenter.columbia.edu
> "Sure I am willing to stop watching television
> to get a better education."
> -Rose Friedman, age 11
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
James W. MacDonald, M.S.
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
Ann Arbor MI 48109
More information about the Bioconductor