[BioC] Genefilter parameters for mouse 430 2 #2
James W. MacDonald
jmacdon at med.umich.edu
Thu Mar 20 04:27:08 CET 2008
Richard Friedman wrote:
> Thank you for your detailed and helpful reply.
> On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote:
>> That depends. If you are using rma(), then no ;-P
> what about gcrma.
Same diff. The maximum with either will be ~14, so filtering on 100 will
>> You might try something like
>> eset2 <- nsFilter(eset)$eset
>> and see how many probesets you end up with.
> I have tried
> > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset
> > sum(xen2nsSUB)
>  0
> > xen2nsSUB
Yup. That should be
xen2nsSUB <- nsFilter(xen2dataeset)$eset
if you just want the resulting ExpressionSet.
> I then tried (which I think is the correct way to do it)/
> > xen2nsSUB2<-nsFilter(xen2dataeset)
> > xen2nsSUB2
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 623 features, 4 samples
> element names: exprs
> sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel,
> varLabels and varMetadata description:
> sample: arbitrary numbering
> featureNames: 1428670_at, 1457283_at, ..., 1426849_at (623 total)
> fvarLabels and fvarMetadata description: none
> experimentData: use 'experimentData(object)'
> Annotation: mouse4302
>  77
>  41507
>  13
>  2881
>>> 2, Is the only effect of filtering to reduce the multiplier in the
>>> false discovery
>>> analysis OR does it reduce false positives in other ways by
>>> A. In the case of intensity filters by reducing the number of
>>> large fold changes resulting
>>> from the ratios of small numbers.
>>> B. In the case of IQR filters eliminating large t-statistics
>>> resulting for genes with small variation
>>> across samples but fortuitously low standard deviations,
>> Yes and yes, to a certain extent. If you are just doing fold changes,
>> you might consider filtering on each fold change rather than overall.
>> For instance you could create a filter
>> filt <- filterfun(kOverA(1, 100))
>> that you would then use for each fold change comparison to ensure that
>> at least one of the samples had an expression > 100. Shameless plug -
>> see foldFilt() in affycoretools.
> I think that that is basically what I did with genefilter
> described in my first note (.25 of 4 =1). Or am I getting somehing wrong.
Well, that isn't what you did (or maybe it is what you did, but you
didn't do what I am suggesting). If you are doing fold change
calculations then you (IMO) only care about the two things under
consideration. So if you have something like this:
Samples 1 2 3 4
expval 30 85 1500 2500
Then what you did will nuke that probeset. However, the comparisons for
1v3, 1v4, 2v3, 2v4 and 3v4 are probably quite useful. The only one you
don't care about is 1v2, which will give a high fold change but it is
probably not meaningful.
>> If you are doing t-stats with a very small number of replicates (like
>> 2 vs 2), then you should be using limma, and in which case
>> over-filtering the data can be detrimental as well. The reason for
>> that is the prior will be estimated on all the probesets that remain,
>> and if all you have are highly variable probesets then the prior will
>> be larger than you might want. I have seen cases with very small
>> numbers of replicates where using all the data on the chip resulted in
>> many more significant probesets than if I did what I thought was a
>> reasonable filter.
> I am using Limma. I asked for at least 3 sreplicates from the
> experimentalist but she only gave me 2 (story of my life).
LOL. Sounds like the story of my life as well. Unfortunately, I am not
the arbiter of scientific integrity, so I just analyze what I get...
> I got 731 with just the variance filter, 289 with the log2 filter, and
> 619 with nsFilter.
> How many probesets do I need for limma to function properly? Do the
> above numbers seem to small.
I think this is one of those double-edged sword type things. Fewer
probesets are better for multiplicity corrections, but if you just have
the variable ones left maybe the prior will be bigger than you want. I
don't think I can give you a hard number -- I'm just pointing out that
there are two considerations to be made and that it might not be ideal
to go to one or the other extreme.
>> Of course the question remains; is more better?
>> And if more is better, does that mean the ideal would be to find all
>> probesets differentially expressed? Probably not, so we are back to
>> the usual prescriptions; check your data carefully. Make sure your
>> results are sensible. Do EDA to ensure that you don't have some wacky
>> chip messing things up.
> What is EDA? I did all of the quality measures in simpleaffy and in
> AffyPLM and the chips look fine.
Exploratory Data Analysis. Stats 101 -- look at the raw data and see if
there is some kind of wackyness going on. The above is usually what I do
as well, plus a PCA shot to make sure my replicates are all
friendly-like. The limma package now allows you to weight chips
(?arrayWeights), so that is really nice if you have some sketchy chips
in there that you can't really afford to nuke.
>> Check your code to be sure that you haven't made the kind of errors
>> that I like to make. Consult with the experimenter to see if very few
>> genes should be changing (or be expressed at all).
> I have done both of these things.
> Here is my present understanding of the situation following your note:
> 1. Filter by variance and other nsFilter parmaeters is good unless it
> leads to too few probestes.
> Q1. How many are too small?
> Q2. It is advisable to use an intensity filter but log2(100) on at least
> one chip is too high on dim chips?
> Q2B. Is there a way to quantify dimness so I knwo how to adjust the
> Q2C. Or am I better off filtering only on variance, or variance plus
> the nsFilter defaults.
>>> Up until this time I have not filtered because the filtering
>>> parameters looked arbitrary and I
>>> thought that it was cheating to reduce the # of tests used to
>>> compute the FDR. From reading and
>>> further reflection I now believe otherwise. But whereas I now
>>> believe I should filter I am
>>> not at all sure what parameters to use, and how much my final list
>>> of differentially expressed genes
>>> will be sensitive to a choice of those parameters. In particular, i
>>> wonder if the
>>> intensity filter cutoff should vary with chip-type and preprocessing
>>> method (eg GCRMA).
>>> Any thoughts and guidance would be appreciated.
>>> Thanks as always,
>>> Richard A. Friedman, PhD
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator
>>> Center for Computational Biology and Bioinformatics (C2B2)
>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
>>> Box 95, Room 130BB or P&S 1-420C
>>> Columbia University Medical Center
>>> 630 W. 168th St.
>>> New York, NY 10032
>>> (212)305-6901 (5-6901) (voice)
>>> friedman at cancercenter.columbia.edu
>>> "Sure I am willing to stop watching television
>>> to get a better education."
>>> -Rose Friedman, age 11
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> Search the archives:
>> James W. MacDonald, M.S.
>> Affymetrix and cDNA Microarray Core
>> University of Michigan Cancer Center
>> 1500 E. Medical Center Drive
>> 7410 CCGC
>> Ann Arbor MI 48109
More information about the Bioconductor