[BioC] ttest or fold change

Garrett Frampton fcalive at hotmail.com
Wed Dec 17 01:35:34 MET 2003

Dr. Baker,

Thank you very much for the reply.  It was quite enlightening and I agreed
with almost everything.  Particularly the idea that the is no substitute for
collecting enough data to have the power to see that changes that you are
looking for.  Nevertheless, it will be along time before we can get away
from analyzing small datasets (3 vs 3 for example).  It is often important
to perform a small study in order to get preliminary data for a larger one.
In fact, in most cases this would be advisable in order to get an idea of
technical and biological variability prior to designing the larger study.
Consequently, it is important to be able to analyze small datasets.

Suppose that we have a large dataset from a study with two experimental
conditions (100 vs 100).  Assume that there are large, reproducible
differences (many fold, many standard deviations) between the conditions for
a number of genes (1-5% of the data).  A T-test can be used on this dataset
to define a group of differentially expressed genes.  Select 3 samples at
random from each group and use two statistical tests, a T-test and the
Bayesian T-test implemented in Cyber-T.  At any significance cut-off, the
genes found to be differentially expressed by the Bayesian T-test will be in
much better agreement with the genes found by a T-test from the 100 samples
than the regular T-test will be.

I think that this is at odds with your conclusion.


----- Original Message ----- 
From: "Baker, Stephen" <Stephen.Baker at umassmed.edu>
To: <bioconductor at stat.math.ethz.ch>
Sent: Tuesday, December 16, 2003 5:45 PM
Subject: RE: [BioC] ttest or fold change

> Garrett et al,
> The t-test (or ANOVA) does not have a problem with "accidentally too
> small" variances, either with one or more than one outcome of interest.
> The estimate of the error variance by t-tests and ANOVA is a Least
> Squares estimate and is the UNBIASED ESTIMATOR that is also the lower
> bound on the variance for the "best" (minimum variance) linear unbiased
> estimator (BLUE) of the effects being tested (see Graybill 1976).
> Some bayesian methods can generate smaller estimates of variances by
> biasing the estimate toward some overall measure such as the average of
> variances for nearby genes.  These are BIASED estimates based on an
> assumption that a particular gene should really be like genes that are
> "nearby" in some sense, such as they have similar expression levels.
> You would have to present a lot of data to me to convince me that any
> randomly selected gene should have a variance like some other set of
> genes, especially when I have an unbiased estimate at hand that is
> non-controversial, requires no defense, and uses methods that have
> withstood 100 years of review and scrutiny. I'm familiar with shrunken
> estimates of effects that can have a smaller "mean squared error", but
> these are random effects, not variances which control the power and type
> I error rate.
> These approaches, in addition to producing biased estimates sometimes
> require the analyst to impose his or her own particular biases, called
> "prior beliefs" or "priors" on as to how much these estimates should be
> biased by requiring that the analyst input how much weight is given to
> the data from that gene and how much weight is given to the other set
> that the gene is supposed to "be more like".  Again, it would take some
> pretty strong arguments to convince me that any particular analysts
> prior beliefs about how much the data for a gene or data from other
> genes should or should not be weighted.  I would be concerned about  how
> much convincing a readership, reviewer, or study group would need if
> they ever decide to "open the black box" and ask me to explain why such
> an approach is reasonable/justifiable.
> The program Garrett mentioned, Cyber-T, uses such an approach.  To quote
> the Cyber-T manual "...This weighting factor IS CONTROLLED BY THE
> the background variance of a closely related set of genes approximates
> the variance of the gene under consideration".  Now if one was looking
> at just ONE  gene, it makes sense that someone might put a lot of
> thought into it, have looked at a lot of similar genes or other data and
> come to the conclusion that a gene should be like some other genes and
> THEN use this approach.  But this is not the case when you have 10,000
> or 22,000 genes, at least not in the world I'm familiar with.
> I use empirical bayes methods for fitting general linear mixed models,
> where the priors are objective, not my own opinion.  Cyber-T does offer
> the option of setting low confidence in the prior which is an objective
> prior, but the manual points out that this results in the standard
> Student t-test!  Another feature of Cyber-T is that when you have
> "enough" data, the weighted approach converges into the standard t-test
> as well.
> The real problem that researchers face with microarrays is NOT that
> their t-test variances are too small, but that they often have
> insufficient sample to detect the differences they need to detect. The
> ready solution is to get enough data.
> -.- -.. .---- .--. ..-.
> Stephen P. Baker, MScPH, PhD (ABD)            (508) 856-2625
> Sr. Biostatistician- Information Services
> Lecturer in Biostatistics                     (775) 254-4885 fax
> Graduate School of Biomedical Sciences
> University of Massachusetts Medical School, Worcester
> 55 Lake Avenue North                          stephen.baker at umassmed.edu
> Worcester, MA 01655  USA
> ------------------------------
> Message: 6
> Date: Tue, 16 Dec 2003 10:24:31 -0500
> From: "Garrett Frampton" <gmframpt at bu.edu>
> Subject: RE: [BioC] ttest or fold change
> To: <bioconductor at stat.math.ethz.ch>
> Message-ID: <00b801c3c3e8$b3ed2cc0$e1be299b at GARRETT>
> Content-Type: text/plain; charset="US-ASCII"
> Dr. Baker,
> You wrote about "the problem" that the t-test denominator may be
> accidentally "too small".  You say that this issue has been solved
> within the T-test.  It is my belief that this problem has only been
> partially solved.  It is true that this "problem" has been solved for a
> single hypothesis test within the T-test, but it has not been solved for
> microarray data analysis as a whole.
> It is possible to gain power by using local estimates of variance based
> upon more than one gene.  This sort of approach is extremely useful for
> experiments with only a few replicates because it deals with the
> situation where the within group variance for a single gene happens to
> be very small. This is the approach implemented in Cyber-T;
> http://visitor.ics.uci.edu/genex/cybert/.  By looking at the dataset as
> a whole, rather than 1 gene at a time, it is possible to eliminate
> false-positives that arise as a result of coincidentally low within
> group variance.
> Do you agree?
> Other than this minor point I think you did a wonderful job putting the
> statistical concepts that so many struggle with into words.
> Garrett Frampton
> Research Associate
> Boston University School of Medicine - Microarray Resource
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

More information about the Bioconductor mailing list