[BioC] ttest or fold change

Garrett Frampton gmframpt at bu.edu
Tue Dec 16 16:24:31 MET 2003

Dr. Baker,

You wrote about "the problem" that the t-test denominator may be
accidentally "too small".  You say that this issue has been solved within
the T-test.  It is my belief that this problem has only been partially
solved.  It is true that this "problem" has been solved for a single
hypothesis test within the T-test, but it has not been solved for microarray
data analysis as a whole.

It is possible to gain power by using local estimates of variance based upon
more than one gene.  This sort of approach is extremely useful for
experiments with only a few replicates because it deals with the situation
where the within group variance for a single gene happens to be very small.
This is the approach implemented in Cyber-T;
http://visitor.ics.uci.edu/genex/cybert/.  By looking at the dataset as a
whole, rather than 1 gene at a time, it is possible to eliminate
false-positives that arise as a result of coincidentally low within group

Do you agree?
Other than this minor point I think you did a wonderful job putting the
statistical concepts that so many struggle with into words.

Garrett Frampton
Research Associate
Boston University School of Medicine - Microarray Resource

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Baker, Stephen
Sent: Monday, December 15, 2003 2:15 PM
To: bioconductor at stat.math.ethz.ch
Subject: RE: [BioC] ttest or fold change


With respect to t-Tests a couple of people have mentioned "the problem"
that the t-test denominator may be accidentally "too small" .  This is
because the t-test uses an ESTIMATE of the variance from the sample
itself.  This is what William Sealey Gossett, otherwise known as
"Student" discovered that prompted him to develop the t-distribution and
t-test.  Gossett or Student was a brewmaster for Guinness breweries in
Dublin and was doing experiments with hops and things and discovered
that the well known "normal distribution" was inaccurate when you
estimated the variance from a sample.  He developed the t-distribution
empirically that takes the variability in the variance estimate into
account so that the t-test is ALREADY ADJUSTED to compensate for weird
values in the denominator due to random sampling. 

One thing that I think is too often ignored is that different genes have
different variances, the fact that one gene appears to have a smaller
variance than its neighbors (or a larger one) could be that it ACTUALLY
DOES have a larger or smaller variance OR it may be due to sampling
variability.  The t-test assumes the former but adjusts for the latter
possiblity. It worked then and it works now, it is NOT a problem. 

Student's friend, the genius R.A.Fisher took Student's empirical result
and worked out the theory on which analysis of variance is all based.
This theory has withstood the test of time, it is about 100 years old
and still holds, given the assumptions are correct, t-tests and ANOVA
are still "uniformly most powerful tests".  
-.- -.. .---- .--. ..-.
Stephen P. Baker, MScPH, PhD (ABD)            (508) 856-2625
Sr. Biostatistician- Information Services
Lecturer in Biostatistics                     (775) 254-4885 fax
Graduate School of Biomedical Sciences
University of Massachusetts Medical School, Worcester
55 Lake Avenue North                          stephen.baker at umassmed.edu
Worcester, MA 01655  USA



.Message: 3
.Date: Mon, 15 Dec 2003 12:11:43 -0000
.From: "michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk>
.Subject: RE: [BioC] ttest or fold change
.To: bioconductor at stat.math.ethz.ch
<20B7EB075F2D4542AFFAF813E98ACD93028224D1 at cl-exsrv1.irad.bbsrc.ac.uk>
.Content-Type: text/plain;	charset="utf-8"
.Why not try the non-parametric t-tests available?
.I know all the arguments about a "loss of power" etc, but at the end of
day, as statisticians and bioinformaticians, .sometimes biologists come
to us with small numbers of replicates (for very understandable reasons)
and it is our job to get .some meaning out of that data.  Trying to fit
any kind of statistic involving a p-value to such data is a difficult
and .risky task, and trying to explain those results to the biologist is
often very difficult.
.So here's what happens with the non-parametric tests based on ranking.
Those genes with the highest |t| are those where .all the replicates of
one condition are greater than all the replicates of the other
condition.  The next highest |t| is .where all but one of the replicates
of one condition are greater than all the replicates of the other
conddition, etc etc.
.OK, so some of these differences could occur by chance, but we're
dealing with often millions of data points and I really .don't think
it's possible to make no mistakes.  And curse me if you like, but if i
have a gene expression measurement, .replicated 5 times in two
conditions, and in one condition all five replicates are higher than the
five replicates of the .other condition, then I believe that that gene
is differentially expressed.  And thats easy to find with non-parametric
t, .and it is easy to explain to a biologist, and at the end of the day,
is it really wrong to do that?

Bioconductor mailing list
Bioconductor at stat.math.ethz.ch

More information about the Bioconductor mailing list