[BioC] ttest or fold change

Ramon Diaz-Uriarte rdiaz at cnio.es
Wed Dec 17 11:52:18 MET 2003


Dear Stephen,

Thank you for your detailed comments. Two points:

1. It is my understanding that there are other issues at stake besides the 
unbiasedness of the error variance estimate, but I'll leave the technical 
discussion to others who are much more capable. However, the results in, for 
example, Lönnstedt & Speed (2002, Statistica Sinica, 12:31--46), or Smyth 
(2003, http://www.statsci.org/smyth/pubs/ebayes.pdf) or in Qin & Kerr at the 
IMA Workshop (http://www.ima.umn.edu/talks/workshops/9-29-10-3.2003/kerr/
KerrIMA.pdf), seem to indicate, with both simulated and "wet lab data", that 
we can do much better (in terms of false positve and false negatives) using 
t-like tests that combine information across genes than with the standard 
t-test.

2. > The real problem that researchers face with microarrays is NOT that
> their t-test variances are too small, but that they often have
> insufficient sample to detect the differences they need to detect. The
> ready solution is to get enough data.

I do agree with the general point. In a previous incarnation I used to do 
behavioral ecology and help field biologists with their data. It was not 
unheard of (in areas with a lot less funding than molecular biology) to spend 
two yeards in the field following some creatures to try to get decent sample 
sizes (and maybe one or two papers out). The answer to small sample sizes was 
often more field seasons, not shortcuts in the data analysis. 

I often interact with many molecular biologists and MDs who persevere in using 
tiny sample sizes for "serious stuff". This concerns me a lot (as both a 
statistician and a potential patient who might one day seek treatment!).

Best,

R.



On Tuesday 16 December 2003 23:45, Baker, Stephen wrote:
> Garrett et al,
>
> The t-test (or ANOVA) does not have a problem with "accidentally too
> small" variances, either with one or more than one outcome of interest.
> The estimate of the error variance by t-tests and ANOVA is a Least
> Squares estimate and is the UNBIASED ESTIMATOR that is also the lower
> bound on the variance for the "best" (minimum variance) linear unbiased
> estimator (BLUE) of the effects being tested (see Graybill 1976).
>
> Some bayesian methods can generate smaller estimates of variances by
> biasing the estimate toward some overall measure such as the average of
> variances for nearby genes.  These are BIASED estimates based on an
> assumption that a particular gene should really be like genes that are
> "nearby" in some sense, such as they have similar expression levels.
> You would have to present a lot of data to me to convince me that any
> randomly selected gene should have a variance like some other set of
> genes, especially when I have an unbiased estimate at hand that is
> non-controversial, requires no defense, and uses methods that have
> withstood 100 years of review and scrutiny. I'm familiar with shrunken
> estimates of effects that can have a smaller "mean squared error", but
> these are random effects, not variances which control the power and type
> I error rate.
>
> These approaches, in addition to producing biased estimates sometimes
> require the analyst to impose his or her own particular biases, called
> "prior beliefs" or "priors" on as to how much these estimates should be
> biased by requiring that the analyst input how much weight is given to
> the data from that gene and how much weight is given to the other set
> that the gene is supposed to "be more like".  Again, it would take some
> pretty strong arguments to convince me that any particular analysts
> prior beliefs about how much the data for a gene or data from other
> genes should or should not be weighted.  I would be concerned about  how
> much convincing a readership, reviewer, or study group would need if
> they ever decide to "open the black box" and ask me to explain why such
> an approach is reasonable/justifiable.
>
> The program Garrett mentioned, Cyber-T, uses such an approach.  To quote
> the Cyber-T manual "...This weighting factor IS CONTROLLED BY THE
> EXPERIMENTER AND WILL DEPEND ON HOW CONFIDENT THE EXPERIMENTER IS that
> the background variance of a closely related set of genes approximates
> the variance of the gene under consideration".  Now if one was looking
> at just ONE  gene, it makes sense that someone might put a lot of
> thought into it, have looked at a lot of similar genes or other data and
> come to the conclusion that a gene should be like some other genes and
> THEN use this approach.  But this is not the case when you have 10,000
> or 22,000 genes, at least not in the world I'm familiar with.
>
> I use empirical bayes methods for fitting general linear mixed models,
> where the priors are objective, not my own opinion.  Cyber-T does offer
> the option of setting low confidence in the prior which is an objective
> prior, but the manual points out that this results in the standard
> Student t-test!  Another feature of Cyber-T is that when you have
> "enough" data, the weighted approach converges into the standard t-test
> as well.
>
> The real problem that researchers face with microarrays is NOT that
> their t-test variances are too small, but that they often have
> insufficient sample to detect the differences they need to detect. The
> ready solution is to get enough data.
>
> -.- -.. .---- .--. ..-.
> Stephen P. Baker, MScPH, PhD (ABD)            (508) 856-2625
> Sr. Biostatistician- Information Services
> Lecturer in Biostatistics                     (775) 254-4885 fax
> Graduate School of Biomedical Sciences
> University of Massachusetts Medical School, Worcester
> 55 Lake Avenue North                          stephen.baker at umassmed.edu
> Worcester, MA 01655  USA
>
> ------------------------------
>
> Message: 6
> Date: Tue, 16 Dec 2003 10:24:31 -0500
> From: "Garrett Frampton" <gmframpt at bu.edu>
> Subject: RE: [BioC] ttest or fold change
> To: <bioconductor at stat.math.ethz.ch>
> Message-ID: <00b801c3c3e8$b3ed2cc0$e1be299b at GARRETT>
> Content-Type: text/plain;	charset="US-ASCII"
>
> Dr. Baker,
>
> You wrote about "the problem" that the t-test denominator may be
> accidentally "too small".  You say that this issue has been solved
> within the T-test.  It is my belief that this problem has only been
> partially solved.  It is true that this "problem" has been solved for a
> single hypothesis test within the T-test, but it has not been solved for
> microarray data analysis as a whole.
>
> It is possible to gain power by using local estimates of variance based
> upon more than one gene.  This sort of approach is extremely useful for
> experiments with only a few replicates because it deals with the
> situation where the within group variance for a single gene happens to
> be very small. This is the approach implemented in Cyber-T;
> http://visitor.ics.uci.edu/genex/cybert/.  By looking at the dataset as
> a whole, rather than 1 gene at a time, it is possible to eliminate
> false-positives that arise as a result of coincidentally low within
> group variance.
>
> Do you agree?
> Other than this minor point I think you did a wonderful job putting the
> statistical concepts that so many struggle with into words.
>
>
> Garrett Frampton
> Research Associate
> Boston University School of Medicine - Microarray Resource
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)



More information about the Bioconductor mailing list