[BioC] Wilcoxon test [was loged data or not loged previous to use normalize.quantile]

Gordon Smyth smyth at wehi.edu.au
Wed Apr 6 03:06:09 CEST 2005


There are many different permutation tests and a properly designed 
permutation test can be very general indeed. In order to be specific, I'm 
refering to the Wilxon two-sample rank test (aka Mann-Whitney test) which 
is equivalent to a particular permutation test.

Over many years as a statistician, I've heard it said so many times "the 
variances were not equal so I used a Wilcoxon two-sample test instead of a 
t-test" or "I used a rank test which is assumption free". Like Naomi, I 
find it frustating that this misunderstanding is so common. The fact is 
that all tests make some assumptions, and inequality of population 
variances under the null hypothesis breaks the Wilcoxon test just as it 
does the pooled t-test. I don't know which test breaks down more quickly -- 
I certainly haven't seen any evidence that the Wilcoxon test is more robust 
than the t-test to inequality of variances.

It is easy to confirm that the Wilcoxon test breaks down under inequality 
of variances, either by a simulation or just with a back of an envelope 
calculation. Suppose for example that you are testing equality of means (or 
medians) of two populations with sample sizes n1 and n2. Suppose that the 
two populations have equal medians but that population 1 has a very much 
larger variance than population 2. Then the two samples will separate, with 
all of sample 1 larger than sample 2, with probability 1/2^n1. However, the 
one-sided Wilcoxon test p-value in such a case will be 1/choose(n1+n2,n1), 
a very much smaller quantity. Suppose for example that n1=5, n2=10. Then 
the p-value will be evaluated by the Wilcoxon test as

 > 1/choose(5+10,5)
[1] 0.0003330003

but the actual size of the test is

 > 1/2^5
[1] 0.03125

which is 100 times the nominal p-value. This shows that Wilxocon test does 
not hold its size under inequality of variances.

>[BioC] loged data or not loged previous to use normalize.quantile
>Rhonda DeCook rdecook at iastate.edu
>Tue Apr 5 17:51:09 CEST 2005
>
>With respect to permutations tests...
>
>I'm under the impression that you only need independence, not the 
>assumption of
>constant variance.

No, independence is not enough, as you say yourself in the next sentence.

>The permutation test provides us with a distribution of the test statistic
>under the null hypothesis (equal means in the 2-sample scenario, i.e. all 
>data
>was generated from one distribution-even though it may be an ugly looking
>single distribution).

You are saying that the observations must be independent and "from one 
distribution", i.e., iid, exactly as Naomi said.

The whole point is that, if the population variances are not equal, then 
the two samples cannot be from the same distribution.

>  As long as all 'groupings' of the data into 2 groups are
>equally likely (which is provided by the independence assumption)

For all groupings to be equally likely, you need the two populations to 
have the same shape, and this includes equality of variances.

Gordon

>  this
>permutation distribution of the test statistic (e.g. a t-statistic here)gives
>us an idea of the test statistic's distribution under the null without the
>assumption of normality or constant variance.  Computing a permutation 
>p-value
>from this null distribution provides a p-value that has the usual behavior
>under the null, or Uniform(0,1) though in a discrete manner.  When the
>alternative is true, the distribution of the p-value will have more mass near
>zero tha the Uniform(0,1).
>
>If this logic doesn't apply to the microarray setting, please let me know.
>
>Rhonda
>
>
> > I just want to remind people that permutation tests, rank tests, etc still
> > require i.i.d. errors.  So the variance needs to be stabilized even  for
> > nonparametric tests.
> >
> > --Naomi
> >
> > At 01:32 PM 4/4/2005, Fangxin Hong wrote:
> > >Hi Marcelo;
> > >As what Wolfgang mentioned, non-parametric permutation test is an option
> > >when t-distribution assumption is not valid.  But if you have few
> > >replications (2-3), most permutation tests don't have power either. I
> > >would suggest you try RankProd package, which would be powerful enough to
> > >detect differentially expressed genes with 2 replications.
> > >
> > >Bests;
> > >Fangxin
> > >
> > >
> > >
> > > > Hi Marcelo,
> > > >
> > > > the difference is that the power of the test you are doing can be
> > > > different when you consider the data on the "raw" or on the
> > > > log-transformed scale.
> > > >
> > > > Also, the p-value calculated by limma is based on the assumption that
> > > > the null-distribution of the test statistic is given by a
> > > > t-distribution; this assumption might be more or less true in both 
> cases.
> > > >
> > > > You are really doing two different tests: test A, say, consists of
> > > > applying the t-statistic to the untransformed intensities, test B, say,
> > > > applying the t-statistic to the transformed intensities.
> > > >
> > > > Then, if you want to use the t-distribution for getting p-values, you
> > > > need to make sure that the null distribution of your test statistic
> > > > is indeed (to good enough approximation) t-distributed. You can do this
> > > > e.g. by permutations. For that you need either a large number of
> > > > replicates, or to pool variance estimators across genes.
> > > >
> > > > If you don't want to make a parametric assumption for getting p-values,
> > > > you need a larger number of replicates; if you have these, you can for
> > > > example calculate a permutation p-value.
> > > >
> > > > So, there is really no "right" or "wrong" about transforming, or which
> > > > transformation -- as long as you don't violate the assumptions of the
> > > > subsequent tests. If the assumptions are met, then the procedure with
> > > > the highest power is preferable. And that depends very much on your 
> data
> > > > (about which you have not told us much.)
> > > >
> > > > Hope that helps.
> > > >
> > > > And here is another shameless plug: have a look at this paper:
> > > > Differential Expression with the Bioconductor Project
> > > > http://www.bepress.com/bioconductor/paper7
> > > >
> > > >    Best wishes
> > > >     Wolfgang
> > > >
> > > > Marcelo Luiz de Laia wrote:
> > > >> Dear Bioconductors Friends,
> > > >>
> > > >> I have a question that I dont found answer for it. Please, if you 
> have a
> > > >> paper/article that explain it, please, tell me.
> > > >>
> > > >> I normalize our data using normalize.quantile function.
> > > >>
> > > >> If I previous transform our intensities (single channel) in log2, 
> I dont
> > > >> get differentially genes in limma.
> > > >>
> > > >> But, if I dont transform our data, I get some genes with p.value 
> around
> > > >> 0.0001, thats is great!
> > > >>
> > > >> Of course, when I transform the intensities data to log2, I get 
> some NA.
> > > >>
> > > >> Why are there this difference? Am I wrong in does an analysis with not
> > > >> loged data?
> > > >>
> > > >> Thanks a lot
> > > >>
> > > >> Marcelo
> > > >>
> > > >> _______________________________________________
> > > >> Bioconductor mailing list
> > > >> Bioconductor at stat.math.ethz.ch
> > > >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > >
> > > >
> > > > --
> > > > Best regards
> > > >    Wolfgang
> > > >
> > > > -------------------------------------
> > > > Wolfgang Huber
> > > > European Bioinformatics Institute
> > > > European Molecular Biology Laboratory
> > > > Cambridge CB10 1SD
> > > > England
> > > > Phone: +44 1223 494642
> > > > Fax:   +44 1223 494486
> > > > Http:  www.ebi.ac.uk/huber
> > > >
> > > > _______________________________________________
> > > > Bioconductor mailing list
> > > > Bioconductor at stat.math.ethz.ch
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > >
> > > >
> > >
> > >
> > >--
> > >Fangxin Hong, Ph.D.
> > >Plant Biology Laboratory
> > >The Salk Institute
> > >10010 N. Torrey Pines Rd.
> > >La Jolla, CA 92037
> > >E-mail: fhong at salk.edu
> > >
> > >_______________________________________________
> > >Bioconductor mailing list
> > >Bioconductor at stat.math.ethz.ch
> > >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >
> > Naomi S. Altman                                814-865-3791 (voice)
> > Associate Professor
> > Bioinformatics Consulting Center
> > Dept. of Statistics                              814-863-7114 (fax)
> > Penn State University                         814-865-1348 (Statistics)
> > University Park, PA 16802-2111



More information about the Bioconductor mailing list