Bogdan Tanasa
Fri Mar 19 18:22:09 CET 2021
Dear all, thank you all for comments and help.
as far as i can see, shall we have samples of 1000 records, only
"exact=FALSE" allows the code to run:
wilcox.test(rnorm(1000), rnorm(1000, 2), exact=FALSE)$p.value
[1] 7.304863e-231
shall i use "exact=TRUE", it runs out of memory on my 64GB RAM PC :
wilcox.test(rnorm(1000), rnorm(1000, 2), exact=TRUE)$p.value
(the job is terminated by OS)
shall you have any other suggestions, please let me know. thanks a lot !
On Fri, Mar 19, 2021 at 9:05 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> I **believe** -- if my old memory still serves-- that the "exact"
> specification uses a home grown version of the algorithm to calculate
> exact, or close approximations to the exact, permutation distribution
> originally developed by Cyrus Mehta, founder of StatXact software. Of
> course, examining the C code source would determine this, but I don't care
> to attempt this.
>
> If this is (no longer?) correct, please point this out.
>
> Best,
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Mar 19, 2021 at 8:42 AM Jiefei Wang <szwjf08 using gmail.com> wrote:
>
>> Hi Spencer,
>>
>> Thanks for your test results, I do not know the answer as I haven't
>> used wilcox.test for many years. I do not know if it is possible to
>> compute
>> the exact distribution of the Wilcoxon rank sum statistic, but I think it
>> is very likely, as the document of `Wilcoxon` says:
>>
>> This distribution is obtained as follows. Let x and y be two random,
>> independent samples of size m and n. Then the Wilcoxon rank sum statistic
>> is the number of all pairs (x[i], y[j]) for which y[j] is not greater than
>> x[i]. This statistic takes values between 0 and m * n, and its mean and
>> variance are m * n / 2 and m * n * (m + n + 1) / 12, respectively.
>>
>> As a nice feature of the non-parametric statistic, it is usually
>> distribution-free so you can pick any distribution you like to compute the
>> same statistic. I wonder if this is the case, but I might be wrong.
>>
>> Cheers,
>> Jiefei
>>
>>
>> On Fri, Mar 19, 2021 at 10:57 PM Spencer Graves <
>> spencer.graves using effectivedefense.org> wrote:
>>
>> >
>> >
>> > On 2021-3-19 9:52 AM, Jiefei Wang wrote:
>> > > After digging into the R source, it turns out that the argument
>> `exact`
>> > has
>> > > nothing to do with the numeric precision. It only affects the
>> statistic
>> > > model used to compute the p-value. When `exact=TRUE` the true
>> > distribution
>> > > of the statistic will be used. Otherwise, a normal approximation will
>> be
>> > > used.
>> > >
>> > > I think the documentation needs to be improved here, you can compute
>> the
>> > > exact p-value *only* when you do not have any ties in your data. If
>> you
>> > > have ties in your data you will get the p-value from the normal
>> > > approximation no matter what value you put in `exact`. This behavior
>> > should
>> > > be documented or a warning should be given when `exact=TRUE` and ties
>> > > present.
>> > >
>> > > FYI, if the exact p-value is required, `pwilcox` function will be
>> used to
>> > > compute the p-value. There are no details on how it computes the
>> pvalue
>> > but
>> > > its C code seems to compute the probability table, so I assume it
>> > computes
>> > > the exact p-value from the true distribution of the statistic, not a
>> > > permutation or MC p-value.
>> >
>> >
>> > My example shows that it does NOT use Monte Carlo, because
>> > otherwise it uses some distribution. I believe the term "exact" means
>> > that it uses the permutation distribution, though I could be mistaken.
>> > If it's NOT a permutation distribution, I don't know what it is.
>> >
>> >
>> > Spencer
>> > >
>> > > Best,
>> > > Jiefei
>> > >
>> > >
>> > >
>> > > On Fri, Mar 19, 2021 at 10:01 PM Jiefei Wang <szwjf08 using gmail.com>
>> wrote:
>> > >
>> > >> Hey,
>> > >>
>> > >> I just want to point out that the word "exact" has two meanings. It
>> can
>> > >> mean the numerically accurate p-value as Bogdan asked in his first
>> > email,
>> > >> or it could mean the p-value calculated from the exact distribution
>> of
>> > the
>> > >> statistic(In this case, U stat). These two are actually not related,
>> > even
>> > >> though they all called "exact".
>> > >>
>> > >> Best,
>> > >> Jiefei
>> > >>
>> > >> On Fri, Mar 19, 2021 at 9:31 PM Spencer Graves <
>> > >> spencer.graves using effectivedefense.org> wrote:
>> > >>
>> > >>>
>> > >>> On 2021-3-19 12:54 AM, Bogdan Tanasa wrote:
>> > >>>> thanks a lot, Vivek ! in other words, assuming that we work with
>> 1000
>> > >>> data
>> > >>>> points,
>> > >>>>
>> > >>>> shall we use EXACT = TRUE, it uses the normal approximation,
>> > >>>>
>> > >>>> while if EXACT=FALSE (for these large samples), it does not ?
>> > >>>
>> > >>> As David Winsemius noted, the documentation is not clear.
>> > >>> Consider the following:
>> > >>>
>> > >>>> set.seed(1) > x <- rnorm(100) > y <- rnorm(100, 2) > >
>> wilcox.test(x,
>> > >>> y)$p.value
>> > >>> [1] 1.172189e-25 > wilcox.test(x, y)$p.value [1] 1.172189e-25 > >
>> > >>> wilcox.test(x, y, EXACT=TRUE)$p.value [1] 1.172189e-25 >
>> wilcox.test(x,
>> > >>> y, EXACT=TRUE)$p.value [1] 1.172189e-25 > wilcox.test(x, y,
>> > >>> exact=TRUE)$p.value [1] 4.123875e-32 > wilcox.test(x, y,
>> > >>> exact=TRUE)$p.value [1] 4.123875e-32 > > wilcox.test(x, y,
>> > >>> EXACT=FALSE)$p.value [1] 1.172189e-25 > wilcox.test(x, y,
>> > >>> EXACT=FALSE)$p.value [1] 1.172189e-25 > wilcox.test(x, y,
>> > >>> exact=FALSE)$p.value [1] 1.172189e-25 > wilcox.test(x, y,
>> > >>> exact=FALSE)$p.value [1] 1.172189e-25 > We get two values here:
>> > >>> 1.172189e-25 and 4.123875e-32. The first one, I think, is the normal
>> > >>> approximation, which is the same as exact=FALSE. I think that with
>> > >>> exact=FALSE, you get a permutation distribution, though I'm not
>> sure.
>> > >>> You might try looking at "wilcox_test in package coin for exact,
>> > >>> asymptotic and Monte Carlo conditional p-values, including in the
>> > >>> presence of ties" to see if it is clearer. NOTE: R is case
>> sensitive,
>> > so
>> > >>> "EXACT" is a different variable from "exact". It is interpreted as
>> an
>> > >>> optional argument, which is not recognized and therefore ignored in
>> > this
>> > >>> context.
>> > >>> Hope this helps.
>> > >>> Spencer
>> > >>>
>> > >>>
>> > >>>> On Thu, Mar 18, 2021 at 10:47 PM Vivek Das <vd4mmind using gmail.com>
>> > wrote:
>> > >>>>
>> > >>>>> Hi Bogdan,
>> > >>>>>
>> > >>>>> You can also get the information from the link of the Wilcox.test
>> > >>> function
>> > >>>>> page.
>> > >>>>>
>> > >>>>> “By default (if exact is not specified), an exact p-value is
>> computed
>> > >>> if
>> > >>>>> the samples contain less than 50 finite values and there are no
>> ties.
>> > >>>>> Otherwise, a normal approximation is used.”
>> > >>>>>
>> > >>>>> For more:
>> > >>>>>
>> > >>>>>
>> > >>>
>> >
>> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/wilcox.test.html
>> > >>>>> Hope this helps!
>> > >>>>>
>> > >>>>> Best,
>> > >>>>>
>> > >>>>> VD
>> > >>>>>
>> > >>>>>
>> > >>>>> On Thu, Mar 18, 2021 at 10:36 PM Bogdan Tanasa <tanasa using gmail.com>
>> > >>> wrote:
>> > >>>>>> Dear Peter, thanks a lot. yes, we can see a very precise p-value,
>> > and
>> > >>> that
>> > >>>>>> was the request from the journal.
>> > >>>>>>
>> > >>>>>> if I may ask another question please : what is the meaning of
>> > >>> "exact=TRUE"
>> > >>>>>> or "exact=FALSE" in wilcox.test ?
>> > >>>>>>
>> > >>>>>> i can see that the "numerically precise" p-values are different.
>> > >>> thanks a
>> > >>>>>> lot !
>> > >>>>>>
>> > >>>>>> tst = wilcox.test(rnorm(100), rnorm(100, 2), exact=TRUE)
>> > >>>>>> tst$p.value
>> > >>>>>> [1] 8.535524e-25
>> > >>>>>>
>> > >>>>>> tst = wilcox.test(rnorm(100), rnorm(100, 2), exact=FALSE)
>> > >>>>>> tst$p.value
>> > >>>>>> [1] 3.448211e-25
>> > >>>>>>
>> > >>>>>> On Thu, Mar 18, 2021 at 10:15 PM Peter Langfelder <
>> > >>>>>> peter.langfelder using gmail.com> wrote:
>> > >>>>>>
>> > >>>>>>> I thinnk the answer is much simpler. The print method for
>> > hypothesis
>> > >>>>>>> tests (class htest) truncates the p-values. In the above
>> example,
>> > >>>>>>> instead of using
>> > >>>>>>>
>> > >>>>>>> wilcox.test(rnorm(100), rnorm(100, 2), exact=TRUE)
>> > >>>>>>>
>> > >>>>>>> and copying the output, just print the p-value:
>> > >>>>>>>
>> > >>>>>>> tst = wilcox.test(rnorm(100), rnorm(100, 2), exact=TRUE)
>> > >>>>>>> tst$p.value
>> > >>>>>>>
>> > >>>>>>> [1] 2.988368e-32
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> I think this value is what the journal asks for.
>> > >>>>>>>
>> > >>>>>>> HTH,
>> > >>>>>>>
>> > >>>>>>> Peter
>> > >>>>>>>
>> > >>>>>>> On Thu, Mar 18, 2021 at 10:05 PM Spencer Graves
>> > >>>>>>> <spencer.graves using effectivedefense.org> wrote:
>> > >>>>>>>> I would push back on that from two perspectives:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> 1. I would study exactly what the journal said
>> > very
>> > >>>>>>>> carefully. If they mandated "wilcox.test", that function has
>> an
>> > >>>>>>>> argument called "exact". If that's what they are asking, then
>> > using
>> > >>>>>>>> that argument gives the exact p-value, e.g.:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> > wilcox.test(rnorm(100), rnorm(100, 2), exact=TRUE)
>> > >>>>>>>>
>> > >>>>>>>> Wilcoxon rank sum exact test
>> > >>>>>>>>
>> > >>>>>>>> data: rnorm(100) and rnorm(100, 2)
>> > >>>>>>>> W = 691, p-value < 2.2e-16
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> 2. If that's NOT what they are asking, then I'm
>> > not
>> > >>>>>>>> convinced what they are asking makes sense: There is is no
>> such
>> > >>> thing
>> > >>>>>>>> as an "exact p value" except to the extent that certain
>> > assumptions
>> > >>>>>>>> hold, and all models are wrong (but some are useful), as George
>> > Box
>> > >>>>>>>> famously said years ago.[1] Truth only exists in mathematics,
>> and
>> > >>>>>>>> that's because it's a fiction to start with ;-)
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hope this helps.
>> > >>>>>>>> Spencer Graves
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> [1]
>> > >>>>>>>> https://en.wikipedia.org/wiki/All_models_are_wrong
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On 2021-3-18 11:12 PM, Bogdan Tanasa wrote:
>> > >>>>>>>>> <
>> > >>>
>> > https://meta.stackexchange.com/questions/362285/about-a-p-value-2-2e-16
>> > >>>>>>>>> Dear all,
>> > >>>>>>>>>
>> > >>>>>>>>> i would appreciate having your advice on the following please
>> :
>> > >>>>>>>>>
>> > >>>>>>>>> in R, the wilcox.test() provides "a p-value < 2.2e-16", when
>> we
>> > >>>>>> compare
>> > >>>>>>>>> sets of 1000 genes expression (in the genomics field).
>> > >>>>>>>>>
>> > >>>>>>>>> however, the journal asks us to provide the exact p value ...
>> > >>>>>>>>>
>> > >>>>>>>>> would it be legitimate to write : "p-value = 0" ? thanks a
>> lot,
>> > >>>>>>>>>
>> > >>>>>>>>> -- bogdan
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>> --
>> > >>>>> ----------------------------------------------------------
>> > >>>>>
>> > >>>>> Vivek Das, PhD
>> > >>>>>
>> > >>>>
>> > >>>
>> > >>>
>> > >>>
>> >
>> >
>>
>>
>>
>
