[R] Computing P-Value

Wed May 28 17:20:54 CEST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gundala Viswanath wrote:
| Dear Ben,
|
| Given a set of words
| ('foo', 'bar', 'bar', 'bar', "quux" ..... "foo") this can be in 10.000
items.
| I would like to compute the significance of the word occurrence with
P-Value.
|
| Is there a simple way to do it?
|
| - GV
|

~  Closer, but still not enough information.  What is your null
hypothesis?  Equidistribution?  If so, ...

dat <- sample(c("foo","bar","quux","pridznyskie"),
~  replace=TRUE,size=10000)
tab <- table(dat)
chisq.test(tab)

from ?chisq.test:

~     If 'x' is a matrix with one row or column, or if 'x' is a vector
~     and 'y' is not given, then a _goodness-of-fit test_ is performed
~     ('x' is treated as a one-dimensional contingency table).  The
~     entries of 'x' must be non-negative integers.  In this case, the
~     hypothesis tested is whether the population probabilities equal
~     those in 'p', or are all equal if 'p' is not given.

~  Note that this won't test the significance of *individual* deviations
from equiprobability, just the overall pattern.  If you wanted to test
individual words you could use binom.test -- but if you tested more
than one word, or tested words on the basis of those that appeared to
have extreme frequencies, you'd start running into multiple comparisons/
post hoc testing issues.

~  Do you know something about the methods that people usually use
in this area?

~  Ben Bolker

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIPXhVc5UpGjwzenMRAsunAJ9to/KGX0ohSrhUC8qTkhIR0CO8OgCfcejV
+LpiB16YBG9ExiHd2tD0sOg=
=w5FE
-----END PGP SIGNATURE-----