[R] A goodness of fit test for two discrete distributions with unequal variance?

David Winsemius dw|n@em|u@ @end|ng |rom comc@@t@net
Sat Aug 24 00:03:11 CEST 2019

On 8/23/19 2:52 PM, Serena De Stefani wrote:
> I have a computer simulation in which a virtual agent end up in different
> areas of a layout based on several factors. There are 18 conditions in
> total.
> If I collapse the datapoint into bins, where each bin is one of the areas,
> the data would look like this:
>      x0 <- c(3,3,5,5,2) # computer simulation
> Now I would like to validate this model having human subjects going trough
> the same conditions, but I run into two sets of issues:
>   1. the first issue is due to the fact that the dataset is discrete and
> small (there may be less than 5 counts in a bin, and that's a problem for a
> Chi-Square Goodness of Fit test), also there may be ties. After some online
> digging I found two options:
> - a permutation test
> - a Cramer-von Mises test of goodness-of-fit (see this paper
> <https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf>
>   https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf)
> I thought the Cramer-von Mises test of goodness-of-fit test could work, so
> I ran it with made-up data for *one human subject* and I get the following
> result:
>      x0 <- c(3,3,5,5,2) # computer simulation
>      x1 <- c(4,2,5,4,3) # subject 1
>      library(goftest)
>      cvm.test(x0, ecdf(x1))
>      >Cramer-von Mises test of goodness-of-fit
>> Null hypothesis: distribution ‘ecdf(x1)’
>      >data:  x0
>      >omega2 = 0.14667, p-value = 0.4106
> So far so good. But now let’s say I would like to have more than one human
> subject, let’s say four of them. These are the results from the additional
> subjects:
>      x2 <- c(3,3,5,2,5) # subject 2
>      x3 <- c(2,2,5,6,3) # subject 3
>      x4 <- c(3,2,5,6,2) # subject 4
> Now I run in the second set of issues:
> 2. on the one side I have a single computer simulation, on the other side I
> have data from four subjects. Should I take the mean of the results for the
> human subjects? Then would my data still be “discrete”? Or should I run my
> simulation four times? But I would get always the same results, so the
> variance between the two datasets would be different.
> Any ideas? Maybe I should change the design and have more levels for my
> factors, so that I have more trials and the bins get bigger?
> 	[[alternative HTML version deleted]]

Statistics questions, especially those from people who have failed to 
heed the advice of the Posting Guide to post in plain text, are 
off-topic on rhelp and should be posted to a forum where statistics 
questions are welcomed. (My suspicion is that this question will be 
greeted with further requests for clarification of goals, since asking 
what you "should" do requires an careful explanation of what your 
standards of evidence are and what you are attempting to demonstrate.



> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list