# [R] A goodness of fit test for two discrete distributions with unequal variance?

Serena De Stefani @eren@de@te|@n| @end|ng |rom gm@||@com
Fri Aug 23 23:52:55 CEST 2019

```I have a computer simulation in which a virtual agent end up in different
areas of a layout based on several factors. There are 18 conditions in
total.
If I collapse the datapoint into bins, where each bin is one of the areas,
the data would look like this:

x0 <- c(3,3,5,5,2) # computer simulation

Now I would like to validate this model having human subjects going trough
the same conditions, but I run into two sets of issues:

1. the first issue is due to the fact that the dataset is discrete and
small (there may be less than 5 counts in a bin, and that's a problem for a
Chi-Square Goodness of Fit test), also there may be ties. After some online
digging I found two options:
- a permutation test
- a Cramer-von Mises test of goodness-of-fit (see this paper
<https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf>
https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf)

I thought the Cramer-von Mises test of goodness-of-fit test could work, so
I ran it with made-up data for *one human subject* and I get the following
result:

x0 <- c(3,3,5,5,2) # computer simulation
x1 <- c(4,2,5,4,3) # subject 1

library(goftest)

cvm.test(x0, ecdf(x1))

>Cramer-von Mises test of goodness-of-fit
>Null hypothesis: distribution ‘ecdf(x1)’

>data:  x0
>omega2 = 0.14667, p-value = 0.4106

So far so good. But now let’s say I would like to have more than one human
subject, let’s say four of them. These are the results from the additional
subjects:

x2 <- c(3,3,5,2,5) # subject 2
x3 <- c(2,2,5,6,3) # subject 3
x4 <- c(3,2,5,6,2) # subject 4

Now I run in the second set of issues:

2. on the one side I have a single computer simulation, on the other side I
have data from four subjects. Should I take the mean of the results for the
human subjects? Then would my data still be “discrete”? Or should I run my
simulation four times? But I would get always the same results, so the
variance between the two datasets would be different.

Any ideas? Maybe I should change the design and have more levels for my
factors, so that I have more trials and the bins get bigger?

[[alternative HTML version deleted]]

```