[R] two-sample KS test: data becomes significantly different after normalization

Monnand monnand at gmail.com
Mon Jan 12 04:12:40 CET 2015


Hi all,

This question is sort of related to R (I'm not sure if I used an R function
correctly), but also related to stats in general. I'm sorry if this is
considered as off-topic.

I'm currently working on a data set with two sets of samples. The csv file
of the data could be found here: http://pastebin.com/200v10py

I would like to use KS test to see if these two sets of samples are from
different distributions.

I ran the following R script:

# read data from the file
> data = read.csv('data.csv')
> ks.test(data[[1]], data[[2]])
    Two-sample Kolmogorov-Smirnov test

data:  data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they
should come from same distribution.)

However, due to some reasons, instead of the raw values, the actual data
that I will get will be normalized (zero mean, unit variance). So I tried
to normalize the raw data I have and run the KS test again:

> ks.test(scale(data[[1]]), scale(data[[2]]))
    Two-sample Kolmogorov-Smirnov test

data:  scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value < 2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two
samples are significantly different (from different distributions).

My question is: How the normalization could make two similar samples
becomes different from each other? I can see that if two samples are
different, then normalization could make them similar. However, if two sets
of data are similar, then intuitively, applying same operation onto them
should make them still similar, at least not different from each other too
much.

I did some further analysis about the data. I also tried to normalize the
data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
same thing happened. At first, I thought it might be outliers caused this
problem (I can see that an outlier may cause this problem if I normalize
the data into [0,1] range.) I deleted all data whose abs value is larger
than 4 standard deviation. But it still didn't help.

Plus, I even plotted the eCDFs, they *really* look the same to me even
after normalization. Anything wrong with my usage of the R function?

Since the data contains ties, I also tried ks.boot (
http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
result.

Could anyone help me to explain why it happened? Also, any suggestion about
the hypothesis testing on normalized data? (The data I have right now is
simulated data. In real world, I cannot get raw data, but only normalized
one.)

Regards,
-Monnand

	[[alternative HTML version deleted]]



More information about the R-help mailing list