[R] How to calculate area between ECDF and CDF?

Mike Lawrence m4lawren at artsmail.uwaterloo.ca
Mon Dec 4 19:09:01 CET 2006


Hi all,

I'm working with data to which I'm fitting three-parameter weibull 
distributions (shape, scale & shift). The data are of low sample sizes 
(between 10 and 80 observations), so I'm reluctant to check my fits 
using chi-square (also, I'd like to avoid bin choice issues). I'd use 
the Kolmogorov-Smirnov test, but of course this is invalid when the 
distribution parameters are estimated from the data.

So I'm tinkering with an alternative method (excuse my naiveté if this 
is a bad idea, I'm a relative statistical novice) that calculates the 
area of the difference between the ECDF of the data and the CDF of the 
estimated function (somewhat like KS, which looks at the greatest 
distance between these). My thought is to compare this observed area to 
a distribution of simulated areas derived by monte carlo simulation 
(draw N random samples from the estimated function, calculating area, 
and repeat 1e5 times). If the observed area is greater than say 95% of 
the simulated areas, then I'd reject the fit.

My problem is that I can't figure out how to efficiently calculate the 
area between the ECDF and CDF functions. I can of course calculate the 
integral of each easily, and if one were consistently larger than the 
other simple subtraction of the integrals would yield the area between. 
However, when the functions cross, as frequently occurs, the solution 
seems much more complex. Any suggestions? Since as noted above I'll be 
doing the area calculation 1e5 times or so per test, a computationally 
frugal solution would be much appreciated!

Here's some code that I've been toying with:

#set up some true parameters
shape=2
scale=.5
shift=.3
n=10

#generate some observed data
obs=obs=rweibull(10,shape,scale)+shift

#lets say that the following are the estimated parameters from whatever 
estimation process I'm using
est.shape=1.9
est.scale=.6
est.shift=.35

#Calculate area between ECDF and CDF of the function defined by the
#estimated parameters
# ???
#The following would work if the ECDF were consistently higher or lower
#than the CDF

#Get the CDF area between 0 and some large number (here, 10 is pretty
#large)
cdf.area=integrate(pweibull,0,10,shape=est.shape,scale=est.scale)

#Get the ECDF area.
#first get rid of the shift in obs
obs=obs-est.shift
#calculate area by multiplying cumulative proportions by distance
#between knots, then summing
#add knot at 10 to match cdf
k=c(knots(ecdf(obs)),10)
ecdf.area=vector("numeric",(n-1))
for(i in 1:n){
	ecdf.area[i]=(k[i+1]-k[i])*(sum(obs<=k[i])/n)
}
ecdf.area=sum(ecdf.area)

#again, subtraction of the areas works if the ecdf is consistently lower
#than the cdf
diff=cdf.area-ecdf.area
#or consistently higher than the cdf
diff=ecdf.area-cdf.area
#but how to calculate when the functions cross?
	

Cheers,

Mike

-- 
Mike Lawrence
http://artsweb.uwaterloo.ca/~m4lawren

"The road to wisdom? Well, it's plain and simple to express:
Err and err and err again, but less and less and less."
- Piet Hein




More information about the R-help mailing list