[R] Creating a vector of variable bin widths

d.s.robinson at dur.ac.uk d.s.robinson at dur.ac.uk
Thu Mar 1 17:16:58 CET 2007


Dear R users,
              I am having a little trouble with grouping data.

-----------Detailed explanation (summary below)------------

A small sample of my data is below (which has already been rounded and 
grouped a little from the raw data for clarity).

I am sampling data from an unknown game which, according to my null 
hypothesis, follows a binomial distribution. The game can be supposedly 
be played with a range of probabilities (the independent variable) of 
success, 0.0-0.3 are shown below, although my full data set goes all the 
way up 0.99. The number of observations for each probability of success, 
and the actual proportion of wins in the sample (the dependant variable) 
are also shown.

By CLT, the sample winning proportions (the dependant variable) should 
be a unbiased estimator of the population proportion (the independent 
variable). I want to perform a significance test at each probability 
level to see if the null hypothesis can be rejected.

But, the problem is in defining those probability levels. At the moment, 
some probabilities of success have a very low number of observations, 
whilst others have very many. Leaving the data as it is results in 
statistically meaningless results at the low and high levels of success. 
Further grouping the data using fixed group widths results very few data 
points at high and low probabilities, and a few data points in the 
middle with a very high number of observations.

The way around this (I think) is to use variable bin widths. The width 
of each bin should be wide enough so that (again, I think this is a 
reasonable idea) the variance of the sample estimate (using the normal 
approximation to the binomial), [p(1-p)]/n, is less than a certain 
value, say 2% squared. I presume I also need to make sure that for each 
group np<5 and n(1-p)<5, or can this simply replace the variance test?

IndependantVar	Observations	DependantVar	
--------------------------------------------
0.01		1		0.000	
0.03		5		0.000	
0.04		11		0.000	
0.05		9		0.000	
0.06		19		0.000	
0.07		12		0.000	
0.08		18		0.056	
0.09		10		0.200	
0.10		13		0.077	
0.11		17		0.118	
0.12		17		0.059	
0.13		18		0.056	
0.14		21		0.000	
0.15		25		0.160	
0.16		23		0.000	
0.17		35		0.314	
0.18		26		0.231	
0.19		31		0.226	
0.20		27		0.148	
0.21		26		0.462	
0.22		21		0.286	
0.23		29		0.207	
0.24		38		0.289	
0.25		38		0.132	
0.26		27		0.259	
0.27		52		0.308	
0.28		62		0.194	
0.29		82		0.232	
0.30		97		0.278	

------------------Summary---------------------------

So, I how can I write a function that creates a vector of variable break 
values for, say, cut(). It should iteratively make bin widths wider 
until an condition based on the value to be binned (the probability of 
success), and a second value, the number of observations, is met 
(assuming you agree with my method of restricting the variance, the 
rational of which is outlined above).

I would appreciate any comments on either the reasoning (I am fairly new 
to this sort of statistics) or how I can write the R code to achieve the 
proposed goal. I hope I have explained this clearly enough to merit a 
response.

Regards,
	DR



More information about the R-help mailing list