[R] non-linear binning? power-law in R

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Wed Jun 16 17:24:12 CEST 2004


On Wed, 16 Jun 2004, Roger D. Peng wrote:

>When you specify a single number to the `br' argument of `hist' it is 
>only a suggestion.  See the help page, ?hist.  If you want to use the 
>same breakpoints everytime, you need to specify the locations explicity.

OK, I can see why this would be so, but none the less 2 and only 2 very
distinct trends are coming out of the simulation.

By hand I get...

abline(11.1,-0.5)

abline(12.1,-1.2)

On the plot (now below).  This works every time, nothing in between, just
these two trends. Am I being stupid / is this a local phenomenon only?


>
>-roger
>
>Dan Bolser wrote:
>> On Wed, 16 Jun 2004, Roger D. Peng wrote:
>> 
>> 
>>>You can use hist(x, br, plot = FALSE)$counts.
>>>
>>>-roger
>> 
>> 
>> OK, I made this...
>> 
>> 
>> y4 <- runif(100000)**4
>> y4log <- -log(y4)
>> y4bin20 <- hist(y4log,20,plot=FALSE)$counts
>> y4bin20log <- log(y4bin20)
>> 
>> plot(y4bin20log)
>> 
>> for(i in 2:20){
>>  y4 <- runif(100000)**4
>>  y4log <- -log(y4)
>>  y4bin20 <- hist(y4log,20,plot=FALSE)$counts
>>  y4bin20log <- log(y4bin20)
>>  points(y4bin20log,col=i)
>> }
>> 
>> 
>> I am seeing something very strange about 1 loop in every five or ten of
>> the above...
>> 
>> Can someone try this and let me know if every fifth run or so they see a
>> totally different trend? Basically looks like only 10 bins are being used
>> at random during the run!?
>> 
>> Hopefully you will also see what I am trying to do, which is to highlight
>> the increased variance in the tail...
>> 
>> How would I estimate the slope of y4bin20log? (sorry for the names).
>> 
>> Thanks for the help,
>> Dan.
>> 
>> 
>>>Dan Bolser wrote:
>>>
>>>>First, thanks to everyone who helped me get to grips with R in (x)emacs
>>>>(I get confused easily). Special thanks to Stephen Eglen for continued
>>>>support.
>>>>
>>>>My question is about non-linear binning, or density functions over
>>>>distributions governed by a power law ...
>>>>
>>>>y ~ mu*x**lambda	# In one of its forms 
>>>>                        # (can't find Pareto in the online help)
>>>>
>>>>Looking at the following should show my problem....
>>>>
>>>>x3 <- runif(10000)**3	# Probably a better (correct) way to do this
>>>>
>>>>plot( density(x3,cut=0,bw=0.1))
>>>>plot( density(x3,cut=0,bw=0.01))
>>>>plot( density(x3,cut=0,bw=0.001))
>>>>
>>>>plot(density(x3,cut=0,bw=0.1),  log='xy')
>>>>plot(density(x3,cut=0,bw=0.01), log='xy')
>>>>plot(density(x3,cut=0,bw=0.001),log='xy')
>>>>
>>>>The upper three plots show that the bw has a big effect on the appearance
>>>>of the graph by rescaling based on the initial density at low values of x,
>>>>which is very high.
>>>>
>>>>The lower plots show (I think) an error in the use of linear bins to view
>>>>a non linear trend. I would expect this curve to be linear on log-log
>>>>scales (from experience), and you can see the expected behavior in the
>>>>tails of these plots.
>>>>
>>>>If you play with drawing these curves on top of each other they look OK
>>>>apart from at the beginning. However, changing the band width to 0.0001 has
>>>>a radical effect on these plots, and they begin to show a different trend
>>>>(look like they are being governed by a different power).
>>>>
>>>>Hmmm....
>>>>
>>>>x3log <- -log(x3)
>>>>
>>>>plot( density(x3log,cut=0,bw=0.5),  log='y',col=1)
>>>>
>>>>lines(density(x3log,cut=0,bw=0.2),  log='y',col=2)
>>>>lines(density(x3log,cut=0,bw=0.1),  log='y',col=3)
>>>>lines(density(x3log,cut=0,bw=0.01), log='y',col=4)
>>>>
>>>>Sorry...
>>>>
>>>>
>>>>'Real' data of this form is usually discrete, with the value of 1 being
>>>>the most frequent (minimum) event, and higher values occurring less
>>>>frequently according to a power (power-law). This data can be easily
>>>>grouped into discrete bins, and frequency plotted on log scales. The
>>>>continuous data generated above requires some form of density estimation
>>>>or rescaling into discreet values (make the smallest value equal to 1 and
>>>>round everything else into an integer).
>>>>
>>>>I see the aggregate function, but which function lets me simply count the
>>>>number of values in a class (integer bin)?
>>>>
>>>>The analysis of even the discretized data is made more accurate by the use
>>>>of exponentially growing bins. This way you don't need to plot the data on
>>>>log scales, and the increasing variance associated with lower probability
>>>>events is handled by the increasing bin size (giving good accuracy of
>>>>power fitting). How can I easily (ignorantly) implement exponentially
>>>>increasing bin sizes?
>>>>
>>>>Thanks for any feedback,
>>>>
>>>>Dan.
>>>>
>>>>______________________________________________
>>>>R-help at stat.math.ethz.ch mailing list
>>>>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>>>
>>>
>>>
>> 
>> 
>
>




More information about the R-help mailing list