[Rd] Binning of integers with hist() function odd results (P (PR#14048)

gug at fnal.gov gug at fnal.gov
Sat Nov 7 16:05:09 CET 2009


Hi,
    Thank you for responding quickly and explaining the behavior. By  
adding "include.lowest=TRUE,right=FALSE" and manually including breaks  
that resolved the simple test case. Next I updated my more complex  
data set, which already had manually defined breaks, and that resolved  
my issues there too. I have now gone in and updated all my functions  
which use hist() so I hopefully won't forget this in the future.

On Nov 7, 2009, at 7:57 AM, Ted Harding wrote:

> On 06-Nov-09 23:30:12, gug at fnal.gov wrote:
>> Full_Name: Gerald Guglielmo
>> Version: 2.8.1 (2008-12-22)
>> OS: OSX Leopard
>> Submission from: (NULL) (131.225.103.35)
>>
>> When I attempt to use the hist() function to bin integers the  
>> behavior
>> seems
>> very odd as the bin boundary seems inconsistent across the various
>> bins. For
>> some bins the upper boundary includes the next integer value, while  
>> in
>> others it
>> does not. If I add 0.1 to every value, then the hist() binning  
>> behavior
>> is what
>> I would normally expect.
>>
>>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
>>> h1$mids
>> [1] 1.5 2.5 3.5 4.5
>>> h1$counts
>> [1] 3 3 4 5
>>> h2<- 
>>> hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1)
>>> )
>>> h2$mids
>> [1] 1.5 2.5 3.5 4.5 5.5
>>> h2$counts
>> [1] 1 2 3 4 5
>>
>> Naively I would have expected the same distribution of counts in the
>> two cases, but clearly that is not happening. This is a simple  
>> example
>> to illustrate the behavior, originally I noticed this while binning a
>> large data sample where I had set the breaks=c(0,24,1).
>
> This is the correct intended bahaviour. By default, values which are
> exactly on the boundary between two bins are counted in the bin which
> is just below the boundary value. Except that the bottom-most break
> will count values on it into the bin just above it.
>
> Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3];
> 4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts  
> 3,3,4,5.
>
> Since you did not set breaks in
>  h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)),
> they were set using the default method, and you can see what they are
> with
>
>  h1$breaks
>  [1] 1 2 3 4 5
>
> When you add 0.1 to each value, you push the values on the boundaries
> up into the next bin. Now each value is inside its bin, and not on
> any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3];
> 3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and
> 5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe.
>
> The default behaviour described above is defined by the default  
> options
>
>  include.lowest = TRUE, right = TRUE
>
> where:
>
> include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'
>          value will be included in the first (or last, for 'right =
>          FALSE') bar.  This will be ignored (with a warning) unless
>          'breaks' is a vector.
>
>   right: logical; if 'TRUE', the histograms cells are right-closed
>          (left open) intervals.
>
> See '?hist'. You can change this behaviour by shanging the options.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 07-Nov-09                                       Time: 13:57:07
> ------------------------------ XFMail ------------------------------

-- 
-Jerry->
gug at fnal.gov
Pepe's Theory of everything: "Under the right circumstances, things  
happen."


	[[alternative HTML version deleted]]



More information about the R-devel mailing list