[R] Histogram omitting/collapsing groups

Sun Jan 1 12:29:17 CET 2012

On Jan 1, 2012, at 07:40 , Joshua Wiley wrote:

> If you just want a plot of the frequencies at each hour why not just call barplot on the output of table?  Histograms create bins and count in those, which doesn't sound like what you're after.
> 

Exactly. If what you want is a barplot, make a barplot; histograms are for continuous data.   Just remember that you may need to set the levels explicitly in case of empty groups: barplot(table(factor(x,levels=0:23))). (This is irrelevant with 100K data samples, but not with 100 of them).

That being said, the fact that hist() tends to create breakpoints which coincide with data points due to discretization is arguably a bit of a design error, but it is age-old and hard to change now. One way out is to use truehist() from MASS, another is to explicitly set the breaks to intermediate values, as in hist(x, breaks=seq(-.5, 23.5, 1))

> Cheers,
> 
> Josh
> 
> 
> On Dec 31, 2011, at 21:37, jim holtman <jholtman at gmail.com> wrote:
> 
>> Fast fingers; notice that there is still a problem in the counts;  I
>> was only looking at the last.
>> 
>> Happy New Year -- up too late.
>> 
>> On Sun, Jan 1, 2012 at 12:33 AM, jim holtman <jholtman at gmail.com> wrote:
>>> Here is a test I ran and looks fine, but then I created the data, so
>>> it might have something to do with your data:
>>> 
>>>> x <- sample(0:23, 100000, TRUE)
>>>> a <- hist(x, breaks = 24)
>>>> a[1:5]
>>> $breaks
>>> [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>>> 
>>> $counts
>>> [1] 8262 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132
>>> 4139 4231 4216 4158 4054 4185 4153
>>> [21] 4281 4110 4221
>>> 
>>> $intensities
>>> [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155
>>> 0.04157 0.04203 0.04186 0.04158
>>> [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153
>>> 0.04281 0.04110 0.04221
>>> 
>>> $density
>>> [1] 0.08262 0.04114 0.04186 0.04106 0.04153 0.04234 0.04206 0.04155
>>> 0.04157 0.04203 0.04186 0.04158
>>> [13] 0.04132 0.04139 0.04231 0.04216 0.04158 0.04054 0.04185 0.04153
>>> 0.04281 0.04110 0.04221
>>> 
>>> $mids
>>> [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
>>> 13.5 14.5 15.5 16.5 17.5 18.5 19.5
>>> [21] 20.5 21.5 22.5
>>> 
>>>> table(x)
>>> x
>>>  0    1    2    3    4    5    6    7    8    9   10   11   12   13
>>> 14   15   16   17   18   19   20
>>> 4168 4094 4114 4186 4106 4153 4234 4206 4155 4157 4203 4186 4158 4132
>>> 4139 4231 4216 4158 4054 4185 4153
>>> 21   22   23
>>> 4281 4110 4221
>>>> 
>>> 
>>> 
>>> On Sat, Dec 31, 2011 at 11:20 AM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I think you're not understanding quite what's going on with hist. Reread the
>>>> help, and take a look at this small example. The solution I'd use is the last
>>>> item.
>>>> 
>>>>> x <- rep(1:10, times=1:10)
>>>>> table(x)
>>>> x
>>>> 1 2 3 4 5 6 7 8 9 10
>>>> 1 2 3 4 5 6 7 8 9 10
>>>>> 
>>>>> 
>>>>> hist(x, plot=FALSE, right=TRUE)$counts
>>>> [1] 3 3 4 5 6 7 8 9 10
>>>>> hist(x, plot=FALSE, right=TRUE)$breaks
>>>> [1] 1 2 3 4 5 6 7 8 9 10
>>>>> hist(x, plot=FALSE, right=TRUE)$mids
>>>> [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
>>>>> 
>>>>> 
>>>>> hist(x, plot=FALSE, right=FALSE)$counts
>>>> [1]  1  2  3  4  5  6  7  8 19
>>>>> hist(x, plot=FALSE, right=FALSE)$breaks
>>>> [1] 1 2 3 4 5 6 7 8 9 10
>>>>> hist(x, plot=FALSE, right=FALSE)$mids
>>>> [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
>>>>> 
>>>>> 
>>>>> hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$counts
>>>> [1] 1 2 3 4 5 6 7 8 9 10
>>>>> hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$breaks
>>>> [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5
>>>>> hist(x, plot=FALSE, breaks=seq(.5, 10.5, by=1))$mids
>>>> [1] 1 2 3 4 5 6 7 8 9 10
>>>> 
>>>> 
>>>> Sarah
>>>> 
>>>> On Sat, Dec 31, 2011 at 10:25 AM, Aren Cambre <aren at arencambre.com> wrote:
>>>>> I have two large datasets (156K and 2.06M records). Each row has the
>>>>> hour that an event happened, represented by an integer from 0 to 23.
>>>>> 
>>>>> R's histogram is combining some data.
>>>>> 
>>>>> Here's the command I ran to get the histogram:
>>>>>> histinfo <- hist(crashes$hour, right=FALSE)
>>>>> 
>>>>> Here's histinfo:
>>>>>> histinfo
>>>>> $breaks
>>>>> [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>>>>> 
>>>>> $counts
>>>>> [1]  4755  4618  5959  3292  2378  2715  4592  6144  6860  5598  5601
>>>>> 6596  7152  7490  8166
>>>>> [16]  9758 11301 11745  9943  7494  6272  6220 11669
>>>>> 
>>>>> $intensities
>>>>> [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844
>>>>> 0.02937602 0.03930449
>>>>> [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515
>>>>> 0.05223967 0.06242403
>>>>> [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068
>>>>> 0.07464911
>>>>> 
>>>>> $density
>>>>> [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844
>>>>> 0.02937602 0.03930449
>>>>> [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515
>>>>> 0.05223967 0.06242403
>>>>> [17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068
>>>>> 0.07464911
>>>>> 
>>>>> $mids
>>>>> [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
>>>>> 13.5 14.5 15.5 16.5 17.5
>>>>> [19] 18.5 19.5 20.5 21.5 22.5
>>>>> 
>>>>> $xname
>>>>> [1] "crashes$hour"
>>>>> 
>>>>> $equidist
>>>>> [1] TRUE
>>>>> 
>>>>> attr(,"class")
>>>>> [1] "histogram"
>>>>> 
>>>>> Note how the last value in counts is 11669. It's relevant to the
>>>>> output of table(crashes$hour):
>>>>>    0     1     2     3     4     5     6     7     8     9    10
>>>>> 11    12    13    14
>>>>> 4755  4618  5959  3292  2378  2715  4592  6144  6860  5598  5601
>>>>> 6596  7152  7490  8166
>>>>>   15    16    17    18    19    20    21    22    23
>>>>> 9758 11301 11745  9943  7494  6272  6220  6000  5669
>>>>> 
>>>>> Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is
>>>>> that correct for the histogram to combine hours 22 and 23? Since I
>>>>> specified right = FALSE, I figured there's no way 23 would be combined
>>>>> with 22?
>>>>> 
>>>>> Adding breaks=24 to the hist makes no difference; it's still stuck at
>>>>> 23 breaks. I also tried breaks=25 and 23 and several other values, in
>>>>> case I am misinterpreting breaks's meaning, but none of them make a
>>>>> difference.
>>>>> 
>>>>> I imagine this is a n00b question, so my apologies if this is obvious.
>>>>> 
>>>>> Aren
>>>>> 
>>>> 
>>>> --
>>>> Sarah Goslee
>>>> http://www.functionaldiversity.org
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> 
>>> 
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>> 
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>> 
>> 
>> 
>> -- 
>> Jim Holtman
>> Data Munger Guru
>> 
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com