[R] Weighted Histogram

Peter Dalgaard BSA p.dalgaard at biostat.ku.dk
Mon Sep 18 00:35:45 CEST 2000


"Neil E. Klepeis" <nklepeis at uclink4.berkeley.edu> writes:

> Greetings,
> 
> I'm having trouble finding a simple way to calculate a weighted
> histogram where there may be zero raw counts in a given interval.
> 
> Given equal-length vectors of data 'data' and weights 'w', and breaks
> (intervals) for the histogram, I calculate a weighted histogram as
> follows (see MASS's 'truehist' for an unweighted histogram):
> 
> bin <- cut(data, breaks, include.lowest = TRUE)
> wsums <- sapply(split( w, f = bin), sum)
> prob <- wsums/(diff(breaks) * sum(w))     
> 
> where 'bin' is the recoded data vector, 'wsums' is the sum of weights
> across each factor in 'bin', and 'prob' is the vector of probabilities
> used to plot the true histogram.
> 
> But if there are no data points in a certain interval (i.e., zero
> frequency in an interval), the 'wsums' vector does not have the same
> length as 'diff(breaks)' and we are in trouble.  Doing an unweighted
> histogram is no trouble as we use 'tabulate' with a specified number of
> integers equal to the number of total intervals, which results in a
> vector that includes the zero frequency intervals: 
> 
> counts <- tabulate(bin, length(levels(bin)))
> 
> Does anyone know of a simple (vectorized) way in R to calculate a
> weighted histogram where there may be zero counts in any number of
> intervals?  [If not, I suppose I could loop through each interval...]

Hm. I wouldn't quite rule out that there are bugs in the existing code
when empty factor levels get involved (both split() and tapply() seem
to drop them on the floor)

If you start with data, w, and breaks all known to have no missing
values, then you should be able to do
 
 bin <- cut(data, breaks, include.lowest = TRUE)
 wsums <- tapply(w, bin, sum)
 wsums[is.na(wsums)] <- 0
 prob <- wsums/(diff(breaks) * sum(w))     
 
Checking with Splus..: Our split.default appears to be incompatible,
but tapply is the same:

> f<-factor(1,levels=1:2)
> f
[1] 1
Levels:
[1] "1" "2"
> x<-2
> split(x,f)             
$"1":
[1] 2

$"2":
numeric(0)

> tapply(x,f,length)
 1  2 
 1 NA
> sapply(split(x,f),length)         
 1 2 
 1 0

where R (1.1.1) gets

> tapply(x,f,length)
 1  2 
 1 NA 
> sapply(split(x,f),length)
1 
1 
> split(x,f)        
$"1"
[1] 2

So your suggestion works in Splus but not in R. The fix in R would
seem to be to change "factor" to "as.factor" at the start of
split.default.

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list