[R] Calculated mean value based on another column bin from dataframe.

David Winsemius dwinsemius at comcast.net
Wed Apr 6 23:20:52 CEST 2011


On Apr 6, 2011, at 9:46 AM, Fabrice Tourre wrote:

> Dear Henrique Dallazuanna,
>
> Thank you very much for your suggestion.
>
> It is obvious that your method is better than me.
>
> Is it possible to use cut, table,by etc? Whether there is some
> aggregate function in R can do this?
>
> Thanks.
>
> On Wed, Apr 6, 2011 at 2:16 PM, Henrique Dallazuanna  
> <wwwhsd at gmail.com> wrote:
>> Try this:
>>
>> fil <- sapply(ran, '<', e1 = dat[,1]) & sapply(ran[2:(length(ran) +
>> 1)], '>=', e1 = dat[,1])
>> mm <- apply(fil, 2, function(idx)mean(dat[idx, 2]))
>>
>> On Wed, Apr 6, 2011 at 5:48 AM, Fabrice Tourre <fabrice.ciup at gmail.com 
>> > wrote:
>>> Dear list,
>>>
>>> I have a dataframe with two column as fellow.
>>>
>>>> head(dat)
>>>       V1      V2
>>>  0.15624 0.94567
>>>  0.26039 0.66442
>>>  0.16629 0.97822
>>>  0.23474 0.72079
>>>  0.11037 0.83760
>>>  0.14969 0.91312
>>>
>>> I want to get the column V2 mean value based on the bin of column of
>>> V1. I write the code as fellow. It works, but I think this is not  
>>> the
>>> elegant way. Any suggestions?
>>>
>>> dat<-read.table("dat.txt",head=F)
>>> ran<-seq(0,0.5,0.05)
>>> mm<-NULL
>>> for (i in c(1:(length(ran)-1)))
>>> {
>>>    fil<- dat[,1] > ran[i] & dat[,1]<=ran[i+1]
>>>    m<-mean(dat[fil,2])
>>>    mm<-c(mm,m)
>>> }
>>> mm
>>>
>>> Here is the first 20 lines of my data.
>>>
>>>> dput(head(dat,20))
>>> structure(list(V1 = c(0.15624, 0.26039, 0.16629, 0.23474, 0.11037,
>>> 0.14969, 0.16166, 0.09785, 0.36417, 0.08005, 0.29597, 0.14856,
>>> 0.17307, 0.36718, 0.11621, 0.23281, 0.10415, 0.1025, 0.04238,
>>> 0.13525), V2 = c(0.94567, 0.66442, 0.97822, 0.72079, 0.8376,
>>> 0.91312, 0.88463, 0.82432, 0.55582, 0.9429, 0.78956, 0.93424,
>>> 0.87692, 0.83996, 0.74552, 0.9779, 0.9958, 0.9783, 0.92523, 0.99022
>>> )), .Names = c("V1", "V2"), row.names = c(NA, 20L), class =  
>>> "data.frame")
>>>
>>> ______________________________________________

Here is how I would have done it with findInterval and tapply which is  
very similar to using a `cut` and `table` approach:

 > dat$grp <- findInterval(dat$V1, seq(0,0.5,0.05) )
 > tapply(dat$V2, dat$grp, mean)
         1         2         3         4         5         6         8
0.9252300 0.8836100 0.9135429 0.9213600 0.8493450 0.7269900 0.6978900
#####---------------

You do not get exactly the same form of the result as with Henrique's  
method. His yields:
 > mm
  [1] 0.9252300 0.8836100 0.9135429 0.9213600 0.8493450  
0.7269900       NaN
  [8] 0.6978900       NaN       NaN       NaN

####----------------

The cut approach would yield this, which is more informatively  
labeled. (I'm wasn't completely sure the second to last word in the  
prior sentence was a real word, but several dictionaries seem to think  
so.):

 > dat$grp2 <- cut(dat$V1 , breaks=ran)
 > tapply(dat$V2, dat$grp2, mean)
   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]
  0.9252300  0.8836100  0.9135429  0.9213600  0.8493450  0.7269900
(0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5]
         NA  0.6978900         NA         NA


>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list