[R] aggregating along bins and bin-quantiles

Wed Oct 22 10:52:17 CEST 2008

Dear Mark and all interested,

Unfortunately the code provided by Mark does not work - there is a  
syntax error when run as provided. I looked at possibly solving the  
problem, but without much knowledge of the output of "split" (looks  
like a list of lists, and not a list of data frames), it is difficult  
to identify where in the call to lapply the problem arises. The  
problem both in Mark's code and my original (with tapply) is on the  
format of the output of the call to an implicit loop.  In fact I find  
this area of R one of the most obscure to my simplistic way of  
thinking (I would expect the output to have the same format as the  
input (data.frame to data.frame), but I am certain there must be good  
reasons for the way implicit loop functions return what they do).

Any further help would be appreciated, as I may have to resort to some  
(less elegant) loop...

Kind regards,
Ivan

On 22 Oct 2008, at 00:22, markleeds at verizon.net wrote:
> Hi Ivan: I think I understand better so below is some new code  but  
> I'm still not totally sure that it's what you want. If not, then I  
> think it brings you closer anyway ? the split function is very  
> useful and I think that's what you need. let me know if below is  
> what you needed.
> if it's close but not quite right, i can look at it again. it's not  
> a problem. if i'm totally off, maybe you should resend to the list  
> because that means I probably can't fix it.
>
> #= 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> a <- read.csv ( file = "/opt/mark/research/equity/projects/R_mails/ 
> example.csv" , colClasses = c ( "Date" , "numeric" ) ) #beware of  
> the path
>
> # SPLIT BY DATE
> # TO CREATE A LIST OF
> # DATAFRAMES
> DFlist <- split(a,a$Date)
> print(str(DFlist))
>
> # USE LAPPLY TO CALL cut AND
> # THEN aggregate ON EACH COMPONENT
> # DATAFRAME IN THE LIST
> tempresult <- lapply(DFlist,function(.df) {
>  .df$quantile <- cut(.df$value,breaks=quantile(.df 
> $value,probs=seq(0,1,0.1),na.rm=TRUE))
>   aggregate(.df$value,list(DATE=.df$Date,QUANTILE=.df$quantile),sum)
> })
>
> # CHECK IF IT WORKED
> print(tempresult)
>
> # RBIBND EVERYTHING BACK TOGETHER
> # SO THAT IT"S ONE DATAFRAME
> finalresult <- do.call(rbind,tempresult)
> print(finalresult)
>
>
>
>
> On Tue, Oct 21, 2008 at  5:47 PM, Ivan Alves wrote:
>
>> Hello Mark,
>> Many thanks for the reply.  Your suggestion is essentially  
>> equivalent to my first attempt: the quantiles are estimated for the  
>> WHOLE of the a.value column.  Essentially what I would need is to  
>> first break down the value column by "bins" determined by the  
>> a.date column and THEN estimate the quantile for each "bin".  you  
>> see, I would need the quantiles for each data entry, not for all  
>> the entries, thus if there are 12 dates (or "bins"), then I would  
>> need 12x#10 deciles, not just 10.
>> Kind regards,
>> Ivan
>>
>> On 21 Oct 2008, at 22:20, markleeds at verizon.net wrote:
>>
>>> Hi: I still wasn't very clear on what you wanted but that might be  
>>> because i didn't save your original email ? I doubt that below
>>> helps. i used cut instead of cut2 because I didn't have Hmisc  
>>> loaded and I think cut does what you want ? Jim will probably  
>>> later with a better answer.
>>> He's the real expert with this type of thing. I just like to  
>>> practice.
>>>
>>> a <- read.csv ( file = "/opt/mark/research/equity/projects/ 
>>> R_mails/ example.csv" , colClasses = c ( "Date" , "numeric" ) )
>>> a$quantile <- cut(a$value,breaks=quantile(a  
>>> $value,probs=seq(0,1,0.1),na.rm=TRUE))
>>> aggregate(a$value,list(DATE=a$Date,QUANTILE=a$quantile),sum)
>>>
On 21 Oct 2008, at 09:25, Ivan Alves wrote:

> Dear all,
>
> Thanks to Jim and Mark for suggesting including the reproducible  
> code.  Please note that the enclosed file would need to go to into  
> the home folder or that the path for reading the CSV file be  
> changed.  I hope no encoding issues emerge when reading it.
>
> And the code
>
> library(Hmisc) #need the cut2 function to mark the quantile a given  
> line belongs to
> a <- read.csv(file = "~/example.csv",  
> colClasses=c("Date","numeric")) #beware of the path
> dim(a) #should give "[1] 5076    2"
> aggregate(a$value, list(Date = a[,"Date"],Quantile=cut2(a 
> $value,g=10)),sum) #should give the sum by year but on the quantiles  
> for the whole population
> aggregate(a$value, list(Date = a[,"Date"],Quantile=tapply(a 
> $value,use.filter$Date,cut2,g=10)),sum) #gives error mentioned below
>
> Once again, many thanks for any help
> Ivan
>
> On 21 Oct 2008, at 02:40, jim holtman wrote:
>
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> You need to at least post a subset of your data so that we can
>> understand the data structures that you are using.  'dput' will  
>> create
>> an easily readable format for posting your data (much easier than if
>> you post the listing of a table).  Usually it is some 'type mismatch'
>> which says you really have to have the data to run the script  
>> against.
>>
>> On Mon, Oct 20, 2008 at 6:38 PM, Ivan Alves <papucho at mac.com> wrote:
>>> Dear all,
>>>
>>> I would like to aggregate a data frame (consisting of 2 columns -  
>>> one
>>> for the bins, say factors, and one for the values) along bins and
>>> quantiles within the bins.
>>>
>>> I have tried
>>>
>>> aggregate(data.frame$values, list(bin = data.frame
>>> $bin,Quantile=cut2(data.frame$bin,g=10)),sum)
>>>
>>> but then the quantiles apply to the population as a whole and not  
>>> the
>>> individual bins. Upon this realisation I have tried
>>>
>>> aggregate(data.frame$values, list(bin = data.frame
>>> $bin,Quantile=tapply(data.frame$values,data.frame 
>>> $bin,cut2,g=10)),sum)
>>>
>>> which gives the following error:
>>>
>>> Error in sort.list(unique.default(x), na.last = TRUE) :
>>> 'x' must be atomic for 'sort.list'
>>> Have you called 'sort' on a list?
>>>
>>> clearly I am doing something wrong, but cannot figure out what.  I
>>> believe the error stems either from a. the output of tapply being a
>>> list of a dimension equal to the number of bins, and not a list of
>>> equal dimension as the values, or b. that somehow aggregate does not
>>> like that the second list (of the quantiles within the bins are not
>>> sorted nicely)
>>>
>>> 1. Do you have a reference for doing the summation on both bins and
>>> quantiles within the bins?
>>> 2. If not, can you give me some guidance as to what I am doing wrong
>>> and how I can solve the sort/list issue?
>>>
>>> Any help would be greatly appreciated
>>>
>>> Kind regards,
>>>
>>> Ivan Alves
>>>
>>>
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> -- 
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.