[R] by function ??

Matthew Dowle mdowle at mdowle.plus.com
Tue Jan 5 11:53:26 CET 2010


I wrote :
> (some may return vectors, others may return vectors)
Its been pointed out there was a typo, and wasn't very clear anyway. It 
should read '(some may return vectors, others may return scalars)'. I've 
been asked for further explanation so here goes ...

The point I was trying to make is that the following expression is very 
natural to write.  It takes a bit of getting used to though. A reminder of 
the 2 column Dataset (containing a group of 4 rows and a group of 3 rows) 
then the R expression and then the output :

    LEAID  ratio
    6307     0.7200000
    6307     0.7623810
    6307     0.8600000
    6307     0.9200000
    8300     0.5678462
    8300     0.7700000
    8300     0.8300000

the syntax :
    Dataset = data.table(Dataset)
    Dataset[,DT(ratio,scaled=abs(ratio-median(ratio)),sum=sum(ratio)),by="LEAID"]

and the 4 column output :

    LEAID  ratio               scaled             sum
    6307     0.7200000     0.0911905     3.262381
    6307     0.7623810     0.0488095     3.262381
    6307     0.8600000     0.0488095     3.262381
    6307     0.9200000     0.1088095     3.262381
    8300     0.5678462     0.2021538     2.167846
    8300     0.7700000     0.0000000     2.167846
    8300     0.8300000     0.0600000     2.167846

The 2nd argument (the call to DT()) contains 3 expressions, which are 
executed for each subset of the Dataset grouped by LEAID.  The row order is 
maintained for each subset, and these expressions operate on ordered vectors 
as usual in R. We can use column names as variable names directly (like an 
implicit ?with).  Note that Dataset doesn't have to be ordered by LEAID, but 
it just happens to be in this example.

A comment on each of the 3 expressions (the 3 arguments passed to DT() 
above) is perhaps useful :

ratio :   just repeats the ratio vector as is. You don't have to include 
this but I wanted to keep the input data in the output to demonstrate.

abs(ratio-median(ratio))  :   median() returns a scalar, subtracted from 
each element from ratio, and returns a vector. abs() takes a vector, and 
returns a vector. Standard R and basic stuff. Any R expresssion can be used, 
so its more powerful than SQL in thats sense because SQL is restricted to a 
small set of functions (avg, min, max, etc),  which has been said before and 
been true about R for a long time.  Its the overall syntax of the single 
'query' that I'm trying to demonstrate.

sum(ratio) :  returns a scalar aggregate on the vector input. Thats what I 
meant by "others may return scalars".  Notice the the value of sum(ratio) is 
repeated in the final column of the output.  The reason is because at least 
one of the other expressions return vectors, and standard R silent 
repetition rules are coming into play inside DT().

Then the 2 data.table's (one for each of the 2 groups) are combined and a 
single data.table is returned. Very similar to SQL really and some other 
ways to aggregate in R, but more compact, more natural, easier and more 
convenient (and therefore quicker) to write, debug and maintain.


"Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message 
news:hgnjev$3hk$1 at ger.gmane.org...
> or if Dataset is a data.table :
>
>> Dataset = data.table(Dataset)
>> Dataset[,abs(ratio-median(ratio)),by="LEAID"]
>     LEAID        V1
> [1,]  6307 0.0911905
> [2,]  6307 0.0488095
> [3,]  6307 0.0488095
> [4,]  6307 0.1088095
> [5,]  8300 0.2021538
> [6,]  8300 0.0000000
> [7,]  8300 0.0600000
> rather than :
>> Dataset$abs <- with(Dataset, ave(ratio, LEAID, 
>> FUN=function(x)abs(x-median(x))))
>
> This is less code and more natural (to me anyway) e.g. it doesn't require 
> use of function() or ave(). data.table knows that if the j expression 
> returns a vector it should silently repeat the groups to match the length 
> of the j result (which it is doing here).   If the j expression returns a 
> scalar you would just get 2 rows in this example.  Note that the 'by' 
> expression must evaluation to integer, or a list of integer vectors,  so 
> in this case LEAID must either be integer already or coerced to integer 
> using by="as.integer(LEAID)".
>
> To give the aggregate expression a name, just wrap with the DT function. 
> This is also how to return multiple aggregate functions from each subset 
> (some may return vectors, others may return vectors) by listing them 
> inside DT() :
>
>> Dataset[,DT(ratio,scaled=abs(ratio-median(ratio)),sum=sum(ratio)),by="LEAID"]
>     LEAID     ratio    scaled      sum
> [1,]  6307 0.7200000 0.0911905 3.262381
> [2,]  6307 0.7623810 0.0488095 3.262381
> [3,]  6307 0.8600000 0.0488095 3.262381
> [4,]  6307 0.9200000 0.1088095 3.262381
> [5,]  8300 0.5678462 0.2021538 2.167846
> [6,]  8300 0.7700000 0.0000000 2.167846
> [7,]  8300 0.8300000 0.0600000 2.167846
>
>
> "William Dunlap" <wdunlap at tibco.com> wrote in message 
> news:77EB52C6DD32BA4D87471DCD70C8D7000243CBA1 at NA-PA-VBE03.na.tibco.com...
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of L.A.
>> Sent: Saturday, December 12, 2009 12:39 PM
>> To: r-help at r-project.org
>> Subject: Re: [R] by function ??
>>
>>
>>
>> Thanks for all the help, They all worked, But I'm stuck again.
>> I've tried searching, but I not sure how to word my search as
>> nothing came
>> up.
>> Here is my new hurdle, my data has 7 abservations and my
>> results have 2
>> answers:
>>
>>
>> Here is my data
>>
>>      LEAID     ratio
>> 3 6307     0.7200000
>> 1 6307     0.7623810
>> 2 6307     0.8600000
>> 4 6307     0.9200000
>> 5 8300     0.5678462
>> 7 8300     0.7700000
>> 6 8300     0.8300000
>>
>>
>> > median<-summaryBy(ratio ~ LEAID, data = Dataset, FUN = median)
>>
>> > print(median)
>>   LEAID       ratio.median
>> 1 6307        0.8111905
>> 2 8300        0.7700000
>>
>> Now what I want is a way to compute
>> abs(ratio- median)by LEAID for each observation to produce
>> something like
>> this
>>
>> LEAID     ratio          abs
>> 3 6307     0.7200000     .0912
>> 1 6307     0.7623810     .0488
>> 2 6307     0.8600000     .0488
>> 4 6307     0.9200000     .1088
>> 5 8300     0.5678462     .2022
>> 7 8300     0.7700000     .0000
>> 6 8300     0.8300000     .0600
>
> Try ave(), as in
>   > Dataset$abs <- with(Dataset, ave(ratio, LEAID, 
> FUN=function(x)abs(x-median(x))))
>   > Dataset
>     LEAID     ratio       abs
>   3  6307 0.7200000 0.0911905
>   1  6307 0.7623810 0.0488095
>   2  6307 0.8600000 0.0488095
>   4  6307 0.9200000 0.1088095
>   5  8300 0.5678462 0.2021538
>   7  8300 0.7700000 0.0000000
>   6  8300 0.8300000 0.0600000
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>> Thanks,
>> L.A.
>>
>>
>>
>>
>> Ista Zahn wrote:
>> >
>> > Hi,
>> > I think you want
>> >
>> > by(TestData[ , "RATIO"], LEAID, median)
>> >
>> > -Ista
>> >
>> > On Tue, Dec 8, 2009 at 8:36 PM, L.A. <romsa at millect.com> wrote:
>> >>
>> >> I'm just learning and this is probably very simple, but I'm stuck.
>> >> I'm trying to understand the by().
>> >> This works.
>> >> by(TestData, LEAID, summary)
>> >>
>> >> But, This doesn't.
>> >>
>> >> by(TestData, LEAID, median(RATIO))
>> >>
>> >>
>> >> ERROR: could not find function "FUN"
>> >>
>> >> HELP!
>> >> Thanks,
>> >> LA
>> >> --
>> >> View this message in context:
>> >> http://n4.nabble.com/by-function-tp955789p955789.html
>> >> Sent from the R help mailing list archive at Nabble.com.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>> >
>> > -- 
>> > Ista Zahn
>> > Graduate student
>> > University of Rochester
>> > Department of Clinical and Social Psychology
>> > http://yourpsyche.org
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>>
>> -- 
>> View this message in context:
>> http://n4.nabble.com/by-function-tp955789p962666.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list