[R] Multiple if function

Charles C. Berry ccberry at ucsd.edu
Thu Sep 17 18:29:38 CEST 2015


On Thu, 17 Sep 2015, Berend Hasselman wrote:

>
>> On 17 Sep 2015, at 01:42, Dénes Tóth <toth.denes at ttk.mta.hu> wrote:
>>
>>
>>
>> On 09/16/2015 04:41 PM, Bert Gunter wrote:
>>> Yes! Chuck's use of mapply is exactly the split/combine strategy I was
>>> looking for. In retrospect, exactly how one should think about it.
>>> Many thanks to all for a constructive discussion .
>>>
>>> -- Bert
>>>
>>>
>>> Bert Gunter
>>>
>>>>>>
>>>>>> Use mapply like this on large problems:
>>>>>>
>>>>>> unsplit(
>>>>>>   mapply(
>>>>>>       function(x,z) eval( x, list( y=z )),
>>>>>>       expression( A=y*2, B=y+3, C=sqrt(y) ),
>>>>>>       split( dat$Flow, dat$ASB ),
>>>>>>       SIMPLIFY=FALSE),
>>>>>>   dat$ASB)
>>>>>>
>>>>>> Chuck
>>>>>>
>>
>>
>> Is there any reason not to use data.table for this purpose, especially if efficiency is of concern?
>>
>> ---
>>
>> # load data.table and microbenchmark
>> library(data.table)
>> library(microbenchmark)
>> #
>> # prepare data
>> DF <- data.frame(
>>    ASB = rep_len(factor(LETTERS[1:3]), 3e5),
>>    Flow = rnorm(3e5)^2)
>> DT <- as.data.table(DF)
>> DT[, ASB := as.character(ASB)]
>> #
>> # define functions
>> #
>> # Chuck's version
>> fnSplit <- function(dat) {
>>    unsplit(
>>        mapply(
>>            function(x,z) eval( x, list( y=z )),
>>            expression( A=y*2, B=y+3, C=sqrt(y) ),
>>            split( dat$Flow, dat$ASB ),
>>            SIMPLIFY=FALSE),
>>        dat$ASB)
>> }
>> #
>> # data.table-way (IMHO, much easier to read)
>> fnDataTable <- function(dat) {
>>    dat[,
>>        result :=
>>            if (.BY == "A") {
>>                2 * Flow
>>            } else if (.BY == "B") {
>>                3 + Flow
>>            } else if (.BY == "C") {
>>                sqrt(Flow)
>>            },
>>        by = ASB]
>> }
>> #
>> # benchmark
>> #
>> microbenchmark(fnSplit(DF), fnDataTable(DT))
>> identical(fnSplit(DF), fnDataTable(DT)[, result])
>>
>> ---
>>
>> Actually, in Chuck's version the unsplit() part is slow. If the order is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable to the DT-version.
>>
>
> But David’s version is faster than Chuck’s fnSplit. I modified David’s solution slightly to get a result that is identical to fnSplit.
>
> # David's version
> # my modification to return a vector just like fnSplit
> fnDavid <- function(dat) {
>    z <- mapply(
>          function(x,z) eval( x, list( y=z )),
>          expression(A= y*2, B=y+3, C=sqrt(y) ),
>          split( dat$Flow, dat$ASB ),
>          USE.NAMES=FALSE, SIMPLIFY=TRUE
>        )
>    as.vector(t(z))
> }
>
> Added this to Dénes's code.
> Benchmarking  with R package rbenchmark and testing result like this
>
> library(rbenchmark)
> benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF))
> identical(fnSplit(DF), fnDataTable(DT)[, result])
> identical(fnSplit(DF), fnDavid(DF))
>
> gave this:
>
>             test replications elapsed relative user.self sys.self user.child
> 2 fnDataTable(DT)          100   0.829    1.000     0.762    0.066          0
> 3     fnDavid(DF)          100   1.615    1.948     1.515    0.098          0
> 1     fnSplit(DF)          100   2.878    3.472     2.685    0.190          0
>  sys.child
> 2         0
> 3         0
> 1         0
>
>> identical(fnSplit(DF), fnDataTable(DT)[, result])
> [1] TRUE
>> identical(fnSplit(DF), fnDavid(DF))
> [1] TRUE

The above `TRUE' depends on the structure of ASB here. identical(...) is 
often FALSE in the general case. A permutation of ASB is enough to show 
this:

> DF$ASB <- sample(DF$ASB)
> identical(fnSplit(DF), fnDavid(DF))
[1] FALSE
>

unsplit() is the price you pay to cope with general orderings.

Chuck


More information about the R-help mailing list