[R] Functional Programming Problem Using purr and R's data.table shift function

Dénes Tóth toth@dene@ @end|ng |rom kogentum@hu
Sat Dec 31 15:22:29 CET 2022


Hi Michael,

Note that you have to be very careful when using by-reference operations 
in data.table (see `?data.table::set`), especially in a functional 
programming approach. In your function, you avoid this problem by 
calling `data.table(A)` which makes a copy of A even if it is already a 
data.table. However, for large data.table-s, copying can be a very 
expensive operation (esp. in terms of RAM usage), which can be totally 
eliminated by using data.tables in the data.table-way (e.g., joining, 
grouping, and aggregating in the same step by performing these 
operations within `[`, see `?data.table`).

So instead of blindly functionalizing all your code, try to be 
pragmatic. Functional programming is not about using pure functions in 
*every* part of your code base, because it is unfeasible in 99.9% of 
real-world problems. Even Haskell has `IO` and `do`; the point is that 
the  imperative and functional parts of the code are clearly separated 
and imperative components are (tried to be) as top-level as possible.

So when using data.table, a good strategy is to use pure functions for 
performing within-data.table operations, e.g., `DT[, lapply(.SD, mean), 
.SDcols = is.numeric]`, and when these operations alter `DT` by 
reference, invoke the chains of these operations in "pure" wrappers - 
e.g., calling `A <- copy(A)` on the top and then modifying `A` directly.

Cheers,
Denes

Side note: You do not need to use `DT[ , A:= shift(A, fill = NA, type = 
"lag", n = 1)] %>% `[`(return(DT))`. `[.data.table` returns the result 
(the modified DT) invisibly. If you want to let auto-print work, you can 
just use `DT[ , A:= shift(A, fill = NA, type = "lag", n = 1)][]`.

Note that this also means you usually you do not need to use magrittr's 
or base-R pipe when transforming data.table-s. You can do this instead:
```
DT[
   ## filter rows where 'x' column equals "a"
   x == "a"
][
   ## calculate the mean of `z` for each gender and assign it to `y`
   , y := mean(z), by = "gender"
][
   ## do whatever you want
   ...
]
```


On 12/31/22 13:39, Rui Barradas wrote:
> Às 06:50 de 31/12/2022, Michael Lachanski escreveu:
>> Hello,
>>
>> I am trying to make a habit of "functionalizing" all of my code as
>> recommended by Hadley Wickham. I have found it surprisingly difficult 
>> to do
>> so because several intermediate features from data.table break or give
>> unexpected results using purrr and its data.table adaptation, tidytable.
>> Here is the a minimal working example of what has stumped me most 
>> recently:
>>
>> ===
>>
>> library(data.table); library(tidytable)
>>
>> minimal_failing_function <- function(A){
>>    DT <- data.table(A)
>>    DT[ , A:= shift(A, fill = NA, type = "lag", n = 1)] %>% `[`
>>    return(DT)}
>> # works
>> minimal_failing_function(c(1,2))
>> # fails
>> tidytable::pmap_dfr(.l = list(c(1,2)),
>>                      .f = minimal_failing_function)
>>
>>
>> ===
>> These should ideally give the same output, but do not. This also fails
>> using purrr::pmap_dfr rather than tidytable. I am using R 4.2.2 and I 
>> am on
>> Mac OS Ventura 13.1.
>>
>> Thank you for any help you can provide or general guidance.
>>
>>
>> ==
>> Michael Lachanski
>> PhD Student in Demography and Sociology
>> MA Candidate in Statistics
>> University of Pennsylvania
>> mikelach using sas.upenn.edu
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
> 
> Use map_dfr instead of pmap_dfr.
> 
> 
> library(data.table)
> library(tidytable)
> 
> minimal_failing_function <- function(A) {
>    DT <- data.table(A)
>    DT[ , A:= shift(A, fill = NA, type = "lag", n = 1)] %>% `[`
>    return(DT)
> }
> 
> # works
> tidytable::map_dfr(.x = list(c(1,2)),
>                     .f = minimal_failing_function)
> #> # A tidytable: 2 × 1
> #>       A
> #>   <dbl>
> #> 1    NA
> #> 2     1
> 
> 
> Hope this helps,
> 
> Rui Barradas
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list