[R] dplyr - add/expand rows

Martin Morgan martin.morgan at roswellpark.org
Wed Nov 29 23:58:46 CET 2017


On 11/29/2017 05:47 PM, Tóth Dénes wrote:
> Hi Martin,
> 
> On 11/29/2017 10:46 PM, Martin Morgan wrote:
>> On 11/29/2017 04:15 PM, Tóth Dénes wrote:
>>> Hi,
>>>
>>> A benchmarking study with an additional (data.table-based) solution. 
>>
>> I don't think speed is the right benchmark (I do agree that 
>> correctness is!).
> 
> Well, agree, and sorry for the wording. It was really just an exercise 
> and not a full evaluation of the approaches. When I read the avalanche 
> of solutions neither of which mentioning data.table (my first choice for 
> data.frame-manipulations), I became curious how a one-liner data.table 
> code performs against the other solutions in terms of speed and 
> readability.
> Second, I quite often have the feeling that dplyr is extremely overused 
> among novice (and sometimes even experienced) R users nowadays. This is 
> unfortunate, as the present example also illustrates.

Another solution is Bill's approach and dplyr's implementation (adding 
the 1L to keep integers integers!)

fun_bill1 <- function(d) {
   i <- rep(seq_len(nrow(d)), d$to - d$from + 1L)
   j <- sequence(d$to - d$from + 1L)
   ## d[i,] %>% mutate(year = from + j - 1L, from = NULL, to = NULL)
   mutate(d[i,], year = from + j - 1L, from = NULL, to = NULL)
}

which is competitive with IRanges and data.table (the more dplyr-ish? 
solution

   d[i, ] %>% mutate(year = from + j - 1L) %>%
       select(station, record, year))

has intermediate performance) and might appeal to those introduced to R 
through dplyr but wanting more base R knowledge, and vice versa. I think 
if dplyr introduces new users to R, or exposes R users to new approaches 
for working with data, that's great!

Martin


> 
> Regards,
> Denes
> 
>>
>> For the R-help list, maybe something about least specialized R 
>> knowledge required would be appropriate? I'd say there were some 
>> 'hard' solutions -- Michael (deep understanding of Bioconductor and 
>> IRanges), Toth (deep understanding of data.table), Jim (at least for 
>> me moderate understanding of dplyr,especially the .$ notation; a 
>> simpler dplyr answer might have moved this response out of the 
>> 'difficult' category, especially given the familiarity of the OP with 
>> dplyr). I'd vote for Bill's as requiring the least specialized 
>> knowledge of R (though the +/- 1 indexing is an easy thing to get wrong).
>>
>> A different criteria might be reuse across analysis scenarios. Bill 
>> seems to win here again, since the principles are very general and at 
>> least moderately efficient (both Bert and Martin's solutions are 
>> essentially R-level iterations and have poor scalability, as 
>> demonstrated in the microbenchmarks; Bill's is mostly vectorized). 
>> Certainly data.table, dplyr, and IRanges are extremely useful within 
>> the confines of the problem domains they address.
>>
>> Martin
>>
>>> Enjoy! ;)
>>>
>>> Cheers,
>>> Denes
>>>
>>>
>>> --------------------------
>>>
>>>
>>> ## packages ##########################
>>>
>>> library(dplyr)
>>> library(data.table)
>>> library(IRanges)
>>> library(microbenchmark)
>>>
>>> ## prepare example dataset ###########
>>>
>>> ## use Bert's example, with 2000 stations instead of 2
>>> d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
>>>                      from = as.integer(c(60,61,71,72,76,60,65,82,83)),
>>>                      to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
>>>                      record = c("A","B","C","B","D","B","B","D","E"),
>>>                      stringsAsFactors = FALSE)
>>> stations <- rle(d_df$station)
>>> stations$value <- gsub(
>>>    " ", "0",
>>>    paste0("station", format(1:length(stations$value), width = 6)))
>>> d_df$station <- rep(stations$value, stations$lengths)
>>>
>>> ## prepare tibble and data.table versions
>>> d_tbl <- as_tibble(d_df)
>>> d_dt <- as.data.table(d_df)
>>>
>>> ## solutions ##########################
>>>
>>> ## Bert - by
>>> fun_bert <- function(d) {
>>>    out <- by(
>>>      d, d$station, function(x) with(x, {
>>>        i <- to - from +1
>>>        data.frame(record =rep(record,i),
>>>                   year =sequence(i) -1 + rep(from,i),
>>>                   stringsAsFactors = FALSE)
>>>      }))
>>>    data.frame(station = rep(names(out), sapply(out,nrow)),
>>>               do.call(rbind,out),
>>>               row.names = NULL,
>>>               stringsAsFactors = FALSE)
>>> }
>>>
>>> ## Bill - transform
>>> fun_bill <- function(d) {
>>>    i <- rep(seq_len(nrow(d)), d$to-d$from+1)
>>>    j <- sequence(d$to-d$from+1)
>>>    transform(d[i,], year=from+j-1, from=NULL, to=NULL)
>>> }
>>>
>>> ## Michael - IRanges
>>> fun_michael <- function(d) {
>>>    df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
>>>    expand(df, "year")
>>> }
>>>
>>> ## Jim - dplyr
>>> fun_jim <- function(d) {
>>>    d %>%
>>>      rowwise() %>%
>>>      do(tibble(station = .$station,
>>>                record = .$record,
>>>                year = seq(.$from, .$to))
>>>      )
>>> }
>>>
>>> ## Martin - Map
>>> fun_martin <- function(d) {
>>>    d$year <- with(d, Map(seq, from, to))
>>>    res0 <- with(d, Map(data.frame,
>>>                        station=station,
>>>                        record=record,
>>>                        year=year,
>>>                        MoreArgs = list(stringsAsFactors = FALSE)))
>>>    do.call(rbind, unname(res0))
>>> }
>>>
>>> ## Denes - simple data.table
>>> fun_denes <- function(d) {
>>>    out <- d[, .(year = from:to), by = .(station, from, record)]
>>>    out[, from := NULL]
>>> }
>>>
>>> ## Check equality ################################
>>> all.equal(fun_bill(d_df), fun_bert(d_df),
>>>            check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), fun_martin(d_df),
>>>            check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
>>>            check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
>>>            check.attributes = FALSE)
>>> # Be prepared: this solution is super slow
>>> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
>>>            check.attributes = FALSE)
>>>
>>> ## Benchmark #####################################
>>>
>>> ## Martin
>>> print(system.time(fun_martin(d_df)))
>>>
>>> ## Bert
>>> print(system.time(fun_bert(d_df)))
>>>
>>> ## Top 3
>>> print(
>>>    microbenchmark(
>>>      fun_bill(d_df),
>>>      fun_michael(d_df),
>>>      fun_denes(d_dt),
>>>      times = 100L
>>>    )
>>> )
>>>
>>>
>>> -------------------------
>>>
>>> On 11/28/2017 06:49 PM, Michael Lawrence wrote:
>>>> Or with the Bioconductor IRanges package:
>>>>
>>>> df <- with(input, DataFrame(station, year=IRanges(from, to), record))
>>>> expand(df, "year")
>>>>
>>>> DataFrame with 24 rows and 3 columns
>>>>          station     year      record
>>>>      <character> <integer> <character>
>>>> 1       07EA001      1960         QMS
>>>> 2       07EA001      1961         QMC
>>>> 3       07EA001      1962         QMC
>>>> 4       07EA001      1963         QMC
>>>> 5       07EA001      1964         QMC
>>>> ...         ...       ...         ...
>>>> 20      07EA001      1979         QRC
>>>> 21      07EA001      1980         QRC
>>>> 22      07EA001      1981         QRC
>>>> 23      07EA001      1982         QRC
>>>> 24      07EA001      1983         QRC
>>>>
>>>> If you tell the computer more about your data, it can do more things 
>>>> for
>>>> you.
>>>>
>>>> Michael
>>>>
>>>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
>>>> martin.morgan at roswellpark.org> wrote:
>>>>
>>>>> On 11/26/2017 08:42 PM, jim holtman wrote:
>>>>>
>>>>>> try this:
>>>>>>
>>>>>> ##########################################
>>>>>>
>>>>>> library(dplyr)
>>>>>>
>>>>>> input <- tribble(
>>>>>>     ~station, ~from, ~to, ~record,
>>>>>>    "07EA001" ,    1960  ,  1960  , "QMS",
>>>>>>    "07EA001"  ,   1961 ,   1970  , "QMC",
>>>>>>    "07EA001" ,    1971  ,  1971  , "QMM",
>>>>>>    "07EA001" ,    1972  ,  1976  , "QMC",
>>>>>>    "07EA001" ,    1977  ,  1983  , "QRC"
>>>>>> )
>>>>>>
>>>>>> result <- input %>%
>>>>>>     rowwise() %>%
>>>>>>     do(tibble(station = .$station,
>>>>>>               year = seq(.$from, .$to),
>>>>>>               record = .$record)
>>>>>>     )
>>>>>>
>>>>>> ###########################
>>>>>>
>>>>>
>>>>> In a bit more 'base R' mode I did
>>>>>
>>>>>    input$year <- with(input, Map(seq, from, to))
>>>>>    res0 <- with(input, Map(data.frame, station=station, year=year,
>>>>>        record=record))
>>>>>     as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>>>
>>>>> resulting in
>>>>>
>>>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>>>     station  year record
>>>>>      <fctr> <int> <fctr>
>>>>>   1 07EA001  1960    QMS
>>>>>   2 07EA001  1961    QMC
>>>>>   3 07EA001  1962    QMC
>>>>>   4 07EA001  1963    QMC
>>>>>   5 07EA001  1964    QMC
>>>>>   6 07EA001  1965    QMC
>>>>>   7 07EA001  1966    QMC
>>>>>   8 07EA001  1967    QMC
>>>>>   9 07EA001  1968    QMC
>>>>> 10 07EA001  1969    QMC
>>>>> # ... with 14 more rows
>>>>>
>>>>> I though I should have been able to use `tibble` in the second 
>>>>> step, but
>>>>> that leads to a (cryptic) error
>>>>>
>>>>>> res0 <- with(input, Map(tibble, station=station, year=year,
>>>>> record=record))Error in captureDots(strict = `__quosured`) :
>>>>>    the argument has already been evaluated
>>>>>
>>>>> The 'station' and 'record' columns are factors, so different from the
>>>>> original input, but this seems the appropriate data type for theses 
>>>>> columns.
>>>>>
>>>>> It's interesting to compare the 'specialized' knowledge needed for 
>>>>> each
>>>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), 
>>>>> maybe
>>>>> rbind() and Map() for base R.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Jim Holtman
>>>>>> Data Munger Guru
>>>>>>
>>>>>> What is the problem that you are trying to solve?
>>>>>> Tell me what you want to do, not how you want to do it.
>>>>>>
>>>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> To David W.'s point about lack of a suitable reprex ("reproducible
>>>>>>> example"), Bill's solution seems to be for only one station.
>>>>>>>
>>>>>>> Here is a reprex and modification that I think does what was 
>>>>>>> requested
>>>>>>> for
>>>>>>> multiple stations, again using base R and data frames, not dplyr and
>>>>>>> tibbles.
>>>>>>>
>>>>>>> First the reprex with **two** stations:
>>>>>>>
>>>>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)),
>>>>>>>>
>>>>>>>                  from = c(60,61,71,72,76,60,65,82,83),
>>>>>>>                   to = c(60,70,71,76,83,64, 81, 82,83),
>>>>>>>                   record = c("A","B","C","B","D","B","B","D","E"))
>>>>>>>
>>>>>>> d
>>>>>>>>
>>>>>>>     station from to record
>>>>>>> 1     one   60 60      A
>>>>>>> 2     one   61 70      B
>>>>>>> 3     one   71 71      C
>>>>>>> 4     one   72 76      B
>>>>>>> 5     one   76 83      D
>>>>>>> 6     two   60 64      B
>>>>>>> 7     two   65 81      B
>>>>>>> 8     two   82 82      D
>>>>>>> 9     two   83 83      E
>>>>>>>
>>>>>>> ## Now the conversion code using base R, especially by():
>>>>>>>
>>>>>>> out <- by(d, d$station, function(x) with(x, {
>>>>>>>>
>>>>>>> +    i <- to - from +1
>>>>>>> +    data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD 
>>>>>>> =rep(record,i))
>>>>>>> + }))
>>>>>>>
>>>>>>>
>>>>>>> out <- data.frame(station =
>>>>>>>>
>>>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names = 
>>>>>>> NULL)
>>>>>>>
>>>>>>>
>>>>>>> out
>>>>>>>>
>>>>>>>      station YEAR RECORD
>>>>>>> 1      one   60      A
>>>>>>> 2      one   61      B
>>>>>>> 3      one   62      B
>>>>>>> 4      one   63      B
>>>>>>> 5      one   64      B
>>>>>>> 6      one   65      B
>>>>>>> 7      one   66      B
>>>>>>> 8      one   67      B
>>>>>>> 9      one   68      B
>>>>>>> 10     one   69      B
>>>>>>> 11     one   70      B
>>>>>>> 12     one   71      C
>>>>>>> 13     one   72      B
>>>>>>> 14     one   73      B
>>>>>>> 15     one   74      B
>>>>>>> 16     one   75      B
>>>>>>> 17     one   76      B
>>>>>>> 18     one   76      D
>>>>>>> 19     one   77      D
>>>>>>> 20     one   78      D
>>>>>>> 21     one   79      D
>>>>>>> 22     one   80      D
>>>>>>> 23     one   81      D
>>>>>>> 24     one   82      D
>>>>>>> 25     one   83      D
>>>>>>> 26     two   60      B
>>>>>>> 27     two   61      B
>>>>>>> 28     two   62      B
>>>>>>> 29     two   63      B
>>>>>>> 30     two   64      B
>>>>>>> 31     two   65      B
>>>>>>> 32     two   66      B
>>>>>>> 33     two   67      B
>>>>>>> 34     two   68      B
>>>>>>> 35     two   69      B
>>>>>>> 36     two   70      B
>>>>>>> 37     two   71      B
>>>>>>> 38     two   72      B
>>>>>>> 39     two   73      B
>>>>>>> 40     two   74      B
>>>>>>> 41     two   75      B
>>>>>>> 42     two   76      B
>>>>>>> 43     two   77      B
>>>>>>> 44     two   78      B
>>>>>>> 45     two   79      B
>>>>>>> 46     two   80      B
>>>>>>> 47     two   81      B
>>>>>>> 48     two   82      D
>>>>>>> 49     two   83      E
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Bert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>>
>>>>>>> "The trouble with having an open mind is that people keep coming 
>>>>>>> along
>>>>>>> and
>>>>>>> sticking things into it."
>>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>>
>>>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help <
>>>>>>> r-help at r-project.org> wrote:
>>>>>>>
>>>>>>> dplyr may have something for this, but in base R I think the 
>>>>>>> following
>>>>>>>>
>>>>>>> does
>>>>>>>
>>>>>>>> what you want.  I've shortened the name of your data set to 'd'.
>>>>>>>>
>>>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL)
>>>>>>>>
>>>>>>>>
>>>>>>>> Bill Dunlap
>>>>>>>> TIBCO Software
>>>>>>>> wdunlap tibco.com
>>>>>>>>
>>>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) <
>>>>>>>> david.hutchinson at canada.ca> wrote:
>>>>>>>>
>>>>>>>> I have a returned tibble of station operational record similar 
>>>>>>>> to the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> data.collection
>>>>>>>>>>
>>>>>>>>> # A tibble: 5 x 4
>>>>>>>>>     STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>>>>>              <chr>     <int>   <int>  <chr>
>>>>>>>>> 1        07EA001      1960    1960    QMS
>>>>>>>>> 2        07EA001      1961    1970    QMC
>>>>>>>>> 3        07EA001      1971    1971    QMM
>>>>>>>>> 4        07EA001      1972    1976    QMC
>>>>>>>>> 5        07EA001      1977    1983    QRC
>>>>>>>>>
>>>>>>>>> I would like to reshape this to one operational record (row) 
>>>>>>>>> per year
>>>>>>>>>
>>>>>>>> per
>>>>>>>
>>>>>>>> station. Something like:
>>>>>>>>>
>>>>>>>>> 07EA001              1960      QMS
>>>>>>>>> 07EA001              1961      QMC
>>>>>>>>> 07EA001              1962      QMC
>>>>>>>>> 07EA001              1963      QMC
>>>>>>>>> ...
>>>>>>>>> 07EA001              1971      QMM
>>>>>>>>>
>>>>>>>>> Can this be done in dplyr easily?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>>>> posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>>> posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>>
>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>> posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>>          [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>> ng-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>
>>>>> This email message may contain legally privileged 
>>>>> and/or...{{dropped:2}}
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>>
>> This email message may contain legally privileged and/or confidential 
>> information.  If you are not the intended recipient(s), or the 
>> employee or agent responsible for the delivery of this message to the 
>> intended recipient(s), you are hereby notified that any disclosure, 
>> copying, distribution, or use of this email message is prohibited.  If 
>> you have received this message in error, please notify the sender 
>> immediately by e-mail and delete this email message from your 
>> computer. Thank you.
>>
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the R-help mailing list