[R] data.frame: data-driven column selections that vary by row??

Wed Apr 1 21:55:36 CEST 2015

David,

In data.table v1.9.5 (current development version, which you can get
from here: https://github.com/Rdatatable/data.table/wiki/Installation),
new features were added to both `melt` and `cast` for data.tables.
They both can handle multiple columns simultaneously. I think this
would be of interest for you..

Using 1.9.5, here's how I'd do it.

require(data.table) ## v1.9.5+
cols <- grep("^da2.*$", names(bw), value=TRUE)      ## (1)
splt <- split(cols, seq_len(length(cols)/2L))       ## (2)
vars <- unique(gsub("(.*?)_(.*$)", "\\1", cols))    ## (3)
vals <- unique(gsub("(.*?)_(.*$)", "\\2", cols))    ## (4)

ans1 = melt(setDT(bw), measure=splt, variable.name="disc",
value.name=vals) ## (5)
setattr(ans1$disc, 'levels', vars) ## (6)

Explanation:
-------------------

1. Get all cols you've to melt
2. Split them into column pairs that should be combined together
3. Get levels for 'variable' column
4. Get column names for molten result
5. Melt by providing list of columns with each element containing the
columns you'd want to combine together in the molten result directly.
6. Set levels for variable column appropriately.

Advantages:
------------------

1. melting by combining corresponding columns together, directly, is
straightforward and easy to understand, since that's the task you want
to perform. Having to combine all columns together and then split them
back seems roundabout.

2. "casting" (tidyr::spread internally uses reshape2::dcast) is a
relatively complicated operation, and in this case it can be
completely avoided which will save both time and memory (see benchmark
at the bottom of post). It also reorders the result which may not be
desirable.

3. In 'bw', columns `da20_dev_type` and `da2_dev_type` are type
'factor' while others are type 'numeric'. reshape2::melt (or)
tidyr::gather, since it combines all columns will have to coerce these
different types to a common type, here 'character'. So, you'll have to
convert the columns back to the right type after casting. I think
you'll agree that's unnecessary. `melt.data.table` preserves the type
as it combines only relevant columns together.

4. Since the operation is performed in a straightforward manner (and
in C for speed), it's incredibly fast *and* memory efficient.

Benchmark (on ~180,000 rows)
---------------------------------------------

library(tidyr)
library(dplyr)
require(data.table) ## v1.9.5+

# replacing timestamp so that rows for unique (for spread to work correctly)
bw.large = rbindlist(replicate(1e4, bw, simplify=FALSE))[, timestamp := .I][]
object.size(bw.large)/1024^2 # ~38MB

The data is 38MB, which is not at all large... but enough to illustrate.

# data.table
system.time({
cols <- grep("^da2.*$", names(bw), value=TRUE)      ## (1)
splt <- split(cols, seq_len(length(cols)/2L))       ## (2)
vars <- unique(gsub("(.*?)_(.*$)", "\\1", cols))    ## (3)
vals <- unique(gsub("(.*?)_(.*$)", "\\2", cols))    ## (4)

ans1 = melt(setDT(bw.large), measure=splt, variable.name="disc",
value.name=vals) ## (5)
setattr(ans1$disc, 'levels', vars) ## (6)
})
#    user  system elapsed
#   0.260   0.013   0.275

Memory used: 56MB

# tidyr
system.time({
ans2 <- gather(setDF(bw.large), key = "tmp", value = "value",
matches("^d[a-z]+[0-9]+"))
ans2 <- separate(ans2, tmp, c("disc", "var"), "_", extra = "merge")
ans2 <- spread(ans2, var, value)
})
#    user  system elapsed
#  15.818   1.128  17.063

Memory used : 750MB

And that's ~62x speedup.

HTH
Arun Srinivasan
Co-developer, data.table.

On Tue, Mar 31, 2015 at 8:35 PM, Tom Wright <tom at maladmin.com> wrote:
> Nice clean-up!!!
>
> On Tue, 2015-03-31 at 14:19 -0400, Ista Zahn wrote:
>> library(tidyr)
>> library(dplyr)
>> bw <- gather(bw, key = "tmp", value = "value",
>> matches("^d[a-z]+[0-9]+"))
>> bw <- separate(bw, tmp, c("disc", "var"), "_", extra = "merge")
>> bw <- spread(bw, var, value)
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.