[R] reshape to wide format takes extremely long

Thu Sep 2 12:25:51 CEST 2010

Dear Dennis,

cast() is in this case much faster.

> system.time(bigtab <- ddply(big, .(study, subject, cycle, day),
function(x) xtabs(obs ~ type, data = x)))
   user  system elapsed 
  35.36    0.12   35.53 
> system.time(bigtab2 <- cast(data = big, study + subject + cycle + day
~type, value = "obs", fun = mean))
   user  system elapsed 
   4.09    0.00    4.09 

I have the feeling that ddply() has a lot of overhead when the number of
levels is large.

HTH,

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek
team Biometrie & Kwaliteitszorg
Gaverstraat 4
9500 Geraardsbergen
Belgium

Research Institute for Nature and Forest
team Biometrics & Quality Assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium

tel. + 32 54/436 185
Thierry.Onkelinx op inbo.be
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

> -----Oorspronkelijk bericht-----
> Van: r-help-bounces op r-project.org 
> [mailto:r-help-bounces op r-project.org] Namens Dennis Murphy
> Verzonden: donderdag 2 september 2010 11:34
> Aan: Coen van Hasselt
> CC: r-help op r-project.org
> Onderwerp: Re: [R] reshape to wide format takes extremely long
> 
> Hi:
> 
> I did the following test using function ddply() in the plyr 
> package on a toy data frame with 50000 observations using 
> five studies, 20 subjects per study, 25 cycles per subject, 
> five days per cycle and four observations by type per day. No 
> date-time variable was included.
> 
> # Test data frame
> big <- data.frame(study = factor(rep(1:5, each = 10000)),
>                   subject = factor(rep(1:100, each = 500)),
>                   cycle = rep(rep(1:25, each = 20), 100),
>                   day = rep(rep(1:5, each = 4), 500),
>                   type = rep(c('ALB', 'ALP', 'ALT', 'AST'), 12500),
>                   obs = rpois(50000, 70) )
> > dim(big)
> [1] 50000     6
> 
> # 64-bit R on a Windows 7 box with 8Gb RAM and a 2.93GHz Core 
> Duo chip.
> system.time(bigtab <- ddply(big, .(study, subject, cycle,
> day), function(x) xtabs(obs ~ type, data = x)))
>    user  system elapsed
>   30.22    0.02   30.60
> 
> > dim(bigtab)
> [1] 12500     8
> > head(bigtab)
>   study subject cycle day ALB ALP ALT AST
> 1     1       1     1   1  77  80  67  70
> 2     1       1     1   2  60  54  70  70
> 3     1       1     1   3  71  77  69  65
> 4     1       1     1   4  62  71  73  68
> 5     1       1     1   5  78  67  69  78
> 6     1       1     2   1  71  69  74  69
> > tail(bigtab)
>       study subject cycle day ALB ALP ALT AST
> 12495     5     100    24   5  75  83  72  70
> 12496     5     100    25   1  85  52  62  70
> 12497     5     100    25   2  79  64  84  68
> 12498     5     100    25   3  67  65  74  81
> 12499     5     100    25   4  62  86  66  80
> 12500     5     100    25   5  58  76  85  84
> 
> There may be an easier/more efficient way to do this with 
> melt() and cast() in the reshape package, but moved on when I 
> couldn't figure it out within ten minutes (probably because I 
> was thinking 'xtabs of obs by type for 
> study/subject/cycle/day combinations - that's the ticket!' :) 
> Packages sqldf and data.table are other viable options for 
> this sort of task, and now that there is a test data set to 
> play with, it would be interesting to see what else can be 
> done. I'd be surprised if this couldn't be done within a few 
> seconds because the data frame is not that large.
> 
> Anyway, HTH,
> Dennis
> 
> 
> 
> On Thu, Sep 2, 2010 at 12:24 AM, Coen van Hasselt
> <coenvanhasselt op gmail.com>wrote:
> 
> > Hello,
> >
> > I have a data.frame with the following format:
> >
> > > head(clin2)
> >    Study Subject  Type      Obs Cycle Day       Date  Time
> > 1 A001101   10108   ALB 44.00000    98   1 2004-03-11 14:26
> > 2 A001101   10108   ALP 95.00000    98   1 2004-03-11 14:26
> > 3 A001101   10108   ALT 61.00000    98   1 2004-03-11 14:26
> > 5 A001101   10108   AST 33.00000    98   1 2004-03-11 14:26
> >
> > I want to transform this data.frame so that I have "Obs" 
> columns for 
> > each "Type". The full dataset is 45000 rows long. For a 
> short subset 
> > of 100 rows, reshaping takes 0.2 seconds, and produces what I want.
> > All columns are either numeric or character format (incl. 
> date/time).
> >
> > > reshape(clin2, v.names="Obs", timevar="Type",
> > 
> direction="wide",idvar=c("Study","Subject","Cycle","Day","Date
> ","Time"),)
> >      Study Subject Cycle Day       Date  Time Obs.ALB 
> Obs.ALP Obs.ALT
> > Obs.AST
> > 1   A001101   10108    98   1 2004-03-11 14:26      44      
> 95      61
> >  33
> > 11  A001101   10108     1   1 2004-03-12 14:01      41      
> 85      39
> >  33
> > 21  A001101   10108     1   8 2004-03-22 10:34      40      
> 90      70
> >  34
> > 30  A001101   10108     1  15 2004-03-29 09:56      45      
> 97      66
> >     48 [........]
> >
> > However, when using the same reshape command for the full 
> data.frame 
> > of 45000 rows, it still wasn't finished when run overnight 
> (8 GB RAM +
> > 8 GB swap in use).
> >
> > The time to process this data.frame from a 100-row subset to a 
> > 1000-row subset increases from 0.2 sec to 60 sec.
> >
> > I would greatly appreciate a advice why the time for reshaping is 
> > increasing exponentially with the nr. of rows, and how I 
> can do this 
> > in an elegant way.
> >
> > Thanks!
> >
> > Coen.
> >
> > ______________________________________________
> > R-help op r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

Druk dit bericht a.u.b. niet onnodig af.
Please do not print this message unnecessarily.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.