[Rd] reshape scaling with large numbers of times/rows

Gabor Grothendieck ggrothendieck at gmail.com
Thu Aug 24 14:57:05 CEST 2006


If your Z in reality is not naturally numeric try representing it as a
factor and using
the numeric levels as your numbers and then put the level labels back on:

m <- n <- 5
DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25])
Zn <- as.numeric(DF$Z)
system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
system.time({Zn <- as.numeric(DF$Z)
   w2 <- xtabs(Zn ~ Y + X, DF)
   w2[w2 > 0] <- levels(DF$Z)[w2]
   w2[w2 == 0] <- NA
})


On 8/24/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> If for each combination of X and Y there is at most one Z and Z is numeric
> (it could be made so in your example) then you could use xtabs which is
> faster:
>
> > m <- n <- 10
> > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m)))
> > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
> [1] 0.33 0.00 0.33   NA   NA
> > system.time(w2 <- xtabs(Z ~ Y + X, DF))
> [1] 0.01 0.00 0.01   NA   NA
>
> On 8/24/06, Mitch Skinner <lists at arctur.us> wrote:
> > Hello,
> >
> > I'm mailing r-devel because I think the performance problem I'm having
> > is best solved by re-writing reshape() in C.  I've been reading the
> > "writing R extensions" documentation and I have some questions about
> > they best way to write the C bits, but first let me describe my problem:
> >
> > I'm trying to reshape a long data frame with ~70 subjects measured at
> > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> > a wide data frame with one column per locus ("time").  On my data
> > (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> > 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> > close to how my data actually looks):
> >
> > > version
> >               _
> > platform       i686-redhat-linux-gnu
> > arch           i686
> > os             linux-gnu
> > system         i686, linux-gnu
> > status
> > major          2
> > minor          3.1
> > year           2006
> > month          06
> > day            01
> > svn rev        38247
> > language       R
> > version.string Version 2.3.1 (2006-06-01)
> > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> > genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > > system.time(testWide <- reshape(test, v.names=c("genotype"),
> > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> > [1] 1107.506  120.755 1267.568    0.000    0.000
> >
> > I believe that this can be done a lot faster, and I think the problem
> > comes from this loop in reshape.R (in reshapeWide):
> >
> >        for(i in seq(length = length(times))) {
> >            thistime <- data[data[,timevar] %in% times[i],]
> >            rval[,varying[,i]] <-
> > thistime[match(rval[,idvar],thistime[,idvar]),
> > v.names]
> >        }
> >
> > I don't really understand everything going on under the hood here, but I
> > believe the complexity of this loop is something like
> > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> > shows the lion's share (90%) of the time being spent in [.data.frame.
> >
> > What I'd like to do is write a C loop to go through the source (long)
> > data frame once (rather than once per time), and put the values into the
> > destination rows/columns using hash tables to look up the right
> > row/column.
> >
> > Assuming the hashing is constant-time bounded, then the reshape becomes
> > O(nrow(data)*length(varying)).
> >
> > I'd like to use the abitrary-R-vector hashing functionality from
> > unique.c, but I'm not sure how to do it.  I think that functionality is
> > useful to other C code, but the functions that I'm interested in are not
> > part of the R api (they're defined with "static").  Assuming copy/paste
> > is out, I can see two options: 1. to remove "static" from the
> > declarations of some of the functions, and add prototypes for those
> > functions to a new src/library/stats/src/reshape.c file (or to one of
> > the header files in src/include), or 2. to add C functions to do the
> > reshaping to src/main/unique.c and call those from
> > src/library/stats/R/reshape.R.
> >
> > This is all assuming that it makes sense to include this in mainline
> > R--obviously I think it's worthwhile, and I'm surprised other people
> > aren't complaining more.  I would be happy to write/rewrite until y'all
> > are happy with how it's done, of course.
> >
> > I've written a proof-of-concept by copying and pasting the hashing
> > functions, which (on the above data frame) runs 20 times faster than the
> > R version of reshape.  It still needs some debugging, to put it mildly
> > (doesn't work properly on reals), but the basic idea appears to work.
> >
> > The change I made to the reshape R function looks like this:
> > =====================
> >         for(i in seq(length = length(times))) {
> > -            thistime <- data[data[,timevar] %in% times[i],]
> > -            rval[,varying[,i]] <-
> > thistime[match(rval[,idvar],thistime[,idvar]),
> > -                                           v.names]
> > +          for (j in seq(length(v.names))) {
> > +            rval[,varying[j,i]] <-
> > rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> > +                                       length.out=nrow(rval))
> > +          }
> >         }
> >
> > +        colMap <- match(varying, names(rval))
> > +        .Call("do_reshape_wide",
> > +              data[[idvar]], data[[timevar]], data[v.names],
> > +              rval, colMap,
> > +              v.names, times, rval[[idvar]])
> > +
> >         if (!is.null(new.row.names))
> >             row.names(rval) <- new.row.names
> > =====================
> >
> > This part:
> > rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
> > is to initialize the output with appropriately-typed vectors full of
> > NAs; if there's a better/right way to do this please let me know.
> >
> > The do_reshape_wide C function looks like this:
> > =====================
> > SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData,
> >                     SEXP wideFrame, SEXP colMap,
> >                     SEXP vnames, SEXP times, SEXP wideIds) {
> >    HashData idHash, timeHash;
> >    int longRows, numVarying;
> >    int rowNum, varying;
> >    int wideRow, curTime;
> >    SEXP wideCol, longCol;
> >
> >    HashTableSetup(wideIds, &idHash);
> >    PROTECT(idHash.HashTable);
> >    DoHashing(wideIds, &idHash);
> >
> >    HashTableSetup(times, &timeHash);
> >    PROTECT(timeHash.HashTable);
> >    DoHashing(times, &timeHash);
> >
> >    longRows = length(longIds);
> >    numVarying = length(vnames);
> >
> >    for (rowNum = 0; rowNum < longRows; rowNum++) {
> >        /* Lookup returns 1-based answers */
> >        wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
> >        curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
> >
> >        for (varying = 0; varying < numVarying; varying++) {
> >            /* colMap is 1-based */
> >            wideCol = VECTOR_ELT(wideFrame,
> >                                 INTEGER(colMap)[(numVarying * curTime)
> >                                                 + varying] - 1);
> >            longCol = VECTOR_ELT(longData, varying);
> >
> >            SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
> > rowNum));
> >        }
> >    }
> >    UNPROTECT(2);
> > }
> > =====================
> >
> > None examples I recall from "Writing R extensions" had void C function
> > return types; is that a rule?
> >
> > I've put the code up here:
> > http://arctur.us/r/creshape.tar.gz
> >
> > I've never tried to make a package before, but you should be able to
> > just do a R CMD INSTALL on it, if you want to try it out.  Here's an R
> > session using that code:
> > > require(creshape)
> > Loading required package: creshape
> > [1] TRUE
> > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> > genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > > system.time(testWide <- creshape(test, v.names=c("genotype"),
> > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> > [1] 60.756  1.540 62.598  0.000  0.000
> > > system.time(testWide <- reshape(test, v.names=c("genotype"),
> > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> > [1] 1278.231   78.389 1387.739    0.000    0.000
> >
> > Any comments are appreciated,
> > Mitch
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>



More information about the R-devel mailing list