[R] Yet another set of codes to optimize

William Dunlap wdunlap at tibco.com
Wed Dec 17 19:40:57 CET 2008


> [R] Yet another set of codes to optimize
> Daren Tan daren76 at hotmail.com
> Fri Dec 5 03:41:23 CET 2008
>
> I have problems converting my dataset from long to wide format.
Previous
> attempts using reshape package and aggregate function were
unsuccessful
> as they took too long. Apparently, my simplified solution also lasted
> as long.
>
> My complete codes is given below. When sample.size = 10000, the
> execution takes about 20 seconds. But sample.size = 100000 seems to
take
> eternity. My actual sample.size is 15000000 i.e. 15 million.
>
> sample.size <- 10000
>
> m <- data.frame(Name=sample(1:100000, sample.size, T),
Type=sample(1:1000,
>    sample.size, T), Predictor=sample(LETTERS[1:10], sample.size, T))
>
> res <- function(m) {
>     m.12.unique <- unique(m[,1:2])
>     m.12.unique <- m.12.unique[order(m.12.unique[,1],
m.12.unique[,2]),]
>     v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".")
>     v2 <- c(sort(unique(m[,3])))
>     res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
>     m.ids <- paste(m[,1], m[,2], sep=".")
>     for(i in 1:nrow(m)) {
>       x <- m.ids[i]
>       y <- m[i,3]
>       res[x, y] <- res[x, y] + 1
>     }
>    res <- data.frame(m.12.unique[,1], m.12.unique[,2], res,
row.names=NULL)
>    colnames(res) <- c("Name", "Type", v2)
>    return(res)
> }
>
> res(m)

Your for loop is tabulating the items in m.ids and m[,3]
so think of using table().  E.g., replace
    res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
    for(i in 1:nrow(m)) {
      x <- m.ids[i]
      y <- m[i,3]
      res[x, y] <- res[x, y] + 1
    }
with
    res<-table(factor(m.ids,levels=v1), factor(m[,3]))

There is a bit of trickiness in putting this table into
the data.frame.  Since as.data.frame(tableObject) works very
differently than as.data.frame(matrixObject), the naive
    data.frame(m.12.unique[,1], m.12.unique[,2], res, row.names=NULL)
fails.  You need to convert the table res into a matrix with
the same data, dimensions, and dimnames.
    data.frame(m.12.unique[,1], m.12.unique[,2], as.matrix(res),
row.names=NULL)
also fails because a "table" object is a "matrix" object so
as.matrix(tableObject) returns its input, unchanged.

as(res,"matrix") seems to work, as the the wordier
but more explicit array(res,dim(res),dimnames(res)).

res1 <-
function(m) {
    m.12.unique <- unique(m[,1:2])
    m.12.unique <- m.12.unique[order(m.12.unique[,1], m.12.unique[,2]),]
    v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".")
    v2 <- c(sort(unique(m[,3])))
    res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1,
v2))
    m.ids <- paste(m[,1], m[,2], sep=".")
    res <- table(factor(m.ids,levels=v1), factor(m[,3]))
    res <- data.frame(m.12.unique[,1], m.12.unique[,2],
            as(res, "matrix"), row.names=NULL)
    colnames(res) <- c("Name", "Type", v2)
    return(res)
}

Here is a table of times for your original function, time0,
and this modified one, time0.  It looks like res1 eventually
becomes worse than linear, but for a much larger size than
your original.  sort() and unique() cannot have linear time
so they may be becoming factors at size=1e6.

          size   time0  time1
    1       10   0.012  0.012
    2      100   0.032  0.014
    3      200   0.061  0.016
    4      400   0.126  0.020
    5      800   0.286  0.028
    6     1000   0.383  0.033
    7     2000   2.337  0.054
    8     4000   8.578  0.100
    9     8000  39.955  0.214
    10   10000  68.767  0.318
    11   20000 327.973  1.057
    12  100000      NA  3.021
    12 1000000      NA 89.881

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 



More information about the R-help mailing list