[R] Fastest Way to Divide Elements of Row With Its RowSum

William Dunlap wdunlap at tibco.com
Thu Sep 17 19:02:36 CEST 2009


> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Thomas Lumley
> Sent: Thursday, September 17, 2009 6:59 AM
> To: William Revelle
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Fastest Way to Divide Elements of Row With Its RowSum
> 
> On Thu, 17 Sep 2009, William Revelle wrote:
> 
> > At 2:40 PM +0900 9/17/09, Gundala Viswanath wrote:
> >> I have a data frame (dat). What I want to do is for each row,
> >> divide each row  with the sum of its row.
> >> 
> >> The number of row can be large > 1million.
> >> Is there a faster way than doing it this way?
> >> 
> >> datnorm;
> >> for (rw in 1:length(dat)) {
> >>     tmp <- dat[rw,]/sum(dat[rw,])
> >>     datnorm <- rbind(datnorm, tmp);
> >> }
> >> 
> >> 
> >> - G.V.
> >
> >
> > datnorm <- dat/rowSums(dat)
> >
> > this will be faster if dat is a matrix rather than a data.frame.
> >
> 
> Even if it's a data frame and he needs a data frame answer it 
> might be faster to do
>    mat<-as.matrix(dat)
>    matnorm<-mat/rowSums(mat)
>    datnorm<-as.data.frame(dat)

If the data.frame has many more rows than columns and the
number of rows is large (e.g., dimensions 10^6 x 20) you may
find that you run out of space converting it to a matrix.  You
can use much less space by looping over the columns, both
to compute the row sums and to do the division.  E.g., the
following should require only 1 (maybe 2) column's worth of
scratch space:

f2 <- function(x){
   stopifnot(is.data.frame(x), ncol(x)>=1)
   rowsum <- x[[1]]
   if(ncol(x)>1) for(i in 2:ncol(x))
      rowsum <- rowsum + x[[i]]
   for(i in 1:ncol(x))
      x[[i]] <- x[[i]] / rowsum
   x
}

For a 10^6 by 20 all numeric data.frame this runs in 13 seconds
on my machine but things like x/rowSums(x) run out of memory.

When working with data.frames it generally pays to think a column
at a time instead of a row at a time.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

> 
> The other advantage, apart from speed, of doing it with 
> dat/rowSums(dat) rather than the loop is he gets the right 
> answer. The loop goes from 1 to the number of columns if dat 
> is a data frame and 1 to the number of entries if dat is a 
> matrix, not from 1 to the number of rows.
> 
>       -thomas
> 
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 




More information about the R-help mailing list