[R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation)

Sat Mar 27 04:58:54 CET 2010

Hi Dimitri

Your code is complex, so this won't be easy for anyone to deparse.

I think part of the issue is in the calculation you are doing in your innermost loop

data[data[[group.var]] %in% subgroup, name][case]=
1-((1-data[data[[group.var]] %in% subgroup,
name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][case]*l*10)))

This looks to me to be a lag-1 calculation, and not all lagged calculations are vectorizable, so
this part may have to be done in a loop.  If so, working on a matrix instead of a dataframe
can help speed up code dramatically, especially with big dataframes (I've seen 10-fold
increases in speed by working with big matrices with no row or column names
instead of dataframes).

Pre-declare a matrix big enough to hold all your results outside of the loop, 
then do your calculations using the matrix inside the loop.  You can convert
the matrix to a dataframe after the loop completes if needed.

HTH

Steven McKinney

________________________________________
From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski [ld7631 at gmail.com]
Sent: March 26, 2010 6:40 PM
To: Bert Gunter
Cc: r-help
Subject: Re: [R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation)

My sincere apologies if it looked large. Let me try again with less code.
It's hard to do less than that. In fact - there is nothing in this
code but 1 formula and many loops, which is the problem I am not sure
how to solve.
I also tried to be as clear as possible with the comments.
Dimitri

## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE
  set.seed(123)
  data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0,
sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2)))
  data                   # "data" it is the data frame to work with
## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data"
would contain up to 150-200 rows PER SUBGROUP

### Specifying useful parameters used in the slow code below:
vars<-names(data)[2:3]                    # names of variables used in
transformation; in real life - up to 50-60 variables
group.var<-names(data)[1]                # name of the grouping variable
subgroups<-levels(data[[group.var]])   # names of subgroups; in real
life - up to 30 subgroups

# OBJECTIVE:
# Need to create new variables based on the old ones (a & b)
# For each new variable, the value in a given row is a function of (a)
2 constants (that have several levels each),
# (b) value of the original variable (e.g., a.ind.to.max"), and the
value in the previous row on the same new variable
# Plus - it has to be done by subgroup (variable "group")

# Defining 2 constants:
constant1<-c(1:3)                # constant 1 used in transformation -
has 3 levels, in real life - up to 7 levels
constant2<-seq(.15,.45,.15)  # constant 2 used in transformation - has
3 levels, in real life - up to 7 levels

### CODE THAT IS SLOW. Reason - too many loops with the inner-most
loop being very slow - as it is looping through rows:

for(var in vars){                               # looping through variables
  for(c1 in 1:length(constant1)){        # looping through values of constant1
       for(c2 in 1:length(constant2)){   # looping through values of constant2
         d=log(0.5)/constant1[c1]
          l=-log(1-constant2[c2])
          name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".")
          data[[name]]<-NA
          for(subgroup in subgroups){     # looping through subgroups
            data[data[[group.var]] %in% subgroup, name][1] =
1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][1]*l*10)))

      ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS:
             for(case in 2:nrow(data[data[[group.var]] %in% subgroup,
])){ # looping through rows
               data[data[[group.var]] %in% subgroup, name][case]=
1-((1-data[data[[group.var]] %in% subgroup,
name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][case]*l*10)))
             }
      ### END OF THE SLOWEST SECTION (INNERMOST LOOP)
          }
      }
   }
}

### END OF THE CODE

On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> Dmitri:
>
> If you follow the R posting guide you're more likely to get useful replies.
> In particular it asks for **small** reproducible examples -- your example is
> far more code then I care to spend time on anyway (others may be more
> willing or more able to do so of course). I suggest you try (if you haven't
> already):
>
> 1. Profiling the code using Rprof to isolate where the time is spent.And
> then...
>
> 2. Writing a **small** reproducible example to exercise that portion of the
> code and post it with your question to the list. If you need to...
> Typically, if you do these things you'll  figure out how to fix the
> situation on your own.
>
> Cheers,
>
> Bert Gunter
> Genentech Nonclinical Statistics
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Dimitri Liakhovitski
> Sent: Friday, March 26, 2010 2:06 PM
> To: r-help
> Subject: [R] Competing with SPSS and SAS: improving code that loops
> throughrows (data manipulation)
>
> Dear R-ers,
>
> In my question there are no statistics involved - it's all about data
> manipulation in R.
> I am trying to write a code that should replace what's currently being
> done in SAS and SPSS. Or, at least, I am trying to show to my
> colleagues R is not much worse than SAS/SPSS for the task at hand.
> I've written a code that works but it's too slow. Probably because
> it's looping through a lot of things. But I am not seeing how to
> improve it. I've already written a different code but it's 5 times
> slower than this one. The code below takes me slightly above 5 sec for
> the tiny data set. I've tried using it with a real one - was not done
> after hours.
> Need help of the list! Maybe someone will have an idea on how to
> increase the efficiency of my code (just one block of it - in the
> "DATA TRANSFORMATION" Section below)?
>
> Below - I am creating the data set whose structure is similar to the
> data sets the code should be applied to. Also - I have desribed what's
> actually being done - in comments.
> Thanks a lot to anyone for any suggestion!
>
> Dimitri
>
> ###### CREATING THE TEST DATA SET ################################
>
> set.seed(123)
> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10)
> ,a=abs(round(rnorm(20)*10,0)),
> b=abs(round(rnorm(20)*100,0)))
> data
> dim(data)[1]  # !!! In real life I might have up to 150 (!) rows
> (weeks) within each subgroup
>
> ### Specifying parameters used in the code below:
> vars<-names(data)[3:4] # names of variables to be transformed
> nr.vars<-length(vars) # number of variables to be transformed;  !!!
> in real life I'll have to deal with up to 50-60 variables, not 2.
> group.var<-names(data)[1] # name of the grouping variable
> subgroups<-levels(data[[group.var]]) # names of subgroups;  !!! in
> real life I'll have up to 20-25 subgroups, not 2.
>
> # For EACH subgroup: indexing variables a and b to their maximum in
> that subgroup;
> # Further, I'll have to use these indexed variables to build the new ones:
> for(i in vars){
>        new.name<-paste(i,".ind.to.max",sep="")
>        data[[new.name]]<-NA
> }
>
> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
> variables indexed to subgroup max
> for(subgroup in subgroups){
>        data[data[[group.var]] %in%
> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
> subgroup,vars],function(x){
>                y<-x/max(x)
>                return(y)
>        })
> }
> data
>
> ############# DATA TRANSFORMATION #########################################
>
> # Objective: Create new variables based on the old ones (a and b ind.to.max)
> # For each new variable, the value in a given row is a function of (a)
> 2 constants (that have several levels each),
> # (b) the corresponding value of the original variable (e.g.,
> a.ind.to.max"), and the value in the previous row on the same new
> variable
> # PLUS: - it has to be done by subgroup (variable "group")
>
> constant1<-c(1:3)            # constant 1 used for transformation -
> has 3 levels;  !!! in real life it will have up to 7 levels
> constant2<-seq(.15,.45,.15)  # constant 2 used for transformation -
> has 3 levels;  !!! in real life it will have up to 7 levels
>
> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous
> code section):
> start1<-Sys.time()
> for(var in indexed.vars){     # looping through variables
>  for(c1 in 1:length(constant1)){     # looping through levels of constant1
>          for(c2 in 1:length(constant2)){    # looping through levels of
> constant2
>      d=log(0.5)/constant1[c1]
>      l=-log(1-constant2[c2])
>
> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t
> ransf",sep=".")
>      data[[name]]<-NA
>      for(subgroup in subgroups){     # looping through subgroups
>        data[data[[group.var]] %in% subgroup, name][1] =
> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
> var][1]*l*10)))  # this is just the very first row of each subgroup
>        for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
>  # looping through the remaining rows of the subgroup
>          data[data[[group.var]] %in% subgroup, name][case]=
> 1-((1-data[data[[group.var]] %in% subgroup,
> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
> var][case]*l*10)))
>                        }
>                }
>        }
>  }
> }
> end1<-Sys.time()
> print(end1-start1) # Takes me ~0.53 secs
> names(data)
> data
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

--
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.