[R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation)

Dimitri Liakhovitski ld7631 at gmail.com
Mon Mar 29 22:16:59 CEST 2010


Would like to thank every one once more for your great help.
I was able to reduce the time from god knows how many hours to about 2 minutes!
Really appreciate it!
Dimitri

On Sat, Mar 27, 2010 at 11:43 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 03/26/2010 06:40 PM, Dimitri Liakhovitski wrote:
>> My sincere apologies if it looked large. Let me try again with less code.
>> It's hard to do less than that. In fact - there is nothing in this
>> code but 1 formula and many loops, which is the problem I am not sure
>> how to solve.
>> I also tried to be as clear as possible with the comments.
>> Dimitri
>>
>> ## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE
>>   set.seed(123)
>>   data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0,
>> sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2)))
>>   data                   # "data" it is the data frame to work with
>> ## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data"
>> would contain up to 150-200 rows PER SUBGROUP
>>
>> ### Specifying useful parameters used in the slow code below:
>> vars<-names(data)[2:3]                    # names of variables used in
>> transformation; in real life - up to 50-60 variables
>> group.var<-names(data)[1]                # name of the grouping variable
>> subgroups<-levels(data[[group.var]])   # names of subgroups; in real
>> life - up to 30 subgroups
>>
>> # OBJECTIVE:
>> # Need to create new variables based on the old ones (a & b)
>> # For each new variable, the value in a given row is a function of (a)
>> 2 constants (that have several levels each),
>> # (b) value of the original variable (e.g., a.ind.to.max"), and the
>> value in the previous row on the same new variable
>> # Plus - it has to be done by subgroup (variable "group")
>>
>> # Defining 2 constants:
>> constant1<-c(1:3)                # constant 1 used in transformation -
>> has 3 levels, in real life - up to 7 levels
>> constant2<-seq(.15,.45,.15)  # constant 2 used in transformation - has
>> 3 levels, in real life - up to 7 levels
>>
>> ### CODE THAT IS SLOW. Reason - too many loops with the inner-most
>> loop being very slow - as it is looping through rows:
>>
>> for(var in vars){                               # looping through variables
>>   for(c1 in 1:length(constant1)){        # looping through values of constant1
>>        for(c2 in 1:length(constant2)){   # looping through values of constant2
>>          d=log(0.5)/constant1[c1]
>>           l=-log(1-constant2[c2])
>>           name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".")
>>           data[[name]]<-NA
>>           for(subgroup in subgroups){     # looping through subgroups
>>             data[data[[group.var]] %in% subgroup, name][1] =
>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>> var][1]*l*10)))
>>
>>       ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS:
>>              for(case in 2:nrow(data[data[[group.var]] %in% subgroup,
>> ])){ # looping through rows
>>                data[data[[group.var]] %in% subgroup, name][case]=
>> 1-((1-data[data[[group.var]] %in% subgroup,
>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>> var][case]*l*10)))
>>              }
>>       ### END OF THE SLOWEST SECTION (INNERMOST LOOP)
>
> This is a good place to start! Look at the inner loop
>
> I notice there are a lot of expressions whose values don't change, but
> are calculated repeatedly, e.g., exp(1) and data[[group.var]] %in%
> subgroup. Let's move those common subexpressions out of the loop, e.g.,
>
> exp1 <- exp(1)
> idx <- data[[group.var]] %in% subgroup
>
> for(case in 2:nrow(data[idx,])) {
>    data[idx, name][case]=
>        1-((1-data[idx, name][case-1]*exp1^d) /
>           (exp1^(data[idx, var][case]*l*10)))
> }
>
> wow that's already looking better! Looks like data[idx,var] could also
> come out of the loop, and data[idx, name], too, so long as we assign
> back in at the end
>
> exp1 <- exp(1)
> idx <- data[[group.var]] %in% subgroup
> a <- data[idx, var]
> x <- data[idx, name]
>
> for(case in 2:length(x)) {
>    x[case] <- 1 - (1 - x[case-1]*exp1^d) /
>        (exp1^(a[case]*l*10))
> }
>
> data[idx, name] <- x
>
> Still some redundant calculations, e.g., exp1^d, l*10. Some of the
> things we've moved out of the innermost loop can be moved even further
> out (e.g., exp1 all the way to the top). And a calculation like
>
>  exp1^(a[case] * l * 10)
>
> would be better written out of the loop as
>
>  b <- exp1^(a * l * 10)
>
> Why not try re-implementing your code with these ideas in mind? It'll be
> important to make sure that the results of your original and new
> implementation are the same (using all.equal, for instance). The 'CODE
> THAT IS SLOW' part of your original code took 0.35s to execute on my
> machine (using system.time), whereas my revised code with changes like
> that above, and a couple of other small things, ran in 0.01s.
>
> What's your revised code look like now?
>
> Martin
>
>>           }
>>       }
>>    }
>> }
>>
>> ### END OF THE CODE
>>
>> On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.berton at gene.com> wrote:
>>> Dmitri:
>>>
>>> If you follow the R posting guide you're more likely to get useful replies.
>>> In particular it asks for **small** reproducible examples -- your example is
>>> far more code then I care to spend time on anyway (others may be more
>>> willing or more able to do so of course). I suggest you try (if you haven't
>>> already):
>>>
>>> 1. Profiling the code using Rprof to isolate where the time is spent.And
>>> then...
>>>
>>> 2. Writing a **small** reproducible example to exercise that portion of the
>>> code and post it with your question to the list. If you need to...
>>> Typically, if you do these things you'll  figure out how to fix the
>>> situation on your own.
>>>
>>> Cheers,
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Statistics
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>>> Behalf Of Dimitri Liakhovitski
>>> Sent: Friday, March 26, 2010 2:06 PM
>>> To: r-help
>>> Subject: [R] Competing with SPSS and SAS: improving code that loops
>>> throughrows (data manipulation)
>>>
>>> Dear R-ers,
>>>
>>> In my question there are no statistics involved - it's all about data
>>> manipulation in R.
>>> I am trying to write a code that should replace what's currently being
>>> done in SAS and SPSS. Or, at least, I am trying to show to my
>>> colleagues R is not much worse than SAS/SPSS for the task at hand.
>>> I've written a code that works but it's too slow. Probably because
>>> it's looping through a lot of things. But I am not seeing how to
>>> improve it. I've already written a different code but it's 5 times
>>> slower than this one. The code below takes me slightly above 5 sec for
>>> the tiny data set. I've tried using it with a real one - was not done
>>> after hours.
>>> Need help of the list! Maybe someone will have an idea on how to
>>> increase the efficiency of my code (just one block of it - in the
>>> "DATA TRANSFORMATION" Section below)?
>>>
>>> Below - I am creating the data set whose structure is similar to the
>>> data sets the code should be applied to. Also - I have desribed what's
>>> actually being done - in comments.
>>> Thanks a lot to anyone for any suggestion!
>>>
>>> Dimitri
>>>
>>> ###### CREATING THE TEST DATA SET ################################
>>>
>>> set.seed(123)
>>> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10)
>>> ,a=abs(round(rnorm(20)*10,0)),
>>> b=abs(round(rnorm(20)*100,0)))
>>> data
>>> dim(data)[1]  # !!! In real life I might have up to 150 (!) rows
>>> (weeks) within each subgroup
>>>
>>> ### Specifying parameters used in the code below:
>>> vars<-names(data)[3:4] # names of variables to be transformed
>>> nr.vars<-length(vars) # number of variables to be transformed;  !!!
>>> in real life I'll have to deal with up to 50-60 variables, not 2.
>>> group.var<-names(data)[1] # name of the grouping variable
>>> subgroups<-levels(data[[group.var]]) # names of subgroups;  !!! in
>>> real life I'll have up to 20-25 subgroups, not 2.
>>>
>>> # For EACH subgroup: indexing variables a and b to their maximum in
>>> that subgroup;
>>> # Further, I'll have to use these indexed variables to build the new ones:
>>> for(i in vars){
>>>        new.name<-paste(i,".ind.to.max",sep="")
>>>        data[[new.name]]<-NA
>>> }
>>>
>>> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
>>> variables indexed to subgroup max
>>> for(subgroup in subgroups){
>>>        data[data[[group.var]] %in%
>>> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
>>> subgroup,vars],function(x){
>>>                y<-x/max(x)
>>>                return(y)
>>>        })
>>> }
>>> data
>>>
>>> ############# DATA TRANSFORMATION #########################################
>>>
>>> # Objective: Create new variables based on the old ones (a and b ind.to.max)
>>> # For each new variable, the value in a given row is a function of (a)
>>> 2 constants (that have several levels each),
>>> # (b) the corresponding value of the original variable (e.g.,
>>> a.ind.to.max"), and the value in the previous row on the same new
>>> variable
>>> # PLUS: - it has to be done by subgroup (variable "group")
>>>
>>> constant1<-c(1:3)            # constant 1 used for transformation -
>>> has 3 levels;  !!! in real life it will have up to 7 levels
>>> constant2<-seq(.15,.45,.15)  # constant 2 used for transformation -
>>> has 3 levels;  !!! in real life it will have up to 7 levels
>>>
>>> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous
>>> code section):
>>> start1<-Sys.time()
>>> for(var in indexed.vars){     # looping through variables
>>>  for(c1 in 1:length(constant1)){     # looping through levels of constant1
>>>          for(c2 in 1:length(constant2)){    # looping through levels of
>>> constant2
>>>      d=log(0.5)/constant1[c1]
>>>      l=-log(1-constant2[c2])
>>>
>>> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t
>>> ransf",sep=".")
>>>      data[[name]]<-NA
>>>      for(subgroup in subgroups){     # looping through subgroups
>>>        data[data[[group.var]] %in% subgroup, name][1] =
>>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>> var][1]*l*10)))  # this is just the very first row of each subgroup
>>>        for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
>>>  # looping through the remaining rows of the subgroup
>>>          data[data[[group.var]] %in% subgroup, name][case]=
>>> 1-((1-data[data[[group.var]] %in% subgroup,
>>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>> var][case]*l*10)))
>>>                        }
>>>                }
>>>        }
>>>  }
>>> }
>>> end1<-Sys.time()
>>> print(end1-start1) # Takes me ~0.53 secs
>>> names(data)
>>> data
>>>
>>> --
>>> Dimitri Liakhovitski
>>> Ninah.com
>>> Dimitri.Liakhovitski at ninah.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>>
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>



-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com



More information about the R-help mailing list