[R] how to get rid of 2 for-loops and optimize runtime

Tue Oct 20 18:13:26 CEST 2009

> -----Original Message-----
> From: Ian Willems [mailto:ian.willems at uz.kuleuven.ac.be] 
> Sent: Tuesday, October 20, 2009 6:46 AM
> To: William Dunlap; r-help at r-project.org
> Subject: RE: [R] how to get rid of 2 for-loops and optimize runtime
> 
> Hi William,
> 
> Your programs works perfect and very fast for the table I'm 
> using right now (only one match per row)
> If I want to reuse this code other tables, it can match with 
> more than one row.
> Is it possible to adapt your code easily, if I have to sum 
> the values of last month from different rows?

You can use aggregate() with one of those
keys to sum up the values with a common key
value.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> 
> Thank u for your help
> regards,
> 
> Ian
> 
> 
> -----Oorspronkelijk bericht-----
> Van: William Dunlap [mailto:wdunlap at tibco.com] 
> Verzonden: maandag 19 oktober 2009 18:08
> Aan: Ian Willems; r-help at r-project.org
> Onderwerp: RE: [R] how to get rid of 2 for-loops and optimize runtime
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Ian Willems
> > Sent: Monday, October 19, 2009 6:50 AM
> > To: 'r-help at r-project.org'
> > Subject: [R] how to get rid of 2 for-loops and optimize runtime
> > 
> > Short: get rid of the loops I use and optimize runtime
> > 
> > Dear all,
> > 
> > I want to calculate for each row the amount of the month ago. 
> > I use a matrix with 2100 rows and 22 colums (which is still a 
> > very small matrix. nrows of other matrixes can easily be more 
> > then 100000)
> > 
> > Table before
> > Year  month quarter yearmonth Service ...  Amount
> > 2009  9        Q3            092009          A                
> > ...    120
> > 2009  9        Q3            092009          B                
> >  ...     80
> > 2009  8        Q3           082009           A                
> >   ...     40
> > 2009  7        Q3           072009           A                
> >    ...      50
> > 
> > The result I want
> > Year month  quarter yearmonth Service ...    Amount   
> amound_lastmonth
> > 2009 9      Q3      092009          A ...    120      40
> > 2009 9      Q3      092009          B ...    80       ...
> > 2009 8      Q3      082009          A ...    40       50
> > 2009 7      Q3      072009          A ...    50       ...
> > 
> > Table is not exactly the same but gives a good idea what I 
> > have and what I want
> > 
> > The code I have written (see below) does what I want but it 
> > is very very slow. It takes 129s for 400 rows. And the time 
> > gets four times higher each time I double the amount of rows.
> > I'm new in programming in R, but I found that you can use 
> > Rprof and summaryRprof to analyse your code (output see below)
> > But I don't really understand the output
> > I guess I need code that requires linear time and need to get 
> > rid of the 2 for loops.
> > can someone help me or tell me what else I can do to optimize 
> > my runtime
> > 
> > I use R 2.9.2
> > windows Xp service pack3
> > 
> > Thank you in advance
> > 
> > Best regards,
> > 
> > Willems Ian
> > 
> > 
> > *****************************
> > dataset[,5]= month
> > dataset[,3]= year
> > dataset[,22]= amount
> > dataset[,14]= servicetype
> > 
> > [CODE]
> > #for each row of the matrix check if each row has..
> > > for (j in 1:Number_rows) {
> > + sum<-0
> > + for(i in 1:Number_rows){
> > + if (dataset[j,14]== dataset[i,14]) #..the same service type
> > +   {if (dataset[j,18]== dataset[i,18]) # .. the same department
> > +        {if (dataset[j,5]== "1")  # if month=1, month ago is 
> > 12 and year is -1
> > +           {if ("12"== dataset[i,5])
> > +            {if ((dataset[j,3]-1)== dataset[i,3])
> > +
> > +         { sum<-sum + dataset[i,22]}
> > +      }}
> > +      else {
> > +       if ((dataset[j,5]-1)== dataset[i,5]) " if month != 1, 
> > month ago is month -1
> > +         { if (dataset[j,3]== dataset[i,3])
> > +         {sum<-sum + dataset[i,22]}
> > +      }}}}}}
> 
> match() is often useful for quickly finding the locations of
> many items in a vector.  It has no special methods for data.frames
> so you must combine the columns of interest into a character
> vector of keys and use match on the key vectors.  E.g.
> 
> # your test data in a format that mail readers
> # can copy and paste into R:
> d <- read.table(header=TRUE, textConnection("
>    Year  month quarter yearmonth Service  Amount
>    2009  9        Q3   092009          A     120
>    2009  9        Q3   092009          B      80
>    2009  8        Q3   082009          A      40
>    2009  7        Q3   072009          A      50
>    "))
> # The key functions
> dKey <- function(d) {
>    with(d, paste(d$Year, d$month, d$Service, sep=";"))
> }
> keyThisMonth <- function(d)dKey(d)
> keyPrevMonth <- function(d) {
>    stopifnot(!is.null(d$Year), !is.null(d$month), !is.null(d$Service))
>    isJan <- d$month==1
>    d$Year[isJan] <- d$Year[isJan] - 1
>    d$month[isJan] <- 12
>    d$month[!isJan] <- d$month[!isJan] - 1
>    dKey(d)
> }
> # Make the new column:
> d$AmountPrevMonth <- d$Amount[ match(keyPrevMonth(d), 
> keyThisMonth(d)) ]
> # The result
> print(d)
> 
>   Year month quarter yearmonth Service Amount AmountPrevMonth
> 1 2009     9      Q3     92009       A    120              40
> 2 2009     9      Q3     92009       B     80              NA
> 3 2009     8      Q3     82009       A     40              50
> 4 2009     7      Q3     72009       A     50              NA
> 
> This assumes there is only one match per row.  Is this the
> result you are looking for?
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
> 
> > 
> > [\Code]
> > 
> > > summaryRprof()
> > $by.self
> >                self.time self.pct total.time total.pct
> > [.data.frame       33.92  26.2    80.90      62.5
> > NextMethod         12.68  9.8     12.68       9.8
> > [.factor            8.60  6.6      18.36      14.2
> > Ops.factor          8.10  6.3      40.08      31.0
> > sort.int            6.82  5.3      13.70      10.6
> > [                   6.70  5.2      85.44      66.0
> > names               6.54  5.1       6.54       5.1
> > length              5.66  4.4       5.66       4.4
> > ==                  5.04  3.9      44.92      34.7
> > levels              4.80  3.7       5.56       4.3
> > is.na               4.24  3.3       4.24       3.3
> > dim                 3.66  2.8       3.66       2.8
> > switch              3.60  2.8       3.80       2.9
> > vector              2.68  2.1       8.02       6.2
> > inherits            1.90  1.5       1.90       1.5
> > any                 1.68  1.3       1.68       1.3
> > noNA.levels         1.46  1.1       7.84       6.1
> > .Call               1.40  1.1       1.40       1.1
> > !                   1.26  1.0       1.26       1.0
> > attr<-              1.06  0.8       1.06       0.8
> > .subset             1.00  0.8       1.00       0.8
> > class<-             0.82  0.6       0.82       0.6
> > !=                  0.80  0.6       0.80       0.6
> > levels.default      0.68  0.5       0.76       0.6
> > all                 0.62  0.5       0.62       0.5
> > <                   0.54  0.4       0.54       0.4
> > -                   0.48  0.4       0.48       0.4
> > is.factor           0.44  0.3       2.34       1.8
> > .subset2            0.38  0.3       0.38       0.3
> > attr                0.36  0.3       0.36       0.3
> > is.character        0.28  0.2       0.28       0.2
> > is.null             0.28  0.2       0.28       0.2
> > |                   0.26  0.2       0.26       0.2
> > oldClass<-          0.20  0.2       0.20       0.2
> > is.atomic           0.16  0.1       0.16       0.1
> > nzchar              0.10  0.1       0.10       0.1
> > is.numeric          0.06  0.0       0.06       0.0
> > oldClass            0.06  0.0       0.06       0.0
> > (                   0.04  0.0       0.04       0.0
> > [.data              0.02  0.0       0.02       0.0
> > 
> > $by.total
> >                total.time total.pct self.time self.pct
> > [                   85.44  66.0      6.70      5.2
> > [.data.frame        80.90  62.5     33.92     26.2
> > ==                  44.92  34.7      5.04      3.9
> > Ops.factor          40.08  31.0      8.10      6.3
> > [.factor            18.36  14.2      8.60      6.6
> > sort.int            13.70  10.6      6.82      5.3
> > NextMethod          12.68  9.8     12.68      9.8
> > vector               8.02  6.2      2.68      2.1
> > noNA.levels          7.84  6.1      1.46      1.1
> > names                6.54  5.1      6.54      5.1
> > length               5.66  4.4      5.66      4.4
> > levels               5.56  4.3      4.80      3.7
> > is.na                4.24  3.3      4.24      3.3
> > switch               3.80  2.9      3.60      2.8
> > dim                  3.66  2.8      3.66      2.8
> > is.factor            2.34  1.8      0.44      0.3
> > inherits             1.90  1.5      1.90      1.5
> > any                  1.68  1.3      1.68      1.3
> > .Call                1.40  1.1      1.40      1.1
> > !                    1.26  1.0      1.26      1.0
> > attr<-               1.06  0.8      1.06      0.8
> > .subset              1.00  0.8      1.00      0.8
> > class<-              0.82  0.6      0.82      0.6
> > !=                   0.80  0.6      0.80      0.6
> > levels.default       0.76  0.6      0.68      0.5
> > all                  0.62  0.5      0.62      0.5
> > <                    0.54  0.4      0.54      0.4
> > -                    0.48  0.4      0.48      0.4
> > .subset2             0.38  0.3      0.38      0.3
> > attr                 0.36  0.3      0.36      0.3
> > is.character         0.28  0.2      0.28      0.2
> > is.null              0.28  0.2      0.28      0.2
> > |                    0.26  0.2      0.26      0.2
> > oldClass<-           0.20  0.2      0.20      0.2
> > is.atomic            0.16  0.1      0.16      0.1
> > nzchar               0.10  0.1      0.10      0.1
> > is.numeric           0.06  0.0      0.06      0.0
> > oldClass             0.06  0.0      0.06      0.0
> > (                    0.04  0.0      0.04      0.0
> > [.data               0.02  0.0      0.02      0.0
> > 
> > $sampling.time
> > [1] 129.38
> > 
> > Warning message:
> > In readLines(filename, n = chunksize) :
> >   incomplete final line found on 'Rprof.out'
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
>