[R] how to use 'which' inside of 'apply'?

William Dunlap wdunlap at tibco.com
Mon Oct 17 21:40:02 CEST 2011


data.frames are quite efficient when you use
a column at a time, but not when used a row
at a time.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: Nathan Piekielek [mailto:npiekielek at gmail.com]
> Sent: Monday, October 17, 2011 12:37 PM
> To: William Dunlap
> Subject: RE: [R] how to use 'which' inside of 'apply'?
> 
> Wow, that dramatically improves performance. I never realized data frames
> were that inefficient.
> 
> Thanks for your help, incredible difference.
> 
> Nathan
> 
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Monday, October 17, 2011 1:25 PM
> To: Nathan Piekielek
> Cc: r-help at r-project.org
> Subject: RE: [R] how to use 'which' inside of 'apply'?
> 
> Your original code works far faster when the input
> is a matrix that when it is a data.frame.  Selecting
> a row from a data.frame is a very slow operation,
> selecting a row from a matrix is quick.  Modifying
> a row or a single element in a data.frame is even
> worse compared to do it on a matrix.  I compared your
> original code:
> 
> f0 <- function (df)
> {
>     for (i in seq_len(nrow(df))) {
>         t = which(df[i, 2:24] > df[i, 25])
>         r = min(t)
>         df[i, 26] = (r - 1) * 16 + 1
>     }
>     df[, 26] # for now just return the computed column
> }
> 
> to one that converts relevant parts of the data.frame
> df to matrices or vectors before doing the loop over
> rows:
> 
> f0.a <- function (df)
> {
>     thold <- df[, "thold"]
>     tmp <- as.matrix(df[,2:24])
>     ans <- df[,26]
>     for (i in seq_len(nrow(df))) {
>         t = which(tmp[i,]>thold[i])
>         r = min(t)
>         ans[i] = (r-1)*16+1
>     }
>     # df[,26] <- ans
>     ans
> }
> 
> On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took
> 0.140 seconds.  (f1, below, took 0.012 seconds.)
> 
> 
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> > -----Original Message-----
> > From: William Dunlap
> > Sent: Monday, October 17, 2011 11:00 AM
> > To: Nathan Piekielek
> > Cc: r-help at r-project.org
> > Subject: RE: [R] how to use 'which' inside of 'apply'?
> >
> > Try vectorizing it a bit by looping over the columns.
> > E.g.,
> >
> >   f1 <- function (df)
> >   {
> >       # loop (backwards) over all columns in df whose
> >       # names start with "D" to find the earliest one
> >       # that is bigger than column "thold".  I tested with
> >       # df being a data.frame but a matrix should work too.
> >       i <- rep(NA_character_, nrow(df))
> >       colNames <- grep(value = TRUE, "^D", colnames(df))
> >       for (colName in rev(colNames)) {
> >           i[df[, colName] > df[, "thold"]] <- colName
> >       }
> >       # convert column name "D<number>" to <number>.
> >       doy <- as.numeric(sub("^D", "", i))
> >       doy
> >   }
> >   > f1(a)
> >   [1] 129 145 129 177 177 177
> >
> > You could also try looping over rows with something like
> > findInterval.  If there are far fewer columns than rows
> > then looping over columns is generally faster.
> >
> > Your sample data.frame I called 'a' and in copy-and-pastable
> > form (from dput()) is
> > a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
> > 39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
> > 0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
> > 0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
> > 0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
> > 0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
> > 0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
> > ), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
> > 0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
> > 0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
> > ), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
> >     D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 =
> c(0.5583,
> >     0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
> >     0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
> >     0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
> >     0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
> >     0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
> >     0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
> >     0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
> >     0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names =
> c("pt",
> > "D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
> > "D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
> > "D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
> > ), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
> > "6"))
> >
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of R. Michael
> > > Weylandt
> > > Sent: Monday, October 17, 2011 10:32 AM
> > > To: Nathan Piekielek
> > > Cc: r-help at r-project.org
> > > Subject: Re: [R] how to use 'which' inside of 'apply'?
> > >
> > > I think something like this should do it at a huge speed up, though
> > > I'd advise you check it to make sure it does exactly what you want:
> > > there's also nothing to guarantee that something beats the threshold,
> > > so that might make the whole thing fall apart (though I don't think it
> > > will)
> > >
> > > # Sample data
> > > df = data.frame(x = sample(5, 15,T),
> > > 			y = sample(5, 15, T),
> > > 			z = sample(5, 15,T),
> > > 			w = (1:5)/2 + 0.5,
> > > 			th = (1:5)/2,
> > > 			doy = rep(0,15))
> > >
> > > wd <- which(df[,1:4] > df[,5], arr.ind = TRUE)
> > > # identify all elements that beat the threshold value by their indices
> > >
> > > wd <- wd[!duplicated(wd[,1]),]
> > > # select only the first appearance of each "row" value in wd -- this
> > > keeps the earliest column beating the threshold
> > >
> > > wd <- wd[order(wd[,"row"]),]
> > > # sort them by row
> > >
> > > df$doy = (wd[,"col"]-1)*16 + 1
> > > # The column transform you used.
> > >
> > > Hope this helps,
> > >
> > > Michael
> > >
> > >
> > > On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com>
> wrote:
> > > > Hello R-community,
> > > >
> > > > I am trying to populate a column (doy) in a large dataset with the
> first
> > > > column number that exceeds the value in another column (thold) using
> the
> > > > 'apply' function.
> > > >
> > > > Sample data:
> > > >     pt D1 D17 D33 D49 D65 D81 D97 D113   D129   D145      D161
>  D177
> > > > D193   D209   D225   D241   D257
> > > > 1 39177  0   0   0   0   0   0   0    0 0.4336 0.4754 0.5340667
> 0.5927334
> > > > 0.6514 0.6966 0.5900 0.5583 0.5676
> > > > 2 39178  0   0   0   0   0   0   0    0 0.3420 0.4543 0.5397666
> 0.6252333
> > > > 0.7107 0.7123 0.5591 0.4617 0.4206
> > > > 3 39164  0   0   0   0   0   0   0    0 0.4830 0.4943 0.5740333
> 0.6537667
> > > > 0.7335 0.6255 0.6228 0.5255 0.5436
> > > > 4 39143  0   0   0   0   0   0   0    0 0.3088 0.3753 0.4466000
> 0.5179000
> > > > 0.5892 0.6468 0.4794 0.4411 0.4307
> > > > 5 39144  0   0   0   0   0   0   0    0 0.3390 0.4152 0.5147000
> 0.6142000
> > > > 0.7137 0.6914 0.6381 0.5704 0.5619
> > > > 6 39146  0   0   0   0   0   0   0    0 0.4232 0.4442 0.5084000
> 0.5726000
> > > > 0.6368 0.5896 0.4703 0.4936 0.5353
> > > >    D273    D289   D305    D321 D337 D353    thold doy
> > > > 1 0.4682 0.35115 0.2341 0.11705    0    0 0.406825   0
> > > > 2 0.3867 0.25780 0.1289 0.00000    0    0 0.420600   0
> > > > 3 0.5541 0.46195 0.3698 0.18490    0    0 0.459200   0
> > > > 4 0.3632 0.34355 0.3239 0.00000    0    0 0.477800   0
> > > > 5 0.5347 0.49760 0.4605 0.00000    0    0 0.526350   0
> > > > 6 0.4067 0.39685 0.3870 0.00000    0    0 0.511900   0
> > > >
> > > > For the first record in above example I would expect doy = 129.
> > > >
> > > > I can achieve this with the following loop, but it takes several days
> to run
> > > > and there must be a more efficient solution:
> > > >
> > > > for (i in (1:152000)) {
> > > > t=which(data[i,2:24]>data[i,25])
> > > > r=min(t)
> > > > data[i,26]=(r-1)*16+1
> > > > }
> > > >
> > > > How do I write this using 'apply' or another function that will be
> more
> > > > efficient?
> > > >
> > > > I have tried the following:
> > > > data$doy=apply(which(data[,2:24]>data[,25]),1,min)
> > > >
> > > > Which returns the following error message:
> > > > "Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
> > > >  dim(X) must have a positive length"
> > > >
> > > > Any help would be much appreciated.
> > > >
> > > > Nathan
> > > >
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list