[R] Why unique(sample) decreases the performance ?

Sun Mar 20 22:00:27 CET 2011

One thing that it would be good to learn is how to use Rprof to look
at where time is being spent in your script.  Most of the time (over
80%) is spent in the 'unique' call which is doing absolutely nothing
in your code since you are not assigning its output to an object.
Here is what I saw when running Rprof:

/cygdrive/c/perf: perl bin/readRprof.pl Rprof.out .5
  0 295.8 root
  1.  243.1 unique  # 82% of the time
  2. .  242.3 unique.matrix
  3. . .  231.9 apply
  4. . . .  191.9 FUN
  5. . . . |  188.3 paste
  6. . . . | .    1.8 list
  4. . . .    4.5 unlist
  5. . . . |    3.6 lapply
  4. . . .    0.8 aperm
  4. . . .    0.5 is.null
  4. . . .    0.5 !
  3. . .    2.8 alist
  4. . . .    2.5 as.list
  5. . . . |    0.6 as.list.default
  3. . .    2.4 duplicated
  4. . . .    0.8 duplicated.default
  3. . .    2.3 do.call
  1.   37.0 solve  # 12%
  2. .   32.8 solve.default
  3. . .    8.2 ifelse
  4. . . .    3.6 <Anonymous>
  5. . . . |    3.6 menu
  3. . .    7.0 diag
  4. . . .    3.4 array
  3. . .    4.2 as.matrix
  3. . .    2.5 .Call
  3. . .    1.9 colnames
  4. . . .    0.7 NCOL
  3. . .    1.1 rownames
  3. . .    0.8 rownames<-
  3. . .    0.8 colnames<-
  3. . .    0.6 is.qr
  2. .    0.6 crossprod
  1.    4.3 %*%
  1.    2.4 t
  1.    2.3 diag
  1.    1.4 sqrt
  1.    1.2 sample
  1.    0.7 crossprod
  1.    0.6 as.vector
  1.    0.6 -
So why is it there?  What were you expecting to happen with the
'unique' call?  If you want to to use the unique values of
Design.data, then create an object once outside the loop to have the
unique values instead of inside the loop.  It would also be beneficial
to look at what 'unique.matrix' does.  Notice that a good amount of
the time (188/296 = 64%) is in 'paste' pasting together all the values
to be able to finally do the 'unique' on the matrix.  One nice thing
is that most of R is written in R so you can look under the covers,
once you know where to look, to understand what is happening.

On Sun, Mar 20, 2011 at 10:36 AM, ufuk beyaztas <ufukbeyaztas at gmail.com> wrote:
> Hi,
>
> I' am interested in differences between sample's result when samples consist
> of full elements and consist of only distinct elements. When sample consist
> of full elements it take about 120 sec., but when consist of only distinct
> elements it take about 4.5 or 5 times more sec. I expected that opposite of
> this result, because unique(sample) has less elements than full sample. Code
> as follows;
>
> e <- rnorm(n=50, mean=0, sd=sqrt(0.5625))
> x0 <- c(rep(1,50))
> x1 <- rnorm(n=50,mean=2,sd=1)
> x2 <- rnorm(n=50,mean=2,sd=1)
> x3 <- rnorm(n=50,mean=2,sd=1)
> x4 <- rnorm(n=50,mean=2,sd=1)
> y <- 1+ 2*x1+4*x2+3*x3+2*x4+e
> x2[1] = 10     #influential observarion
> y[1] = 10      #influential observarion
>
> X <- matrix(c(x0,x1,x2,x3,x4),ncol=5)
> Y <- matrix(y,ncol=1)
> Design.data <- cbind(X, Y)
>
> for (j in 1:nrow(X)) {
>
> result <- vector("list", )
>
> for( i in 1: 3100) {
>
> data <- Design.data[sample(50,50,replace=TRUE),] ##### and
> unique(Design.data.....)
> dataX <- data[,1:5]
> dataY <- data[,6]
>
> B.cap.simulation <- solve(crossprod(dataX)) %*% crossprod(dataX, dataY)
> P.simulation <- dataX %*% solve(crossprod(dataX)) %*% t(dataX)
> Y.cap.simulation <- P.simulation %*% dataY
> e.simulation <- dataY - Y.cap.simulation
> dX.simulation <- nrow(dataX) - ncol(dataX)
> var.cap.simulation <- crossprod(e.simulation) / (dX.simulation)
> ei.simulation <- as.vector(dataY - dataX %*% B.cap.simulation)
> pi.simulation <- diag(P.simulation)
> var.cap.i.simulation <- (((dX.simulation) *
> var.cap.simulation)/(dX.simulation - 1)) -
> (ei.simulation^2/((dX.simulation - 1) * (1 - pi.simulation)))
> ti.simulation <- ei.simulation / sqrt(var.cap.simulation * (1 -
> pi.simulation))
> ti.star.simulation <- ei.simulation / sqrt(var.cap.i.simulation * (1 -
> pi.simulation))
> pi.star.simulation <- pi.simulation + ei.simulation^2 /
> crossprod(e.simulation)
> WKi.simulation <- (ti.star.simulation)*sqrt(pi.simulation/(1-pi.simulation))
> Wi.simulation <- WKi.simulation * sqrt((nrow(dataX)-1)/(1-pi.simulation))
>
> result[[i]] <- list(outWi.simulation=(Wi.simulation),influ.obs = any (dataY
> ==Y[j,] ))
>
> }
>
> i.obs <- sapply(result,function(x) {x$influ.obs})
> ni.result <- result[! i.obs]
> ni.Wi.simulation <- sapply(ni.result,function(x) {x$outWi.simulation})
> if (j==1) {
> ni.Wi.simulation1 <-  ni.Wi.simulation
> }else if (j==2) {
> ni.Wi.simulation49 <-  matrix(ni.Wi.simulation , nrow=1)
>
> }else{
> ni.Wi.simulation49
> <-cbind(ni.Wi.simulation49,matrix(ni.Wi.simulation,nrow=1))
> }
> }
>
> Can someone give me an idea ? Many thanks.
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Why-unique-sample-decreases-the-performance-tp3391199p3391199.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?