[R] Has For bucle be impooved in R

Mon Aug 7 23:39:57 CEST 2017

If you run it under the profiler in RStudio, you will see that the 'lm'
call is taking about 2 seconds longer in the function which might have to
do with resolving the reference.  So it is probably the function call in
'lapply' vs. the in-line statement in the 'for' loop that account for the
differences.  I have attached the output of the profiler.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Mon, Aug 7, 2017 at 10:57 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be>
wrote:

> Dear Jesus,
>
> The difference is marginal when each code chunk does the same things. Your
> for loop does not yields the same output as the lapply. Here is the cleaned
> version of your code.
>
> n<-10000
> set.seed(123)
> x<-rnorm(n)
> y<-x+rnorm(n)
> rand.data<-data.frame(x,y)
> k<-100
> samples <- split(sample(n), rep(seq_len(k),length=n))
>
> library(microbenchmark)
> microbenchmark(
>   "for" = {
>     res <- vector("list", length(samples))
>     for(index in seq_along(samples)) {
>       fit <- lm(y~x, data = rand.data[-samples[[index]],])
>       pred <- predict(fit, newdata = rand.data[samples[[index]],])
>       res[[i]] <- ((pred - rand.data$y[samples[[index]]])^2)
>     }
>   },
>   lapply = {
>     cv.fold.fun <- function(index){
>       fit <- lm(y~x, data = rand.data[-samples[[index]],])
>       pred <- predict(fit, newdata = rand.data[samples[[index]],])
>       return((pred - rand.data$y[samples[[index]]])^2)
>     }
>     lapply(seq_along(samples), cv.fold.fun)
>   }
> )
>
> Unit: milliseconds
>    expr      min       lq     mean   median       uq      max neval cld
>     for 866.4196 897.3137 949.8155 926.1918 946.8390 1767.463   100   a
>  lapply 837.7804 889.6620 947.2401 909.9946 939.6379 2476.415   100   a
>
> Best regards,
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2017-08-07 16:48 GMT+02:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>
> > The lapply loop and the for loop have very similar speed characteristics.
> > Differences seen are almost always due to how you use memory in the body
> of
> > the loop. This fact is not new. You may be under the incorrect assumption
> > that using lapply is somehow equivalent to "vectorization", which it is
> not.
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> > On August 7, 2017 7:29:58 AM PDT, "Jesús Para Fernández" <
> > j.para.fernandez at hotmail.com> wrote:
> > >Hi!
> > >
> > >I am doing a lapply and for comparaison and I get that for is faster
> > >than lapply.
> > >
> > >
> > >What I have done:
> > >
> > >
> > >
> > >n<-100000
> > >set.seed(123)
> > >x<-rnorm(n)
> > >y<-x+rnorm(n)
> > >rand.data<-data.frame(x,y)
> > >k<-100
> > >samples<-split(sample(1:n),rep(1:k,length=n))
> > >
> > >res<-list()
> > >t<-Sys.time()
> > >for(i in 1:100){
> > >  modelo<-lm(y~x,rand.data[-samples[[i]]])
> > >  prediccion<-predict(modelo,rand.data[samples[[i]],])
> > >  res[[i]] <- (prediccion - rand.data$y[samples[[i]]])
> > >
> > >}
> > >print(Sys.time()-t)
> > >
> > >Which takes 8.042 seconds
> > >
> > >and using Lapply
> > >
> > >cv.fold.fun <- function(index){
> > >   fit <- lm(y~x, data = rand.data[-samples[[index]],])
> > >   pred <- predict(fit, newdata = rand.data[samples[[index]],])
> > >   return((pred - rand.data$y[samples[[index]]])^2)
> > >  }
> > >
> > >
> > >t<-Sys.time()
> > >
> > >nuevo<-lapply(seq(along = samples),cv.fold.fun)
> > >print(Sys.time()-t)
> > >
> > >
> > >Which takes 9.56 seconds.
> > >
> > >So... has been improved the FOR loop on R???
> > >
> > >Thanks!
> > >
> > >
> > >
> > >
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> > posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: profile.png
Type: image/png
Size: 21309 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20170807/c4a641b2/attachment.png>