[R] Vectorizing a for-loop for cross-validation in R

Aleksandre Gavashelishvili @|ek@@ndre@g@v@@he||@hv||| @end|ng |rom |||@un|@edu@ge
Thu Jan 24 07:50:24 CET 2019


Before posting on the r-help list I did run Rprof(). In my posting I asked
for help with re-writing the specific script into sapply() or
foreach()/doParallel format.

Thanks anyway for your time and suggestions,
Lexo

On Thu, Jan 24, 2019 at 12:38 AM Eric Berger <ericjberger using gmail.com> wrote:

> Charles writes about saving execution time by eliminating redundancies.
> If you see redundancies related to calling a time-consuming function
> multiple times with the same arguments, a very easy way to speed up your
> program is to memoise the functions using the package memoise.
>
> HTH,
> Eric
>
>
>
>
> On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry using ucsd.edu> wrote:
>
>> See inline.
>>
>> > On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
>> aleksandre.gavashelishvili using iliauni.edu.ge> wrote:
>> >
>> > I'm trying to speed up a script that otherwise takes days to handle
>> larger
>> > data sets. So, is there a way to completely vectorize or paralellize the
>> > following script:
>> >
>> >                *# k-fold cross validation*
>> >
>> > df <- trees # a data frame 'trees' from R.
>> > df <- df[sample(nrow(df)), ] # randomly shuffles the data.
>> > k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
>> > validation.
>> > folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
>> > unique numbers for k equally size folds.
>> > df$ID <- folds # adds fold IDs.
>> > df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
>> > "pred2" "pred3" to speed up the following loop.
>> >
>> > library(mgcv)
>> >
>>
>> Rprof()
>>
>> replicate(100, {
>>
>>
>> > for(i in 1:k) {
>> >  # looping for different models:
>> >  m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
>> >  m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
>> >  m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
>> >
>> >  # looping for predictions:
>> >  df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
>> >  df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
>> >  df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
>> > }
>> >
>>
>> })
>>
>> Rprof(NULL)
>>
>> summaryRprof()
>>
>> ## read ?Rprof to get a sense of what it does
>>
>> ## read the summary to determine where time is being spent.
>>
>> ## the result was surprising to me. YMMV.
>>
>> ## there may be redundancies that you can eliminate by
>> ##  - doing the setup within gam() one time and saving it
>> ##  - calling the worker functions by modifying the setup
>> ##    in a loop or function and saving the results
>>
>>
>> > # calculating residuals:
>> > df$res1 <- with(df, Volume - pred1)
>> > df$res2 <- with(df, Volume - pred2)
>> > df$res3 <- with(df, Volume - pred3)
>> >
>> > Model <- paste("m", 1:3, sep="") # creates a vector of model names.
>> >
>> > # creating a vector of mean-square errors (MSE):
>> > MSE <- with(df, c(
>> >  sum(res1^2) / nrow(df),
>> >  sum(res2^2) / nrow(df),
>> >  sum(res3^2) / nrow(df)
>> > ))
>> >
>> > model.mse <- data.frame(Model, MSE) # creates a data frame of model
>> names
>> > and mean-square errors.
>> > model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
>> > data frame in order of increasing mean-square errors.
>> >
>> > I'd appreciate any help. This code takes several days if run on >=30,000
>> > different GAM models and 3 predictors. Could you please help with
>> > re-writing the script into sapply() or foreach()/doParallel format?
>> >
>>
>> This is something you should learn to do. It is pretty standard practice.
>> Use the body of your for loop as the body of a function, add arguments, and
>> create a suitable return value. The something like
>>
>>         lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
>>
>> should work.  If it does, then parallel::mclapply(...) should also work.
>>
>> HTH,
>>
>> Chuck
>>
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list