[R] Ranger could not work with caret

Fri Jul 1 21:18:54 CEST 2022

@Rui Barradas <ruipbarradas using sapo.pt>

Thank you again for the useful explanation.

Best regards

On Fri, Jul 1, 2022 at 8:26 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:

> Hello,
>
> The error doesn't arise in randomForest because rf has a function tuneRF
> that looks for the best mtry (best relative to OOB error estimate). And
> it's this value that it uses.
>
> The question's code gives Ranger errors but it also gives R warnings:
>
> Warning messages:
> 1: model fit failed for Fold01: mtry=48, min.node.size=5,
> splitrule=variance Error in ranger::ranger(dependent.variable.name =
> ".outcome", data = x,  :
>    User interrupt or internal error.
>
>
> As you can see, mtry=48 is the double of ncol(tr) when should *never* be
> greater than the number of variables in the data set. Why it is using
> this value, I don't know. Function bug? Ask the package maintainer?
>
> And, by the way, package caret does or can do a grid search for optimal
> parameter values. If that is giving errors and you are calling rf
> directly why bother whith caret's error? Use the original function. Here
> is an example with tuneRF. Setting argument doBest to TRUE you'll have
> both the optimal value for mtry and the fitted random forest. 2 in 1.
>
>
> library(randomForest)
> #  randomForest 4.7-1.1
> #  Type rfNews() to see new features/changes/bug fixes.
>
> c2 <- tuneRF(
>    x = tr[-ncol(tr)],
>    y = tr$act_effort,
>    mtryStart = ncol(tr)/2,
>    doBest = TRUE
> )
> #  mtry = 12  OOB error = 139920.7
> #  Searching left ...
> #  mtry = 6     OOB error = 170909.3
> #  -0.2214729 0.05
> #  Searching right ...
> #  mtry = 23    OOB error = 128566.7
> #  0.08114586 0.05
>
> c2
> #
> #  Call:
> #   randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
> #                 Type of random forest: regression
> #                       Number of trees: 500
> #  No. of variables tried at each split: 23
> #
> #            Mean of squared residuals: 129734.8
> #                      % Var explained: 39.98
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>
> Às 17:18 de 01/07/2022, Neha gupta escreveu:
> > Thank you so much for your help. I hope it will work.
> >
> > However, why the same error doesn't arise when I am using rf. They both
> > have the same parameters and it's default values.
> >
> > Best regards
> >
> > On Friday, July 1, 2022, Rui Barradas <ruipbarradas using sapo.pt
> > <mailto:ruipbarradas using sapo.pt>> wrote:
> >
> >     Hello,
> >
> >     The error is in Ranger parameter mtry becoming greater than the
> >     number of variables (columns).
> >     mtry can be set manually in caret::train argument tuneGrid. But for
> >     random forests you must also set the split rule and the minimum node.
> >
> >
> >     library(caret)
> >     library(farff)
> >
> >     boot <- trainControl(method = "cv", number = 10)
> >
> >     # set the maximum mtry manually to ncol(tr)
> >     # this creates a sequence of mtry values
> >     mtry <- var_seq(ncol(tr), len = 3)  # 3 is the default value
> >     mtry
> >     #  [1]  2 13 24
> >     #[1]  2 13 24
> >
> >     splitrule <- c("variance", "extratrees")
> >     min.node.size <- 1:10
> >     mtrygrid <- expand.grid(mtry, splitrule, min.node.size)
> >     names(mtrygrid) <- c("mtry", "splitrule", "min.node.size")
> >
> >     c1 <- train(act_effort ~ ., data = tr,
> >                 method = "ranger",
> >                 tuneLength = 5,
> >                 metric = "MAE",
> >                 preProc = c("center", "scale", "nzv"),
> >                 tuneGrid = mtrygrid,
> >                 trControl = boot)
> >     c1
> >     #  Random Forest
> >     #
> >     #  30 samples
> >     #  23 predictors
> >     #
> >     #  Pre-processing: centered (48), scaled (48), remove (58)
> >     #  Resampling: Cross-Validated (10 fold)
> >     #  Summary of sample sizes: 28, 27, 27, 28, 27, 27, ...
> >     #  Resampling results across tuning parameters:
> >     #
> >     #    mtry  splitrule   min.node.size  RMSE      Rsquared   MAE
> >     #     2    variance     1             256.6391  0.8103759  186.3609
> >     #     2    variance     2             249.7120  0.8628109  183.6696
> >     #     2    variance     3             258.8240  0.8284449  189.0712
> >     #
> >     # [...omit...]
> >     #
> >     #    13    extratrees  10             254.9569  0.8918014  191.2524
> >     #    24    variance     1             177.7188  0.9458652  112.2800
> >     #    24    variance     2             172.6826  0.9204287  108.5943
> >     #    24    variance     3             172.9954  0.9271006  109.2554
> >     #    24    variance     4             172.2467  0.9523067  110.0776
> >     #    24    variance     5             175.2485  0.9283317  112.8798
> >     #    24    variance     6             177.9285  0.9369881  115.8970
> >     #    24    variance     7             180.5959  0.9485035  117.5816
> >     #    24    variance     8             178.8037  0.9358033  117.8725
> >     #    24    variance     9             176.5849  0.9210959  117.0055
> >     #    24    variance    10             178.6439  0.9257969  119.8035
> >     #    24    extratrees   1             219.1368  0.8801770  141.0720
> >     #    24    extratrees   2             216.1900  0.8550002  140.9263
> >     #    24    extratrees   3             212.4138  0.8979379  141.4282
> >     #    24    extratrees   4             218.2631  0.9121471  146.2908
> >     #    24    extratrees   5             212.5679  0.9279598  144.2715
> >     #    24    extratrees   6             218.9856  0.9141754  152.2099
> >     #    24    extratrees   7             222.8540  0.9412682  152.4614
> >     #    24    extratrees   8             228.1156  0.9423414  161.8456
> >     #    24    extratrees   9             226.6182  0.9408306  160.5264
> >     #    24    extratrees  10             226.9280  0.9429413  165.6878
> >     #
> >     #  MAE was used to select the optimal model using the smallest value.
> >     #  The final values used for the model were mtry = 24, splitrule =
> >     variance
> >     #   and min.node.size = 2.
> >     plot(c1)
> >
> >
> >
> >     Hope this helps,
> >
> >     Rui Barradas
> >
> >
> >     Às 23:03 de 30/06/2022, Neha gupta escreveu:
> >
> >         Ok, the data is pasted below
> >
> >         But on the same data (everything the same) and with other models
> >         like RF, SVM etc, it works fine.
> >
> >           > dput(head(tr, 30))
> >         structure(list(recordnumber = c(0, 0.02, 0.04, 0.06, 0.07, 0.08,
> >         0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.16, 0.17, 0.18, 0.23, 0.24,
> >         0.25, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.35, 0.36, 0.37, 0.38,
> >         0.4, 0.41), projectname = structure(c(1L, 1L, 1L, 1L, 2L, 3L,
> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
> >         4L, 4L, 4L, 4L, 4L, 4L, 5L, 6L), levels = c("de", "erb", "gal",
> >         "X", "hst", "slp", "spl", "Y"), class = "factor"), cat2 =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L,
> >         9L, 11L, 5L, 4L, 6L, 8L, 3L, 9L, 9L, 9L, 9L, 6L, 7L), levels =
> >         c("Avionics",
> >         "application_ground", "avionicsmonitoring",
> "batchdataprocessing",
> >         "communications", "datacapture", "launchprocessing",
> >         "missionplanning",
> >         "monitor_control", "operatingsystem", "realdataprocessing",
> >         "science",
> >         "simulation", "utility"), class = "factor"), forg =
> structure(c(2L,
> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels =
> c("f",
> >         "g"), class = "factor"), center = structure(c(2L, 2L, 2L, 2L,
> >         2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L), levels = c("1", "2",
> >         "3", "4", "5", "6"), class = "factor"), year = c(0.5, 0.5, 0.5,
> >         0.5, 0.6875, 0.5625, 0.5625, 0.8125, 0.5625, 0.875, 0.5625, 0.75,
> >         0.5625, 0.8125, 0.75, 0.9375, 0.9375, 0.9375, 0.6875, 0.6875,
> >         0.6875, 0.6875, 0.875, 1, 0.9375, 0.9375, 0.9375, 0.9375, 0.5625,
> >         0.25), mode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 3L, 3L, 3L, 3L), levels = c("embedded", "organic",
> >         "semidetached"
> >         ), class = "factor"), rely = structure(c(4L, 4L, 4L, 4L, 4L,
> >         4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 3L,
> >         3L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L), levels = c("vl", "l", "n",
> >         "h", "vh", "xh"), class = "factor"), data = structure(c(2L, 2L,
> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
> >         5L, 5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels = c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), cplx =
> >         structure(c(4L,
> >         4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
> >         3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), time =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L,
> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), stor =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L,
> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), virt =
> >         structure(c(2L,
> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 3L, 3L,
> >         3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), turn =
> >         structure(c(2L,
> >         2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
> >         3L, 4L, 4L, 4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), acap =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), aexp =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 4L, 5L, 5L, 4L, 5L, 4L, 4L,
> >         4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), pcap =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 4L, 5L, 4L, 5L, 3L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
> >         4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), vexp =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), lexp =
> >         structure(c(4L,
> >         4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 1L, 4L, 4L, 4L, 4L, 3L, 3L,
> >         3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), modp =
> >         structure(c(4L,
> >         4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 5L, 5L, 5L, 5L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), tool =
> >         structure(c(3L,
> >         3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 1L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), sced =
> >         structure(c(2L,
> >         2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> >         3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
> >         c("vl",
> >         "l", "n", "h", "vh", "xh"), class = "factor"), equivphyskloc =
> >         c(0.025534,
> >         0.006945, 0.008988, 0.002655, 0.067102, 0.006741, 0.019508,
> >         0.005209,
> >         0.101215, 0.010622, 0.101215, 0.019508, 0.152283, 0.031253,
> >         0.014401,
> >         0.014401, 0.037892, 0.009294, 0.015729, 0.012154, 0.032377,
> >         0.035339,
> >         0.004698, 0.009703, 0.00572, 0.012358, 0.091002, 0.007252,
> 0.180778,
> >         0.307527), act_effort = c(117.6, 31.2, 25.2, 10.8, 352.8, 72,
> >         72, 24, 360, 36, 215, 48, 324, 60, 48, 90, 210, 48, 82, 62, 170,
> >         192, 18, 50, 42, 60, 444, 42, 1248, 2400)), row.names = c(1L,
> >         3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 18L, 19L,
> >         24L, 25L, 26L, 29L, 30L, 31L, 32L, 33L, 34L, 36L, 37L, 38L, 39L,
> >         41L, 42L), class = "data.frame")
> >
> >
> >
> >         On Thu, Jun 30, 2022 at 11:28 PM Rui Barradas
> >         <ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>
> >         <mailto:ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>>>
> wrote:
> >
> >              Hello,
> >
> >              Please post data in dput format, without it it's difficult
> >         to tell.
> >              If I substitute
> >
> >              mpg for act_effort
> >              mtcars for tr
> >
> >              keeping everything else, I don't get any errors.
> >              And the error message says clearly that the error is in tr
> >         (data).
> >
> >              Can you post the output of dput(head(tr, 30))?
> >
> >              Rui Barradas
> >
> >
> >              Às 19:32 de 30/06/2022, Neha gupta escreveu:
> >               > I posted it for the second time as I didn't get any
> >         response from
> >              group
> >               > members. I am not sure if some problem is with the
> question.
> >               >
> >               >
> >               >
> >               > I cannot run the "ranger" model with caret. I am only
> >         using the
> >              farff and
> >               > caret libraries and the following code:
> >               >
> >               > boot <- trainControl(method = "cv", number=10)
> >               >
> >               > c1 <-train(act_effort ~ ., data = tr,
> >               >                method = "ranger",
> >               >                 tuneLength = 5,
> >               >                metric = "MAE",
> >               >                preProc = c("center", "scale", "nzv"),
> >               >                trControl = boot)
> >               >
> >               > The error I get is the repeating of the following
> >         message until I
> >              interrupt
> >               > it.
> >               >
> >               > Error: mtry can not be larger than number of variables
> >         in data.
> >              Ranger will
> >               > EXIT now.
> >               >
> >               >       [[alternative HTML version deleted]]
> >               >
> >               > ______________________________________________
> >               > R-help using r-project.org <mailto:R-help using r-project.org>
> >         <mailto:R-help using r-project.org <mailto:R-help using r-project.org>>
> >         mailing list
> >              -- To UNSUBSCRIBE and more, see
> >               > https://stat.ethz.ch/mailman/listinfo/r-help
> >         <https://stat.ethz.ch/mailman/listinfo/r-help>
> >              <https://stat.ethz.ch/mailman/listinfo/r-help
> >         <https://stat.ethz.ch/mailman/listinfo/r-help>>
> >               > PLEASE do read the posting guide
> >         http://www.R-project.org/posting-guide.html
> >         <http://www.R-project.org/posting-guide.html>
> >              <http://www.R-project.org/posting-guide.html
> >         <http://www.R-project.org/posting-guide.html>>
> >               > and provide commented, minimal, self-contained,
> >         reproducible code.
> >
>

	[[alternative HTML version deleted]]