[R] randomForest gives different results for formula call v. x, y methods. Why?

Gavin Simpson gavin.simpson at ucl.ac.uk
Sun Apr 29 15:38:39 CEST 2007


On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote:
> Just out of curiosity, I took the default "iris" example in the RF
> helpfile...
> but seeing the admonition against using the formula interface for large data
> sets, I wanted to play around a bit to see how the various options affected
> the output. Found something interesting I couldn't find documentation for...
> 
> Just like the example...
> > set.seed(12) # to be sure I have reproducibility

No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with:

> require(randomForest)
Loading required package: randomForest
randomForest 4.5-18

*if* I reset the seed before each call to randomForest.

Your example code doesn't seem to be resetting the random seed before
each run. As such, each run is using a different set of random variables
at each bootstrap sample.

E.g. runs all same with reset seed:

> set.seed(12)
> randomForest(Species ~ ., data=iris)

Call:
 randomForest(formula = Species ~ ., data = iris)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06
> set.seed(12)
> randomForest(x=iris[,1:4],y=iris[,5])

Call:
 randomForest(x = iris[, 1:4], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06
> set.seed(12)
> randomForest(x=iris[,c(1:4)],y=iris[,5])

Call:
 randomForest(x = iris[, c(1:4)], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06
> set.seed(12)
> randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])

Call:
 randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06

HTH

G
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [t] +44 (0)20 7679 0522
ECRC                              [f] +44 (0)20 7679 0565
UCL Department of Geography
Pearson Building                  [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street
London, UK                        [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT                          [w] http://www.freshwaters.org.uk/
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-help mailing list