[R] Random Forest - Strata

Coll gbcoll2 at gmail.com
Tue Jul 27 20:46:53 CEST 2010


Thanks for all the help.

I had tried using the "index" in caret to try to dictate which rows of the
sample would be used in each of the tree building in RF. (e.g. use all data
from A B site for training, hold out all data from C site for testing etc) 

However after running, when I cross-checked the "index" that goes to train
function and the "inbag" in the resulting randomForest object, I found the
two didn't match. 

Shown as below:

> data(iris)
> tmpIrisIndex <- createDataPartition(iris$Species, p=0.632, times = 10)
> head(tmpIrisIndex,3)
[[1]]
 [1]   1   2   3   7  10  11  12  13  16  18  20  22  24  25  26  27  28  29 
31
[20]  34  35  36  37  38  39  40  41  43  46  47  48  50  52  53  55  56  57 
58
[39]  61  64  65  66  67  68  69  71  74  75  76  77  79  82  83  84  85  86 
88
[58]  90  91  92  94  96  98  99 102 103 104 106 108 109 111 112 113 114 115
116
[77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146
147
[96] 150

[[2]]
 [1]   1   3   6   7   8  10  12  13  14  16  18  20  21  22  23  24  26  27 
28
[20]  29  30  32  34  35  36  38  42  44  46  47  48  50  51  53  54  55  58 
60
[39]  61  62  67  68  69  70  72  73  74  76  77  79  81  82  83  85  86  88 
89
[58]  90  92  93  95  97  99 100 103 104 105 107 108 109 111 112 113 114 117
119
[77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145
147
[96] 149

[[3]]
 [1]   1   5   7   9  10  11  12  14  18  20  21  22  23  24  26  29  30  31 
33
[20]  34  35  36  37  38  39  40  44  45  46  47  48  49  51  52  53  54  56 
58
[39]  61  63  65  66  69  70  72  74  75  76  77  78  79  80  82  83  85  86 
87
[58]  90  91  92  93  94  98 100 102 103 105 106 107 109 110 113 114 115 116
117
[77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142
146
[96] 150

> irisTrControl <- trainControl(method = "oob", index = tmpIrisIndex)
> rf.iris.obj <-train(Species~., data= iris, method = "rf", ntree = 10,
> keep.inbag = TRUE, trControl = irisTrControl)
Fitting: mtry=2 
Fitting: mtry=3 
Fitting: mtry=4 
> head(rf.iris.obj$finalModel$inbag,20)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    0    1    0    0    0    1    0    1     1
 [2,]    1    1    1    1    1    0    1    0    1     0
 [3,]    1    1    1    0    0    1    1    0    0     0
 [4,]    1    0    1    0    1    1    0    1    0     1
 [5,]    0    1    1    1    1    1    0    1    0     1
 [6,]    1    1    0    1    0    0    1    1    1     0
 [7,]    1    1    0    0    1    1    0    0    0     0
 [8,]    1    1    1    1    1    0    1    1    1     1
 [9,]    1    1    0    1    0    1    0    1    1     0
[10,]    1    1    1    0    1    1    0    0    0     1
[11,]    1    1    1    1    1    1    1    0    1     0
[12,]    1    1    1    1    1    0    1    0    1     1
[13,]    1    0    1    1    1    1    1    1    0     1
[14,]    0    1    1    1    0    1    0    0    0     0
[15,]    1    1    1    1    1    1    1    1    1     0
[16,]    1    1    0    0    0    0    1    0    1     1
[17,]    1    0    1    0    0    0    1    1    0     1
[18,]    1    0    1    1    1    1    1    1    1     1
[19,]    1    0    1    0    1    1    1    0    1     1
[20,]    1    0    1    0    1    1    1    0    1     0

My understanding is the 1st tree in the RF should be built with
tmpIrisIndex[1] i.e. "1   2   3   7  10  11  12  13  ..." ?
But the Inbag in the resulting forest is showing it is using "1 2 3 4 6 7 8
9..." for inbag in 1st tree?

Why the index passed to train does not match what got from inbag in the rf
object? Or I had looked to the wrong place to check this?

Any help / comments would be appreciated. Thanks a lot.

Regards,
Coll



-- 
View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list