[R] Question on RandomForest in unsupervised mode

Irilenia Nobeli irilenia.nobeli at kcl.ac.uk
Wed Jun 6 18:27:23 CEST 2007


Hi,

I attempted to run the randomForest() function on a dataset without  
predefined classes. According to the manual, running randomForest  
without a response variable/class labels should result in the  
function assuming you are running in unsupervised mode. In this case,  
I understand that my data is all assigned to one class whereas a  
second synthetic class is made up, which is assigned to a second  
class. The online manual suggests that an oob misclassification error  
in this two-class problem of ~40% or more would indicate that the x- 
variables look like independent variables to random forests (and I  
assume that in this case the proximities obtained by the randomForest  
would not be informative for clustering).

When I run randomForest() in the unsupervised mode my first problem  
is that I get NULL entries for the confusion matrix and the err.rate,  
but I suppose this is normal behaviour. My only information (apart  
from the proximities of course), seems to be the votes, from which I  
can deduce whether the variables are meaningful or not. The second  
problem is that, in my case, all my observations seem to have about  
20-40% of the votes from class 1 and the rest from class 2 (i.e.  
class 2 "wins" always). Assuming that class 1 was assigned to my  
original data, I'd say this is rather surprising.
Initially I thought this was simply a problem of my data not being  
meaningful, but I repeated simply the forest with the "iris" example  
data and I get more or less the same result.
I did simply:

iris.urf <- randomForest(iris[,-5])
iris.urf$votes

and I got again most of the votes coming from class 2, although here  
vote percentages are slightly more balanced than with my data  
(approximately 40 to 60% most of the time).

Has anyone got experience with unsupervised randomForest() in R and  
can explain to me why I'm observing this behaviour? In general, any  
hints about pitfalls regarding random forests in unsupervised mode  
would be very much appreciated.

Many thanks in advance,

Irilenia

-----------------------------
Irilenia (Irene) Nobeli
Randall Division of Cell and Molecular Biophysics
New Hunt's House (room 3.14)
King's College London, Guy's Campus
London, SE1 1UL
U.K.
irilenia.nobeli at kcl.ac.uk
+44(0)207-8486329



More information about the R-help mailing list