[R] randomForest parameters for image classification

Thu Nov 18 14:39:02 CET 2010

1. Memory issue: You may want to try to increase nodesize (e.g., to 5,
11, or even 21) and see if that degrades performance.  If not, you
should be able to grow more trees with the larger nodesize.   Another
option is to use the sampsize argument to have randomForest() do the
random subsampling for you (on a per tree basis, rather than one random
subset for the entire forest).

2. predict() giving NA: Have no idea why you are calling predict() that
way.  The first argument of all predict() methods that I know about (not
just for randomForest) needs to be a model object, then followed by the
data you want to predict, not the other way around.

Andy

> -----Original Message-----
> From: Deschamps, Benjamin [mailto:Benjamin.Deschamps at AGR.GC.CA] 
> Sent: Tuesday, November 16, 2010 11:16 AM
> To: r-help at r-project.org
> Cc: Liaw, Andy
> Subject: RE: [R] randomForest parameters for image classification
> 
> I have modified my code since asking my original question. The
> classifier is now generated correctly (with a good, low error rate, as
> expected). However, I am running into two issues: 
> 
> 1) I am getting an error at the prediction stage, I get only 
> NA's when I
> try to run data down the forest;
> 2) I run out of memory when generating the forest with more than 200
> trees due to the large block of memory already occupied by 
> the training
> data
> 
> Here is my code:
> 
> 
> library(raster)
> library(randomForest)
> 
> # Set some user variables
> fn = "image.pix"
> outraster = "output.pix"
> training_band = 2
> validation_band = 1
> 
> # Get the training data
> myraster = stack(fn)
> training_class = subset(myraster, training_band)
> training_class[training_class == 0] = NA
> training_class = Which(training_class != 0, cells=TRUE)
> training_data = extract(myraster, training_class)
> training_response = 
> as.factor(as.vector(training_data[,training_band]))
> training_predictors = training_data[,3:nlayers(myraster)]
> remove(training_data)
> 
> # Create and save the forest
> r_tree = randomForest(training_predictors, 
> y=training_response, ntree =
> 200, keep.forest=TRUE) # Runs out of memory with ntree > ~200
> remove(training_predictors, training_response)
> 
> # Classify the whole image
> predictor_data = subset(myraster, 3:nlayers(myraster))
> layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
> predictions = predict(predictor_data, r_tree, filename=outraster,
> format="PCIDSK", overwrite=TRUE, progress="text", 
> type="response") #All
> NA!?
> remove(predictor_data)
> 
> 
> See also a thread I started on
> http://stackoverflow.com/questions/4186507/rgdal-efficiently-r
> eading-lar
> ge-multiband-rasters about improving the efficiency of collecting the
> training data...
> 
> Thanks, Benjamin
> 
> 
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com] 
> Sent: November 11, 2010 7:02 AM
> To: Deschamps, Benjamin; r-help at r-project.org
> Subject: RE: [R] randomForest parameters for image classification
> 
> Please show us the code you used to run randomForest, the output, as
> well as what you get with other algorithms (on the same random subset
> for comparison).  I have yet to see a dataset where randomForest does
> _far_ worse than other methods.
> 
> Andy 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of 
> Deschamps, Benjamin
> > Sent: Tuesday, November 09, 2010 10:52 AM
> > To: r-help at r-project.org
> > Subject: [R] randomForest parameters for image classification
> > 
> > I am implementing an image classification algorithm using the
> > randomForest package. The training data consists of 31000+ training
> > cases over 26 variables, plus one factor predictor variable (the
> > training class). The main issue I am encountering is very 
> low overall
> > classification accuracy (a lot of confusion between classes). 
> > However, I
> > know from other classifications (including a regular decision tree
> > classifier) that the training and validation data is sound 
> and capable
> > of producing good accuracies). 
> > 
> >  
> > 
> > Currently, I am using the default parameters (500 trees, 
> mtry not set
> > (default), nodesize = 1, replace=TRUE). Does anyone have experience
> > using this with large datasets? Currently I need to 
> randomly sample my
> > training data because giving it the full 31000+ cases returns 
> > an out of
> > memory error; the same thing happens with large numbers of 
> > trees.  From
> > what I read in the documentation, perhaps I do not have 
> > enough trees to
> > fully capture the training data?
> > 
> >  
> > 
> > Any suggestions or ideas will be greatly appreciated.
> > 
> >  
> > 
> > Benjamin
> > 
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> Notice:  This e-mail message, together with any attach...{{dropped:26}}