[R] questions for using randomForest/pamr to predict biological data
zj29 at cornell.edu
Thu May 9 17:10:32 CEST 2013
I am using randomForest and pamr to analyze some biological data. Basically
the data show how many of each bacterium (bug) is present in each soil
sample (sample) at each location (loc) and from each plant genotype (gen).
I want to use the bugs to predict the plant genotypes. Please find some
sample data in the dput format as the attached PDF file (a long file, can
be saved as a plain txt file to dget the data).
I ran randomForest with the following command:
rf1=randomForest(gen~., mydata, ntree=1000, mtry=21, importance=T,
I got a very high OOB error rate (87%), and high classification errors for
each genotype: even the lowest error rate was 40%.
I realized my data was somewhat unbalanced, so I played with the mtry,
sampsize, and strata parameters. However, the OOB error rates and
classification errors were still high.
I noticed that with only two genotypes, the OOB error rate and
classification errors went down significantly to 20%.
I also tried the varSelRF package to select variables, but this did not
lower the OOB errors much.
With pamr, no variable was left after the default 30 threshold values, and
the FDR rates were all 1. If I ordinate the data, I can see that there is
no obvious cluster among the genotypes.
So my questions are:
1) is random forest or pamr a valid approach to do this
2) can I further improve the randomForest or pamr predictions, and
3) can I at least use the bugs to predict some genotypes (eg. gen18 with a
classification error of 40% by randomForest) confidently, if not all.
Thanks a lot for any comment or suggestion,
Ruth Ley Lab
Field of Microbiology, Cornell University
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 636048 bytes
Desc: not available
More information about the R-help