[R] R help-classification accuracy of DFA and RF using caret

David Winsemius dwinsemius at comcast.net
Wed Nov 6 22:57:06 CET 2013


On Nov 6, 2013, at 10:07 AM, Henderson, Robin Michelle wrote:

> Hi,
> 
> I am a graduate student applying published R scripts to compare the classification accuracy of 2 predictive models, one built using discriminant function analysis and one using random forests (webpage link for these scripts is provided below).  The purpose of these models is to predict the biotic integrity of streams.  Specifically, I am trying to compare the classification accuracy (i.e., prediction of group membership)of both the DFA and RF models using k-fold crossvalidation for the following metrics: AUC ROC, percent correctly classified, specificity, sensitivity, and Kappa.

Sensitivity, "accuracy" (= percent correct), and specificity are only defined when you establish a particular threshold for decision. The is no "sensitivity" or "specificity" that will accrue to a classification model. AUC is an effort at presenting such an overall value, but it has deficiencies and is insensitive to statistically significant differences in models.

> I would also like to obtain the F statistic, Wilks lambda, MSE or RMSE for the random forest models as the script does not contain code to get this data.

I doubt very much that is by accident or oversight on the part of the randomForest developers.

>  I think I need to use the caret package to obtain the classification accuracy, but I keep getting error messages when I apply the train function to my data.  As I am relatively new to R and my thesis committee is unable to help as they are also unfamiliar with R, I thought it best to ask for help.

I think you need to add a statistician to your committee. The difficulties you are facing (of which you appear to be unaware) are not just related to being new to R.


>  Would someone be willing to help me?
> 
> 
> Thanks,
> Robin
> 
> http://www.epa.gov/wed/pages/models/rivpacs/rivpacs.htm
> 
> 
>> TrainDataDFAgrps2 <-predcal
>> TrainClassesDFAgrps2 <-grp.2;
>> DFAgrps2Fit1 <- train(TrainDataDFAgrps2, TrainClassesDFAgrps2,
> +  method = "lda",
> + tuneLength = 10,
> + trControl = trainControl(method = "cv"));
> Error in train.default(TrainDataDFAgrps2, TrainClassesDFAgrps2, method = "lda",  :
>  wrong model type for regression

That error is pointing out that you are choosing a method that expects a particular form of outcome (continuous) and does not accept a categorical (possibly an R factor?) outcome. I suspect you may be using the `caret` package, but it's unclear. I think this is further evidence of the need for competent statistical consultation. You would be advised to study further in Venables and Ripley's MASS(v4) or in Hastie, Tibshirani, and Freidmans ESL(v2).

This link, found with a simple google search, suggests that the author of the cited code is at an academic institution only one state away from you: fw.oregonstate.edu/system/files/Van%20Sickle%20CV%20consult.pdf‎. He may be willing to offer assistance.

-- 
David.

> 
>> RFgrps2Fit1 <- train(TrainDataRFgrps2, TrainClassesRFgrps2,
> +  method = "rf",
> + tuneLength = 10,
> + trControl = trainControl(method = "cv"));
> There were 50 or more warnings (use warnings() to see the first 50)
> 
> Clip of predcal (same length as grp.2, but too much data to display all):
>> predcal
>          Reference_Test HUC12_AREA_HA_log10 ELEV_m M_Slp_sqt Precip_mm Temp_CX10
> 2370                   R                 3.7  588.0       2.2      1751       148
> 559                    R                 4.0  643.1       1.8      1674       141
> 2062                   R                 4.0  643.1       1.8      1674       141
> 2467                   R                 4.0  643.1       1.8      1674       141
> 1176                   R                 3.9  694.3       2.4      1534       131
> 1840                   R                 3.9  694.3       2.4      1534       131
> 2052                   R                 3.9  694.3       2.4      1534       131
> 1174                   R                 4.1  605.0       2.1      1382       138
> 1841                   R                 4.1  605.0       2.1      1382       138
> 2051                   R                 4.1  605.0       2.1      1382       138
> 1831                   R                 4.1  363.9       1.7       937       156
> 
> 
> Grps.2:
> grp.2
>  [1] 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1
> [45] 2 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1
> [89] 1 2 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 1 1 2
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list