[R] caret: Error when using rpart and CV != LOOCV

Thu May 17 20:05:52 CEST 2012

Dominik,

There are a number of formulations of this statistic (see the
Kvålseth[*] reference below).

I tend to think of R^2 as the proportion of variance explained by the
model[**]. With the "traditional" formula, it is possible to get
negative proportions (if there are extreme outliers in the
predictions, the negative proportion can be very large). I used this
formulation because it is always on (0, 1). It is called "R^2" after
all!

Here is an example:

> set.seed(1)
> simObserved <- rnorm(100)
> simPredicted <- simObserved + rnorm(100)*.1
>
> cor(simObserved, simPredicted)^2
[1] 0.9887525
> customSummary(data.frame(obs = simObserved,
+                          pred = simPredicted))
      RMSE   Rsquared
0.09538273 0.98860908
>
> simPredicted[1]
[1] -0.6884905
> simPredicted[1] <- 10
>
> cor(simObserved, simPredicted)^2
[1] 0.3669257
> customSummary(data.frame(obs = simObserved,
+                          pred = simPredicted))
     RMSE  Rsquared
 1.066900 -0.425169

It is somewhat extreme, but it does happen.

Max

* Kvålseth, T. (1985). Cautionary note about $R^2$. American
statistician, 39(4), 279–285.
* This is a very controversial statement when non-linear models are
used. I'd rather use RMSE, but many scientists I work with still think
in terms of R^2 regardless of the model. The randomForest function
also computes this statistic, but calls it "% Var explained" instead
of explicitly labeling it as "R^2". This statistic has generated
heated debates and I hope that I will not have to wear a scarlet R in
Nashville in a few weeks.

On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn <dominik at dbruhn.de> wrote:
> Hy Max,
> thanks again for the answer.
>
> I checked the caret implementation and you were right. If the
> predictions for the model constant (or sd(pred)==0) then the
> implementation returns a NA for the rSquare (in postResample). This is
> mainly because the caret implementation uses `cor` (from the
> stats-package) which would throw a error for values with sd(pred)==0.
>
> Do you know why this is implemented in this way? I wrote my own
> summaryFunction which calculates rSquare by hand and it works fine. It
> nevertheless does NOT(!) generate the same values as the original
> implementation. It seems that the calcuation of Rsquare does not seem to
> be consistent. I took mine from Wikipedia [1].
>
> Here is my code:
> ---
> customSummary <- function (data, lev = NULL, model = NULL) {
>         #Calulate rSquare
>         ssTot <- sum((data$obs-mean(data$obs))^2)
>         ssErr <- sum((data$obs-data$pred)^2)
>         rSquare <- 1-(ssErr/ssTot)
>
>         #Calculate MSE
>         mse <- mean((data$pred - data$obs)^2)
>
>         #Aggregate
>         out <- c(sqrt(mse), 1-(ssErr/ssTot))
>         names(out) <- c("RMSE", "Rsquared")
>
>         return(out)
> }
> ---
>
> [1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions
>
> Thanks!
> Dominik
>
>
>
>
> On 17/05/12 04:10, Max Kuhn wrote:
>> Dominik,
>>
>> See this line:
>>
>>>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  30.37   30.37   30.37   30.37   30.37   30.37
>>
>> The variance of the predictions is zero. caret uses the formula for
>> R^2 by calculating the correlation between the observed data and the
>> predictions which uses sd(pred) which is zero. I believe that the same
>> would occur with other formulas for R^2.
>>
>> Max
>>
>> On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
>>> Thanks Max for your answer.
>>>
>>> First, I do not understand your post. Why is it a problem if two of
>>> predictions match? From the formula for calculating R^2 I can see that
>>> there will be a DivByZero iff the total sum of squares is 0. This is
>>> only true if the predictions of all the predicted points from the
>>> test-set are equal to the mean of the test-set. Why should this happen?
>>>
>>> Anyway, I wrote the following code to check what you tried to tell:
>>>
>>> --
>>> library(caret)
>>> data(trees)
>>> formula=Volume~Girth+Height
>>>
>>> customSummary <- function (data, lev = NULL, model = NULL) {
>>>    print(summary(data$pred))
>>>    return(defaultSummary(data, lev, model))
>>> }
>>>
>>> tc=trainControl(method='cv', summaryFunction=customSummary)
>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>> --
>>>
>>> This outputs:
>>> ---
>>>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  18.45   18.45   18.45   30.12   35.95   53.44
>>>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  22.69   22.69   22.69   32.94   38.06   53.44
>>>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>  30.37   30.37   30.37   30.37   30.37   30.37
>>> [cut many values like this]
>>> Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
>>> method = method,  :
>>>  There were missing values in resampled performance measures.
>>> -----
>>>
>>> As I didn't understand your post, I don't know if this confirms your
>>> assumption.
>>>
>>> Thanks anyway,
>>> Dominik
>>>
>>>
>>> On 16/05/12 17:30, Max Kuhn wrote:
>>>> More information is needed to be sure, but it is most likely that some
>>>> of the resampled rpart models produce the same prediction for the
>>>> hold-out samples (likely the result of no viable split being found).
>>>>
>>>> Almost every incarnation of R^2 requires the variance of the
>>>> prediction. This particular failure mode would result in a divide by
>>>> zero.
>>>>
>>>> Try using you own summary function (see ?trainControl) and put a
>>>> print(summary(data$pred)) in there to verify my claim.
>>>>
>>>> Max
>>>>
>>>> On Wed, May 16, 2012 at 11:30 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
>>>>> More information is needed to be sure, but it is most likely that some
>>>>> of the resampled rpart models produce the same prediction for the
>>>>> hold-out samples (likely the result of no viable split being found).
>>>>>
>>>>> Almost every incarnation of R^2 requires the variance of the
>>>>> prediction. This particular failure mode would result in a divide by
>>>>> zero.
>>>>>
>>>>> Try using you own summary function (see ?trainControl) and put a
>>>>> print(summary(data$pred)) in there to verify my claim.
>>>>>
>>>>> Max
>>>>>
>>>>> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
>>>>>> Hy,
>>>>>> I got the following problem when trying to build a rpart model and using
>>>>>> everything but LOOCV. Originally, I wanted to used k-fold partitioning,
>>>>>> but every partitioning except LOOCV throws the following warning:
>>>>>>
>>>>>> ----
>>>>>> Warning message: In nominalTrainWorkflow(dat = trainData, info =
>>>>>> trainInfo, method = method, : There were missing values in resampled
>>>>>> performance measures.
>>>>>> -----
>>>>>>
>>>>>> Below are some simplified testcases which repoduce the warning on my
>>>>>> system.
>>>>>>
>>>>>> Question: What does this error mean? How can I avoid it?
>>>>>>
>>>>>> System-Information:
>>>>>> -----
>>>>>>> sessionInfo()
>>>>>> R version 2.15.0 (2012-03-30)
>>>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>>>>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>>>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] rpart_3.1-52   caret_5.15-023 foreach_1.4.0  cluster_1.14.2
>>>>>> reshape_0.8.4
>>>>>> [6] plyr_1.7.1     lattice_0.20-6
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0     iterators_1.0.6
>>>>>> [5] tools_2.15.0
>>>>>> -------
>>>>>>
>>>>>>
>>>>>> Simlified Testcase I: Throws warning
>>>>>> ---
>>>>>> library(caret)
>>>>>> data(trees)
>>>>>> formula=Volume~Girth+Height
>>>>>> train(formula, data=trees,  method='rpart')
>>>>>> ---
>>>>>>
>>>>>> Simlified Testcase II: Every other CV-method also throws the warning,
>>>>>> for example using 'cv':
>>>>>> ---
>>>>>> library(caret)
>>>>>> data(trees)
>>>>>> formula=Volume~Girth+Height
>>>>>> tc=trainControl(method='cv')
>>>>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>>>>> ---
>>>>>>
>>>>>> Simlified Testcase III: The only CV-method which is working is 'LOOCV':
>>>>>> ---
>>>>>> library(caret)
>>>>>> data(trees)
>>>>>> formula=Volume~Girth+Height
>>>>>> tc=trainControl(method='LOOCV')
>>>>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>>>>> ---
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> --
>>>>>> Dominik Bruhn
>>>>>> mailto: dominik at dbruhn.de
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Max
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Dominik Bruhn
>>> mailto: dominik at dbruhn.de
>>>
>>
>>
>>
>
>
> --
> Dominik Bruhn
> mailto: dominik at dbruhn.de
>

-- 

Max