[R] prediction error for test set-cross validation

Frank E Harrell Jr f.harrell at vanderbilt.edu
Wed Mar 11 14:02:11 CET 2009


Uwe Ligges wrote:
> 
> 
> Mehmet U Ayvaci wrote:
>> Hi,
>>  
>>
>> I have a database of 2211 rows with 31 entries each and I manually 
>> split my
>> data into 10 folds for cross validation. I build logistic regression 
>> model
>> as: 
>>  
>>
>>> model <- glm(qual ~ AgGr + FaHx + PrHx + PrSr + PaLp + SvD + IndExam + 
>>
>>             Rad +BrDn + BRDS + PrinFin+ SkRtr + NpRtr + SkThck +TrThkc +
>> SkLes + AxAdnp + ArcDst + MaDen + CaDt + MaMG +
>>             MaMrp + MaSh + SCTub + SCFoc + MaSz,
>> family=binomial(link=logit));
>>
>>  
>>
>> Where the  variables are taken from the trainSet of size 1989x31. The 
>> test
>> set is sized 222x31. Now my question is when I try to predict on the test
>> set it gives me the error:
>>
>>  
>>
>>> predict.glm(model, testSet, type ="response")
>>
>> "Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
>>   subscript out of bounds"
>>
>>  
>>
>> It does fine on trainSet. so it is something about the testSet. On the 
>> other
>> hand, I realized that some independent variables say "MaSz" takes 3
>> different values in the trainset vs. 4 different ones in the testSet. 
>> I am
>> not sure if this is the cause.If so, what would be the remedy?
>>
>>  
>>
>> Since I can retrieve the coefficients of the logistic regression, I could
>> manually calculate response for each entry in the testSet. This could 
>> solve
>> my problem although burdensome. But, I don't know an easy way of doing 
>> it as
>> my logistic regression have 80+ coefficients.
> 
> 
> Well, if "MaSz takes 3 different values in the trainset vs. 4 different 
> ones in the testSet", then you won't even be able to calculate it by 
> hand, because you got no coefficients for the 4th level of that factor.
> Either you need the data to estimate coefficients from or you cannot 
> predict.
> 
> Uwe Ligges

And note that your test sample is far too small to yield reliable 
results.  You need to use resampling (e.g., bootstrap or 50-fold repeats 
of 10-fold cross-validation).  See the validate function in the Design 
package.  Note that validate does not implement the proportion 
classified correctly because this is an improper scoring rule with 
minimum information/lowest precision/lowest power.

Frank Harrell

> 
> 
> 
>>  
>>
>>  
>>
>> Could somebody advise?
>>
>>  
>>
>>  
>>
>> Thanks,
>> M
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list