[R] cross validation function translated from stata

Frank E Harrell Jr f.harrell at Vanderbilt.Edu
Thu Jan 21 15:33:39 CET 2010


Take a look at the validate.lrm function in the rms package.

Note that the use of threshold probabilities results in an improper 
scoring rule which will mislead you.  Also note that you need to repeat 
10-fold CV 50-100 times for precision, and that at each repeat you have 
to start from zero in analyzing associations.

Frank

zhu yao wrote:
> Hi, everyone:
> 
> I ask for help about translating a stata program into R.
> 
> The program perform cross validation as it stated.
> 
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
> 
> =================================================================
> First is stata code.
> 
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
> 
> g `prediction1'=.
> g `prediction2'=.
> 
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
> 
> forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
> 
> 
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''")
> 
> drop u set `prediction1' `prediction2'
> }
> 
> 
> use "`dca1'", clear
> forvalues i=2(1)200 {
> append using "`dca`i''"
> }
> 
> collapse all none modelp1 modelp2, by(threshold)
> 
> save "cross validation dca output.dta", replace
> 
> twoway(line none all modelp1 modelp2 threshold, sort)
> 
> =================================================================
> Here is my draft of R code. cMain is my dataset.
> 
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
> 
> for (i in 1:200) {
>  cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
> cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
> 
> for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
>  cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
> 
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
> }
> 
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
> cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
> cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
> 
> =================================================================
> 
> My questions are
> 1. How to ensure equal numbers of events in each set in R?
> 2. A part of stata code is
>                         forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
> 
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
>   I don't understand what's the difference between prediction1 and
> prediction2
> 3. Is my code right?
> 
> Thanks !
> 
> Yao Zhu
> Department of Urology
> Fudan University Shanghai Cancer Center
> Shanghai, China
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list