[R] cross validation function translated from stata

Thu Jan 21 16:05:41 CET 2010

Hi,

On Thu, Jan 21, 2010 at 8:55 AM, zhu yao <mailzhuyao at gmail.com> wrote:
> Hi, everyone:
>
> I ask for help about translating a stata program into R.
>
> The program perform cross validation as it stated.
>
> #1. Randomly divide the data set into 10 sets of equal size, ensuring equal
> numbers of events in each set
> #2. Fit the model leaving out the 1st set
> #3. Apply the fitted model in (2) to the 1st set to obtain the predicted
> probability of a prostate cancer diagnosis.
> #4. Repeat steps (2) to (3) leaving out and then applying the fitted model
> to the ith group, i = 2, 3... 10. Every subject now has a predicted
> probability of a prostate cancer diagnosis.
> #5. Using the predicted probabilities, compute the net benefit at various
> threshold probabilities.
> #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
> threshold probability is the mean across the 200 replications.
>
> =================================================================
> First is stata code.
>
> forvalues i=1(1)200 {
> local event="cancer"
> local predictors1 = "total_psa"
> local predictors2 = "total_psa free_psa"
> local prediction1 = "base"
> local prediction2 = "full"
>
> g `prediction1'=.
> g `prediction2'=.
>
> quietly g u = uniform()
> sort `event' u
> g set = mod(_n, 10) + 1
>
> forvalues j=1(1)10{
> quietly logit `event' `predictors1' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction1' = ptemp if set==`j'
> drop ptemp
>
>
> quietly logit `event' `predictors2' if set~=`j'
> quietly predict ptemp if set==`j'
> quietly replace `prediction2' = ptemp if set==`j'
> drop ptemp
> }
> tempfile dca`i'
> quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''")
>
> drop u set `prediction1' `prediction2'
> }
>
>
> use "`dca1'", clear
> forvalues i=2(1)200 {
> append using "`dca`i''"
> }
>
> collapse all none modelp1 modelp2, by(threshold)
>
> save "cross validation dca output.dta", replace
>
> twoway(line none all modelp1 modelp2 threshold, sort)
>
> =================================================================
> Here is my draft of R code. cMain is my dataset.
>
> predca<-rep(0,40000)
> dim(predca)<-c(200,200)
>
> for (i in 1:200) {
>  cvgroup<-rep(1:10,length=110)
> cvgroup<-sample(cvgroup)
> cvpre<-rep(0,length=110)
> cvMain<-cbind(cMain,cvgroup,cvpre)
>
> for (j in 1:10) {
> cvdev<-cvMain[cvMain$cvgroup!=j,]
> cvval<-cvMain[cvMain$cvgroup==j,]
>  cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
> cvprej<-predict(cvfit,cvval,type="fitted")
>
> #put the fitted value in dataset
> cvMain[cvgroup==j,]$cvpre<-prej
> }
>
> cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
> cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
> cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
> cvnr<-cvnb/cvtpthres
> predca[cvn,1:99]<-cvnb
> predca[cvn,101:199]<-cvnr
> }
>
> =================================================================
>
> My questions are
> 1. How to ensure equal numbers of events in each set in R?

I just wanted to point you to the "createFolds" and
"createDataPartition" functions in the caret package ... they try to
do something similar, so perhaps you can see how others have tried to
solve this problem:

http://cran.r-project.org/web/packages/caret/index.html

For example, from their help page:

"""For other data splitting, the random sampling is done within the
levels of y when y is a factor in an attempt to balance the class
distributions within the splits. For numeric y, the sample is split
into groups sections based on quantiles and sampling is done within
these subgroups. Also, for very small class sizes (<= 3) the classes
may not show up in both the training and test data"""

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact