[R] problem with BootCV for coxph in pec after feature selection with glmnet (lasso)
E Joffe
ejoffe at hotmail.com
Thu Jul 11 15:13:52 CEST 2013
pec accepts only R-objects for which a predictSurvProb method exists and
glmnet is not such an object.
Currently, predictSurvProb methods are available for the following
R-objects:
matrix
aalen, cox.aalen from library(timereg)
mfp from library(mfp)
phnnet, survnnet from library(survnnet)
rpart (from library(rpart))
coxph, survfit from library(survival)
cph, psm from library(rms)
prodlim from library(prodlim)
glm from library(stats)
For calculating the brier score for glmnet one needs to use the peperr
package with the c060 library that wraps glmnet as an object suitable for
peperr.
peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x,
fit.fun=fit.glmnet, args.fit=list(family="cox"),
complexity=complexity.glmnet,args.complexity=list(family="cox"),
indices=resample.indices(n=length(time),
method="boot", sample.n=10))
To get the integrated brier score for the entire model it seems one needs to
use the ipec function but I still need to research that.
Many thanks to Thomas Hielscher who authored the c060 package and was
extremely kind to help me with this.
-----Original Message-----
From: E Joffe [mailto:ejoffe at hotmail.com]
Sent: Friday, July 05, 2013 7:16 PM
To: 'r-help at R-project.org'
Subject: problem with BootCV for coxph in pec after feature selection with
glmnet (lasso)
Hi,
I am attempting to evaluate the prediction error of a coxph model that was
built after feature selection with glmnet.
In the preprocessing stage I used na.omit (dataset) to remove NAs.
I reconstructed all my factor variables into binary variables with dummies
(using model.matrix) I then used glmnet lasso to fit a cox model and select
the best performing features.
Then I fit a coxph model using only these feature.
When I try to evaluate the model using pec and a bootstrap I get an error
that the prediction matrix has wrong dimensions.
Suddenly the cox object has 318 variables instead of 356 variables in the
dataset.
I don't know why this is happening.
The cox object I assign to pec and the dataframe are both of the same size.
However, once pec refits the model its size changes (356 -> 318 variables).
Apparently something is happening during the bootstrap sampling that removes
some variables.
As mentioned, I used na.omit in the preprocessing so there should not be any
NAs.
Here are some details from my workspace:
Reformat_dataSet: 368 obs. Of 356 variables print (glmnet.cox) ----> 354 df,
p=1 n= 368, number of events= 288 (354 df = 354 variables + time and status
=> 356 variables)
Here is the pec function and the error:
pec.f <- as.formula(Hist(time,status) ~ 1) brierGlmnet <-
pec(list(glmnet.cox), data = reformat_Dataset, splitMethod="BootCV", B=50,
formula = pec.f)
> Error in predictSurvProb.coxph(object = structure(list(coefficients =
structure(c(-4.27787223119601, :
Prediction matrix has wrong dimensions:
368 rows and 318 columns.
But requested are predicted probabilities for
118 subjects (rows) in newdata and 356 time points (columns)
This may happen when some covariate values are missing in newdata!?
Here are the relevant sections of the code:
trainSet <- na.omit (dataset)
#creat Y (survival matrix) for glmnet
surv_obj <- Surv(trainSet$time,trainSet$status)
## tranform categorical variables into binary variables with dummy for
trainSet
predict_matrix <- model.matrix(~ ., data=trainSet,
contrasts.arg = lapply
(trainSet[,sapply(trainSet, is.factor)], contrasts, contrasts=FALSE))
## remove the statu/time variables from the predictor matrix (x) for glmnet
predict_matrix <- subset (predict_matrix, select=c(-time,-status))
## create a glmnet cox object using lasso regularization
glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")
# find lambda for which dev.ratio is max
max.dev.index <- which.max(glmnet.obj$dev.ratio)
optimal.lambda <- glmnet.obj$lambda[max.dev.index]
# take beta for optimal lambda
optimal.beta <- glmnet.obj$beta[,max.dev.index]
# find non zero beta coef
nonzero.coef <- abs(optimal.beta)>0
selectedBeta <- optimal.beta[nonzero.coef]
# take only covariates for which beta is not zero
selectedVar <- predict_matrix[,nonzero.coef]
# create a dataframe for trainSet with time, status and selected variables
in binary representation for evaluation in pec
reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
# create coxph object with pre-defined coefficients
glmnet.cox <- coxph(surv_obj ~ selectedVar, init=selectedBeta,iter=0)
## Brier score for the cox-glmnet model
brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset,
splitMethod="BootCV", B=50,
formula = pec.f)
Thank you !!!
More information about the R-help
mailing list