[BioC] undefined columns selected error when using bagging{ipred}

Valerie Obenchain vobencha at fhcrc.org
Sun Sep 9 16:45:46 CEST 2012


Hi Constanze,

The problems appears to be with how bagging() deals with the column 
names of the sample data frame. The immediate solution is to change the 
column names to non-numbers,

 > bagg <- bagging(response ~., data = exprDF[,selected], ntrees = 100)
Error in `[.data.frame`(m, attr(Terms, "term.labels")) :
undefined columns selected
 > dat <- exprDF[,selected]
 > colnames(dat) <- paste0("A", 1:ncol(dat))
 > bagg <- bagging(response ~., data = dat, ntrees = 100)
 > bagg

Bagging survival trees with 25 bootstrap replications

Call: bagging.data.frame(formula = response ~ ., data = df, ntrees = 100)

As you've seen from error messages as you've worked through these 
examples, several packages are no longer maintained and many functions 
have evolved since the book was written. ipred is currently maintained 
and it is the package that bagging() comes from. I'm cc'ing the 
maintainer because this issue may be a bug.

Hi Torsten,

It looks like bagging() does not like colnames that are numeric coerced 
to character. Using an modified example from ?bagging,

data(DLBCL)
## first example works fine
mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE)

## change the column names of the data.frame
names(DLBCL) <- c("DLCL.Sample", "Gene.Expression", "time", "cens", 
"IPI", 1:10)
 > names(DLBCL)
[1] "DLCL.Sample" "Gene.Expression" "time" "cens"
[5] "IPI" "1" "2" "3"
[9] "4" "5" "6" "7"
[13] "8" "9" "10"
 > mod <- bagging(Surv(time,cens) ~ ., data=DLBCL, coob=TRUE)
Error in `[.data.frame`(m, attr(Terms, "term.labels")) :
undefined columns selected

The error is thrown from this line in the irpart() function,
isord <- unlist(lapply(m[attr(Terms, "term.labels")], tfun))

When the 'Terms' variable is created, the term labels are created with 
an extra backslash "`" which prevents them from being matched to the 
column names of the data.frame (m),

debugging in: irpart(y ~ ., data = mydata, control = control, bcontrol = 
list(nbagg = nbagg,
ns = ns, replace = REPLACE))
...
Browse[2]>
debug: Terms <- attr(m, "terms")
...
Browse[2]> attr(Terms, "term.labels")
[1] "DLCL.Sample" "Gene.Expression" "IPI" "`1`"
[5] "`2`" "`3`" "`4`" "`5`"
[9] "`6`" "`7`" "`8`" "`9`"
[13] "`10`"
...
Browse[2]> colnames(m)
[1] "y" "DLCL.Sample" "Gene.Expression" "IPI"
[5] "1" "2" "3" "4"
[9] "5" "6" "7" "8"
[13] "9" "10"


Valerie



On 09/05/12 08:21, Constanze [guest] wrote:
> Dear All,
>
> i'm trying to reproduce the results of the survival analysis in Capter 17, p.307 of "Bioinformatics and Computational Biology Solutions using R and Bioconductor" using the code chunks from http://www.bioconductor.org/help/publications/books/bioinformatics-and-computational-biology-solutions/chapter-code/Computational_Inference.R
> The call to the bagging function throws an error, although i decreased the amount of variables selected to p=25 (so the model fit wouldn't be over-determined). The code is below.
>
> Thanks a lot,
>
> Constanze
>
>
>> library("exactRankTests")
>   Package ‘exactRankTests’ is no longer under development.
>   Please consider using package ‘coin’ instead.
>
>> # library("coin")
>> library("ipred")
> Lade nötiges Paket: rpart
> Lade nötiges Paket: MASS
> Lade nötiges Paket: mlbench
> Lade nötiges Paket: nnet
> Lade nötiges Paket: class
>> library("kidpack")
> *** Deprecation warning ***:
> The package 'kidpack' is deprecated and will not be supported after Bioconductor release 2.1.
>
>
>> data(eset)
>> var_selection<- function(indx, expressions, response, p = 100) {
> +
> +     y<- switch(class(response),
> +         "factor" = { model.matrix(~ response - 1)[indx, ,drop = FALSE] },
> +         "Surv" = { matrix(cscores(response[indx]), ncol = 1) },
> +         "numeric" = { matrix(rank(response[indx]), ncol = 1) }
> +     )
> +
> +     x<- expressions[,indx, drop = FALSE]
> +     n<- nrow(y)
> +     linstat<- x %*% y
> +     Ey<- matrix(colMeans(y), nrow = 1)
> +     Vy<- matrix(rowMeans((t(y) - as.vector(Ey))^2), nrow = 1)
> +
> +     rSx<- matrix(rowSums(x), ncol = 1)
> +     rSx2<- matrix(rowSums(x^2), ncol = 1)
> +     E<- rSx %*% Ey
> +     V<- n / (n - 1) * kronecker(Vy, rSx2)
> +     V<- V - 1 / (n - 1) * kronecker(Vy, rSx^2)
> +
> +     stats<- abs(linstat - E) / sqrt(V)
> +     stats<- do.call("pmax", as.data.frame(stats))
> +     return(which(stats>  sort(stats)[length(stats) - p]))
> + }
>>
>> remove<- is.na(eset$survival.time)
>> seset<- eset[,!remove]
>> response<- Surv(seset$survival.time, seset$died)
>> response[response[,1] == 0]<- 1
>> expressions<- t(apply(exprs(seset), 1, rank))
>> exprDF<- as.data.frame(t(expressions))
>>
>> I<- nrow(exprDF)
>> Iindx<- 1:I
>> selected<- var_selection(Iindx, expressions, response,p=25)
>> bagg<- bagging(response ~., data = exprDF[,selected],ntrees = 100)
> Fehler in `[.data.frame`(m, attr(Terms, "term.labels")) :
>    undefined columns selected
>
>
>   -- output of sessionInfo():
>
> R version 2.15.1 (2012-06-22)
> Platform: i486-pc-linux-gnu (32-bit)
>
> locale:
>   [1] LC_CTYPE=de_DE.utf8       LC_NUMERIC=C
>   [3] LC_TIME=de_DE.utf8        LC_COLLATE=de_DE.utf8
>   [5] LC_MONETARY=de_DE.utf8    LC_MESSAGES=de_DE.utf8
>   [7] LC_PAPER=C                LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=de_DE.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] kidpack_1.5.10        ipred_0.8-8           class_7.3-4
>   [4] nnet_7.3-4            mlbench_2.1-1         MASS_7.3-21
>   [7] rpart_3.1-54          exactRankTests_0.8-22 affy_1.26.0
> [10] Biobase_2.8.0         survival_2.36-14
>
> loaded via a namespace (and not attached):
> [1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.15.1
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list