[R] data lost in cv.tree?

Alexy Khrabrov braver at pobox.com
Thu Sep 25 07:26:51 CEST 2003


Greetings -- I'm programming a data mining system
in R for protein structural data.  As a seasoned
Perl and Python and Ada and ML, et al., programmer,
I am severely befuddled by the environment problem,
where data is not found in a 3rd level nested
function.  I did peruse the parent frame not on the
search path idea, and came up with a hack which
kinda works, also below with the code which should
work but doesn't.  However, until I fully understand
the issue, I cannot trust my model, which is serious.
So here's a toy example I extracted from my code,
reproducing the problem:


##################################################################


where.is.X <- function() {
  nx <- 3
  ny <- 4
  y <- as.factor(c(1,0,1,1))
  X <- data.frame(matrix(c(1:(nx*ny)), nrow = ny, ncol = nx,
                         dimnames=list(c(),paste("x",1:nx,sep=""))))

  btr <- best.tree.lost(y~., X)
  print(summary(btr))
}


best.tree.lost <- function (fmla, X) {
  tr <- tree(fmla, X)
  print(summary(tr))
  cvtr <- cv.tree(tr, envir=parent.frame())
  size <- cvtr$size[order(cvtr$dev)[1]]
  print(size)
  btr <- prune.tree(tr, best=size)
  btr
}


##################################################################


here.is.X <- function() {
  nx <- 3
  ny <- 4
  y <- as.factor(c(1,0,1,1))
  X <- data.frame(matrix(c(1:(nx*ny)), nrow = ny, ncol = nx,
                         dimnames=list(c(),paste("x",1:nx,sep=""))))

  btr <- best.tree.found(y~., X)
  print(summary(btr))
}


best.tree.found <- function (fmla, X) {

  assign(".fmla", fmla, sys.frame(0))
  assign(".X", X, sys.frame(0))

  assign(".tr", tree(.fmla, .X), sys.frame(0))
  print(summary(.tr))
  cvtr <- cv.tree(.tr)
  size <- cvtr$size[order(cvtr$dev)[1]]
  print(size)
  btr <- prune.tree(.tr, best=size)
  btr
}

Now, if you ask
> where.is.X()
you get:
> Error in model.frame.default(formula = fmla, data = X, subset = c("1",  : 
	Object "X" not found

and if you say
> here.is.X()

you get a normal error :)  (as the toy tree is a singleton, if you know
of an easy way to generate a meaningful X for it, please show me).

At this point, I went looking for ways to achieve the effect of .found
in .lost without the global assignment.  To my horror, I found that
you can supply environments, local=list(...), try to assign frames,
say something like data=parent.frame(); that formuli have frames somewhere
associated with them; that I am never longer sure, for y~., what y and .
actually are at a point in space and time; that model.frame(tr) magically
finds out that I supplied data=X, even though I didn't name X in 
tree(fmla, X); that cv.tree can't find X even though it's not a parameter,
and if tr needs to know it, it sort of should make sure it knows where it
took it from in the first place!

Horrors!  Please enlighten me where formuli and models keep their training
data sets, how can I verify they are what they should have been, or I will
never trust R models.  As a pro I can quickly hack anything with globals,
but copying stuff around is not the answer.  I need the R model.frame 
enlightenment!

Same problem arises in stepAIC, and global assignment to frame 0 solves
it, but there should be (a) a better way and (b) a clear general understanding
as to where formuli and data frames are associated and found!

For dessert, I want to run R under cygwin; Windows distro is stand-alone,
quits in cygwin; is there a cygwin-ready distro?  Compiling with mingw
howtos seem to be for the stand-aloner also...  And just by itself, it
doesn't compile easily (?)...

Cheers,
Alexy




More information about the R-help mailing list