[Rd] problem using model.frame()

Martin Maechler maechler at stat.math.ethz.ch
Wed Aug 17 20:24:12 CEST 2005


>>>>> "GS" == Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>>>     on Tue, 16 Aug 2005 18:44:23 +0100 writes:

    GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor Grothendieck
    GS> wrote:
    >> On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
    >> wrote: > On Tue, 2005-08-16 at 11:25 -0400, Gabor
    >> Grothendieck wrote: > > It can handle data frames like
    >> this:
    >> > >
    >> > > model.frame(y1) > > or > > model.frame(~., y1)
    >> > 
    >> > Thanks Gabor,
    >> > 
    >> > Yes, I know that works, but I want the function
    >> coca.formula to accept a > formula like this y2 ~ y1,
    >> with both y1 and y2 being data frames. It is
    >> 
    >> The expressions I gave work generally (i.e. lm, glm,
    >> ...), not just in model.matrix, so would it be ok if the
    >> user just does this?
    >> 
    >> yourfunction(y2 ~., y1)

    GS> Thanks again Gabor for your comments,

    GS> I'd prefer the y1 ~ y2 as data frames - as this is the
    GS> most natural way of doing things. I'd like to have (y2
    GS> ~., y1) as well, and (y2 ~ spp1 + spp2 + spp3, y1) also
    GS> work - silently without any trouble.

I'm sorry, Gavin, I tend to disagree quite a bit.

The formula notation has quite a history in the S language, and
AFAIK never was the idea to use data.frames as formula
components, but rather as "environments" in which formula
components are looked up --- exactly as Gabor has explained.

To break with such a deeply rooted principle, 
you should have very very good reasons, because you're breaking
the concepts on which all other uses of formulae are based.
And this would potentially lead to much confusion of your users,
at least in the way they should learn to think about what
formulae mean.

Martin


    >> If it really is important to do it the way you describe,
    >> are the data frames necessarily numeric? If so you could
    >> preprocess your formula by placing as.matrix around all
    >> the variables representing data frames using something
    >> like this:
    >> 
    >> https://www.stat.math.ethz.ch/pipermail/r-help/2004-December/061485.html

    GS> Yes, they are numeric matrices (as data frames). I've
    GS> looked at this, but I'd prefer to not have to do too
    GS> much messing with the formula.

    >> Of course, if they are necessarily numeric maybe they can
    >> be matrices in the first place?

    GS> Because read.table etc. produce data.frames and this is
    GS> the natural way to work with data in R.

but it is also slightly inefficient if they are numeric.
There are places for data frames and for matrices.

Why should it be a problem to use 
    M <- as.matrix(read.table(..))
?

For large files, it could be quite a bit more efficient,
needing a bit more of code, to
use scan() to read the numeric data directly :

      h1 <- scan(..., n=1) ## <read variable names>
      nc <- length(h1)
      a <- matrix(scan(...., what = numeric(), ...),  
                  ncol = nc, dimnames = list(NULL, h1))

maybe this would be useful to be packaged into
a small utility with usage

      read.matrix(...,  type = numeric(), ...)      


    GS> Following your suggestions, I altered my code to
    GS> evaluate the rhs of the formula and check if it was of
    GS> class "data.frame". If it is then I stop processing and
    GS> return it as a data.frame as this point. If not, it
    GS> eventually gets passed on to model.frame() for it to
    GS> deal with it.

    GS> So far - limited testing - it seems to do what I wanted
    GS> all along. I'm sure there's a gotcha in there somewhere
    GS> but at least the code runs so I can check for problems
    GS> against my examples.

    GS> Right, back to writing documentation...

    GS> G

    >> > more intuitive, to my mind at least for this particular
    >> example and > analysis, to specify the formula with a
    >> data frame on the rhs.
    >> > 
    >> > model.frame doesn't work with the formula "~ y1" if the
    >> object y1, in > the environment when model.frame
    >> evaluates the formula, is a data.frame.  > It works if y1
    >> is a matrix, however. I'd like to work around this >
    >> problem, say by creating an environment in which y1 is
    >> modified to be a > matrix, if possible. Can this be done?
    >> > 
    >> > At the moment I have something working by grabbing the
    >> bits of the > formula and then using get() to grab the
    >> named object. Of course, this > won't work if someone
    >> wants to use R's formula interface with the > following
    >> formula y2 ~ var1 + var2 + var3, data = y1, or to use the
    >> > subset argument common to many formula
    >> implementations. I'd like to have > the function work in
    >> as general a manner as possible, so I'm fishing > around
    >> for potential solutions.
    >> > 
    >> > All the best,
    >> > 
    >> > Gav
    >> > 
    >> > >
    >> > > On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
    >> wrote: > > > Hi I'm having a problem with model.frame,
    >> encapsulated in this example:
    >> > > >
    >> > > > y1 <-
    >> matrix(c(3,1,0,1,0,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1), > > >
    >> nrow = 5, byrow = TRUE) > > > y1 <- as.data.frame(y1) > >
    >> > rownames(y1) <- paste("site", 1:5, sep = "") > > >
    >> colnames(y1) <- paste("spp", 1:4, sep = "") > > > y1
    >> > > >
    >> > > > model.frame(~ y1) > > > Error in
    >> model.frame(formula, rownames, variables, varnames,
    >> extras, extranames, : > > > invalid variable type
    >> > > >
    >> > > > temp <- as.matrix(y1) > > > model.frame(~ temp) > >
    >> > temp.spp1 temp.spp2 temp.spp3 temp.spp4 > > > 1 3 1 0 1
    >> > > > 2 0 1 1 0 > > > 3 0 0 1 0 > > > 4 0 0 1 1 > > > 5 0
    >> 1 1 1
    >> > > >
    >> > > > Ideally the above wouldn't have names like
    >> temp.var1, temp.var2, but one > > > could deal with that
    >> later.
    >> > > >
    >> > > > I have tracked down the source of the error message
    >> to line 1330 in > > > model.c - here I'm stumped as I
    >> don't know any C, but it looks as if the > > > code is
    >> looping over the variables in the formula and checking of
    >> they > > > are the right "type". So a matrix of variables
    >> gets through, but a > > > data.frame doesn't.
    >> > > >
    >> > > > It would be good if model.frame could cope with
    >> data.frames in formulae, > > > but seeing as I am
    >> incapable of providing a patch, is there a way around > >
    >> > this problem?
    >> > > >
    >> > > > Below is the head of the function I am currently
    >> using, including the > > > function for parsing the
    >> formula - borrowed and hacked from > > >
    >> ordiParseFormula() in package vegan.
    >> > > >
    >> > > > I can work out the class of the rhs of the
    >> forumla. Is there a way to > > > create a suitable
    >> environment for the data argument of parseFormula() > > >
    >> such that it contains the rhs dataframe coerced to a
    >> matrix, which then > > > should get through
    >> model.frame.default without error? How would I go > > >
    >> about manipulating/creating such an environment? Any
    >> other ideas?
    >> > > >
    >> > > > Thanks in advance
    >> > > >
    >> > > > Gav
    >> > > >
    >> > > > coca.formula <- function(formula, method =
    >> c("predictive", "symmetric"), > > > reg.method =
    >> c("simpls", "eigen"), weights = NULL, > > > n.axes =
    >> NULL, symmetric = FALSE, data) > > > { > > > parseFormula
    >> <- function (formula, data) > > > { > > > browser() > > >
    >> Terms <- terms(formula, "Condition", data = data) > > >
    >> flapart <- fla <- formula <- formula(Terms, width.cutoff
    >> = 500) > > > specdata <- formula[[2]] > > > X <-
    >> eval(specdata, data, parent.frame()) > > > X <-
    >> as.matrix(X) > > > formula[[2]] <- NULL > > > if
    >> (formula[[2]] == "1" || formula[[2]] == "0") > > > Y <-
    >> NULL > > > else { > > > mf <- model.frame(formula, data,
    >> na.action = na.fail) > > > Y <- model.matrix(formula, mf)
    >> > > > if (any(colnames(Y) == "(Intercept)")) { > > > xint
    >> <- which(colnames(Y) == "(Intercept)") > > > Y <- Y[,
    >> -xint, drop = FALSE] > > > } > > > } > > > list(X = X, Y
    >> = Y) > > > } > > > if (missing(data)) > > > data <-
    >> parent.frame() > > > #browser() > > > dat <-
    >> parseFormula(formula, data)
    >> > > >
    >> > > > --
    >> > > >
    >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    >> > > > Gavin Simpson [T] +44 (0)20 7679 5522 > > > ENSIS
    >> Research Fellow [F] +44 (0)20 7679 7565 > > > ENSIS
    >> Ltd. & ECRC [E] gavin.simpsonATNOSPAMucl.ac.uk > > > UCL
    >> Department of Geography [W]
    >> http://www.ucl.ac.uk/~ucfagls/cv/ > > > 26 Bedford Way
    >> [W] http://www.ucl.ac.uk/~ucfagls/ > > > London.  WC1H
    >> 0AP.  > > >
    >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    >> > > >
    >> > > > ______________________________________________ > >
    >> > R-devel at r-project.org mailing list > > >
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >
    >> > --
    >> >
    >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    >> > Gavin Simpson [T] +44 (0)20 7679 5522 > ENSIS Research
    >> Fellow [F] +44 (0)20 7679 7565 > ENSIS Ltd. & ECRC [E]
    >> gavin.simpsonATNOSPAMucl.ac.uk > UCL Department of
    >> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ > 26
    >> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ > London.
    >> WC1H 0AP.  >
    >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    >> > 
    >> > 
    >> >
    GS> --
    GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    GS> Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research
    GS> Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. & ECRC [E]
    GS> gavin.simpsonATNOSPAMucl.ac.uk UCL Department of
    GS> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26
    GS> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ London.
    GS> WC1H 0AP.
    GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

    GS> ______________________________________________
    GS> R-devel at r-project.org mailing list
    GS> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list