[Rd] problem using model.frame()

Gavin Simpson gavin.simpson at ucl.ac.uk
Thu Aug 18 08:53:05 CEST 2005


On Wed, 2005-08-17 at 21:48 -0400, Gabor Grothendieck wrote:
> If its just a matter of specifying two data frames how about just
> letting the user specify them as the first two arguments without
> injecting formulas into it so that any of these are allowed but
> data frames are still not allowed in formulas other than in the
> data argument:
> 
> yourfunction(df1, df2)
> yourfunction(y ~ sp1 + sp2)
> yourfunction(y ~., df)
> 
> This could easily be implemented by having yourfunction be
> generic in which case the first one would dispatch
> yourfunction.data.frame and the second and third would
> dispatch yourfunction.formula .  

Hi Gabor,

yourfunction() is already generic, I have .default and .formula methods.
The default implementation of the method (Co-correspondence analysis) is
akin to a regression and uses a form of multivariate PLS. So one data
matrix plays the role of the response and one the predictor. Which is
the reason for wanting to use a formula interface.

Cheers,

G

> On 8/17/05, Gavin Simpson <gavin.simpson at ucl.ac.uk> wrote:
> > On Wed, 2005-08-17 at 20:24 +0200, Martin Maechler wrote:
> > > >>>>> "GS" == Gavin Simpson <gavin.simpson at ucl.ac.uk>
> > > >>>>>     on Tue, 16 Aug 2005 18:44:23 +0100 writes:
> > >
> > >     GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor Grothendieck
> > >     GS> wrote:
> > >     >> On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
> > >     >> wrote: > On Tue, 2005-08-16 at 11:25 -0400, Gabor
> > >     >> Grothendieck wrote: > > It can handle data frames like
> > >     >> this:
> > >     >> > >
> > >     >> > > model.frame(y1) > > or > > model.frame(~., y1)
> > >     >> >
> > >     >> > Thanks Gabor,
> > >     >> >
> > >     >> > Yes, I know that works, but I want the function
> > >     >> coca.formula to accept a > formula like this y2 ~ y1,
> > >     >> with both y1 and y2 being data frames. It is
> > >     >>
> > >     >> The expressions I gave work generally (i.e. lm, glm,
> > >     >> ...), not just in model.matrix, so would it be ok if the
> > >     >> user just does this?
> > >     >>
> > >     >> yourfunction(y2 ~., y1)
> > >
> > >     GS> Thanks again Gabor for your comments,
> > >
> > >     GS> I'd prefer the y1 ~ y2 as data frames - as this is the
> > >     GS> most natural way of doing things. I'd like to have (y2
> > >     GS> ~., y1) as well, and (y2 ~ spp1 + spp2 + spp3, y1) also
> > >     GS> work - silently without any trouble.
> > >
> > > I'm sorry, Gavin, I tend to disagree quite a bit.
> > >
> > > The formula notation has quite a history in the S language, and
> > > AFAIK never was the idea to use data.frames as formula
> > > components, but rather as "environments" in which formula
> > > components are looked up --- exactly as Gabor has explained.
> > 
> > Hi Martin, thanks for your comments,
> > 
> > But then one could have a matrix of variables on the rhs of the formula
> > and it would work - whether this is a documented feature or un-intended
> > side-effect of matrices being stored as vectors with dims, I don't know.
> > 
> > And whilst the formula may have a long history, a number of packages
> > have extended the interface to implement a specific feature, which don't
> > work with standard functions like lm, glm and friends. I don't see how
> > what I wanted to achieve is greatly different to that or using a matrix.
> > 
> > > To break with such a deeply rooted principle,
> > > you should have very very good reasons, because you're breaking
> > > the concepts on which all other uses of formulae are based.
> > > And this would potentially lead to much confusion of your users,
> > > at least in the way they should learn to think about what
> > > formulae mean.
> > 
> > In the end I managed to treat y1 ~ y2 (both data frames) as a special
> > case, which allows the existing formula notation to work as well, so I
> > can use y1 ~ y2, y1 ~ ., data = y2, or y1 ~ var + var2, data = y2. This
> > is what I wanted all along, to extend my interface (not do anything to
> > R's formulae), but to also work in the traditional sense.
> > 
> > The model I am writing code for really is modelling the relationship
> > between two matrices of data. In one version of the method, there is
> > real equivalence between both sides of the formula so it would seem odd
> > to treat the two sides of the formula differently. At least to me ;-)
> > 
> > > Martin
> > >
> > >
> > >     >> If it really is important to do it the way you describe,
> > >     >> are the data frames necessarily numeric? If so you could
> > >     >> preprocess your formula by placing as.matrix around all
> > >     >> the variables representing data frames using something
> > >     >> like this:
> > >     >>
> > >     >> https://www.stat.math.ethz.ch/pipermail/r-help/2004-December/061485.html
> > >
> > >     GS> Yes, they are numeric matrices (as data frames). I've
> > >     GS> looked at this, but I'd prefer to not have to do too
> > >     GS> much messing with the formula.
> > >
> > >     >> Of course, if they are necessarily numeric maybe they can
> > >     >> be matrices in the first place?
> > >
> > >     GS> Because read.table etc. produce data.frames and this is
> > >     GS> the natural way to work with data in R.
> > >
> > > but it is also slightly inefficient if they are numeric.
> > > There are places for data frames and for matrices.
> > 
> > I agree - and in the code I've written, y1 and y2 quickly get coerced to
> > matrices before the real number crunching begins.
> > 
> > However, all the other R modelling functions I have used work with
> > data.frames. Arguably, it could cause more confusion to write a function
> > that looked, walked and quacked like an R modelling function but needed
> > the user to apply an extra step to use - a step not usually required
> > under normal R usage.
> > 
> > All the best,
> > 
> > Gav
> > 
> > > Why should it be a problem to use
> > >     M <- as.matrix(read.table(..))
> > > ?
> > >
> > > For large files, it could be quite a bit more efficient,
> > > needing a bit more of code, to
> > > use scan() to read the numeric data directly :
> > >
> > >       h1 <- scan(..., n=1) ## <read variable names>
> > >       nc <- length(h1)
> > >       a <- matrix(scan(...., what = numeric(), ...),
> > >                   ncol = nc, dimnames = list(NULL, h1))
> > >
> > > maybe this would be useful to be packaged into
> > > a small utility with usage
> > >
> > >       read.matrix(...,  type = numeric(), ...)
> > >
> > >
> > >     GS> Following your suggestions, I altered my code to
> > >     GS> evaluate the rhs of the formula and check if it was of
> > >     GS> class "data.frame". If it is then I stop processing and
> > >     GS> return it as a data.frame as this point. If not, it
> > >     GS> eventually gets passed on to model.frame() for it to
> > >     GS> deal with it.
> > >
> > >     GS> So far - limited testing - it seems to do what I wanted
> > >     GS> all along. I'm sure there's a gotcha in there somewhere
> > >     GS> but at least the code runs so I can check for problems
> > >     GS> against my examples.
> > >
> > >     GS> Right, back to writing documentation...
> > >
> > >     GS> G
> > >
> > >     >> > more intuitive, to my mind at least for this particular
> > >     >> example and > analysis, to specify the formula with a
> > >     >> data frame on the rhs.
> > >     >> >
> > >     >> > model.frame doesn't work with the formula "~ y1" if the
> > >     >> object y1, in > the environment when model.frame
> > >     >> evaluates the formula, is a data.frame.  > It works if y1
> > >     >> is a matrix, however. I'd like to work around this >
> > >     >> problem, say by creating an environment in which y1 is
> > >     >> modified to be a > matrix, if possible. Can this be done?
> > >     >> >
> > >     >> > At the moment I have something working by grabbing the
> > >     >> bits of the > formula and then using get() to grab the
> > >     >> named object. Of course, this > won't work if someone
> > >     >> wants to use R's formula interface with the > following
> > >     >> formula y2 ~ var1 + var2 + var3, data = y1, or to use the
> > >     >> > subset argument common to many formula
> > >     >> implementations. I'd like to have > the function work in
> > >     >> as general a manner as possible, so I'm fishing > around
> > >     >> for potential solutions.
> > >     >> >
> > >     >> > All the best,
> > >     >> >
> > >     >> > Gav
> > >     >> >
> > >     >> > >
> > >     >> > > On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
> > >     >> wrote: > > > Hi I'm having a problem with model.frame,
> > >     >> encapsulated in this example:
> > >     >> > > >
> > >     >> > > > y1 <-
> > >     >> matrix(c(3,1,0,1,0,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1), > > >
> > >     >> nrow = 5, byrow = TRUE) > > > y1 <- as.data.frame(y1) > >
> > >     >> > rownames(y1) <- paste("site", 1:5, sep = "") > > >
> > >     >> colnames(y1) <- paste("spp", 1:4, sep = "") > > > y1
> > >     >> > > >
> > >     >> > > > model.frame(~ y1) > > > Error in
> > >     >> model.frame(formula, rownames, variables, varnames,
> > >     >> extras, extranames, : > > > invalid variable type
> > >     >> > > >
> > >     >> > > > temp <- as.matrix(y1) > > > model.frame(~ temp) > >
> > >     >> > temp.spp1 temp.spp2 temp.spp3 temp.spp4 > > > 1 3 1 0 1
> > >     >> > > > 2 0 1 1 0 > > > 3 0 0 1 0 > > > 4 0 0 1 1 > > > 5 0
> > >     >> 1 1 1
> > >     >> > > >
> > >     >> > > > Ideally the above wouldn't have names like
> > >     >> temp.var1, temp.var2, but one > > > could deal with that
> > >     >> later.
> > >     >> > > >
> > >     >> > > > I have tracked down the source of the error message
> > >     >> to line 1330 in > > > model.c - here I'm stumped as I
> > >     >> don't know any C, but it looks as if the > > > code is
> > >     >> looping over the variables in the formula and checking of
> > >     >> they > > > are the right "type". So a matrix of variables
> > >     >> gets through, but a > > > data.frame doesn't.
> > >     >> > > >
> > >     >> > > > It would be good if model.frame could cope with
> > >     >> data.frames in formulae, > > > but seeing as I am
> > >     >> incapable of providing a patch, is there a way around > >
> > >     >> > this problem?
> > >     >> > > >
> > >     >> > > > Below is the head of the function I am currently
> > >     >> using, including the > > > function for parsing the
> > >     >> formula - borrowed and hacked from > > >
> > >     >> ordiParseFormula() in package vegan.
> > >     >> > > >
> > >     >> > > > I can work out the class of the rhs of the
> > >     >> forumla. Is there a way to > > > create a suitable
> > >     >> environment for the data argument of parseFormula() > > >
> > >     >> such that it contains the rhs dataframe coerced to a
> > >     >> matrix, which then > > > should get through
> > >     >> model.frame.default without error? How would I go > > >
> > >     >> about manipulating/creating such an environment? Any
> > >     >> other ideas?
> > >     >> > > >
> > >     >> > > > Thanks in advance
> > >     >> > > >
> > >     >> > > > Gav
> > >     >> > > >
> > >     >> > > > coca.formula <- function(formula, method =
> > >     >> c("predictive", "symmetric"), > > > reg.method =
> > >     >> c("simpls", "eigen"), weights = NULL, > > > n.axes =
> > >     >> NULL, symmetric = FALSE, data) > > > { > > > parseFormula
> > >     >> <- function (formula, data) > > > { > > > browser() > > >
> > >     >> Terms <- terms(formula, "Condition", data = data) > > >
> > >     >> flapart <- fla <- formula <- formula(Terms, width.cutoff
> > >     >> = 500) > > > specdata <- formula[[2]] > > > X <-
> > >     >> eval(specdata, data, parent.frame()) > > > X <-
> > >     >> as.matrix(X) > > > formula[[2]] <- NULL > > > if
> > >     >> (formula[[2]] == "1" || formula[[2]] == "0") > > > Y <-
> > >     >> NULL > > > else { > > > mf <- model.frame(formula, data,
> > >     >> na.action = na.fail) > > > Y <- model.matrix(formula, mf)
> > >     >> > > > if (any(colnames(Y) == "(Intercept)")) { > > > xint
> > >     >> <- which(colnames(Y) == "(Intercept)") > > > Y <- Y[,
> > >     >> -xint, drop = FALSE] > > > } > > > } > > > list(X = X, Y
> > >     >> = Y) > > > } > > > if (missing(data)) > > > data <-
> > >     >> parent.frame() > > > #browser() > > > dat <-
> > >     >> parseFormula(formula, data)
> > >     >> > > >
> > >     >> > > > --
> > >     >> > > >
> > >     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >     >> > > > Gavin Simpson [T] +44 (0)20 7679 5522 > > > ENSIS
> > >     >> Research Fellow [F] +44 (0)20 7679 7565 > > > ENSIS
> > >     >> Ltd. & ECRC [E] gavin.simpsonATNOSPAMucl.ac.uk > > > UCL
> > >     >> Department of Geography [W]
> > >     >> http://www.ucl.ac.uk/~ucfagls/cv/ > > > 26 Bedford Way
> > >     >> [W] http://www.ucl.ac.uk/~ucfagls/ > > > London.  WC1H
> > >     >> 0AP.  > > >
> > >     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >     >> > > >
> > >     >> > > > ______________________________________________ > >
> > >     >> > R-devel at r-project.org mailing list > > >
> > >     >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >     >> > > >
> > >     >> > --
> > >     >> >
> > >     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >     >> > Gavin Simpson [T] +44 (0)20 7679 5522 > ENSIS Research
> > >     >> Fellow [F] +44 (0)20 7679 7565 > ENSIS Ltd. & ECRC [E]
> > >     >> gavin.simpsonATNOSPAMucl.ac.uk > UCL Department of
> > >     >> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ > 26
> > >     >> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ > London.
> > >     >> WC1H 0AP.  >
> > >     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >     >> >
> > >     >> >
> > >     >> >
> > >     GS> --
> > >     GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >     GS> Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research
> > >     GS> Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. & ECRC [E]
> > >     GS> gavin.simpsonATNOSPAMucl.ac.uk UCL Department of
> > >     GS> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26
> > >     GS> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ London.
> > >     GS> WC1H 0AP.
> > >     GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > >
> > >     GS> ______________________________________________
> > >     GS> R-devel at r-project.org mailing list
> > >     GS> https://stat.ethz.ch/mailman/listinfo/r-devel
> > --
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > Gavin Simpson                     [T] +44 (0)20 7679 5522
> > ENSIS Research Fellow             [F] +44 (0)20 7679 7565
> > ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
> > UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
> > 26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
> > London.  WC1H 0AP.
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > 
> >
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-devel mailing list