[Rd] problem using model.frame()

Thu Aug 18 06:57:39 CEST 2005

On 18 Aug 2005, at 1:49, Gavin Simpson wrote:

> On Wed, 2005-08-17 at 20:24 +0200, Martin Maechler wrote:
>>>>>>> "GS" == Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>>>>>     on Tue, 16 Aug 2005 18:44:23 +0100 writes:
>>
>>     GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor Grothendieck
>>     GS> wrote:
>>>> On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>> wrote: > On Tue, 2005-08-16 at 11:25 -0400, Gabor
>>>> Grothendieck wrote: > > It can handle data frames like
>>>> this:
>>>>>>
>>>>>> model.frame(y1) > > or > > model.frame(~., y1)
>>>>>
>>>>> Thanks Gabor,
>>>>>
>>>>> Yes, I know that works, but I want the function
>>>> coca.formula to accept a > formula like this y2 ~ y1,
>>>> with both y1 and y2 being data frames. It is
>>>>
>>>> The expressions I gave work generally (i.e. lm, glm,
>>>> ...), not just in model.matrix, so would it be ok if the
>>>> user just does this?
>>>>
>>>> yourfunction(y2 ~., y1)
>>
>>     GS> Thanks again Gabor for your comments,
>>
>>     GS> I'd prefer the y1 ~ y2 as data frames - as this is the
>>     GS> most natural way of doing things. I'd like to have (y2
>>     GS> ~., y1) as well, and (y2 ~ spp1 + spp2 + spp3, y1) also
>>     GS> work - silently without any trouble.
>>
>> I'm sorry, Gavin, I tend to disagree quite a bit.
>>
>> The formula notation has quite a history in the S language, and
>> AFAIK never was the idea to use data.frames as formula
>> components, but rather as "environments" in which formula
>> components are looked up --- exactly as Gabor has explained.
>
> Hi Martin, thanks for your comments,
>
> But then one could have a matrix of variables on the rhs of the formula
> and it would work - whether this is a documented feature or un-intended
> side-effect of matrices being stored as vectors with dims, I don't 
> know.
>
> And whilst the formula may have a long history, a number of packages
> have extended the interface to implement a specific feature, which 
> don't
> work with standard functions like lm, glm and friends. I don't see how
> what I wanted to achieve is greatly different to that or using a 
> matrix.
>
>> To break with such a deeply rooted principle,
>> you should have very very good reasons, because you're breaking
>> the concepts on which all other uses of formulae are based.
>> And this would potentially lead to much confusion of your users,
>> at least in the way they should learn to think about what
>> formulae mean.
>
> In the end I managed to treat y1 ~ y2 (both data frames) as a special
> case, which allows the existing formula notation to work as well, so I
> can use y1 ~ y2, y1 ~ ., data = y2, or y1 ~ var + var2, data = y2. This
> is what I wanted all along, to extend my interface (not do anything to
> R's formulae), but to also work in the traditional sense.
>
> The model I am writing code for really is modelling the relationship
> between two matrices of data. In one version of the method, there is
> real equivalence between both sides of the formula so it would seem odd
> to treat the two sides of the formula differently. At least to me ;-)

It seems that I may be responsible for one of these extensions (lhs as 
a data.frame in cca and rda in vegan package). There the response (lhs) 
is multivariate or a multispecies community, and you must take that as 
a whole without manipulation (and if you tried using VGAM you see there 
really is painful to define lhs with, say, 127 elements). However, in 
general you shouldn't use models where you use all the 'explanatory' 
variables (rhs) that yo happen to have by accident. So much bad science 
has been created with that approach even in your field, Gav. The whole 
idea of formula is the ability to choose from candidate variables. That 
is: to build a model. Therefore you have one-sided formulae in prcomp() 
and princomp(): you can say prcomp(~ x1 + log(x2) +x4, data) or 
prcomp(~ . - x3, data). I think you should try to keep it so. Do 
instead like Gabor suggested: you could have a function coca.default or 
coca.matrix with interface:

coca.matrix(matx, maty, matz) -- or you can name this as coca.default.

and coca.formula which essentially parses your formula and returns a 
list of matrices you need:

coca.formula <- function(formula, data)
{
      matricesout <- parsemyformula(formula, data)
     coca(matricesout$matx, matricesout$maty, matricesoutz)
}
Then you need the generic: coca <- function(...) UseMethod("coca") and 
it's done (but fails in R CMD check unless you add "..." in all 
specific functions...). The real work is always done in coca.matrix (or 
coca.default), and the others just chew your data into suitable form 
for your workhorse.

If then somebody thinks that they need all possible variables as 
'explanatory' variables (or perhaps constraints in your case), they 
just call the function as

coca(matx, maty, matz)

And if you have coca.data.frame they don't need 'quacking' with extra 
steps:

coca.data.frame <- function(dfx, dfy dfz) coca(as.matrix(dfx), 
as.matrix(dfy), as.matrix(dfz)).

This you call as coca(dfx, dfy, dfz) and there you go.

The essential feature in formula is the ability to define the model. 
Don't give it away.

cheers, jazza
--
Jari Oksanen, Oulu, Finland