[R] issue building dataframes with matrices.

Bill.Venables at csiro.au Bill.Venables at csiro.au
Wed Aug 13 07:03:36 CEST 2008


It's a feature and it's been there forever.  (It's even present in
another system not unlike R.)

Suppose you set

y <- matrix(1:3)

and construct

dfr <- data.frame(x=1:3, y)

Then you invoke the constructor function, data.frame, which by default
simplifies things like matrices to single columns, naming them as
necessary.

Now if you directly modify dfr by adding another component, like

dfr$yy <- y

You bypass the constructor function and its default simplifications, but
you do not bypass the structure tests.  This is, in fact the simplest
way to put a matrix inside a data frame intact, but it must have the
same number of rows as has the data frame itself.

There are other ways of getting a matrix into a data frame intact, and
sometimes it is mildly useful to do this.  Consider, for example, the
following:

dfr <- within(data.frame(x = 1:5), {
    y <- rbinom(5, 100, plogis((x-3)/2))
    SF <- cbind(S = y, F = 100-y)
    rm(y)
  })
  
names(dfr)  ### Note the apparent discrepancy
dfr         ### with the printed version.

(fm <- glm(SF ~ x, binomial, dfr))

Bill Venables
http://www.cmis.csiro.au/bill.venables/ 


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Daryl Morris
Sent: Wednesday, 13 August 2008 11:31 AM
To: r-help at r-project.org
Subject: [R] issue building dataframes with matrices.

Hello,
Is this a bug or a feature?  I am using R 2.7.1 on Apple OS X.


 > y <- matrix(1:3,nrow=3)     # y is a single-column matrix
 > df <-data.frame(x=1:3,y=y)
 > sapply(df,data.class)
        x         y
"numeric" "numeric"
 > df$yy <- y
 > sapply(df,data.class)
        x         y        yy
"numeric" "numeric"  "matrix"


I'm not sure why dataframes are allowed to have matrices as members.    
It's also weird to me that y & yy have different classes.  It seems like

there has been a blurring of the line between lists and dataframes.   
When did dataframes start taking members other than vectors?

This is an issue if one for example builds a dataframe to fit a model, 
and then subsequently wants to use predict.  You have to work a bit to 
avoid a type mismatch error.

 > df$out = df$x+df$y+df$yy + rnorm(3)
 > df
  x y yy       out
1 1 1  1  3.066348
2 2 2  2  5.516017
3 3 3  3 11.073452

 
 > glmout = glm(out~x+y+yy,data=df)
 > predict(glmout,newdata=data.frame(x=1:3,y=1:3,yy=1:3))
Error: variable 'yy' was fitted with type "nmatrix.1" but type "numeric"

was supplied
 >
 > predict(glmout,newdata=data.frame(x=1:3,y=1:3,yy=matrix(1:3)))
Error: variable 'yy' was fitted with type "nmatrix.1" but type "numeric"

was supplied
 > predict(glmout,newdata=df[,-4])
        1         2         3
 2.548387  6.551939 10.555491
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
:
  prediction from a rank-deficient fit may be misleading

I'm not really looking for a "solution", as I can already identify 
several workarounds.  I guess I'm mainly trying to figure out what the 
philosophy is here.

This is also weird to me:

 > df$yy <- as.data.frame(y)
 > df
  x y V1       out
1 1 1  1  3.066348
2 2 2  2  5.516017
3 3 3  3 11.073452
 > glmout = glm(out~x+y+V1,data=df)
Error in eval(expr, envir, enclos) : object "V1" not found
 > glmout = glm(out~x+y+yy,data=df)
Error in model.frame.default(formula = out ~ x + y + yy, data = df, 
drop.unused.levels = TRUE) :
  invalid type (list) for variable 'yy'
 > glmout = glm(out~x+y+yy$VI,data=df)
Error in model.frame.default(formula = out ~ x + y + yy$VI, data = df,
:
  invalid type (NULL) for variable 'yy$VI'

Is it impossible to build a model from a dataframe built this way?


thanks, Daryl Morris
(Biostatistics, Univ. of Washington)

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list