[R] problem with predict()

Liaw, Andy andy_liaw at merck.com
Fri Jun 28 14:46:06 CEST 2002


You can try:

  names(test) <- names(train)

before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.

Andy

> -----Original Message-----
> From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> Sent: Thursday, June 27, 2002 3:29 PM
> To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
> 
> 
> 
> # Yes. You are *still* using a matrix in a data frame.  
> Please do read more
> # carefully.
> 
> I have read some more R documentation trying to understand difference
> between
> matrices and data frames etc... and I repeat my example this time
> executing EXACTLY the same code with only difference being 
> that in one case
> I use smaller data sets ({train,test}-small.csv) and in the 
> second case I
> use larger
> data sets ({train,test}.csv) - and I got different behaviour.
> 
> Small case (10*4) runs fine, larger case (164*119) fails.
> 
> Any ideas why this happens ? 
> 
> R
> 
> > rm(list=ls())
> > train.data <- read.csv("train-small.csv", header = TRUE, row.names =
> "mol", comment.char="")
> > test.data <- read.csv("test-small.csv", header = TRUE, 
> row.names = "mol",
> comment.char="")
> > yr <- train.data[,1]; xr <- train.data[,-1]
> > xr <- scale(xr)
> > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, 
> "scaled:scale")
> > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > xr <- xr[,!mask] # rm NA's
> > ys <- test.data[,1]; xs <- test.data[,-1]
> > xs <- scale(xs, center = x.center, scale = x.scale)
> > xs <- xs[,!mask]
> > train <- data.frame(y = yr, x = xr)
> > test <- data.frame(y = ys, x = xs)
> > model <- lm(y~., train)
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 10 4 ; dim(test) = 10 4 
> > length(predict(model, test))
> [1] 10
> > train.data <- read.csv("train.csv", header = TRUE, 
> row.names = "mol",
> comment.char="")
> > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",
> comment.char="")
> [snip...]
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 164 119 ; dim(test) = 35 119 
> > length(predict(model, test))
> Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : 
>         subscript out of bounds
> >
> 
> Ryszard Czerminski   phone: (781)994-0479
> ArQule, Inc.         email:ryszard at arqule.com
> 19 Presidential Way  http://www.arqule.com
> Woburn, MA 01801     fax: (781)994-0679
> 
> 
> -----Original Message-----
> From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> Sent: Friday, June 21, 2002 3:41 PM
> To: Czerminski, Ryszard
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
> 
> 
> On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
> 
> > --- first problem
> >
> > If I store 'simulated' data in data frames:
> > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
> > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
> > it still works the same way i.e. the code below works fine
> > for simulated data and fails for 'real' data the only
> > difference being in actual numeric values stored in data
> > structures of the same shape and type.
> >
> > Any suggestions why this happens ?
> 
> Yes. You are *still* using a matrix in a data frame.  Please 
> do read more
> carefully.
> 
> > --- second problem
> >
> > > As Andy Liaw pointed out, xr is a matrix.  Take a look at 
> the names of
> > > train.  Hint: they do not contain `x'.
> >
> > Following your hint I am guessing that the fact that names 
> do not contain
> > 'x'
> > explains why lm(y~., train) form works and lm(y~x, train) fails
> > and "lm(y~., train)" means roughly: correlate column "y" to 
> all other
> colums
> 
> No, it means regress y on all the remaining colums in the 
> data argument.
> 
> >
> > Where I can find more detail specification of this syntax ?
> > In help(lm) I find this paragraph:
> >
> >      Models for `lm' are specified symbolically.  A typical 
> model has
> >      the form `response ~ terms' where `response' is the 
> (numeric)...
> >
> > which does not quite cover this case.
> 
> In any good book on the subject.
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.
> -.-
> r-help mailing list -- Read 
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: 
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._.
> _._
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.-.-
> r-help mailing list -- Read 
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: 
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._._._
> 

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.

==============================================================================

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list