[R] problem with predict()

Czerminski, Ryszard ryszard at arqule.com
Fri Jun 28 15:27:39 CEST 2002


This time I use the same file for train.data and test.data
throwing in "names(test) <- names(train)" before predict() for double
protection (:-)
and it still fails...

Is it some weird problem with this particular data set ? or a bug ?
(why "subscript out of bounds" ?)

> rm(list=ls())
> train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
comment.char="")
> test.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
comment.char="")
> yr <- train.data[,1]; xr <- train.data[,-1]
> xr <- scale(xr)     # matrix <- scale(data.frame)
> x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale")
> mask <- apply(xr, 2, function(x) any(is.na(x)))
> xr <- xr[,!mask] # rm NA's
> ys <- test.data[,1]; xs <- test.data[,-1]
> xs <- scale(xs, center = x.center, scale = x.scale)
> xs <- xs[,!mask]
> train <- data.frame(y = yr, x = xr)
> test <- data.frame(y = ys, x = xs)
> model <- lm(y~., train)
> cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
dim(train) = 164 119 ; dim(test) = 164 119 
> names(test) <- names(train)
> length(predict(model, test))
Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : 
        subscript out of bounds
>

Ryszard Czerminski   phone: (781)994-0479
ArQule, Inc.         email:ryszard at arqule.com
19 Presidential Way  http://www.arqule.com
Woburn, MA 01801     fax: (781)994-0679


-----Original Message-----
From: Liaw, Andy [mailto:andy_liaw at merck.com]
Sent: Friday, June 28, 2002 8:46 AM
To: 'Czerminski, Ryszard'
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()


You can try:

  names(test) <- names(train)

before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.

Andy

> -----Original Message-----
> From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> Sent: Thursday, June 27, 2002 3:29 PM
> To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
> 
> 
> 
> # Yes. You are *still* using a matrix in a data frame.  
> Please do read more
> # carefully.
> 
> I have read some more R documentation trying to understand difference
> between
> matrices and data frames etc... and I repeat my example this time
> executing EXACTLY the same code with only difference being 
> that in one case
> I use smaller data sets ({train,test}-small.csv) and in the 
> second case I
> use larger
> data sets ({train,test}.csv) - and I got different behaviour.
> 
> Small case (10*4) runs fine, larger case (164*119) fails.
> 
> Any ideas why this happens ? 
> 
> R
> 
> > rm(list=ls())
> > train.data <- read.csv("train-small.csv", header = TRUE, row.names =
> "mol", comment.char="")
> > test.data <- read.csv("test-small.csv", header = TRUE, 
> row.names = "mol",
> comment.char="")
> > yr <- train.data[,1]; xr <- train.data[,-1]
> > xr <- scale(xr)
> > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, 
> "scaled:scale")
> > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > xr <- xr[,!mask] # rm NA's
> > ys <- test.data[,1]; xs <- test.data[,-1]
> > xs <- scale(xs, center = x.center, scale = x.scale)
> > xs <- xs[,!mask]
> > train <- data.frame(y = yr, x = xr)
> > test <- data.frame(y = ys, x = xs)
> > model <- lm(y~., train)
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 10 4 ; dim(test) = 10 4 
> > length(predict(model, test))
> [1] 10
> > train.data <- read.csv("train.csv", header = TRUE, 
> row.names = "mol",
> comment.char="")
> > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",
> comment.char="")
> [snip...]
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 164 119 ; dim(test) = 35 119 
> > length(predict(model, test))
> Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : 
>         subscript out of bounds
> >
> 
> Ryszard Czerminski   phone: (781)994-0479
> ArQule, Inc.         email:ryszard at arqule.com
> 19 Presidential Way  http://www.arqule.com
> Woburn, MA 01801     fax: (781)994-0679
> 
> 
> -----Original Message-----
> From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> Sent: Friday, June 21, 2002 3:41 PM
> To: Czerminski, Ryszard
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
> 
> 
> On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
> 
> > --- first problem
> >
> > If I store 'simulated' data in data frames:
> > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
> > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
> > it still works the same way i.e. the code below works fine
> > for simulated data and fails for 'real' data the only
> > difference being in actual numeric values stored in data
> > structures of the same shape and type.
> >
> > Any suggestions why this happens ?
> 
> Yes. You are *still* using a matrix in a data frame.  Please 
> do read more
> carefully.
> 
> > --- second problem
> >
> > > As Andy Liaw pointed out, xr is a matrix.  Take a look at 
> the names of
> > > train.  Hint: they do not contain `x'.
> >
> > Following your hint I am guessing that the fact that names 
> do not contain
> > 'x'
> > explains why lm(y~., train) form works and lm(y~x, train) fails
> > and "lm(y~., train)" means roughly: correlate column "y" to 
> all other
> colums
> 
> No, it means regress y on all the remaining colums in the 
> data argument.
> 
> >
> > Where I can find more detail specification of this syntax ?
> > In help(lm) I find this paragraph:
> >
> >      Models for `lm' are specified symbolically.  A typical 
> model has
> >      the form `response ~ terms' where `response' is the 
> (numeric)...
> >
> > which does not quite cover this case.
> 
> In any good book on the subject.
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.
> -.-
> r-help mailing list -- Read 
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: 
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._.
> _._
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.-.-
> r-help mailing list -- Read 
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: 
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._._._
> 

----------------------------------------------------------------------------
--
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete
it.

============================================================================
==
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list