[R] problem with predict()

Fri Jun 28 19:42:21 CEST 2002

I will try.

I tried to use lm() to have basic linear "reference".

If this is the reason (i.e. train data set is rank-deficient) and lm
cannot handle such situation, what are the other packages in R
you could recommend which would handle this type of data ?
Also in such case: should not lm() report a problem in a model building
phase ?

So far I have used SVM approach to do regression with this data set and I am
getting
rather poor r^2 (~0.25 on test set), but I do not have any numerical
problems with SVM.
I am also planning to try randomForest() to do classification. This was my
immediate
motivation to turn to R.

All the best,

R

Ryszard Czerminski   phone: (781)994-0479
ArQule, Inc.         email:ryszard at arqule.com
19 Presidential Way  http://www.arqule.com
Woburn, MA 01801     fax: (781)994-0679

-----Original Message-----
From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
Sent: Friday, June 28, 2002 12:39 PM
To: Czerminski, Ryszard
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()

Have you tried the R debugging tools?  If not, please make use of them.
My guess is that you have a rank-deficient problem.

?debugger
?recover
?dump.frames
...

On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:

> This time I use the same file for train.data and test.data
> throwing in "names(test) <- names(train)" before predict() for double
> protection (:-)
> and it still fails...
>
> Is it some weird problem with this particular data set ? or a bug ?
> (why "subscript out of bounds" ?)

That's what the debugging tools are for.

>
> > rm(list=ls())
> > train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
> comment.char="")
> > test.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
> comment.char="")
> > yr <- train.data[,1]; xr <- train.data[,-1]
> > xr <- scale(xr)     # matrix <- scale(data.frame)
> > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
"scaled:scale")
> > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > xr <- xr[,!mask] # rm NA's
> > ys <- test.data[,1]; xs <- test.data[,-1]
> > xs <- scale(xs, center = x.center, scale = x.scale)
> > xs <- xs[,!mask]
> > train <- data.frame(y = yr, x = xr)
> > test <- data.frame(y = ys, x = xs)
> > model <- lm(y~., train)
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 164 119 ; dim(test) = 164 119
> > names(test) <- names(train)
> > length(predict(model, test))
> Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
>         subscript out of bounds
> >
>
> Ryszard Czerminski   phone: (781)994-0479
> ArQule, Inc.         email:ryszard at arqule.com
> 19 Presidential Way  http://www.arqule.com
> Woburn, MA 01801     fax: (781)994-0679
>
>
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: Friday, June 28, 2002 8:46 AM
> To: 'Czerminski, Ryszard'
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
>
>
> You can try:
>
>   names(test) <- names(train)
>
> before calling predict() to make sure that the variable names match.
> Without your data files, it's hard to tell why your first example worked.
>
> Andy
>
> > -----Original Message-----
> > From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> > Sent: Thursday, June 27, 2002 3:29 PM
> > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch
> > Subject: RE: [R] problem with predict()
> >
> >
> >
> > # Yes. You are *still* using a matrix in a data frame.
> > Please do read more
> > # carefully.
> >
> > I have read some more R documentation trying to understand difference
> > between
> > matrices and data frames etc... and I repeat my example this time
> > executing EXACTLY the same code with only difference being
> > that in one case
> > I use smaller data sets ({train,test}-small.csv) and in the
> > second case I
> > use larger
> > data sets ({train,test}.csv) - and I got different behaviour.
> >
> > Small case (10*4) runs fine, larger case (164*119) fails.
> >
> > Any ideas why this happens ?
> >
> > R
> >
> > > rm(list=ls())
> > > train.data <- read.csv("train-small.csv", header = TRUE, row.names =
> > "mol", comment.char="")
> > > test.data <- read.csv("test-small.csv", header = TRUE,
> > row.names = "mol",
> > comment.char="")
> > > yr <- train.data[,1]; xr <- train.data[,-1]
> > > xr <- scale(xr)
> > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
> > "scaled:scale")
> > > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > > xr <- xr[,!mask] # rm NA's
> > > ys <- test.data[,1]; xs <- test.data[,-1]
> > > xs <- scale(xs, center = x.center, scale = x.scale)
> > > xs <- xs[,!mask]
> > > train <- data.frame(y = yr, x = xr)
> > > test <- data.frame(y = ys, x = xs)
> > > model <- lm(y~., train)
> > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> > dim(train) = 10 4 ; dim(test) = 10 4
> > > length(predict(model, test))
> > [1] 10
> > > train.data <- read.csv("train.csv", header = TRUE,
> > row.names = "mol",
> > comment.char="")
> > > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",
> > comment.char="")
> > [snip...]
> > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> > dim(train) = 164 119 ; dim(test) = 35 119
> > > length(predict(model, test))
> > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
> >         subscript out of bounds
> > >
> >
> > Ryszard Czerminski   phone: (781)994-0479
> > ArQule, Inc.         email:ryszard at arqule.com
> > 19 Presidential Way  http://www.arqule.com
> > Woburn, MA 01801     fax: (781)994-0679
> >
> >
> > -----Original Message-----
> > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> > Sent: Friday, June 21, 2002 3:41 PM
> > To: Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch
> > Subject: RE: [R] problem with predict()
> >
> >
> > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
> >
> > > --- first problem
> > >
> > > If I store 'simulated' data in data frames:
> > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
> > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
> > > it still works the same way i.e. the code below works fine
> > > for simulated data and fails for 'real' data the only
> > > difference being in actual numeric values stored in data
> > > structures of the same shape and type.
> > >
> > > Any suggestions why this happens ?
> >
> > Yes. You are *still* using a matrix in a data frame.  Please
> > do read more
> > carefully.
> >
> > > --- second problem
> > >
> > > > As Andy Liaw pointed out, xr is a matrix.  Take a look at
> > the names of
> > > > train.  Hint: they do not contain `x'.
> > >
> > > Following your hint I am guessing that the fact that names
> > do not contain
> > > 'x'
> > > explains why lm(y~., train) form works and lm(y~x, train) fails
> > > and "lm(y~., train)" means roughly: correlate column "y" to
> > all other
> > colums
> >
> > No, it means regress y on all the remaining colums in the
> > data argument.
> >
> > >
> > > Where I can find more detail specification of this syntax ?
> > > In help(lm) I find this paragraph:
> > >
> > >      Models for `lm' are specified symbolically.  A typical
> > model has
> > >      the form `response ~ terms' where `response' is the
> > (numeric)...
> > >
> > > which does not quite cover this case.
> >
> > In any good book on the subject.
> >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.
> > -.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !)  To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._.
> > _._
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !)  To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> >
>
>
----------------------------------------------------------------------------
> --
> Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA)
that
> may be confidential, proprietary copyrighted and/or legally privileged,
and
> is intended solely for the use of the individual or entity named on this
> message.  If you are not the intended recipient, and have received this
> message in error, please immediately return this by e-mail and then delete
> it.
>
>
============================================================================
> ==
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
> r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._