[R] can predict ignore rows with insufficient info

Tue Sep 16 22:17:59 CEST 2003

Peter  -

Your subsequent email seems just right.  You have to determine
ahead of time which rows can be estimated.  Here's a strategy,
and possibly some code to implement it.

Let  supported(i,y,d)  be a user-written function which returns
a logical vector indicating rows which should be omitted from
the prediction on account of a non-covered covariate in column i
of data frame d with outcome variable y.  Apply this function to
all columns in your data frame using  lapply().  Then do the "or"
of all the logical vectors by calculating the row sums of the
numeric (0 or 1) equivalents.  Last, convert back to logical,
and subscript your data frame with this in the call to  predict().

Here's some rough code:

supported <- function(i,y,d)  {
   result <- rep(F, dim(d)[1])      # default return value when
   if (is.factor(d[[i]]))           #  d[[i]] is not a factor.
     result <- d[[i]] %in% unique(d[[i]][ !is.na(d[[y]]) ])
   result  }

tmp.1 <- lapply(seq(along=const), supported, "days", const)
tmp.2 <- matrix(unlist(tmp.1[ names(const) != "days" ]), nrow=dim(const)[1])
tmp.3 <- as.logical(as.vector(tmp.2 %*% rep(1, dim(tmp.2)[2])))

x <- predict(g, const[ is.na(const$days) & !tmp.3, ])

This code uses a few arcane maneuvers.  Look at help pages for
the relevant functions to dope out what it is doing.  Particularly
for  lapply(), seq(), rep(), unlist(), unique(), "%*%", "%in%".
(The last two must be quoted in order to see the help).

However, the code might work for you right out of the box !

-  tom blackwell  -  u michigan medical school  -  ann arbor  -

On Tue, 16 Sep 2003, Peter Whiting wrote:

> I need predict to ignore rows that contain levels not in the
> model.
>
> Consider a data frame, "const", that has columns for the number of
> days required to construct a site and the city and state the site
> was constructed in.
>
> g<-lm(days~city,data=const)
>
> Some of the sites in const have not yet been completed, and therefore
> they have days==NA. I want to predict how many days these sites
> will take to complete (I've simplified the above discussion to
> remove many of the other factors involved.)
>
> nconst<-subset(const,is.na(const$days))
> x<-predict(g,nconst)
> Error in model.frame.default(object, data, xlev = xlev) :
>         factor city has new level(s) ALBANY
>
> This is because we haven't yet completed a site in Albany.
> If I just had one to worry about I could easily fix it (choose
> a nearby market with similar characteristic) but I am dealing
> with a several hundred cities. Instead, for the cities not
> modeled by g I'd simply like to use the state, even though I
> don't expect it to be as good:
>
> g<-lm(days~state,data=const)
> x<-predict(g,nconst)
>
> I'm not sure how to identify the cities in nconst that are not
> modeled by g (my actual model has many more predictors in the
> formula) Is there a way to instruct predict to only predict the
> rows for which it has enough information and not complain about
> the others?
>
> g<-lm(days~city,data=const)
> x<-predict(g,nconst) ## the rows of x with city=ALBANY will be NA
> g<-lm(days~state,data=const)
> y<-predict(g,nconst)
> x[is.na(x)]<-y[is.na(x)]
>
> thanks,
> pete
>