[Rd] stringsAsFactors

Duncan Murdoch murdoch.duncan at gmail.com
Mon Feb 11 21:17:37 CET 2013


On 11/02/2013 2:34 PM, Terry Therneau wrote:
> The root of this problem is that the .getXlevels function does not return the levels for
> character variables.
Thanks, that looks easy to fix (not by changing .getXlevels, but by 
having model.frame convert the character variables, instead
of waiting for model.matrix).

Duncan
> Future predictions depend on that information.
>
> On 02/11/2013 11:50 AM, Duncan Murdoch wrote:
> > On 11/02/2013 12:13 PM, William Dunlap wrote:
> >> Note that changing this does not just mean getting rid of "silly warnings".
> >> Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.
> >>
> >> > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17,
> >> 28.1,28.8,30.1))
> >> > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
> >>    Warning message:
> >>    In model.matrix.default(mt, mf, contrasts) :
> >>      variable 'f' converted to a factor
> >> > predict(fit_ab, newdata=d)
> >>     1  2  3  4  5  6  7  8  9 10
> >>     1  2  3  4 25 26 27  8  9 10
> >>    Warning messages:
> >>    1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
> >>      variable 'f' converted to a factor
> >>    2: In predict.lm(fit_ab, newdata = d) :
> >>      prediction from a rank-deficient fit may be misleading
> >>
> >> fit_ab is not rank-deficient and the predict should report
> >>     1 2 3 4 NA NA NA 28 29 30
> >
> > In R-devel, the two warnings about factor conversions are no longer given, but the
> > predictions are the same and the warning about rank deficiency still shows up.  If f is
> > set to be a factor, an error is generated:
> >
> > Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> > object$xlevels) :
> >   factor f has new levels B
> >
> > I think both the warning and error are somewhat reasonable responses.  The fit is rank
> > deficient relative to the model that includes f == "B",  because the column of the
> > design matrix corresponding to f level B would be completely zero.  In this particular
> > model, we could still do predictions for the other levels, but it also seems reasonable
> > to quit, given that clearly something has gone wrong.
> >
> > I do think that it's unfortunate that we don't get the same result in both cases, and
> > I'd like to have gotten the predictions you suggested, but I don't think that's going to
> > happen.  The reason for the difference is that the subsetting is done before the
> > conversion to a factor, but I think that is unavoidable without really big changes.
> >
> > Duncan Murdoch
> >
> >
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >> > -----Original Message-----
> >> > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
> >> > Of Terry Therneau
> >> > Sent: Monday, February 11, 2013 5:50 AM
> >> > To: r-devel at r-project.org; Duncan Murdoch
> >> > Subject: Re: [Rd] stringsAsFactors
> >> >
> >> > I think your idea to remove the warnings is excellent, and a good compromise.
> >> > Characters
> >> > already work fine in modeling functions except for the silly warning.
> >> >
> >> > It is interesting how often the defaults for a program reflect the data sets in use
> >> at the
> >> > time the defaults were chosen.  There are some such in my own survival package whose
> >> > proper value is no longer as "obvious" as it was when I chose them.  Factors are very
> >> > handy for variables which have only a few levels and will be used in modeling.  Every
> >> > character variable of every dataset in "Statistical Models in S", which introduced
> >> > factors, is of this type so auto-transformation made a lot of sense.  The "solder" data
> >> > set there is one for which Helmert contrasts are proper so guess what the default
> >> > contrast
> >> > option was?  (I think there are only a few data sets in the world for which Helmert
> >> makes
> >> > sense, however, and R eventually changed the default.)
> >> >
> >> > For character variables that should not be factors such as a street adress
> >> > stringsAsFactors can be a real PITA, and I expect that people's preference for the
> >> option
> >> > depends almost entirely on how often these arise in their own work.  As long as there is
> >> > an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the default, partly
> >> > because the current value is a tripwire in the hallway that eventually catches every new
> >> > user.
> >> >
> >> > Terry Therneau
> >> >
> >> > On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
> >> > > Both of these were discussed by R Core.  I think it's unlikely the
> >> > > default for stringsAsFactors will be changed (some R Core members like
> >> > > the current behaviour), but it's fairly likely the show.signif.stars
> >> > > default will change.  (That's if someone gets around to it:  I
> >> > > personally don't care about that one.  P-values are commonly used
> >> > > statistics, and the stars are just a simple graphical display of them.
> >> > > I find some p-values to be useful, and the display to be harmless.)
> >> > >
> >> > > I think it's really unlikely the more extreme changes (i.e. dropping
> >> > > show.signif.stars completely, or dropping p-values) will happen.
> >> > >
> >> > > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
> >> > > I'll let the people who like it defend it.  What I will likely do is
> >> > > make a few changes so that character vectors are automatically changed
> >> > > to factors in modelling functions, so that operating with
> >> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
> >> >
> >> > ______________________________________________
> >> > R-devel at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >



More information about the R-devel mailing list