[Rd] stringsAsFactors

Wed Feb 13 13:54:03 CET 2013

On Wed, Feb 13, 2013 at 7:33 AM, Michael Dewey <info at aghmed.fsnet.co.uk> wrote:
> At 18:01 11/02/2013, Ista Zahn wrote:
>>
>> FWIW my view is that for data cleaning and organizing factors just get
>> it the way. For modeling I like them because they make it easier to
>> understand what is happening. For example I can look at the levels()
>> to see what the reference group will be. With characters one has to
>> know a) that levels are created in alphabetical order and b) the
>> alphabetical order of the the unique values in the character vector.
>> Ugh. So my habit is to turn off stringsAsFactors, then explicitly
>> convert to factors before modeling (I also use factors to change the
>> order in which things are displayed in tables and graphs, another
>> place where converting to factors myself is useful but the creating
>> them in alphabetical order by default is not)
>>
>> All this is to say that I would like options(stingsAsFactors=FALSE) to
>> be the default, but I like the warning about converting to factors in
>> modeling functions because it reminds me that I forgot to covert them,
>> which I like to do anyway...
>
>
> I seem to be one of the few people who find the current default helpful.
> When I read in a dataset I am nearly always going to follow it with one or
> more of the modelling functions and so I do want to treat the categorical
> variables as factors. I cannot off-hand think of an example where I have had
> to convert them to characters.

Your data must reach you in a much better state than mine reaches me.
I spend most of my time organizing, combining, fixing typos,
reshaping, merging and so on. Then I see the dreaded warning

"In `[<-.factor`(`*tmp*`, 6, value = "z") :
  invalid factor level, NAs generated

which reminds me that I've forgotten to set stringsAsFactors=FALSE.
However, I'm not saying I don't like factors. Once the data is cleaned
up they are very useful. But often I find that when I'm trying to
clean up a messy data set they just get in the way. And since that is
what I spend most of my time doing, factors get in the way most of the
time for me.

>
> Incidentally xkcd has, while this discussion has been going on, posted
> something relevant
> http://www.xkcd.com/1172/
>
>
>
>
>> Best,
>> Ista
>>
>> On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>> > On 11/02/2013 12:13 PM, William Dunlap wrote:
>> >>
>> >> Note that changing this does not just mean getting rid of "silly
>> >> warnings".
>> >> Currently, predict.lm() can give wrong answers when stringsAsFactors is
>> >> FALSE.
>> >>
>> >>    > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4,
>> >> 15:17, 28.1,28.8,30.1))
>> >>    > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>> >>    Warning message:
>> >>    In model.matrix.default(mt, mf, contrasts) :
>> >>      variable 'f' converted to a factor
>> >>    > predict(fit_ab, newdata=d)
>> >>     1 2 3 4 5 6 7 8 9 10
>> >>     1  2  3  4 25 26 27  8  9 10
>> >>    Warning messages:
>> >>    1: In model.matrix.default(Terms, m, contrasts.arg =
>> >> object$contrasts)
>> >> :
>> >>      variable 'f' converted to a factor
>> >>    2: In predict.lm(fit_ab, newdata = d) :
>> >>      prediction from a rank-deficient fit may be misleading
>> >>
>> >> fit_ab is not rank-deficient and the predict should report
>> >>     1 2 3 4 NA NA NA 28 29 30
>> >
>> >
>> > In R-devel, the two warnings about factor conversions are no longer
>> > given,
>> > but the predictions are the same and the warning about rank deficiency
>> > still
>> > shows up.  If f is set to be a factor, an error is generated:
>> >
>> > Error in model.frame.default(Terms, newdata, na.action = na.action, xlev
>> > =
>> > object$xlevels) :
>> >   factor f has new levels B
>> >
>> > I think both the warning and error are somewhat reasonable responses.
>> > The
>> > fit is rank deficient relative to the model that includes f == "B",
>> > because
>> > the column of the design matrix corresponding to f level B would be
>> > completely zero.  In this particular model, we could still do
>> > predictions
>> > for the other levels, but it also seems reasonable to quit, given that
>> > clearly something has gone wrong.
>> >
>> > I do think that it's unfortunate that we don't get the same result in
>> > both
>> > cases, and I'd like to have gotten the predictions you suggested, but I
>> > don't think that's going to happen.  The reason for the difference is
>> > that
>> > the subsetting is done before the conversion to a factor, but I think
>> > that
>> > is unavoidable without really big changes.
>> >
>> > Duncan Murdoch
>> >
>> >
>> >
>> >>
>> >> Bill Dunlap
>> >> Spotfire, TIBCO Software
>> >> wdunlap tibco.com
>> >>
>> >> > -----Original Message-----
>> >> > From: r-devel-bounces at r-project.org
>> >> > [mailto:r-devel-bounces at r-project.org] On Behalf
>> >> > Of Terry Therneau
>> >> > Sent: Monday, February 11, 2013 5:50 AM
>> >> > To: r-devel at r-project.org; Duncan Murdoch
>> >> > Subject: Re: [Rd] stringsAsFactors
>> >> >
>> >> > I think your idea to remove the warnings is excellent, and a good
>> >> > compromise.
>> >> > Characters
>> >> > already work fine in modeling functions except for the silly warning.
>> >> >
>> >> > It is interesting how often the defaults for a program reflect the
>> >> > data
>> >> > sets in use at the
>> >> > time the defaults were chosen.  There are some such in my own
>> >> > survival
>> >> > package whose
>> >> > proper value is no longer as "obvious" as it was when I chose them.
>> >> > Factors are very
>> >> > handy for variables which have only a few levels and will be used in
>> >> > modeling.  Every
>> >> > character variable of every dataset in "Statistical Models in S",
>> >> > which
>> >> > introduced
>> >> > factors, is of this type so auto-transformation made a lot of sense.
>> >> > The "solder" data
>> >> > set there is one for which Helmert contrasts are proper so guess what
>> >> > the default
>> >> > contrast
>> >> > option was?  (I think there are only a few data sets in the world for
>> >> > which Helmert makes
>> >> > sense, however, and R eventually changed the default.)
>> >> >
>> >> > For character variables that should not be factors such as a street
>> >> > adress
>> >> > stringsAsFactors can be a real PITA, and I expect that people's
>> >> > preference for the option
>> >> > depends almost entirely on how often these arise in their own work.
>> >> > As
>> >> > long as there is
>> >> > an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as
>> >> > the
>> >> > default, partly
>> >> > because the current value is a tripwire in the hallway that
>> >> > eventually
>> >> > catches every new
>> >> > user.
>> >> >
>> >> > Terry Therneau
>> >> >
>> >> > On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>> >> > > Both of these were discussed by R Core.  I think it's unlikely the
>> >> > > default for stringsAsFactors will be changed (some R Core members
>> >> > > like
>> >> > > the current behaviour), but it's fairly likely the
>> >> > > show.signif.stars
>> >> > > default will change.  (That's if someone gets around to it:  I
>> >> > > personally don't care about that one.  P-values are commonly used
>> >> > > statistics, and the stars are just a simple graphical display of
>> >> > > them.
>> >> > > I find some p-values to be useful, and the display to be harmless.)
>> >> > >
>> >> > > I think it's really unlikely the more extreme changes (i.e.
>> >> > > dropping
>> >> > > show.signif.stars completely, or dropping p-values) will happen.
>> >> > >
>> >> > > Regarding stringsAsFactors:  I'm not going to defend keeping it as
>> >> > > is,
>> >> > > I'll let the people who like it defend it.  What I will likely do
>> >> > > is
>> >> > > make a few changes so that character vectors are automatically
>> >> > > changed
>> >> > > to factors in modelling functions, so that operating with
>> >> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
>> >> >
>> >> > ______________________________________________
>> >> > R-devel at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> Michael Dewey
> info at aghmed.fsnet.co.uk
> http://www.aghmed.fsnet.co.uk/home.html
>