[Rd] stringsAsFactors

Duncan Murdoch murdoch.duncan at gmail.com
Wed Feb 13 14:39:23 CET 2013


On 13-02-13 8:30 AM, Milan Bouchet-Valat wrote:
> Le mercredi 13 février 2013 à 12:33 +0000, Michael Dewey a écrit :
>> At 18:01 11/02/2013, Ista Zahn wrote:
>>> FWIW my view is that for data cleaning and organizing factors just get
>>> it the way. For modeling I like them because they make it easier to
>>> understand what is happening. For example I can look at the levels()
>>> to see what the reference group will be. With characters one has to
>>> know a) that levels are created in alphabetical order and b) the
>>> alphabetical order of the the unique values in the character vector.
>>> Ugh. So my habit is to turn off stringsAsFactors, then explicitly
>>> convert to factors before modeling (I also use factors to change the
>>> order in which things are displayed in tables and graphs, another
>>> place where converting to factors myself is useful but the creating
>>> them in alphabetical order by default is not)
>>>
>>> All this is to say that I would like options(stingsAsFactors=FALSE) to
>>> be the default, but I like the warning about converting to factors in
>>> modeling functions because it reminds me that I forgot to covert them,
>>> which I like to do anyway...
>>
>> I seem to be one of the few people who find the current default
>> helpful. When I read in a dataset I am nearly always going to follow
>> it with one or more of the modelling functions and so I do want to
>> treat the categorical variables as factors. I cannot off-hand think
>> of an example where I have had to convert them to characters.
> If the changes to modeling functions that are discussed in this thread
> can finally be applied (i.e. a solution is found), characters would be
> converted to factors automatically, so you would not notice the
> difference. And if you need to change the order of levels of your
> factors, calling factor(myVar, levels=c(...)) is the same, be myVar a
> character or a factor.

I think most of the changes *have* been applied.  Please try R-devel, 
and point out problems.

The only change that I would like to apply but haven't (and probably 
won't) is to change the default for stringsAsFactors to FALSE.  Users 
who think that is a bad idea can bolster their cases by setting 
options(stringsAsFactors=FALSE), and posting descriptions of all the 
horrors that ensue.

Duncan Murdoch

>
>> Incidentally xkcd has, while this discussion has been going on,
>> posted something relevant
>> http://www.xkcd.com/1172/
> Truly hilarious, indeed. But beware, it sounds like an argument in favor
> of the change, while you are lobbying against it. :-p
>
>
> Regards
>
>
>
>>
>>
>>> Best,
>>> Ista
>>>
>>> On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
>>> <murdoch.duncan at gmail.com> wrote:
>>>> On 11/02/2013 12:13 PM, William Dunlap wrote:
>>>>>
>>>>> Note that changing this does not just mean getting rid of "silly
>>>>> warnings".
>>>>> Currently, predict.lm() can give wrong answers when stringsAsFactors is
>>>>> FALSE.
>>>>>
>>>>>     > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4,
>>>>> 15:17, 28.1,28.8,30.1))
>>>>>     > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>>>>>     Warning message:
>>>>>     In model.matrix.default(mt, mf, contrasts) :
>>>>>       variable 'f' converted to a factor
>>>>>     > predict(fit_ab, newdata=d)
>>>>>      1 2 3 4 5 6 7 8 9 10
>>>>>      1  2  3  4 25 26 27  8  9 10
>>>>>     Warning messages:
>>>>>     1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts)
>>>>> :
>>>>>       variable 'f' converted to a factor
>>>>>     2: In predict.lm(fit_ab, newdata = d) :
>>>>>       prediction from a rank-deficient fit may be misleading
>>>>>
>>>>> fit_ab is not rank-deficient and the predict should report
>>>>>      1 2 3 4 NA NA NA 28 29 30
>>>>
>>>>
>>>> In R-devel, the two warnings about factor conversions are no longer given,
>>>> but the predictions are the same and the warning about rank
>>> deficiency still
>>>> shows up.  If f is set to be a factor, an error is generated:
>>>>
>>>> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
>>>> object$xlevels) :
>>>>    factor f has new levels B
>>>>
>>>> I think both the warning and error are somewhat reasonable responses.  The
>>>> fit is rank deficient relative to the model that includes f ==
>>> "B",  because
>>>> the column of the design matrix corresponding to f level B would be
>>>> completely zero.  In this particular model, we could still do predictions
>>>> for the other levels, but it also seems reasonable to quit, given that
>>>> clearly something has gone wrong.
>>>>
>>>> I do think that it's unfortunate that we don't get the same result in both
>>>> cases, and I'd like to have gotten the predictions you suggested, but I
>>>> don't think that's going to happen.  The reason for the difference is that
>>>> the subsetting is done before the conversion to a factor, but I think that
>>>> is unavoidable without really big changes.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>>>>
>>>>>
>>>>> Bill Dunlap
>>>>> Spotfire, TIBCO Software
>>>>> wdunlap tibco.com
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: r-devel-bounces at r-project.org
>>>>>> [mailto:r-devel-bounces at r-project.org] On Behalf
>>>>>> Of Terry Therneau
>>>>>> Sent: Monday, February 11, 2013 5:50 AM
>>>>>> To: r-devel at r-project.org; Duncan Murdoch
>>>>>> Subject: Re: [Rd] stringsAsFactors
>>>>>>
>>>>>> I think your idea to remove the warnings is excellent, and a good
>>>>>> compromise.
>>>>>> Characters
>>>>>> already work fine in modeling functions except for the silly warning.
>>>>>>
>>>>>> It is interesting how often the defaults for a program reflect the data
>>>>>> sets in use at the
>>>>>> time the defaults were chosen.  There are some such in my own survival
>>>>>> package whose
>>>>>> proper value is no longer as "obvious" as it was when I chose them.
>>>>>> Factors are very
>>>>>> handy for variables which have only a few levels and will be used in
>>>>>> modeling.  Every
>>>>>> character variable of every dataset in "Statistical Models in S", which
>>>>>> introduced
>>>>>> factors, is of this type so auto-transformation made a lot of sense.
>>>>>> The "solder" data
>>>>>> set there is one for which Helmert contrasts are proper so guess what
>>>>>> the default
>>>>>> contrast
>>>>>> option was?  (I think there are only a few data sets in the world for
>>>>>> which Helmert makes
>>>>>> sense, however, and R eventually changed the default.)
>>>>>>
>>>>>> For character variables that should not be factors such as a street
>>>>>> adress
>>>>>> stringsAsFactors can be a real PITA, and I expect that people's
>>>>>> preference for the option
>>>>>> depends almost entirely on how often these arise in their own work.  As
>>>>>> long as there is
>>>>>> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the
>>>>>> default, partly
>>>>>> because the current value is a tripwire in the hallway that eventually
>>>>>> catches every new
>>>>>> user.
>>>>>>
>>>>>> Terry Therneau
>>>>>>
>>>>>> On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>>>>>>> Both of these were discussed by R Core.  I think it's unlikely the
>>>>>>> default for stringsAsFactors will be changed (some R Core members like
>>>>>>> the current behaviour), but it's fairly likely the show.signif.stars
>>>>>>> default will change.  (That's if someone gets around to it:  I
>>>>>>> personally don't care about that one.  P-values are commonly used
>>>>>>> statistics, and the stars are just a simple graphical display of them.
>>>>>>> I find some p-values to be useful, and the display to be harmless.)
>>>>>>>
>>>>>>> I think it's really unlikely the more extreme changes (i.e. dropping
>>>>>>> show.signif.stars completely, or dropping p-values) will happen.
>>>>>>>
>>>>>>> Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
>>>>>>> I'll let the people who like it defend it.  What I will likely do is
>>>>>>> make a few changes so that character vectors are automatically changed
>>>>>>> to factors in modelling functions, so that operating with
>>>>>>> stringsAsFactors=FALSE doesn't trigger silly warnings.
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> Michael Dewey
>> info at aghmed.fsnet.co.uk
>> http://www.aghmed.fsnet.co.uk/home.html
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list