[R] Problem with factor state when subset()ing a data frame

Terry Therneau therneau at mayo.edu
Mon Feb 12 15:36:10 CET 2007


  The solution to most "factors" questions on the R mailing list is
to set the global option stringsAsFactors to F.  Make it your part of your
default R startup.  Even better, do what we have done at Mayo for the
last 10+ years and make it the default for your whole unit.  (150+ users,
20+ years of S experience).  We were one of the groups that whined to
Insightful until they added this feature, which unfortunately did not
become a part of R until fairly recently. 

  For some character variables the factor logic makes sense, for other it
does not.  If you set the option above, then you can use an explicit
    mydata$variable  <- factor(mydata$variable)
for the variables that should be factors.   In my experience, with a wide
variety of data analysis, that is about 1/10 of my character variables.
Others may disagree about the fraction, but one of the really bad aspects 
of the default design is that it forces 100% conversion of characters
to another class, which is certainly not best state.  (Street address,
for instance, never makes sense as a factor).

  When factor are the right thing, they are very useful.  I would agree with
Peter Dalgaard's assessment of past discussion about automatically dropping
unused levels: there is no approach that always works best, and the current
default has been extensively talked over and appears to be the best current
default.  They most certainly should not disappear from the language, or have
major changes without a lot of discussion.  

   Terry Therneau
   Biostatistics, Mayo Clinic



More information about the R-help mailing list