[R] dummy variable encoding

Richard.Cotton at hsl.gov.uk Richard.Cotton at hsl.gov.uk
Thu Mar 5 16:49:41 CET 2009


>    can anyone tell me why an encoding of 1/2 for a dummy variable for
>    two groups (e.g. gender) seems to be preferred over 0/1?
>    It's been bugging me for a while, 0/1 seems more natural, but I have
>    been told (without explanation) that 1/2 is better. Why?

The best encoding depends upon which language you would like to manipulate 
the variable in.  In R, genders are most naturally represented as factors. 
 That means that in an external data source (like a spreadsheet of data), 
you should ideally have the gender recorded as human-understandable text 
("male" and "female", or "M" and "F").  Once the data is read into R, by 
default R will convert the string to factors (keeping the human readable 
labels).  This way you avoid having to remember that 1 means male (or 
whatever).

If you were manipulating the data in a different language that didn't have 
factors, then it might be more appropriate to use an integer.  Which 
integers you use doesn't matter, you need to have a look-up table to know 
what each number refers to, whatever you choose.

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}




More information about the R-help mailing list