[R] Retrieving Factors with Levels Ordered

Berwin A Turlach berwin at maths.uwa.edu.au
Sat Jan 1 07:59:19 CET 2011


G'day H.T.

On Sat,  1 Jan 2011 00:41:10 -0500 (EST)
"H. T. Reynolds" <htr at udel.edu> wrote:

> When I create a factor with labels in the order I want, write the
> data as a text file, and then retrieve them, the factor levels are no
> longer in the proper order.

Not surprisingly. :)
 
[..big snip..]
> testdf <- read.table(file = 'Junkarea/test.txt')

Did you look at the file Junkarea/test.txt, e.g. with a text editor?
You will see that read.table() stores only the observed values but no
information about what the mode of each variable in the data frame is.
In particular, it doesn't store that a variable is a factor and
definitely not the levels and their ordering.  

Actually, write.table() saves a factor by writing out the observed
labels as character stings.  Only because read.table() by default turns
character data into factors (a behaviour that some useRs don't like and
why the option stringsAsFactors exists) you end up with a factor again.

> # But levels are no longer ordered

But the help file of factor (see ?factor) states in the warning section:

     The levels of a factor are by default sorted, but the sort order
     may well depend on the locale at the time of creation, and should
     not be assumed to be ASCII.

Thus, arguably, the levels are ordered, just not the order you want. :)

> Clearly I am missing something obvious, but I can't see it. If I save
> "feducord" and load it, the order of the levels is as it should be.
> But I don't know why writing to a test file should change anything.
> Any help would be greatly appreciated.

If you save() and load() an object, then it is saved in binary format,
thus much more information about it can be stored.  Indeed, I would
expect that a faithful internal representation of the object is stored
in the binary format so that a save(), rm() and load() would restore
exactly the same object.

Saving objects in text formats is prone to loss of information.  E.g.,
as you experience with write.table() and read.table() no information
about ordering of levels is stored using these functions.  If care is
not taken when writing numbers to text files, the internal
representation can change, e.g. :

R> x <- data.frame(x=seq(from=0, to=1, length=11))
R> write.table(x, file="/tmp/junk1")
R> y <- read.table("/tmp/junk1")
R> identical(x,y)
[1] FALSE
R> x-y
              x
1  0.000000e+00
2  0.000000e+00
3  0.000000e+00
4  5.551115e-17
5  0.000000e+00
6  0.000000e+00
7  1.110223e-16
8  1.110223e-16
9  0.000000e+00
10 0.000000e+00
11 0.000000e+00

Having said all this, if you want to save your data in a text file
with a representation that remembers the ordering of the factor
levels, look at dput():

R> fac <- gl(2,4, labels=c("White", "Black"))
R> fac
[1] White White White White Black Black Black Black
Levels: White Black
R> write.table(fac, file="/tmp/junk")
R> str(read.table("/tmp/junk"))
'data.frame':	8 obs. of  1 variable:
 $ x: Factor w/ 2 levels "Black","White": 2 2 2 2 1 1 1 1
R> dput(fac, file="/tmp/junk")
R> str(dget(file="/tmp/junk"))
 Factor w/ 2 levels "White","Black": 1 1 1 1 2 2 2 2

It might just be that the text representation used by dput() is not
particularly digestible for the human eye. :)

> (You're right, I don't have anything better to do on New Year's eve.)

New Year's eve?  The first day of the new year is already nearly
over! :)

HTH.

Cheers,

	Berwin

========================== Full address ============================
Berwin A Turlach                      Tel.: +61 (8) 6488 3338 (secr)
School of Maths and Stats (M019)            +61 (8) 6488 3383 (self)
The University of Western Australia   FAX : +61 (8) 6488 1028
35 Stirling Highway                   
Crawley WA 6009                e-mail: berwin at maths.uwa.edu.au
Australia                        http://www.maths.uwa.edu.au/~berwin



More information about the R-help mailing list