[R] Appending new values to an existing factor vector

David Hall (coding) hacking at gringer.org
Sat Mar 15 01:25:03 CET 2008


Hello,

I've recently come across a situation where I'm trying to read in [genotype
data] files that have around 80,000,000 lines, 4 fields, with a high proportion
of repeated strings, here's a sample:

rsXXXXXXX       SAMPLE0001      CG      0.05302
rsXXXXXX        SAMPLE0001      CC      0.06817
rsXXXXXXXX      SAMPLE0001      CC      0.01369
rsXXXXXXY       SAMPLE0001      GG      0.01816
rsXXXXXXZ       SAMPLE0001      GG      0.006711
rsXXXXXXX       SAMPLE0002      GG      0.05813

[For the purpose of the work I'm doing at the moment, I don't care about the
last column]

What's the best way to read in these data?

My understanding of what happens when I do read.table on such a file is that it
reads the file into a matrix (or perhaps a list) of character strings, then
carries out the character conversions [i.e. as.factor(data[[i]])].

infile.df <- read.table(gzfile("large_file.txt.gz"), nrows = 82000000)

Doing this all in one go results in R complaining about not having enough memory
to store a data structure of that size [I'm running on Linux, with 1.5GB memory
 + 2GB swap], so I need to do it piecewise, but I suspect the memory issues will
still be present if I do that.

What I'd like is a way to read in, say, a million lines at a time, do the factor
conversion, then append to my existing data frame, which has columns of factors.

However, something I came across while participating in the ICFP 2007
(http://www.icfpcontest.org/) using R was the strange behaviour when adding
new/unknown values to a factor vector:

> (a <- factor(c("I","C","I","C","F","I")))
[1] I C I C F I
Levels: C F I
> append(a,"P")
[1] "3" "1" "3" "1" "2" "3" "P"

What would be nice is for unknown levels to be added and encoded as a new value,
without having to refactor the whole list, as follows:

> factor(append(as.character(a),"P"))
[1] I C I C F I P
Levels: C F I P

Is there a better way to do this that means I don't need to do the character
conversion process?

The need to do this character conversion seems to removes one of the useful
features of a factored vector in that it substantially reduces space requirements.

Thanks for your help,
David Hall



More information about the R-help mailing list