[R] character type and memory usage

Sat Jan 17 07:21:17 CET 2015

First, a very easy question:  What is the difference between using 
what="character" and what=character() in scan()?  What is the reason for 
the character() syntax?

I am working with some character vectors that are up to about 27.5 million 
elements long.  The elements are always unique.  Specifically, these are 
names of genetic markers.  This is how much memory those names take up:

> snps <- scan("SNPs.txt", what=character())
Read 27446736 items
> object.size(snps)
1756363648 bytes
> object.size(snps)/length(snps)
63.9917128215173 bytes

As you can see, that's about 1.76 GB of memory for the vector at an 
average of 64 bytes per element.  The longest string is only 14 bytes, 
though.  The file takes up 313 MB.

Using 64 bytes per element instead of 14 bytes per element is costing me a 
total of 1,372,336,800 bytes.  In a different example where the longest 
string is 4 characters, the elements each use 8 bytes.  So it looks like 
I'm stuck with either 8 bytes or 64 bytes.  Is that true?  There is no way 
to modify that?

By the way...

It turns out that 99.72% of those character strings are of the form 
paste("rs", Int) where Int is an integer of no more than 9 digits.  So if 
I use only those markers, drop the "rs" off, and load them as integers, I 
see a huge improvement:

> snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items
> object.size(snps)
109478864 bytes
> object.size(snps)/length(snps)
4.00000146146985 bytes

That saves 93.8% of the memory by dropping 0.28% of the markers and 
encoding as integers instead of strings.  I might end up doing this by 
encoding the other characters as negative integers.

Mike