[R] character type and memory usage
mbmiller+l at gmail.com
Sat Jan 17 07:21:17 CET 2015
First, a very easy question: What is the difference between using
what="character" and what=character() in scan()? What is the reason for
the character() syntax?
I am working with some character vectors that are up to about 27.5 million
elements long. The elements are always unique. Specifically, these are
names of genetic markers. This is how much memory those names take up:
> snps <- scan("SNPs.txt", what=character())
Read 27446736 items
As you can see, that's about 1.76 GB of memory for the vector at an
average of 64 bytes per element. The longest string is only 14 bytes,
though. The file takes up 313 MB.
Using 64 bytes per element instead of 14 bytes per element is costing me a
total of 1,372,336,800 bytes. In a different example where the longest
string is 4 characters, the elements each use 8 bytes. So it looks like
I'm stuck with either 8 bytes or 64 bytes. Is that true? There is no way
to modify that?
By the way...
It turns out that 99.72% of those character strings are of the form
paste("rs", Int) where Int is an integer of no more than 9 digits. So if
I use only those markers, drop the "rs" off, and load them as integers, I
see a huge improvement:
> snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items
That saves 93.8% of the memory by dropping 0.28% of the markers and
encoding as integers instead of strings. I might end up doing this by
encoding the other characters as negative integers.
More information about the R-help