[R] character type and memory usage

Sat Jan 17 11:01:43 CET 2015

On 01/16/2015 10:21 PM, Mike Miller wrote:
> First, a very easy question:  What is the difference between using
> what="character" and what=character() in scan()?  What is the reason for the
> character() syntax?
>
> I am working with some character vectors that are up to about 27.5 million
> elements long.  The elements are always unique.  Specifically, these are names
> of genetic markers.  This is how much memory those names take up:
>
>> snps <- scan("SNPs.txt", what=character())
> Read 27446736 items
>> object.size(snps)
> 1756363648 bytes
>> object.size(snps)/length(snps)
> 63.9917128215173 bytes
>
> As you can see, that's about 1.76 GB of memory for the vector at an average of
> 64 bytes per element.  The longest string is only 14 bytes, though.  The file
> takes up 313 MB.
>
> Using 64 bytes per element instead of 14 bytes per element is costing me a total
> of 1,372,336,800 bytes.  In a different example where the longest string is 4
> characters, the elements each use 8 bytes.  So it looks like I'm stuck with
> either 8 bytes or 64 bytes.  Is that true?  There is no way to modify that?

Hi Mike --

R represents the atomic vector types as so-called S-expressions, which in 
addition to the actual data contain information about whether they have been 
referenced by one or more symbols etc.; you can get a sense of this with

     > x <- 1:5
     > .Internal(inspect(x))
     @4c732940 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5

where the number after @ is the memory location, INTSXP indicates that the type 
of data is an integer, etc. So a vector requires memory for the S-expression, 
and for the actual data.

A character vector is represented by an S-expression for the vector itself, and 
an S-expression for each element of the vector, and of course the data itself

     > .Internal(inspect(y))
     @4ce72090 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
       @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
       @137ccd8 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "a"
       @15a6698 09 CHARSXP g0c1 [gp=0x61] [ASCII] [cached] "b"

The large S-expression overhead is recouped by long (in the nchar() sense) or 
re-used strings, but that's not the case for your data.

There is no way around this in base R. There are general-purpose solutions like 
the data.table package, or retaining your large data in a data base (like 
SQLite) that you interface from within R using e.g., sqldf or dplyr to do as 
much data reduction in the data base (and out of R) as possible. In your 
particular case the Bioconductor Biostrings package BStringSet() might be relevant

   http://bioconductor.org/packages/release/bioc/html/Biostrings.html

This will consume memory more along the lines of 1 byte per character + 1 byte 
per string, and is of particular relevance because you are likely doing other 
genetic operations for which the Bioconductor project has relevant packages (see 
especially the GenomicRanges package).

If your work is not particularly domain-specific, data.table would be a good bet 
(it also has an implementation for working with overlapping ranges, which is a 
very common task with SNPs). A lot of SNP data management is really relational, 
for which the SQL representation (and dplyr, for me) is the obvious choice. 
Bioconductor would be the choice if there is to be extensive domain-specific 
work. I am involved in the Bioconductor project, so not exactly impartial.

Martin

>
> By the way...
>
> It turns out that 99.72% of those character strings are of the form paste("rs",
> Int) where Int is an integer of no more than 9 digits.  So if I use only those
> markers, drop the "rs" off, and load them as integers, I see a huge improvement:
>
>> snps <- scan("SNPs_rs.txt", what=integer())
> Read 27369706 items
>> object.size(snps)
> 109478864 bytes
>> object.size(snps)/length(snps)
> 4.00000146146985 bytes
>
> That saves 93.8% of the memory by dropping 0.28% of the markers and encoding as
> integers instead of strings.  I might end up doing this by encoding the other
> characters as negative integers.
>
> Mike
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793