[Rd] Memory allocation in read.table

Wed Aug 28 20:10:17 CEST 2013

On Aug 28, 2013, at 1:59 PM, Hadley Wickham wrote:

>>> Why do those lines need any allocations? I thought class<- and attr<-
>>> were primitives, and hence would modify in place.
>>> 
>> 
>> .. but only if there is no other reference to the data (i.e. NAMED < 2). If there are two references, they have to copy, because it would change the other copy.
>> Here, however, it already has NAMED=2 because of
>> 
>> data <- data[keep]
> 
> Ah, got it - thanks!
> 
>> PS: if you are loading any sizable data, the one thing you don't want to do is to use read.table() ;)
> 
> Yes ;)  Romain and I (mostly Romain) are working on some faster
> alternatives at https://github.com/romainfrancois/fastread.
> 
> One surprising finding so far (to me at least), is that when loading a
> file full of doubles, you pretty quickly get to the point where strtod
> is the bottleneck.
> 

Yup - parsing is the most expensive part. That's why for high-throughput data you don't want to use ASCII representation. It's amazing that the disk speeds are now so high that CPUs are the bottlenecks now, not vice versa.

Re fast loading - yes, that's something I was also working around in iotools https://github.com/s-u/iotools 

Cheers,
Simon