[R] aggregate function - na.action/ performance issues re structs and algorithms

Mon Feb 7 18:15:58 CET 2011

----------------------------------------
> From: hadley at rice.edu
> Date: Mon, 7 Feb 2011 11:00:59 -0600
> To: mdowle at mdowle.plus.com
> CC: r-help at stat.math.ethz.ch
> Subject: Re: [R] aggregate function - na.action
>
> > Does FAQ 1.8 answer that ok ?
> >   "Ok, I'm starting to see what data.table is about, but why didn't you
> > enhance data.frame in R? Why does it have to be a new package?"
> >   http://datatable.r-forge.r-project.org/datatable-faq.pdf
>
> Kind of. I think there are two sets of features data.table provides:
>
> * a compact syntax for expressing many common data manipulations
> * high performance data manipulation
>
> FAQ 1.8 answers the question for the syntax, but not for the
> performance related features.
>
> Basically, I'd love to be able to use the high performance components
> of data table in plyr, but keep using my existing syntax. Currently
> the only way to do that is for me to dig into your C code to
> understand why it's fast, and then implement those ideas in plyr.

Without looking ( theo original subj would have caused me to miss most of this thread), 
usually the problems are with data strcutures that
don't know about algorithm access patterns or are not characterized beyond things like  order
to operate on a collection of some kind( O(n) for example to access). I think the author suggested
page loading time as a contributing factor IIRC and this would
be great news since that is one of my personal rants:)  People complain
about "running out of memory" but it is unlikely you have an algorithm that
just randomly picks one of those "billions and billions" of bits after the
prior memory operation.  Cache aware structures and algorothms can be a big
deal, see for example many good white papers on intel site. Tables generally connote
random access but usually you just want to stream the data or hopefully operate on 
local blocks. Long before VM thrashing, low level cache pollution can become a problem etc.

Personally I've always thought a streaming source would be nice. Not sure if you 
want a prefetch() or similar interface signatures  to let your algorithm prepare your stucts etc. 

>
> Hadley
>