[R] fast or space-efficient lookup?

Mon Oct 10 05:14:46 CEST 2011

Hi Ivo,

I'll just be brief, but regarding data.table's syntax: one person's
"strange" is another's "intuitive" :-)

Note also that data.table also provides a `merge.data.table` function
which works pretty much as you expect merge.data.frame to. The
exception being that its `by` argument will, by default, work on the
intersection of keys between the two data.tables being merged, and not
all columns of the same name between the two tables.

Hope that's helpful.

-steve

On Sun, Oct 9, 2011 at 10:44 PM, ivo welch <ivo.welch at gmail.com> wrote:
> hi patrick.  thanks.  I think you are right..
>
> combined <- merge( main, aggregate.data, by="day", all.x=TRUE, all.y=FALSE )
> lm( stockreturn ~ marketreturn, data=combined )
>
> becomes something like
>
> main <- as.data.table(main)
> setkey(main, "yyyymmdd")
> aggregate.data <- as.data.table(aggregate.data)
> setkey(aggregate.data, "yyyymmdd")
> main <- main[ aggregate.data ]
>
> this is fast and memory efficient.
>
> for me, on the plus side, data.table is a data.frame, so it can be
> used easily elsewhere.  on the plus side, data.table is
> super-efficient.  on the minus side, data.table often has very strange
> syntax.  for example, main[aggregate.data] is counterintuitive.
> passing of functions in a slot that should be a tensor index is also
> strange.  I would much prefer it if all non-tensor functionality was
> in functions, and not in arguments following [ ].
>
> I have written this before:  Given that applied end users of
> statistics typically use data.frame as their main container for data
> sets, data.frame should be as efficient and tuned as possible.  cell
> assignments should be fast.  indexing and copying should be fast.  it
> would give R a whole lot more appeal.  the functionality in data.table
> should be core functionality, not requiring an add-on with strange
> syntax.  just my 5 cents...of course, the R developers are saints,
> putting in a lot of effort with no compense, so complaining is unfair.
>  and thanks to Matthew Doyle for writing data.table, without which I
> couldn't do this AT ALL.
>
> regards,
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
>
>
>
>
>
> On Sun, Oct 9, 2011 at 10:42 AM, Patrick Burns <pburns at pburns.seanet.com> wrote:
>> I think you are looking for the 'data.table'
>> package.
>>
>> On 09/10/2011 17:31, ivo welch wrote:
>>>
>>> Dear R experts---I am struggling with memory and speed issues.  Advice
>>> would be appreciated.
>>>
>>> I have a long data set (of financial stock returns, with stock name
>>> and trading day).  All three variables, stock return, id and day, are
>>> irregular.  About 1.3GB in object.size (200MB on disk).  now, I need
>>> to merge the main data set with some aggregate data (e.g., the S&P500
>>> market rate of return, with a day index) from the same day.  this
>>> "market data set" is not a big data set (object.size=300K, 5 columns,
>>> 12000 rows).
>>>
>>> let's say my (dumb statistical) plan is to run one grand regression,
>>> where the individual rate of return is y and the market rate of return
>>> is x.  the following should work without a problem:
>>>
>>> combined<- merge( main, aggregate.data, by="day", all.x=TRUE, all.y=FALSE
>>> )
>>> lm( stockreturn ~ marketreturn, data=combined )
>>>
>>> alas, the merge is neither space-efficient nor fast.  in fact, I run
>>> out of memory on my 16GB linux machine.  my guess is that by whittling
>>> it down, I could work it (perhaps doing it in chunks, and then
>>> rbinding it), but this is painful.
>>>
>>> in perl, I would define a hash with the day as key and the market
>>> return as value, and then loop over the main data set to supplement
>>> it.
>>>
>>> is there a recommended way of doing such tasks in R, either super-fast
>>> (so that I merge many many times) or space efficient (so that I merge
>>> once and store the results)?
>>>
>>> sincerely,
>>>
>>> /iaw
>>>
>>> ----
>>> Ivo Welch (ivo.welch at gmail.com)
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Patrick Burns
>> pburns at pburns.seanet.com
>> twitter: @portfolioprobe
>> http://www.portfolioprobe.com/blog
>> http://www.burns-stat.com
>> (home of 'Some hints for the R beginner'
>> and 'The R Inferno')
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact