[R] Working with semantic data triplets

Mon Nov 8 14:49:33 CET 2010

Dear R Help List,

I have the following data set:
<id> <date> <factor>
eg.
1, date11, f1
1, date12, f2
1, date13, f3
[...]
2, date21, fi
2, date22, fj
[…]

f1 – fn are various levels of a factor variable.

Each ID may contain 1 to many entries. These represent basically semantic data (triplets).

I want to construct the corresponding contingency table, detailing all direct transitions:
factor_i => factor_j; in other words:
__ f1 f2 f3 f4 … fn
f1
f2
f3
…
fn

Beyond the simple count, I would also want to compute various statistics on the time intervals between the (fi, fj) transition, including mean, sd, median and various quantiles.

(fi, fi) transitions are only possible if only one entry is available for that ID, and consequently this represents the number of unique IDs with only one triplet (and that triplet having associated fi) – but the solution is not required to compute this (I am not very interested in this data; it is more for pedantic reasons).

However, the solution should also compute:
all (fi, fj) direct transitions
all (fi, fj) direct transitions, where fi is the baseline value (baseline = smallest date); it may be easy to create such a data subset, once the first problem is solved, and the true issue is only solving for the general problem

What would be the best way to do this?
I am invariably ending up with solutions based on loops, but I feel this is entirely wrong. Also, the dataset contains over 600,000 triplets and is growing fast, so loops would be pretty unusable on my home computer (my senses are telling me that loops are very slow).

I can sort the data externally (by ID, DATE) to ease computations.

Thank you very much for your help.

Sincerely,

Leonard

-- 
GRATIS! Movie-FLAT mit über 300 Videos.