[R] Subset of time observations where timediff > 60 secs

Peter Dalgaard p.dalgaard at biostat.ku.dk
Wed Dec 9 08:05:06 CET 2009


Gabor Grothendieck wrote:
> There is an example at the end of the Prices and Returns section of
> the zoo-quickref vignette in the zoo package.
> 
> library(zoo)
> vignette("zoo-quickref")
> 
> If speed is your main concern check this recent thread that was posted
> on R-sig-finance:
> http://n4.nabble.com/SUMMARY-Reducing-an-intra-day-dataset-into-one-obs-per-second-td949612.html

Hum, not sure the above actually solve the same problem. Looks like one 
of those rare cases where SAS's data step solves something almost 
trivially (apologies if the syntax below is not quite right), but it 
won't vectorize in any sensible way for R:

data fee;
  set foo;
  retain last:
  if time < last + 60 then delete; else last=time;

I think the easiest fix could be to code the core loop in C (e.g., 
returning a vector of indices to retain).



> On Mon, Dec 7, 2009 at 10:57 AM, Karl Ove Hufthammer <karl at huftis.org> wrote:
>> Dear list members
>>
>> I have a rather large vector (part of a data frame) giving the time
>> (date + time, POSIXct) of observations. The times are irregular (with
>> both small and large jumps) but increasing, and there are several
>> millions of them.
>>
>> I now wish to reduce my data set, so that I only have observations which
>> are at least (for example) 60 seconds apart. Basically, I need (all) the
>> indices of my time variable where the difference in times are at least
>> 60 seconds.
>>
>> I thought this would be a rather simple task, but perhaps I'm tired, for
>> I couldn't figure out how to do it in a even moderately elegant way (not
>> looping over all the values, which is quite slow).
>>
>> This solution seemed sensible:
>>
>> x=cumsum(diff(timevar) %/% 60)
>> ind=c(1,cumsum(rle(x)$lengths)+1) # And perhaps removing the last value
>>
>> but doesn't work, as it only captures the 'first times' in each
>> 60-second interval following the first time value, and thus may include
>> times with values that are closer than 60 seconds.
>>
>> I also considered round.POSIXct and trunc.POSIXct, but these are not
>> appropriate either, for obvious reasons.
>>
>> So, any ideas how to do this in an elegant and efficient way?
>>
>> --
>> Karl Ove Hufthammer
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907




More information about the R-help mailing list