[R] Must be a better way to collate sequenced data

Petr PIKAL petr.pikal at precheza.cz
Mon Jun 8 15:14:44 CEST 2009


Hi


"Burke, Robin" <rburke at cs.depaul.edu> napsal dne 08.06.2009 11:28:46:

> Thanks for the quick response. Sorry for being unclear with my example. 
Here 
> is something more concrete:
> 
> user <- c(1, 2, 1, 2, 3, 1, 3, 4, 2,  3,  4,  1);
> time <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 
1200);
> userCount <- c(1, 1, 2, 2, 1, 3, 2, 1, 3,  3,  2,  4);
> 
> period <- 100
> 
> utime.data <- data.frame(USER=user, TIME=time, USER_COUNT=userCount);
> 
> The answer
> 
> >utime.rcount
>   TIME TIME      PERC
> 1    0    0 1.4166667
> 2    1    4 1.4166667
> 3    3    9 0.9166667
> 4    6    6 0.2500000

Only partial

These code shall do what you want, however I did not check speed

utime.data$TIME <- utime.data$TIME %/% period
lll <- split(utime.data, utime.data$USER)
utime.tstart <- lapply(lll, function(x) x[1,2])
utime.tstart <- as.numeric(unlist(utime.tstart))
utime.userMax <- aggregate(utime.data["USER_COUNT"], utime.data["USER"], 
max)
for ( i in 1:length(utime.tstart)) lll[[i]]["TIME"] <- 
lll[[i]]["TIME"]-utime.tstart[i]
for ( i in 1:length(utime.tstart)) lll[[i]]["USER_COUNT"] <- 
1/utime.userMax[i,2]
augdata <- do.call(rbind, lll)[,2:3]
utime.rcount <- aggregate(augdata, augdata["TIME"],sum)

However it probably can be improved further.

Regards
Petr


> 
> I'm investigating the plyr package. I think splitting by users and 
re-merging 
> may do the trick, providing I can re-merge in order of the transformed 
time 
> value. That would avoid the costly sort operation in aggregate.
> 
> Robin Burke
> Associate Professor
> School of Computer Science, Telecommunications, and
>    Information Systems
> DePaul University 
> (currently on leave at University College Dublin)
> 
> http://josquin.cti.depaul.edu/~rburke/
> 
> "The universe is made of stories, not of atoms" - Muriel Rukeyser
> 
> 
> 
> -----Original Message-----
> From: Petr PIKAL [mailto:petr.pikal at precheza.cz] 
> Sent: Monday, June 08, 2009 8:36 AM
> To: Burke, Robin
> Cc: r-help at r-project.org
> Subject: Odp: [R] Must be a better way to collate sequenced data
> 
> Hi
> 
> nobody has your data and so your code is irreproducible. Here are only 
few 
> comments 
> 
> augdata <<- as.data.frame(cbind(utime.atimes, utime.aperc))
> 
> data.frame(utime.atimes, utime.aperc) is enough. cbinding is rather 
> dangerous as it produce matrix and it has to have only one type of 
values.
> 
> I am a little bit puzzled by your example.
> 
> u.profile<-c(50,20,10)
> u.days<-c(1,2,3)
> proc.prof<-u.profile/sum(u.profile)
> data.frame(u.days, proc.prof)
>   u.days proc.prof
> 1      1     0.625
> 2      2     0.250
> 3      3     0.125
> 
> OTOH you speak about normalization by max value
> 
> proc.prof<-u.profile/max(u.profile)
> data.frame(u.days, proc.prof)
>   u.days proc.prof
> 1      1       1.0
> 2      2       0.4
> 3      3       0.2
> 
> Some suggestion which comes to my mind is to
> 
> 1. Transfer time.stamp to POSIX class
> 2. Split your data according to users
> mylist <- split(data, users)
> 3. transform your data by lapply(mylist, desired transformation)
> 4. perform aggregation by days for each part of the list
> 5. reprocess list to data frame
> 
> Maybe some functions from plyr or  doBy library could help you.
> 
> Regards
> Petr
> 
> 
> 
> 
> r-help-bounces at r-project.org napsal dne 07.06.2009 23:55:00:
> 
> > I have data that looks like this
> > 
> > time_stamp (seconds)  user_id
> > 
> > The data is (partial) ordered by time - in that sometimes transactions 

> occur 
> > at the same timestamp. The output I want is collated by transaction 
time 
> on a 
> > per user basis, normalized by the maximum number of transactions per 
> user, and
> > aggregated over each day. So, if the users have 50 transactions in the 

> first 
> > day and 20 transactions on the second day, and 10 transactions on the 
> third 
> > day, the output would be as follows, if each transaction represents 
> 0.01% of 
> > each user's total profile. (In reality, they all have different 
profile 
> > lengths so a transaction represents a different percentage for each 
> user.)
> > 
> > time_since_first_transaction (days)        percent_of_profile
> > 1       0.50
> > 2       0.20
> > 3       0.10
> > 
> > I have the following code that computes the right answer, but it is 
> really 
> > inefficient, so I'm sure that I'm doing something wrong. Really 
> inefficient 
> > means > 30 minutes for an 100 k item data frame on a 2.2 GHz machine, 
> and my 
> > 1-million data set has never finished. I'm no stranger to functional 
> > programming (Lisp programmer) but I can't figure out a way to subtract 

> the 
> > first timestamp for user A from all of the other timestamps for user A 

> without
> > either (a) building a separate table of "first entries for each user", 

> which I
> > do here, or (b) re-computing the initial entry for each user with 
every 
> row, 
> > which is what I did before and is even more inefficient. Another 
killer 
> > operation seems to be the aggregate step on the last line, which I use 

> to 
> > collate the data by days. It seems very slow, but I don't know any 
other 
> way 
> > to do this. I realize that I am living proof that one can program in C 

> no 
> > matter what language one uses - so I would appreciate any 
enlightenment 
> on offer. If !
> >  there's no better way, I'll pre-process everything in Perl, but I'd 
> rather 
> > learn the "R" way to do things like this. Thanks.
> > 
> >                 # Build table of times
> > utime.times <<- utime.data["TIME"] %/% period;
> >                 utime.tstart <<- vector("numeric", 
> length=max(utime.data["USER"]));
> >                 for (i in 1:nrow(utime.data))
> >                 {
> >                                 if (as.numeric(utime.data[i, 
> "USER_COUNT"])==1)
> >                                 {
> >                                                 day <- utime.times[i, 
> "TIME"];
> >                                                 user <- utime.data[i, 
> "USER"];
> >                                                 utime.tstart[user] <<- 

> day;
> >                                 }
> >                 }
> > 
> >                 # Build table of maximum profile sizes
> >                 utime.userMax <<- aggregate(utime.data["USER_COUNT"],
> > utime.data["USER"],
> >                                                                 max);
> > 
> >                 utime.atimes <<- vector("numeric", 
> length=nrow(utime.data));
> >                 utime.aperc <<- vector("numeric", 
> length=nrow(utime.data));
> >                 augdata <<- as.data.frame(cbind(utime.atimes, 
> utime.aperc));
> >                 names(augdata) <<- c("TIME", "PERC");
> >                 for (i in 1:nrow(utime.data))
> >                 {
> >                                 # adjust time according to user start 
> time
> > augdata[i, "TIME"] <<-
> >                                                 utime.times[i,"TIME"] 
-
> > utime.tstart[utime.data[i,"USER"]];
> >                                 # look up maximum user count
> >                                 umax <- subset(utime.userMax,
> > 
> > USER==as.numeric(utime.data[i, "USER"]))["USER_COUNT"];
> >                                 augdata[i, "PERC"] <<- 1.0/umax;
> >                 }
> > 
> >                 utime.rcount <<- aggregate(augdata, augdata["TIME"], 
> sum);
> >                 ....
> > 
> > 
> > Robin Burke
> > Associate Professor
> > School of Computer Science, Telecommunications, and
> >    Information Systems
> > DePaul University
> > (currently on leave at University College Dublin)
> > 
> > http://josquin.cti.depaul.edu/~rburke/
> > 
> > "The universe is made of stories, not of atoms" - Muriel Rukeyser
> > 
> > 
> >    [[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list