[R] what is the effective method to apply the below logic for ~1.2 million records in R

Ista Zahn istazahn at gmail.com
Sun Sep 20 04:48:43 CEST 2015


This assumes that the data are sorted by customer, and that only the
first value of Time_Diff is missing for each customer (and that the
first value is always missing for each customer). If those assumptions
hold you can do something like

A <- read.table(text = "customer   Time_Diff    flag_1
1                   NA           1
1                   10           2
1                    8           3
1                   15           1
1                    9           2
1                   10           3
2                   NA           1
2                    2           2
2                    5           3",
header = TRUE)

A$flag_1 <- NULL

library(data.table)

A <- as.data.table(A)
A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0,
diff(Time_Diff > 12) > 0)))]
## I'm not proud of the previous line, probably there is a cleaner way
A[ , flag_1 := 1:.N, by = c("customer", "g15")]
A[ , g15 := NULL]

Best,
Ista

On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at gmail.com> wrote:
> Hi,
>
> I am trying to apply the below logic to generate flag_1 column on a data
> set consisting of ~1.2 million records in R.
>
> Code :
>
> for(i in 1: nrows)
>   {
>               if(A$customer[i]==A$customer[i+1])
>                 {
>
>                   if(is.na(A$Time_Diff[i]))
>                      A$flag_1[i] <- 1
>                      else if (A$Time_Diff[i] > 12)
>                      A$flag_1[i] <- 1
>                      else
>                      A$flag_1[i] <- A$flag_1[i-1]+1
>
>                }
>
>             else
>             {
>
>               if(is.na(A$Time_Diff[i]))
>                      A$flag_1[i] <- 1
>                      else if (A$Time_Diff[i] > 12)
>                      A$flag_1[i] <- 1
>                      else
>                      A$flag_1[i] <- A$flag_1[i-1]+1
>
>                }
> }
>
>
> Resultant dataset should look like
>
> Customer   Time_diff    flag_1
> 1                   NA           1
> 1                   10             2
> 1                    8              3
> 1                    15            1
> 1                    9               2
> 1                    10              3
> 2                     NA            1
> 2                      2               2
> 2                      5               3
>
> The above logic will take approximately 60 hours to generate the flag_1
> column on a dataset consisting of ~1.2 million records. Is there any
> effective way in R to implement this logic in R ?
>
> Appreciate your help.
>
> Thanks,
> Ravi
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list