[R] what is the effective method to apply the below logic for ~1.2 million records in R

Jim Lemon drjimlemon at gmail.com
Sun Sep 20 05:31:25 CEST 2015


Hi Ravi,
Try this:

current_customer<-0
for(row in 1:dim(A)[1]) {
 if(current_customer == A$Customer[row]) {
  if(A$Time_Diff[row] > 12) A$flag_1[row]<-1
  else A$flag_1[row]<-A$flag_1[row-1]+1
 }
 else {
  current_customer<-A$Customer[row]
  A$flag_1[row]<-1
 }
}

Jim

On Sun, Sep 20, 2015 at 12:25 PM, David Winsemius <dwinsemius at comcast.net>
wrote:

>
> On Sep 19, 2015, at 2:09 PM, Ravi Teja wrote:
>
> > Hi,
> >
> > I am trying to apply the below logic to generate flag_1 column on a data
> > set consisting of ~1.2 million records in R.
> >
> > Code :
> >
> > for(i in 1: nrows)
> >  {
> >              if(A$customer[i]==A$customer[i+1])
> >                {
> >
> >                  if(is.na(A$Time_Diff[i]))
> >                     A$flag_1[i] <- 1
> >                     else if (A$Time_Diff[i] > 12)
> >                     A$flag_1[i] <- 1
> >                     else
> >                     A$flag_1[i] <- A$flag_1[i-1]+1
> >
> >               }
> >
> >            else
> >            {
> >
> >              if(is.na(A$Time_Diff[i]))
> >                     A$flag_1[i] <- 1
> >                     else if (A$Time_Diff[i] > 12)
> >                     A$flag_1[i] <- 1
> >                     else
> >                     A$flag_1[i] <- A$flag_1[i-1]+1
> >
> >               }
> > }
>
> The inner logic of the consequent and alternative appear identical.
> Vectorized approaches would surely be faster. You should post some code
> that matches the data. In R customer is not the same as Customer, and
> Time_diff is not Time_Diff,  and my patience for this code review has
> expired.
>
> Post the output from and do include code to create `nrows`:
>
>  dput( head (A, 20) )
>
>
> >
> > Resultant dataset should look like
> >
> > Customer   Time_diff    flag_1
> > 1                   NA           1
> > 1                   10             2
> > 1                    8              3
> > 1                    15            1
> > 1                    9               2
> > 1                    10              3
> > 2                     NA            1
> > 2                      2               2
> > 2                      5               3
> >
> > The above logic will take approximately 60 hours to generate the flag_1
> > column on a dataset consisting of ~1.2 million records. Is there any
> > effective way in R to implement this logic in R ?
> >
> > Appreciate your help.
> >
> > Thanks,
> > Ravi
> >
> >       [[alternative HTML version deleted]]
>
> AND R-help is a plain text only mailing list.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list