[R] merging and obtaining the nearest value

William Dunlap wdunlap at tibco.com
Mon Aug 20 03:50:57 CEST 2012


Yes, if the Special_Dates are not sorted then f4 needs to sort them.
Perhaps closestValue should just sort its vec argument.

I didn't realize that the output should not have any duplicate
entries.  I thought it should have the same number of rows
as the input A.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
> Sent: Sunday, August 19, 2012 3:54 PM
> To: William Dunlap; Francesco
> Cc: r-help
> Subject: Re: [R] merging and obtaining the nearest value
> 
> Hello,
> 
> You're right, your solution is much faster, but it doesn't remove
> duplicates.
> When I ran f4() with larger datasets it poduced an error,
> 
> Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly
> 
> So here they all are.
> 
> f1 <- function(A, B){
>      m <- merge(A, B)
>      result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),
> function(x){
>          if(nrow(x)){
>                a <- abs(x$DATE - x$Special_Date)
>                x[which.min(a), ] }}) )
>      result$Difference <- result$DATE - result$Special_Date
>      result$Special_Date <- NULL
>      rownames(result) <- seq_len(nrow(result))
>      result
> }
> 
> closestValue <- function (x, vec)
> {
>      # for each value in x, find closest value in vec.
>      # Break ties by using highest.
>      # Assume vec is sorted.
>      intervalNo <- findInterval(x, vec)
>      lowerValue <- vec[pmax(1, intervalNo)]
>      upperValue <- vec[pmin(length(vec), intervalNo+1)]
>      ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> }
> f4 <- function (A, B)  {
>      A$TYPE <- as.factor(A$TYPE)
>      uA <- levels(A$TYPE)
>      As <- split(A$DATE, A$TYPE)
>      B <- B[order(B$TYPE, B$Special_Date), ]
>      Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
>      closest <- numeric(nrow(A))
>      split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
>      A$Difference <- A$DATE - closest
>      A
> }
> 
> # Test data, not many types
> nA <- 1e3
> nB <- 1e4
> set.seed(1)
> ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
> tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)
> 
> aa <- data.frame(TYPE = ta, DATE = da)
> bb <- data.frame(TYPE = tb, Special_Date = db)
> 
> t1 <- system.time(r1 <- f1(aa, bb))
> t4 <- system.time(r4 <- f4(aa, bb))
> rbind(t1 = t1, t4 = t4)
> 
> sum( duplicated(r4) )  # 165
> 
> Rui Barradas
> Em 19-08-2012 22:58, William Dunlap escreveu:
> > And the following, f4,  uses the same algorithm as f2 but codes
> > it somewhat more efficiently.  It uses the same closestValue()
> > function.
> > f4 <- function (A, B)  {
> >      A$TYPE <- as.factor(A$TYPE)
> >      uA <- levels(A$TYPE)
> >      As <- split(A$DATE, A$TYPE)
> >      Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
> >      closest <- numeric(nrow(A))
> >      split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
> >      A$Difference <- A$DATE - closest
> >      A
> > }
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf
> >> Of William Dunlap
> >> Sent: Sunday, August 19, 2012 1:49 PM
> >> To: Francesco; r-help at r-project.org
> >> Subject: Re: [R] merging and obtaining the nearest value
> >>
> >> The following, f2(A,B), should do well with lots of rows in A and B
> >> as long as the number of types is not huge.
> >>
> >> f2 <- function(A, B) {
> >>      types <- as.character(unique(A$TYPE))
> >>      result <- numeric(nrow(A))
> >>      Bs <- split(B$Special_Date, B$TYPE)
> >>      for(type in types) {
> >>          w <- A$TYPE == type
> >>          # can omit the sort() below if you know that B$Special_Date is sorted.
> >>          result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
> >>      }
> >>      A$Difference <- A$DATE - result
> >>      A
> >> }
> >>
> >> closestValue <- function (x, vec)
> >> {
> >>      # for each value in x, find closest value in vec.
> >>      # Break ties by using highest.
> >>      # Assume vec is sorted.
> >>      intervalNo <- findInterval(x, vec)
> >>      lowerValue <- vec[pmax(1, intervalNo)]
> >>      upperValue <- vec[pmin(length(vec), intervalNo+1)]
> >>      ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> >> }
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf
> >>> Of William Dunlap
> >>> Sent: Sunday, August 19, 2012 9:43 AM
> >>> To: Francesco; r-help at r-project.org
> >>> Subject: Re: [R] merging and obtaining the nearest value
> >>>
> >>> How many different types are there?  Just a handful or many thousands?
> >>> For this sort of problem it is often handy to write a function which generates
> >>> datasets of the sort you are thinking of but parameterized by the
> >>> number of rows, levels, etc., so you can see how the execution time
> >>> varies with these things.
> >>>
> >>> If there are just a few types, try looping over types and using findInterval
> >>> to see where A$Date fits into the sequence of B$Special_Date.
> >>>
> >>>
> >>> Bill Dunlap
> >>> Spotfire, TIBCO Software
> >>> wdunlap tibco.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> >> Behalf
> >>>> Of Francesco
> >>>> Sent: Sunday, August 19, 2012 4:01 AM
> >>>> To: r-help at r-project.org
> >>>> Subject: Re: [R] merging and obtaining the nearest value
> >>>>
> >>>> Dear Riu, Many thanks for your suggestion
> >>>>
> >>>> However these are just simplified examples... in reality the dataset A
> >>>> contains millions of observations and B several thousands of rows...
> >>>> Could I still use a modified form of your suggestion?
> >>>>
> >>>> Thanks
> >>>>
> >>>> On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Try the following.
> >>>>>
> >>>>>
> >>>>> A <- read.table(text="
> >>>>>
> >>>>> TYPE   DATE
> >>>>> A            2
> >>>>> A            5
> >>>>> A            20
> >>>>> B            10
> >>>>> B            2
> >>>>> ", header = TRUE)
> >>>>>
> >>>>>
> >>>>> B <- read.table(text="
> >>>>>
> >>>>> TYPE  Special_Date
> >>>>> A              2
> >>>>> A              6
> >>>>> A              20
> >>>>> A              22
> >>>>> B              5
> >>>>> B              6
> >>>>> ", header = TRUE)
> >>>>>
> >>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
> >>>>> function(x){
> >>>>>          a <- abs(x$DATE - x$Special_Date)
> >>>>>          if(nrow(x)) x[which(min(a) == a), ] }) )
> >>>>> result$Difference <- result$DATE - result$Special_Date
> >>>>> result$Special_Date <- NULL
> >>>>> rownames(result) <- seq_len(nrow(result))
> >>>>> result
> >>>>>
> >>>>>
> >>>>> Also, it's a good practice to post data examples using dput(). For instance,
> >>>>>
> >>>>> dput(A)
> >>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
> >>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
> >>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L))
> >>>>>
> >>>>> Now all we have to do is run the statement A <- structure(... etc...) to
> >>>>> have an exact copy of the data example.
> >>>>> Anyway, your example with input and the wanted result was very welcome.
> >>>>>
> >>>>> Hope this helps,
> >>>>>
> >>>>> Rui Barradas
> >>>>>
> >>>>> Em 19-08-2012 11:10, Francesco escreveu:
> >>>>>> Dear R-help
> >>>>>>
> >>>>>> Î would like to know if there is a short solution in R for this
> >>>>>> merging problem...
> >>>>>>
> >>>>>> Let say I have a dataset A as:
> >>>>>>
> >>>>>> TYPE   DATE
> >>>>>> A            2
> >>>>>> A            5
> >>>>>> A            20
> >>>>>> B            10
> >>>>>> B            2
> >>>>>>
> >>>>>> (there can be duplicates for the same type and date)
> >>>>>>
> >>>>>> and I have another dataset B as :
> >>>>>>
> >>>>>> TYPE  Special_Date
> >>>>>> A              2
> >>>>>> A              6
> >>>>>> A              20
> >>>>>> A              22
> >>>>>> B              5
> >>>>>> B              6
> >>>>>>
> >>>>>> The question is : I would like to obtain the difference between the
> >>>>>> date of each observation in A and the closest special date in B with
> >>>>>> the same type. In case of ties I would take the latest date of the
> >>>>>> two.
> >>>>>>
> >>>>>> For example I would obtain here
> >>>>>>
> >>>>>> TYPE   DATE   Difference
> >>>>>> A            2            0=2-2
> >>>>>> A            5            -1=5-6
> >>>>>> A            20            0=20-20
> >>>>>> B            10           +4=10-6
> >>>>>> B            2             -3=2-5
> >>>>>>
> >>>>>> Do you know how to (simply?) obtain this in R?
> >>>>>>
> >>>>>> Many thanks!
> >>>>>> Best Regards
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list