[R] merging and obtaining the nearest value

William Dunlap wdunlap at tibco.com
Sun Aug 19 22:49:15 CEST 2012


The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.

f2 <- function(A, B) {
    types <- as.character(unique(A$TYPE))
    result <- numeric(nrow(A))
    Bs <- split(B$Special_Date, B$TYPE)
    for(type in types) {
        w <- A$TYPE == type
        # can omit the sort() below if you know that B$Special_Date is sorted.
        result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
    }
    A$Difference <- A$DATE - result
    A
}

closestValue <- function (x, vec) 
{
    # for each value in x, find closest value in vec.
    # Break ties by using highest.
    # Assume vec is sorted.
    intervalNo <- findInterval(x, vec)
    lowerValue <- vec[pmax(1, intervalNo)]
    upperValue <- vec[pmin(length(vec), intervalNo+1)]
    ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of William Dunlap
> Sent: Sunday, August 19, 2012 9:43 AM
> To: Francesco; r-help at r-project.org
> Subject: Re: [R] merging and obtaining the nearest value
> 
> How many different types are there?  Just a handful or many thousands?
> For this sort of problem it is often handy to write a function which generates
> datasets of the sort you are thinking of but parameterized by the
> number of rows, levels, etc., so you can see how the execution time
> varies with these things.
> 
> If there are just a few types, try looping over types and using findInterval
> to see where A$Date fits into the sequence of B$Special_Date.
> 
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> > Of Francesco
> > Sent: Sunday, August 19, 2012 4:01 AM
> > To: r-help at r-project.org
> > Subject: Re: [R] merging and obtaining the nearest value
> >
> > Dear Riu, Many thanks for your suggestion
> >
> > However these are just simplified examples... in reality the dataset A
> > contains millions of observations and B several thousands of rows...
> > Could I still use a modified form of your suggestion?
> >
> > Thanks
> >
> > On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > Hello,
> > >
> > > Try the following.
> > >
> > >
> > > A <- read.table(text="
> > >
> > > TYPE   DATE
> > > A            2
> > > A            5
> > > A            20
> > > B            10
> > > B            2
> > > ", header = TRUE)
> > >
> > >
> > > B <- read.table(text="
> > >
> > > TYPE  Special_Date
> > > A              2
> > > A              6
> > > A              20
> > > A              22
> > > B              5
> > > B              6
> > > ", header = TRUE)
> > >
> > > result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
> > > function(x){
> > >         a <- abs(x$DATE - x$Special_Date)
> > >         if(nrow(x)) x[which(min(a) == a), ] }) )
> > > result$Difference <- result$DATE - result$Special_Date
> > > result$Special_Date <- NULL
> > > rownames(result) <- seq_len(nrow(result))
> > > result
> > >
> > >
> > > Also, it's a good practice to post data examples using dput(). For instance,
> > >
> > > dput(A)
> > > structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
> > > "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
> > > "DATE"), class = "data.frame", row.names = c(NA, -5L))
> > >
> > > Now all we have to do is run the statement A <- structure(... etc...) to
> > > have an exact copy of the data example.
> > > Anyway, your example with input and the wanted result was very welcome.
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > > Em 19-08-2012 11:10, Francesco escreveu:
> > >>
> > >> Dear R-help
> > >>
> > >> Î would like to know if there is a short solution in R for this
> > >> merging problem...
> > >>
> > >> Let say I have a dataset A as:
> > >>
> > >> TYPE   DATE
> > >> A            2
> > >> A            5
> > >> A            20
> > >> B            10
> > >> B            2
> > >>
> > >> (there can be duplicates for the same type and date)
> > >>
> > >> and I have another dataset B as :
> > >>
> > >> TYPE  Special_Date
> > >> A              2
> > >> A              6
> > >> A              20
> > >> A              22
> > >> B              5
> > >> B              6
> > >>
> > >> The question is : I would like to obtain the difference between the
> > >> date of each observation in A and the closest special date in B with
> > >> the same type. In case of ties I would take the latest date of the
> > >> two.
> > >>
> > >> For example I would obtain here
> > >>
> > >> TYPE   DATE   Difference
> > >> A            2            0=2-2
> > >> A            5            -1=5-6
> > >> A            20            0=20-20
> > >> B            10           +4=10-6
> > >> B            2             -3=2-5
> > >>
> > >> Do you know how to (simply?) obtain this in R?
> > >>
> > >> Many thanks!
> > >> Best Regards
> > >>
> > >> ______________________________________________
> > >> R-help at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > >> http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list