[R] Efficient means to link two data frames

Ben Tupper btupper at bigelow.org
Thu Sep 29 16:10:20 CEST 2016


Hi,

I have a solution based upon findInterval() which depends upon the ordered nature of the 'Day' column.  I can't speak to whether or not it is efficient but it is handy.  I love the findInterval() function but have often wished it works with look up tables in descending order.  This function, find_interval(), is my first reasonable pass at working with descending order.

https://gist.github.com/btupper/5fc6cc4e7d86f39f9e4f

## start
source("find_interval.R")

x <- structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 
27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128, 
96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76, 
89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA, 
-27L), class = "data.frame")

TESTDAYS <- data.frame(TestDay = c(4, 11, 15))

x <- x[rev(1:nrow(x)),]

ix <- find_interval(TESTDAYS[,'TestDay'], x[,'Day'])

TESTDAYS[,'TestValue'] <- x[ix,'Value']

TESTDAYS
#   TestDay TestValue
# 1       4       103
# 2      11        84
# 3      15       109

### end

Will that do the trick with your large dataset?

Ben


> On Sep 29, 2016, at 9:38 AM, Dennis Fisher <fisher at plessthan.com> wrote:
> 
> R 3.3.1 
> OS X
> 
> Colleagues,
> 
> I have two large data frames that I am trying to link efficiently.   A small example is as follows:
> 
> structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
> 13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 
> 27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128, 
> 96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76, 
> 89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA, 
> -27L), class = "data.frame")
> 
> which becomes:
>  Day Value
> 1    1    76
> 2    2   116
> 3    3   111
> 4    5   103
> 5    6   114
> 6    7    99
> 7    8   128
> 8    9    96
> 9   10    81
> 10  11    84
> 11  13    81
> 12  14   108
> 13  16   109
> 14  17   106
> 15  18   125
> 16  19   128
> 17  20    92
> 18  21    90
> 19  22    83
> 20  23    89
> 21  24    76
> 22  25    89
> 23  26   101
> 24  27    93
> 25  28    98
> 26  29    77
> 27  30    92
> 
> The second dataframe is merely:
> TESTDAYS			<- data.frame(TestDay = c(4, 11, 15))
> 
> For each row in the second dataframe, I would like to identify the first row in the first dataframe in which Day is >= TestDay.
> For example, for TestDay == 4, Day would equal 5.  I would then append the corresponding “Value” in the TestValue column
> The result would be:
> TestDay TestValue
> 1       4       103
> 2      11        84
> 3      15       109
> 
> I can accomplish this with brute force but I suspect that there is some clever day to vectorize this.  Any help would be appreciated.
> 
> Dennis
> 
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone / Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com <http://www.plessthan.com/>
> 
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org



More information about the R-help mailing list