[R] Trying to avoid the loop while merging two data frames

Dimitri Liakhovitski dimitri.liakhovitski at gmail.com
Tue Dec 22 21:56:50 CET 2015


Actually, the correct merge line should be:
my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F)

On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> You are right, guys, merge is working. Somehow I was under the
> erroneous impression that because the second data frame (myinfo)
> contains no column 'myid' merge will not work.
> Below is the cleaner code and comparison:
>
> #########################################
> ### Example with smaller data frames
> #########################################
>
> set.seed(123)
> mydata <- data.frame(myid = 1001:1020,
>                      version = sample(1:10, 20, replace = T))
> head(mydata)
> table(mydata$version)
>
> set.seed(12)
> myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b =
> rnorm(50), c = rnorm(50), d = rnorm(50))
> head(myinfo, 40)
> table(myinfo$version)
>
> ###----------------------------------------
> ### METHOD 1 - Looping through each id of mydata and grabbing
> ### all columns of myinfo for the corresponding 'version':
>
>
> # Create placeholder list for the results:
> result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
> length(result)
> (result)[1:3]
>
>
> # Looping through each element of 'result':
> for(i in 1:length(result)){
>       id <- result[[i]]$myid
>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>       result[[i]]$myid <- id
>       result[[i]] <- result[[i]][c(6, 1:5)]
> }
> result <- do.call(rbind, result)
> result.order <- arrange(result,  myid, version, a, b, c, d)
> head(result.order) # This is the desired result
>
> ###----------------------------------------
> ### METHOD 2 - merge
>
> my.merge <- merge(myinfo, mydata, by="version")
> names(my.merge)
> result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")]
> names(result2)
> result2.order <- arrange(result2, myid, version, a, b, c, d)
> dim(result2.order)
> head(result2.order)
>
> # Same result?
> all.equal(result.order, result2.order)
>
> On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I know I am overwriting.
>> merge doesn't solve it because each version in mydata is given to more
>> than one id. Hence, I thought I can't merge by version.
>> I am not sure how to answer the question about "the problem".
>> I described the current state and the desired state. If possible, I'd
>> like to get from the current state to the desired state faster than
>> when using a loop.
>>
>> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:
>>> You seem to be saving 'myid' and then overwriting it with the last
>>> statement:
>>>
>>>  result[[i]] <- result[[i]][c(5, 1:4)]
>>>
>>> Why doesn't 'merge' work for you?  I tried it on your data, and seem to get
>>> back the same number of rows; may not be in the same order, but the content
>>> looks the same, and it does have 'myid' on it.
>>>
>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>
>>>> Hello!
>>>> I have a solution for my task that is based on a loop. However, it's
>>>> too slow for my real-life problem that is much larger in scope.
>>>> However, I cannot use merge. Any advice on how to do it faster?
>>>> Thanks a lot for any hint on how to speed it up!
>>>>
>>>> # I have 'mydata' data frame:
>>>> set.seed(123)
>>>> mydata <- data.frame(myid = 1001:1100,
>>>>                      version = sample(1:20, 100, replace = T))
>>>> head(mydata)
>>>> table(mydata$version)
>>>>
>>>> # I have 'myinfo' data frame that contains information for each 'version':
>>>> set.seed(12)
>>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b =
>>>> rnorm(60),
>>>>                                  c = rnorm(60), d = rnorm(60))
>>>> head(myinfo, 40)
>>>>
>>>> ### MY SOLUTION WITH A LOOP:
>>>> ### Looping through each id of mydata and grabbing
>>>> ### all columns from 'myinfo' for the corresponding 'version':
>>>>
>>>> # 1. Creating placeholder list for the results:
>>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
>>>> length(result)
>>>> (result)[1:3]
>>>>
>>>>
>>>> # 2. Looping through each element of 'result':
>>>> for(i in 1:length(result)){
>>>>       id <- result[[i]]$myid
>>>>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>>>>       result[[i]]$myid <- id
>>>>       result[[i]] <- result[[i]][c(5, 1:4)]
>>>> }
>>>> result <- do.call(rbind, result)
>>>> head(result) # This is the desired result
>>>>
>>>> --
>>>> Dimitri Liakhovitski
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>>
>> --
>> Dimitri Liakhovitski
>
>
>
> --
> Dimitri Liakhovitski



-- 
Dimitri Liakhovitski



More information about the R-help mailing list