[R] Trying to avoid the loop while merging two data frames

Dimitri Liakhovitski dimitri.liakhovitski at gmail.com
Tue Dec 22 22:10:23 CET 2015


Taking it back - no need for all.x = T, all.y = F

On Tue, Dec 22, 2015 at 3:56 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> Actually, the correct merge line should be:
> my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F)
>
> On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> You are right, guys, merge is working. Somehow I was under the
>> erroneous impression that because the second data frame (myinfo)
>> contains no column 'myid' merge will not work.
>> Below is the cleaner code and comparison:
>>
>> #########################################
>> ### Example with smaller data frames
>> #########################################
>>
>> set.seed(123)
>> mydata <- data.frame(myid = 1001:1020,
>>                      version = sample(1:10, 20, replace = T))
>> head(mydata)
>> table(mydata$version)
>>
>> set.seed(12)
>> myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b =
>> rnorm(50), c = rnorm(50), d = rnorm(50))
>> head(myinfo, 40)
>> table(myinfo$version)
>>
>> ###----------------------------------------
>> ### METHOD 1 - Looping through each id of mydata and grabbing
>> ### all columns of myinfo for the corresponding 'version':
>>
>>
>> # Create placeholder list for the results:
>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
>> length(result)
>> (result)[1:3]
>>
>>
>> # Looping through each element of 'result':
>> for(i in 1:length(result)){
>>       id <- result[[i]]$myid
>>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>>       result[[i]]$myid <- id
>>       result[[i]] <- result[[i]][c(6, 1:5)]
>> }
>> result <- do.call(rbind, result)
>> result.order <- arrange(result,  myid, version, a, b, c, d)
>> head(result.order) # This is the desired result
>>
>> ###----------------------------------------
>> ### METHOD 2 - merge
>>
>> my.merge <- merge(myinfo, mydata, by="version")
>> names(my.merge)
>> result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")]
>> names(result2)
>> result2.order <- arrange(result2, myid, version, a, b, c, d)
>> dim(result2.order)
>> head(result2.order)
>>
>> # Same result?
>> all.equal(result.order, result2.order)
>>
>> On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>> I know I am overwriting.
>>> merge doesn't solve it because each version in mydata is given to more
>>> than one id. Hence, I thought I can't merge by version.
>>> I am not sure how to answer the question about "the problem".
>>> I described the current state and the desired state. If possible, I'd
>>> like to get from the current state to the desired state faster than
>>> when using a loop.
>>>
>>> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:
>>>> You seem to be saving 'myid' and then overwriting it with the last
>>>> statement:
>>>>
>>>>  result[[i]] <- result[[i]][c(5, 1:4)]
>>>>
>>>> Why doesn't 'merge' work for you?  I tried it on your data, and seem to get
>>>> back the same number of rows; may not be in the same order, but the content
>>>> looks the same, and it does have 'myid' on it.
>>>>
>>>>
>>>> Jim Holtman
>>>> Data Munger Guru
>>>>
>>>> What is the problem that you are trying to solve?
>>>> Tell me what you want to do, not how you want to do it.
>>>>
>>>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>>
>>>>> Hello!
>>>>> I have a solution for my task that is based on a loop. However, it's
>>>>> too slow for my real-life problem that is much larger in scope.
>>>>> However, I cannot use merge. Any advice on how to do it faster?
>>>>> Thanks a lot for any hint on how to speed it up!
>>>>>
>>>>> # I have 'mydata' data frame:
>>>>> set.seed(123)
>>>>> mydata <- data.frame(myid = 1001:1100,
>>>>>                      version = sample(1:20, 100, replace = T))
>>>>> head(mydata)
>>>>> table(mydata$version)
>>>>>
>>>>> # I have 'myinfo' data frame that contains information for each 'version':
>>>>> set.seed(12)
>>>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b =
>>>>> rnorm(60),
>>>>>                                  c = rnorm(60), d = rnorm(60))
>>>>> head(myinfo, 40)
>>>>>
>>>>> ### MY SOLUTION WITH A LOOP:
>>>>> ### Looping through each id of mydata and grabbing
>>>>> ### all columns from 'myinfo' for the corresponding 'version':
>>>>>
>>>>> # 1. Creating placeholder list for the results:
>>>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
>>>>> length(result)
>>>>> (result)[1:3]
>>>>>
>>>>>
>>>>> # 2. Looping through each element of 'result':
>>>>> for(i in 1:length(result)){
>>>>>       id <- result[[i]]$myid
>>>>>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>>>>>       result[[i]]$myid <- id
>>>>>       result[[i]] <- result[[i]][c(5, 1:4)]
>>>>> }
>>>>> result <- do.call(rbind, result)
>>>>> head(result) # This is the desired result
>>>>>
>>>>> --
>>>>> Dimitri Liakhovitski
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Dimitri Liakhovitski
>>
>>
>>
>> --
>> Dimitri Liakhovitski
>
>
>
> --
> Dimitri Liakhovitski



-- 
Dimitri Liakhovitski



More information about the R-help mailing list