[R] Trying to avoid the loop while merging two data frames

Dimitri Liakhovitski dimitri.liakhovitski at gmail.com
Tue Dec 22 21:50:39 CET 2015


You are right, guys, merge is working. Somehow I was under the
erroneous impression that because the second data frame (myinfo)
contains no column 'myid' merge will not work.
Below is the cleaner code and comparison:

#########################################
### Example with smaller data frames
#########################################

set.seed(123)
mydata <- data.frame(myid = 1001:1020,
                     version = sample(1:10, 20, replace = T))
head(mydata)
table(mydata$version)

set.seed(12)
myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b =
rnorm(50), c = rnorm(50), d = rnorm(50))
head(myinfo, 40)
table(myinfo$version)

###----------------------------------------
### METHOD 1 - Looping through each id of mydata and grabbing
### all columns of myinfo for the corresponding 'version':


# Create placeholder list for the results:
result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
length(result)
(result)[1:3]


# Looping through each element of 'result':
for(i in 1:length(result)){
      id <- result[[i]]$myid
      result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
      result[[i]]$myid <- id
      result[[i]] <- result[[i]][c(6, 1:5)]
}
result <- do.call(rbind, result)
result.order <- arrange(result,  myid, version, a, b, c, d)
head(result.order) # This is the desired result

###----------------------------------------
### METHOD 2 - merge

my.merge <- merge(myinfo, mydata, by="version")
names(my.merge)
result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")]
names(result2)
result2.order <- arrange(result2, myid, version, a, b, c, d)
dim(result2.order)
head(result2.order)

# Same result?
all.equal(result.order, result2.order)

On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> I know I am overwriting.
> merge doesn't solve it because each version in mydata is given to more
> than one id. Hence, I thought I can't merge by version.
> I am not sure how to answer the question about "the problem".
> I described the current state and the desired state. If possible, I'd
> like to get from the current state to the desired state faster than
> when using a loop.
>
> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:
>> You seem to be saving 'myid' and then overwriting it with the last
>> statement:
>>
>>  result[[i]] <- result[[i]][c(5, 1:4)]
>>
>> Why doesn't 'merge' work for you?  I tried it on your data, and seem to get
>> back the same number of rows; may not be in the same order, but the content
>> looks the same, and it does have 'myid' on it.
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>>
>>> Hello!
>>> I have a solution for my task that is based on a loop. However, it's
>>> too slow for my real-life problem that is much larger in scope.
>>> However, I cannot use merge. Any advice on how to do it faster?
>>> Thanks a lot for any hint on how to speed it up!
>>>
>>> # I have 'mydata' data frame:
>>> set.seed(123)
>>> mydata <- data.frame(myid = 1001:1100,
>>>                      version = sample(1:20, 100, replace = T))
>>> head(mydata)
>>> table(mydata$version)
>>>
>>> # I have 'myinfo' data frame that contains information for each 'version':
>>> set.seed(12)
>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b =
>>> rnorm(60),
>>>                                  c = rnorm(60), d = rnorm(60))
>>> head(myinfo, 40)
>>>
>>> ### MY SOLUTION WITH A LOOP:
>>> ### Looping through each id of mydata and grabbing
>>> ### all columns from 'myinfo' for the corresponding 'version':
>>>
>>> # 1. Creating placeholder list for the results:
>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid))
>>> length(result)
>>> (result)[1:3]
>>>
>>>
>>> # 2. Looping through each element of 'result':
>>> for(i in 1:length(result)){
>>>       id <- result[[i]]$myid
>>>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>>>       result[[i]]$myid <- id
>>>       result[[i]] <- result[[i]][c(5, 1:4)]
>>> }
>>> result <- do.call(rbind, result)
>>> head(result) # This is the desired result
>>>
>>> --
>>> Dimitri Liakhovitski
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Dimitri Liakhovitski



-- 
Dimitri Liakhovitski



More information about the R-help mailing list