[BioC] Merge dataframes

Sat Oct 8 18:29:14 CEST 2011

Hi Joao,

A simple one line solution using two functions from the 'caroline' CRAN 
package could suffice as well:

nerge(list(data=data1, col=nv(color, rownames(data2))), all.x=T)

nv : named vector
nerge : named merge

Dave

On Sat, 8 Oct 2011, "João Daniel N. Duarte" wrote:

> Hi everybody,
>
> Paniagua explanation and solution worked very good! Thank you a lot!
> I got the code to what I need.
>
> As Paniagua said, I was losing row.names when I merged a data.frame with a 
> factor object. Instead of merging data1 to data2$color, I merged it to 
> data2["color"]. It preserves row.names.
>
> When I do:
>
> d1 <- merge(data1, data2["color"], by=0, all.x=T)
>
> It creates a new variable on column 1, with the original row.names, so I have 
> to transform this data.frame in a new one, using this variable as the 
> row.names:
>
> d2 <- data.frame(d1, row.names=1)
>
> It does the job, but row.names are not in the same order as in data1, so I 
> reorder it:
>
> d2[row.names(data1),]
>
> And it gives me the dataframe data1, binded to the new variable color, with 
> the NAs in the proper position.
>
> Thank you all!
>
> Cheers,
>
> Joao D.
>
> Em 07/10/2011, às 20:36, Paniagua, Eric escreveu:
>
>> Hi Joao,
>> 
>> To save space, I'm not going to paste the data frames in.  Just start a 
>> session, run your original code, and then try out the commands below.
>> 
>> The problem is a simple type confusion.  If you check the class of 
>> data2$color, you'll see it's not a data frame:
>> 
>>> class(data2$color)
>> [1] "factor"
>> 
>> The problem is that a factor is not a data frame, but merge expects data 
>> frames as arguments.  If either one is not a data frame, merge will coerce 
>> the type if possible.  In this case, that means that passing data2$color to 
>> merge is equivalent to this:
>> 
>>> merge(data1, as.data.frame(data2$color), by=0, all.x=T)
>> 
>> If you inspect the coerced version of the argument, you'll see that new row 
>> names have been assigned during coercion.  Well, technically not "new" ones 
>> because a factor does not have row names.  In other words, the row names 
>> were lost when you used the '$'.  Just run the following to see:
>> 
>>> data2$color
>>> as.data.frame(data2$color)
>> 
>> The solution is to just subset the data frame instead of pulling out the 
>> factor.  If you replace data2$color in your code with data2["color"], your 
>> code behaves as intended.
>> 
>> A good first clue in discovering this issue is the fact that in your 
>> original code, you end up in the <NA> elements all at the bottom, rows 
>> 16-20, every time.  The odds of this specific arrangement happening any one 
>> particular time are 1 in 15504, or about 0.00645%.  Coincidentally, the 15 
>> non <NA> values are occupying rows 1-15, making the hypothesis that your 
>> rows are being relabeled somewhere along the way much more probable.  And, 
>> in this case, correct.
>> 
>> Underlying that insight or intuition is the much more general debugging 
>> approach of actually checking and looking at every intermediate value in 
>> your computation that you can.  And, of course, reading the documentation 
>> :)
>> 
>> For instance, check out the class and value of data2["color"]:
>> 
>>> class(data2["color"])
>> [1] "data.frame"
>> 
>> You can also see that data2["color"] preserves the row names:
>> 
>>> data2["color"]
>> 
>> You can try this to verify that with my solution the values work out 
>> correctly (though I'm sure there's a better way to do it directly in R):
>> 
>>> data3 <- merge(data1, data2["color"], by=0, all.x=T)
>>> data3
>>> data4 <- data.frame(data3, row.names=1)
>>> data4
>>> data4[row.names(data1),]
>>> data5 <- merge(data1, data4, by=0)
>>> data5
>>> data6 <- data.frame(data5, row.names=1)
>>> data6
>>> data6[row.names(data1),]
>> 
>> (Of course, I'm using way more variables there than are required for the 
>> computation.  Consider their addition as purely didactic.)
>> 
>> Anyway, I hope that helps!  Happy scripting.
>> 
>> Best,
>> Eric
>> 
>> ________________________________________
>> From: bioconductor-bounces at r-project.org 
>> [bioconductor-bounces at r-project.org] on behalf of João Daniel Nunes Duarte 
>> [jdanielnd at gmail.com]
>> Sent: Friday, October 07, 2011 6:32 PM
>> Subject: [BioC] Merge dataframes
>> 
>> Hello,
>> 
>> I am having some problems to use the 'merge' function. I'm not sure if I 
>> got
>> its working right.
>> 
>> What I want to do is:
>> 
>> 1) Suppose I have a dataframe like:
>>
>>       height        width
>> 1        1.1                2.3
>> 2        2.1                2.5
>> 3        1.8                1.9
>> 4        1.6                2.1
>> 5        1.8                2.4
>> 
>> 2) And I generate a second dataframe sampled from this one, like:
>>
>>       height        width
>> 1        1.1                2.3
>> 3        1.8                1.9
>> 5        1.8                2.4
>> 
>> 3) Next, I add a new variable from this dataframe:
>>
>>       height        width        color
>> 1        1.1                2.3            red
>> 3        1.8                1.9            red
>> 5        1.8                2.4            blue
>> 
>> 4) So, I want to merge those dataframes, so that the new variable, color, 
>> is
>> binded to the first dataframe. Of course some cases won't have value for 
>> it,
>> since I generated this variable in a smaller dataframe. In those cases I
>> want the value to be NA. The result dataframe should be:
>>
>>       height        width        color
>> 1        1.1                2.3            red
>> 2        2.1                2.5            NA
>> 3        1.8                1.9            red
>> 4        1.6                2.1            NA
>> 5        1.8                2.4            blue
>> 
>> I have written some codes, but they're not working properly. The new
>> variable has its values mixed up, and they do not correspond to its
>> row.names.
>> 
>> # Generate the first dataframe
>> data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5))
>> # Sample a smaller dataframe from data1
>> data2 <- data1[sample(1:20,15,replace=F),]
>> # Generate the new variable
>> color <- sample(c("red","blue"),15,replace=T)
>> # Bind the new variable to data2
>> data2 <- cbind(data2, color)
>> # Merge the data1 and data2$color by row.names, and force it to has the 
>> same
>> values that data1. Next it generates a new dataframe where column 1 is the
>> row.name, and then sort it by the row.name from data1.
>> data.frame(merge(data1,data2$color, by=0,
>> all.x=T),row.names=1)[row.names(data1),]
>> 
>> I'm not sure what am I doing wrong. Can anyone see where the mistake is?
>> 
>> Thank you!
>> 
>> Cheers,
>> 
>> Joao D.
>>
>>       [[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor