[R] merge gives me too many rows

Denis Chabot chabotd at globetrotter.net
Mon Sep 18 03:11:22 CEST 2006


Hi,

I am using merge to add some variables to an existing dataframe. I  
use the option "all.x=F" so that my final dataframe will only have as  
many rows as the first file I name in the call to merge.

With a large dataframe using a lot of "by" variables, the number of  
rows of the merged dataframe increases from 177325 to 179690:

 >dim(test)
[1] 177325      9
 > test2 <- merge(test, fish, by=c("predateu", "origin", "navire",  
"nbpc", "no_rel", "trait", "tagno"), all.x=F)
 > dim(test2)
[1] 179690     11

I tried to make a smaller dataset with R commands that I could post  
here so that other people could reproduce, but merge behaved as  
expected: final number of rows was the same as the number of rows in  
the first file named in the call to merge.

I took a subset of my large dataframe and could mail this to anyone  
interested in verifying the problem.

 > test3 <- test[100001:160000,]
 >
 > dim(test3)
[1] 60000     9
 > test4 <- merge(test3, fish, by=c("predateu", "origin", "navire",  
"nbpc", "no_rel", "trait", "tagno"), all.x=F)
 >
 > dim(test4)
[1] 60043    11

I compared test3 and test4 line by line. The first 11419 lines were  
the same (except for added variables, obviously) in both dataframes,  
but then lines 11420 to 11423 were repeated in test4. Then no problem  
for a lot of rows, until rows 45756-45760 in test3. These are offset  
by 4 in test4 because of the first group of extraneous lines just  
reported, and are found on lines 45760 to 45765. But they are also  
repeated on lines 45765 to 45769. And so on a few more times.

Thus merge added lines (repeated a small number of lines) to the  
final dataframe despite my use of all.x=F.

Am I doing something wrong? If not, is there a solution? Not being  
able to merge is a setback! I was attempting to move the last few  
things I was doing with SAS to R...

Please let me know if you want the file test3 (2.3 MB as a csv file,  
but only 352 KB in R (.rda) format).

Sincerely,

Denis Chabot

 > R.Version()
$platform
[1] "powerpc-apple-darwin8.6.0"

$arch
[1] "powerpc"

$os
[1] "darwin8.6.0"

$system
[1] "powerpc, darwin8.6.0"

$status
[1] ""

$major
[1] "2"

$minor
[1] "3.1"

$year
[1] "2006"

$month
[1] "06"

$day
[1] "01"

$`svn rev`
[1] "38247"

$language
[1] "R"

$version.string
[1] "Version 2.3.1 (2006-06-01)"



More information about the R-help mailing list