[R] merge gives me too many rows

Mon Sep 18 07:38:43 CEST 2006

I think you may misunderstand the meaning of all.x = FALSE.

Setting all.x to false ensures that only rows of x that have matches 
in y will be included. Equivalently, if a row of x is not matched in 
y, it will not be in the output. However, if a row in x is matched by 
more than one row in y, then that row will be repeated as many times 
as there are matching rows in y. That is, you have a 1 to many match 
(1 in x to many in y). SAS behaves the same way.

Are you sure this is not what is happening?

Also, all.x = FALSE is the default; it is not necessary to specify 
it. In fact, the default is to output only rows that are found in 
both x and y (matching on the specified variables, of course).

-Don

At 9:11 PM -0400 9/17/06, Denis Chabot wrote:
>Hi,
>
>I am using merge to add some variables to an existing dataframe. I 
>use the option "all.x=F" so that my final dataframe will only have as 
>many rows as the first file I name in the call to merge.
>
>With a large dataframe using a lot of "by" variables, the number of 
>rows of the merged dataframe increases from 177325 to 179690:
>
>  >dim(test)
>[1] 177325      9
>  > test2 <- merge(test, fish, by=c("predateu", "origin", "navire", 
>"nbpc", "no_rel", "trait", "tagno"), all.x=F)
>  > dim(test2)
>[1] 179690     11
>
>I tried to make a smaller dataset with R commands that I could post 
>here so that other people could reproduce, but merge behaved as 
>expected: final number of rows was the same as the number of rows in 
>the first file named in the call to merge.
>
>I took a subset of my large dataframe and could mail this to anyone 
>interested in verifying the problem.
>
>  > test3 <- test[100001:160000,]
>  >
>  > dim(test3)
>[1] 60000     9
>  > test4 <- merge(test3, fish, by=c("predateu", "origin", "navire", 
>"nbpc", "no_rel", "trait", "tagno"), all.x=F)
>  >
>  > dim(test4)
>[1] 60043    11
>
>I compared test3 and test4 line by line. The first 11419 lines were 
>the same (except for added variables, obviously) in both dataframes, 
>but then lines 11420 to 11423 were repeated in test4. Then no problem 
>for a lot of rows, until rows 45756-45760 in test3. These are offset 
>by 4 in test4 because of the first group of extraneous lines just 
>reported, and are found on lines 45760 to 45765. But they are also 
>repeated on lines 45765 to 45769. And so on a few more times.
>
>Thus merge added lines (repeated a small number of lines) to the 
>final dataframe despite my use of all.x=F.
>
>Am I doing something wrong? If not, is there a solution? Not being 
>able to merge is a setback! I was attempting to move the last few 
>things I was doing with SAS to R...
>
>Please let me know if you want the file test3 (2.3 MB as a csv file, 
>but only 352 KB in R (.rda) format).
>
>Sincerely,
>
>Denis Chabot
>
>  > R.Version()
>$platform
>[1] "powerpc-apple-darwin8.6.0"
>
>$arch
>[1] "powerpc"
>
>$os
>[1] "darwin8.6.0"
>
>$system
>[1] "powerpc, darwin8.6.0"
>
>$status
>[1] ""
>
>$major
>[1] "2"
>
>$minor
>[1] "3.1"
>
>$year
>[1] "2006"
>
>$month
>[1] "06"
>
>$day
>[1] "01"
>
>$`svn rev`
>[1] "38247"
>
>$language
>[1] "R"
>
>$version.string
>[1] "Version 2.3.1 (2006-06-01)"
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
---------------------------------
Don MacQueen
Lawrence Livermore National Laboratory
Livermore, CA, USA