[R] Solved: [Fwd: Matching failure in merge()]

Sat Mar 21 12:57:53 CET 2009

I've found where the problem was and a way to solve this problem:
One dataset was encoded (and read) as UTF-8
and the other one was encoded (en read) as latin3
In this case, even if at the terminal you see
the same characters, R states that the two
elements are not equal.

Don't know if this is the way it has to be,
or this is a bug.

Anyway, editing  the second file (encoded as latin3)
with OpenOffice calc,
saving it as UTF-8 and reading it in R with encoding="UTF-8"
solved the problem

Agus

-------- Original Message --------
Subject: Matching failure in merge()
Date: Thu, 19 Mar 2009 20:04:48 +0100
From: Agustin Lobo <Agustin.Lobo at ija.csic.es>
Reply-To: Agustin.Lobo at ija.csic.es
To: r-help at r-project.org

Hi!

I've done a merging between 2 dataframes using merge():

delme <-
merge(miDUNS50peqB,Bnomscodmunicipis,by.x="POBLACION",by.y="NOMMUNI",all.x=T,sort=F)

After realizing some problems in the resulting dataset,
I've found that the problem was that, in some cases, there
was no match between the by.x and the by.y elements, despite the
fact that, apparently, such matching should exist. Specifically,
I get no match for the cases in which both the by.x and the by.y variables
are equal to "SANT VICENÇ DELS HORTS". In the following example I select
fields POBLACION and NOMMUNI in two cases for which both fields
should be identical to "SANT VICENÇ DELS HORTS" and I get:
(082634 is the municipality code for that town in Bnomscodmunicipis
and 08620 is the postal code for that town in delme)

> x <- Bnomscodmunicipis[Bnomscodmunicipis$CODMUN=="082634",1][1]
> y <- miDUNS50peqB[miDUNS50peqB$CODPOSTAL=="08620","POBLACION"][1]

> str(x)
  chr "SANT VICENÇ DELS HORTS"
> str(y)
  chr "SANT VICENÇ DELS HORTS"
> x==y
[1] FALSE

which I cannot understand. If I just cut and paste
those values and run the equivalent logical operation:
> "SANT VICENÇ DELS HORTS" == "SANT VICENÇ DELS HORTS"
[1] TRUE

The problem is that the values for "SANT VICENÇ DELS HORTS"
in the resulting merged dataframe are wrong.

Any help with this issue would be greatly appreciated, I'm really
astonished. I think it might involve an encoding problem
with the non-ascii characters, but don't get to see it.

I'm using R 2.8.1 on ubuntu 8.04 (in english; And R is in English too)

Agus

-- 
Dr. Agustin Lobo
Institut de Ciencies de la Terra "Jaume Almera" (CSIC)
LLuis Sole Sabaris s/n
08028 Barcelona
Spain
Tel. 34 934095410
Fax. 34 934110012
email: Agustin.Lobo at ija.csic.es
http://www.ija.csic.es/gt/obster

-- 
Dr. Agustin Lobo
Institut de Ciencies de la Terra "Jaume Almera" (CSIC)
LLuis Sole Sabaris s/n
08028 Barcelona
Spain
Tel. 34 934095410
Fax. 34 934110012
email: Agustin.Lobo at ija.csic.es
http://www.ija.csic.es/gt/obster