[R] Chinese characters encoding problem with XML

Wind2 windspeedo at qq.com
Wed Dec 31 12:48:42 CET 2008


Problems focused on XML methods.
xml is OK.  And the heading of xml as following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head><meta http-equiv="Content-Type" content="text/html;
charset=gb2312"><title>深圳国投</title>

There is correct charset=gb2312, which is also the content of the web page.

>doc<-xmlRoot(xml)
>doc[[1]]
<head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8"><title>娣卞湷鍥芥姇</title>

The charset has been changed to UTF-8.

> doc1<-xmlRoot(xml,encoding="gb2312")
> doc1[[1]]
<head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8"><title>娣卞湷鍥芥姇</title>

It seems that some methods of XML will change the charset to UTF-8 on their
own will.




Wind2 wrote:
> 
> XML is a good tool reading data from web within R.  But I wonder how could
> get the encoding correctly.
> 
> library(XML)
> url <- 'http://www.szitic.com/docc/jz-lmzq.html'
> xml <- htmlTreeParse(url, useInternal=TRUE)
> q <- "//tbody/tr/td"
> dat <- unlist(xpathApply(xml, q, xmlValue))
> df <- as.data.frame(t(matrix(dat, 4)))
> dt<-as.character(df[15,1])
> 
> The first column of df is dates in Chinese.   dt is one of the Chinese
> dates.
> When I copied the content of dt into the email, it become the following:
>> dt
> [1]
> "2008&#229;岹砀戀&#13312;&#12544;&#12800;鰀\x8825&#230;岗砀愀&#13568;&#8704;&#3328;&#2560;&#15872;&#8192;
> 
> Indeed in R,  it looks like:
>>dt
> [1] "2008\345\271\xb412\346\234\x8825\346\227\xa5"
> 
> and the color of the numbers differs a little.
> 
>> getOption("encoding")
> [1] "native.enc"
>> Sys.getlocale()
> [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of
> China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of
> China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of
> China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of
> China.936"
>> 
> 
> Package:              XML
> Version:              1.98-1
> Date:                 2008/10/17
> 
> R version 2.8.0 (2008-10-20)
> Windows Vista Basic, Simplified Chinese edition.
> 
> There is no problem using Chinese characters in R codes.
> 
> I wonder how could get the Chinese characters with XML.   Or is there any
> methods which could help me convert the encoding of characters from UTF-8
> to unicode in R?
> 
> Regards,
> Wind
> 
> ------------------
> http://windspeedo.spaces.live.com
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Chinese-characters-encoding-problem-with-XML-tp21225957p21230340.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list