[R] Getting htmlParse to work with Hebrew? (on windows)

Duncan Temple Lang duncan at wald.ucdavis.edu
Tue Jan 31 02:33:54 CET 2012


With some off-line interaction and testing by Tal, the latest
version of the XML package (3.9-4) should resolve these issues.
So the encoding from the document is used in more cases as the default.

It is often important to specify the encoding for HTML files in
the call to htmlParse() and use "UTF-8" rather than the lower case.

I'll add code to make this simpler when I get a chance.

  Thanks Tal

    D.

On 1/30/12 5:35 AM, Tal Galili wrote:
> Hello dear R-help mailing list.
> 
> 
> 
> I wish to be able to have htmlParse work well with Hebrew, but it keeps to
> scramble the Hebrew text in pages I feed into it.
> 
> For example:
> 
> # why can't I parse the Hebrew correctly?
> 
> library(RCurl)
> library(XML)
> u = "http://humus101.com/?p=2737"
> a = getURL(u)
> a # Here - the hebrew is fine.
> a2 <- htmlParse(a)
> a2 # Here it is a mess...
> 
> None of these seem to fix it:
> 
> htmlParse(a, encoding = "utf-8")
> 
> htmlParse(a, encoding = "iso8859-8")
> 
> This is my locale:
> 
>> Sys.getlocale()
> 
> [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
>>
> 
> Any suggestions?
> 
> 
> Thanks up front,
> Tal
> 
> 
> 
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list