[R] Removing Embedded Null characters from text/html

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Oct 16 17:58:31 CEST 2009


[David contacted me directly, so I am sending my off-line reply to the list
 just for the record in case others encounter a simple problem.]

Hi David.

 No problem contacting me at all.
I saw your mail at one point on the mailing list,
but didn't have a chance to respond.

Indeed, it seems like there is some embedded null in the string.
I need to investigate more about what is happening with the encoding, etc.
and whether it is on the RCurl or R side.

But for the meantime, the following two approaches seem to get around the problem:

 1) just use htmlParse(url)  on the URL directly, i.e. don't use RCurl.
    We only need basic HTTP facilities and htmlParse() (or more specifically
    libxml2) provides these for us.

 2) If you need RCurl to manage the connection and communication for the HTTP request,
    use
      txt = rawToChar(getURLContent(url, binary = TRUE))

       # You'll see a warning about truncation

      htmlParse(txt, asText = TRUE)

BTW, use htmlTreeParse() or htmlParse(). I use the latter and then XPath
expression via getNodeSet() or xpathApply() to extract content from the document.

 HTH,
   D.

David Young wrote:
> Hi,
> 
> I'm trying to download some data from the web and am running into
> problems with 'embedded null' characters.  These seem to indicate to R
> that it should stop processing the page so I'd like to remove them.
> I've been looking around and can't seem to identify exactly what the
> character is and consequently how to remove it.
> 
> # THE CODE WORKS ON THIS PAGE
> library(RCurl)
> library(XML)
> theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
> webpage <- getURL(theurl)
> 
> # BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS
> theurl <- "http://screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21"
> webpage <- getURL(theurl)
> 
> Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
>   Failed writing body (1371 != 1461)
> In addition: Warning messages:
> 1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
>   truncating string with embedded nul: 'ttp://finance.  
>   ## I DELETED SOME HERE FOR BREVITY##  al>\nData and  [... truncated]
> 2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
>   only read 1371 of the 1461 input bytes/characters
> 
> # THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER
> destfile<-"file:///C:/projects/stock data/data/test.htm"
> download.file ( theurl , destfile , quiet = TRUE )
> 
> # WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE
> # PROBLEM AND THEN GETTING RID OF IT.
> 
> I'd appreciate any advice.
> 
> 
>




More information about the R-help mailing list