[R] Externalptr class to character class from (web) scrape

Duncan Murdoch murdoch.duncan at gmail.com
Fri Jul 26 19:32:41 CEST 2013


By the way, here's a quicker (but slightly more dangerous) way to find the

print.XMLInternalDocument


function:  just call debug(print) before printing the object.  Two steps 
take you to the method, and let you see what it's doing. The "danger" 
comes because now print() will always trigger the debugger, which can be 
a little confusing!  Remember undebug(print) at the end.

Duncan Murdoch


On 26/07/2013 1:26 PM, Duncan Murdoch wrote:
> On 26/07/2013 12:43 PM, Nick McClure wrote:
> > I'm hitting a wall. When I use the 'scrape' function from the package
> > 'scrapeR' to get the pagesource from a web page, I do the following:
> > (as an example)
> >
> > website.doc = parse("http://www.google.com")
> >
> > When I look at it, it seems fine:
> >
> > website.doc[[1]]
> >
> > This seems to have the information I need.  Then when I try to get it
> > into a character vector,
> >
> > character.website = as.character(website.doc[[1]])
> >
> > I get the error:
> >
> > Error in as.vector(x, "character") :
> > cannot coerce type 'externalptr' to vector of type 'character'
> >
> > I'm trying very very hard to wrap my head around how to get this
> > external pointer to a character, but after reading many help files, I
> > cannot understand how to do this. Any ideas?
>
> You should use str() in cases like this. When I look at
> str(website.doc[[1]]) (after producing website.doc with scrape(), not
> parse()), I see
>
>   > str(website.doc[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument',
> 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
> - attr(*, "headers")= Named chr [1:2] "<HTML><HEAD><meta
> http-equiv=\"content-type\"
> content=\"text/html;charset=utf-8\">\n<TITLE>302
> Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"
> ..- attr(*, "names")= chr [1:2] "<HTML><HEAD><meta
> http-equiv=\"content-type\"
> content=\"text/html;charset=utf-8\">\n<TITLE>302
> Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"
>
> So it is an external pointer with a number of classes. One or more of
> those will have a print method. methods(print) will list all the print
> methods, and I see there's a (hidden) print.XMLInternalDocument method
> somewhere. Then
>
>   > getAnywhere("print.XMLInternalDocument")
> A single object matching ‘print.XMLInternalDocument’ was found
> It was found in the following places
> registered S3 method for print from namespace XML
> namespace:XML
> with value
>
> function (x, ...)
> {
> cat(as(x, "character"), "\n")
> }
> <environment: namespace:XML>
>
> shows that the as() generic should work, even though as.character()
> doesn't, and indeed as(website.doc[[1]], "character") does display
> something.
>
> Duncan Murdoch
>
>
>



More information about the R-help mailing list