[R] retrieve certain part from html

Tony Breyal tony.breyal at googlemail.com
Wed Sep 23 14:43:26 CEST 2009


maybe you could modify the following to suit your situation (i use
this xPath expression to get links from google):

?htmlTreeParse
?getNodeSet

> library(XML)
> link <- 'http://www.google.co.uk/search?hl=en&client=firefox-a&rls=org.mozilla:en-GB:official&hs=2XR&ei=mxa6SojjOeaMjAfJkcDuBQ&sa=X&oi=spell&resnum=0&ct=result&cd=1&q=Doctor+Who&spell=1'
> html <- htmlTreeParse(link, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(html, "//a[@href][@class='l']")
> sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
 [1] "http://www.bbc.co.uk/
doctorwho/"
 [2] "http://www.bbc.co.uk/doctorwho/
classic/"
 [3] "http://en.wikipedia.org/wiki/
Doctor_Who"
 [4] "http://www.youtube.com/watch?
v=LF2x5IKxmAQ"
 [5] "http://www.youtube.com/watch?
v=DnKNupdSH8g"
 [6] "http://www.telegraph.co.uk/culture/tvandradio/doctor-who/6199603/
Doctor-Who-Top-10-fans-vote-for-all-time-best-episode.html"
 [7] "http://www.google.com/hostednews/ap/article/ALeqM5i17A4FXTLhJX10-
sCbhhnhdqY9HwD9ASO6A00"
 [8] "http://www.telegraph.co.uk/news/newstopics/celebritynews/6200053/
Doctor-Who-star-David-Tennant-voted-pupils-dream-head-teacher.html"
 [9] "http://www.imdb.com/title/
tt0436992/"
[10] "http://www.imdb.com/title/
tt0056751/"
[11] "http://
www.gallifreyone.com/"
[12] "http://
www.doctorwho.co.uk/"
[13] "http://
www.drwhoguide.com/"
[14] "http://www.bbcamerica.com/content/123/index.jsp"



On 23 Sep, 13:29, "Rene" <kaixinma... at gmail.com> wrote:
> Dear All,
>
> Can someone please guide me how to get the certain part from a long html
> language?
>
> e.g.
>
> "<td><a href='2005-01.html'>2005-01</a></td><td><a
> href='2006-01.html'>2006-01</a></td><td><a
> href='2007-01.html'>2007-01</a></td><td><a
> href='2008-01.html'>2008-01</a></td><td><a
> href='2009-01.html'>2009-01</a></td>"
>
> How to get only the wording of  "2005-01.html", "2006-01.html",
> "2007-01.html"," 2008-01.html"," 2009-01.html" from the above html code? I
> have tried to use gsub function, but not working.
>
> Please guide me on this.
>
> Thanks a lot.
>
> Rene.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list