[R] Analyzing Publications from Pubmed via XML

Gabor Grothendieck ggrothendieck at gmail.com
Sun Dec 16 05:39:30 CET 2007


If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE, trim = TRUE)

out <- cbind(
	Author = unlist(xpathApply(doc, "//author", xmlValue)),
	PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))),
	Abstract = unlist(xpathApply(doc, "//description",
		function(x) {
			on.exit(free(doc2))
			doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
				useInternalNodes = TRUE, trim = TRUE)
			xpathApply(doc2, "//p[4]", xmlValue)
		}
	)))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field


The last line produces (it may look messed up in this email):

> substring(out, 1, 25) # display it
      Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
 [5,] " Hopp R, Natarajan N, Lew" "17908862" ""
 [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente"
 [7,] " Mouadeb DA, Belafsky PC"  "17765779" "OBJECTIVES: The 585nm pul"
 [8,] " Thompson L"               "17702311" ""
 [9,] " Schaffer A, Brotherton J" "17688640" ""
[10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate"
[11,] " Shah KV, Westra WH"       "17627059" ""
[12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off"
[13,] " Akst LM, Broadhurst MS, " "17592395" ""
[14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be"


On Dec 15, 2007 10:13 PM, David Winsemius <dwinsemius at comcast.net> wrote:
> David Winsemius <dwinsemius at comcast.net> wrote in
> news:Xns9A077F740B4A0dNOTwinscomcast at 80.91.229.13:
>
> > "Farrel Buchinsky" <fjbuch at gmail.com> wrote in
> > news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com:
> >
> >> On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
> >> wrote:
> >>> or just try looking in the annotate package from Bioconductor
> >>>
> >>
> >> Yip. annotate seems to be the most streamlined way to do this.
> >> 1) How does one turn the list that is created into a dataframe whose
> >> column names are along the lines of date, title, journal, authors etc
> >
> > Gabor's example already did that task.
> >
>
> Actually the object returned by Gabor's method was a list of lists. Here
> is one way (probably very inefficient) of getting "doc" into a
> data.frame:
>
> colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
>           doc = doc, fun = xmlValue)
>
> titles=as.vector(unlist(colvals[1])[3:17])
>
> # needed to drop extraneous titles for search name and an NCBI header
> #>str(colvals)
> #List of 3
> # $ //title   :List of 17
> #  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
> #  ..$ : chr "NCBI PubMed"
>
> authors=colvals[[2]]
> jrnls=colvals[[3]]
>
> # not sure why, but trying to do it in one step failed:
> #  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),
> #                     authors=colvals[[2]],jnrls=colvals[[3]])
> # Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
> # authors = colvals[[2]],  :
> #  arguments imply differing number of rows: 15, 1
> # but the following worked
>
>  cites<-data.frame(titles=as.vector(titles))
>  cites$author<-authors
>  cites$jrnls<-jrnls
>  cites
>
> I am still wondering how to extract material that does not have an XML
> tag.  Each item looks like:
>
>  <item>
>   <title>Gastroesophageal reflux in patients with recurrent laryngeal
> papillomatosis.</title>
>   <link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
> tmpl=NoSidebarfile&amp;db=PubMed&amp;cmd=Retrieve&amp;list_uids=17589729
> &amp;dopt=Abstract</link>
>   <description>
>    <![CDATA[
>    <table border="0" width="100%"><tr><td align="left"><a
> href="http://www.scielo.br/scielo.php?script=sci_arttext&amp;pid=S0034-
> 72992007000200011&amp;lng=en&amp;nrm=iso&amp;tlng=en"><img
> src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
> img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
> href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
> db=PubMed&amp;cmd=Display&amp;dopt=PubMed_PubMed&amp;from_uid=17589729">
> Related Articles</a></td></tr></table>
>        <p><b>Gastroesophageal reflux in patients with recurrent
> laryngeal papillomatosis.</b></p>
>        <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
> </p>
>        <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR,
> Fujita R, De Marco EK</p>
>        <p>Evidence of a relation between gastroesophaeal reflux and
> pediatric respiratory disorders increases every year. Many respiratory
> symptoms and clinical conditions such as stridor, chronic cough, and
> recurrent pneumonia and bronchitis appear to be related to
> gastroesophageal reflux. Some studies have also suggested that
> gastroesophageal reflux may be associated with recurrent laryngeal
> papillomatosis, contributing to its recurrence and severity. AIM: the aim
> of this study was to verify the frequency and intensity of
> gastroesophageal reflux in children with recurrent laryngeal
> papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
> between 3 and 12 years, presenting laryngeal papillomatosis, were
> included in this study. The children underwent 24-hour double-probe pH-
> metry. RESULTS: fifty percent of the patients had evidence of
> gastroesophageal reflux at the distal sphincter; 90% presented reflux at
> the proximal sphincter. CONCLUSION: the frequency of proximal
> gastroesophageal reflux is significantly increased in patients with
> recurrent laryngeal papillomatosis.</p>
>        <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
>   </description>
>   <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
> Marco EK</author>
>   <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
>   <guid isPermaLink="false">PubMed:17589729</guid>
>  </item>
>
> I would like to access, for instance, the PMID or the abstract within the
> <description> element, but I do not think that they have names in the the
> same way that <author> or <category> have xml named nodes. I suspect that
> getting the output in a different format, say as MEDLINE, might produce
> output that was tagged more completely.
>
>
> --
> David Winsemius
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list