[R] Analyzing Publications from Pubmed via XML

David Winsemius dwinsemius at comcast.net
Sun Dec 16 04:13:52 CET 2007

David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A077F740B4A0dNOTwinscomcast at 

> "Farrel Buchinsky" <fjbuch at gmail.com> wrote in
> news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com: 
>> On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
>> wrote: 
>>> or just try looking in the annotate package from Bioconductor
>> Yip. annotate seems to be the most streamlined way to do this.
>> 1) How does one turn the list that is created into a dataframe whose
>> column names are along the lines of date, title, journal, authors etc
> Gabor's example already did that task.

Actually the object returned by Gabor's method was a list of lists. Here 
is one way (probably very inefficient) of getting "doc" into a 

colvals <-sapply(c("//title", "//author", "//category"), xpathApply, 
           doc = doc, fun = xmlValue)


# needed to drop extraneous titles for search name and an NCBI header
#List of 3
# $ //title   :List of 17
#  ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
#  ..$ : chr "NCBI PubMed"


# not sure why, but trying to do it in one step failed:
#  cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),  
#                     authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]), 
# authors = colvals[[2]],  : 
#  arguments imply differing number of rows: 15, 1
# but the following worked


I am still wondering how to extract material that does not have an XML 
tag.  Each item looks like:

   <title>Gastroesophageal reflux in patients with recurrent laryngeal 
    <table border="0" width="100%"><tr><td align="left"><a 
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a 
Related Articles</a></td></tr></table>
        <p><b>Gastroesophageal reflux in patients with recurrent 
laryngeal papillomatosis.</b></p>
        <p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
        <p>Authors:  Pignatari SS, Liriano RY, Avelino MA, Testa JR, 
Fujita R, De Marco EK</p>
        <p>Evidence of a relation between gastroesophaeal reflux and 
pediatric respiratory disorders increases every year. Many respiratory 
symptoms and clinical conditions such as stridor, chronic cough, and 
recurrent pneumonia and bronchitis appear to be related to 
gastroesophageal reflux. Some studies have also suggested that 
gastroesophageal reflux may be associated with recurrent laryngeal 
papillomatosis, contributing to its recurrence and severity. AIM: the aim 
of this study was to verify the frequency and intensity of 
gastroesophageal reflux in children with recurrent laryngeal 
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged 
between 3 and 12 years, presenting laryngeal papillomatosis, were 
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of 
gastroesophageal reflux at the distal sphincter; 90% presented reflux at 
the proximal sphincter. CONCLUSION: the frequency of proximal 
gastroesophageal reflux is significantly increased in patients with 
recurrent laryngeal papillomatosis.</p>
        <p>PMID: 17589729 [PubMed - in process]</p>    ]]>
   <author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De 
Marco EK</author>
   <category>Rev Bras Otorrinolaringol (Engl Ed)</category>
   <guid isPermaLink="false">PubMed:17589729</guid>

I would like to access, for instance, the PMID or the abstract within the 
<description> element, but I do not think that they have names in the the 
same way that <author> or <category> have xml named nodes. I suspect that 
getting the output in a different format, say as MEDLINE, might produce 
output that was tagged more completely.

David Winsemius

More information about the R-help mailing list