[BioC] help with PubMed Central OAI

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Apr 20 19:55:19 CEST 2012


Hi Chris

 The problem is that the <abstract> node has a namespace.

So the following will do what you want (and also avoids using readLines() by retrieving
 the URL directly in xmlParse().)

 url <-
"http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878"

 doc2 = xmlParse(url)

 getNodeSet(doc2, "//x:abstract", c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle"))

or

  xpathSApply(doc2, "//x:abstract", xmlValue,
                namespaces = c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle"))


The namespaces is defined on the <article> node.

  D.

On 4/20/12 10:33 AM, Chris Stubben wrote:
> I've been using Efetch to get some full text articles from Pubmed Central,  which works fine...
> 
> url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC2784878"
> x<-readLines(url)
> doc <- xmlParse(x )   # requires XML package
> xpathSApply(doc, "//abstract", xmlValue)
> [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding.
> Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus
> subtilis by the use of tiling arrays.
> 
> 
> I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I
> would try OAI, but I can't get the same xpath queries to work.
> 
> url <-
> "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878"
> 
> x2<-readLines(url)  # will warn about incomplete final line
> doc2 <- xmlParse(x2 )
> xpathSApply(doc2, "//abstract", xmlValue)
> list()
> 
> This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName))
> 
>              abstract                    ack              addr-line                    aff                article    
> article-categories
>                     1                      1                      1                      1                     
> 1                      1
>            article-id           article-meta          article-title           author-notes                  
> back                   body
>                     3                      1                     79                      1                     
> 1                      1
>               caption                contrib          contrib-group    copyright-statement               
> corresp                   date
>                     7                      3                      1                      1                     
> 1                      1
> 
> Thanks for any help.
> Chris Stubben
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list