[BioC] paper - download - pubmed

Chris Stubben stubben at lanl.gov
Wed Jan 16 18:52:26 CET 2013


>>
>> So, the problem is not that, for each paper I have to download the
>> pdfs (which are available if I go to the pubmed and search directly
>> there) and the corresponding supplementary files.
>>

   Nooshin,
   You can download pdfs from Pubmed Central if you have one PMC id.
   download.file([1]"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf",
   "PMC3446303.pdf")
   However, NCBI clearly states that you may NOT use any kind of automated
   process to download articles in bulk from the main PMC site, so I would use
   the ftp site for Open Access articles (see
   [2]http://www.ncbi.nlm.nih.gov/pmc/tools/ftp ).  The ftp site also has the
   supplemental files included.  First, read the list of available files
   pmcftp <- read.delim( [3]"ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt"
   , skip=1, header=FALSE, stringsAsFactors=FALSE)
    nrow(pmcftp)
   [1] 552677
   names(pmcftp)<-c("dir", "citation", "id")
   Then match PMC ids and loop through the results to download and untar the
   files
   y <- subset(pmcftp, id %in% c("PMC3446303", "PMC3463124") )
   y
   509377                  75/e9/Genome_Biol_2012_Apr_24_13(4)_R29.tar.gz
   Genome Biol. 2012 Apr 24; 13(4):R29 PMC3446303
   514389         04/0f/Bioinformatics_2012_Oct_1_28(19)_2532-2533.tar.gz
   Bioinformatics. 2012 Oct 1; 28(19):2532-2533 PMC3463124
   for( i in 1: nrow(y) ){
   destfile <- paste(y$id[i], ".tar.gz", sep="")
   download.file( paste([4]"ftp://ftp.ncbi.nlm.nih.gov/pub/pmc", y$dir[i],
   sep="/"),  destfile )
   untar( destfile, compressed=TRUE)
   }
   Also, if you need to get a list of PMC ids in R,  I have a package called
   genomes on BioC that includes E-utility scripts. So something like this
   query would get the 49 pmc ids for articles with Bioconductor in the title.
   x2<-  esummary(esearch("bioconductor[TITLE]  AND open access[FILTER]",
   db="pmc"), version="2.0")
   Esummary uses a generic parser by default, so PMCids are mashed together in
   a column with other Ids
    ids<-gsub(".*(PMC[0-9]*)", "\\1", x2$ArticleIds)
   y <- subset(pmcftp, id %in% ids)
    You could run esummary and add parse=FALSE to get the XML results and parse
   that any way you like.  Or even use esearch and set usehistory="n"
   ids2 <- paste("PMC",  esearch("bioconductor[TITLE] AND open access[FILTER]",
   db="pmc", usehistory="n", retmax=100), sep="")
   Chris
--

Chris Stubben

Los Alamos National Lab
Bioscience Division
MS M888
Los Alamos, NM 87545

References

   1. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3446303/pdf
   2. http://www.ncbi.nlm.nih.gov/pmc/tools/ftp
   3. ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt
   4. ftp://ftp.ncbi.nlm.nih.gov/pub/pmc


More information about the Bioconductor mailing list