[R] regexp problem (was: Re: publication statistics from Web of Science)

baptiste auguie ba208 at exeter.ac.uk
Thu Jan 15 11:19:37 CET 2009


Whoops, it seems I could use some help with regular expressions...

Consider the following two functions, creating a search string, and  
retrieving the content from the url,
>
> makeURLsearch <- function(key, dates=c(NULL, NULL)){
> 	
> 	base.search <- "http://scholar.google.co.uk/scholar?"
> 	key.search <- paste("as_q=", key,"&",  sep="")
> 	other.search <- "num=10&btnG=Search 
> + 
> Scholar 
> &as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&"
> 	dates.search <- paste("as_ylo=", dates[1], "&as_yhi=", dates[2],  
> "&as_allsubj=all&hl=en&lr=", sep="")
> 	
> 	full.search <- paste(base.search, key.search, other.search,  
> dates.search, sep="")
> 	return(full.search)
> }
>
>
> makeURLsearch("plasmonics")
> makeURLsearch("photonics", c(1980, NULL))
>
> retrieveNumberPublications <- function(url){
> 	
> 	x <- readLines(url)
> 	y <- grep('of about',x, value=TRUE)
> 	z <- gsub('of about\\s+</b>','\\1',y[1],perl=TRUE) # this does not  
> do what I wanted
>
>         # the bit to retrieve is the number below
> 	#  <b>10</b> of about <b>21,900</b> for <b><b>photonics</b>
> 	z
> }
>
> retrieveNumberPublications( makeURLsearch("photonics", c(2008,  
> NULL)) )

I can isolate the long string containing the result I want, but not  
single out the value which lies between " <b>10</b> of about  
<b>21,900</b> for <b><b>photonics</b> " .

Any regexp guru to help me out? I've never got my head around these,  
other than trivial cases.

Many thanks,

baptiste


On 15 Jan 2009, at 09:45, baptiste auguie wrote:

> For the record, I thought I'd share two findings:
>
> First, the web of science website does seem to have some sort of API,
> as discussed here:
>
> http://scientific.thomson.com/support/faq/webservices/
> It does not seem like a trivial thing to set up though.
>
> Second, because I could not pass the search term easily in the
> address, I looked into Google scholar instead, where a typical search
> looks like:
> http://scholar.google.co.uk/scholar?as_q=plasmonics&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=1960&as_allsubj=all&hl=en&lr=
>
> here it is trivial to create such a string with the desired keyword
> and dates, and retrieve the number of results using readLines(url) and
> grep.
>
>
> Thanks to Phil Spector for some pointers.
>
> Best wishes,
>
> baptiste

_____________________________

Baptiste Auguié

School of Physics
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK

Phone: +44 1392 264187

http://newton.ex.ac.uk/research/emag




More information about the R-help mailing list