[BioC] biomaRt: connection stopping

Steffen Durinck durincks at mail.nih.gov
Wed Sep 13 18:19:15 CEST 2006


Hi,

I would like to add that biomaRt in RCurl mode can handle big queries 
but will break when you use it in a big loop.
An alternative to what Jim suggests could be to do the query for all ids 
at once:

A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters="refseq_dna",values=RS)

By adding refseq_dna as an attribute, HUGO symbols and RefSeq identifiers will be automatically matched up in A.  If needed,  you can loop over the result in A and you avoid doing 18000+ separate database queries so it will be faster.

best,
Steffen




James W. MacDonald wrote:
> J.delasHeras at ed.ac.uk wrote:
>   
>> Hi,
>>
>> I suspect this is something to do purely with my connection, but I 
>> thought I'd ask, just in case:
>>
>> I have a list of refseq ids (NM_xxxxx), 18028 of them.
>> I wanted to get the gene symbols for those genes, so I used biomaRt on 
>> the whole list. What I got was a single column data frame longer than 
>> 18028, as I get multiple results with some of these refseq ids. There 
>> doesn't seem to be an easy way to regroup them together, so I do the 
>> following instead:
>>     
>
> Using the RCurl interface for a big query like that isn't ideal. You 
> would be better off installing RMySQL and using the MySQL interface 
> (note: you can get RMySQL using biocLite(), thanks to the fine folks in 
> Seattle). Also, you can have getBM() put things in a list, so any 
> duplicated gene symbols will be grouped together.
>
> A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output = 
> "list", mysql = TRUE)
>
> Should do the trick.
>
> HTH,
>
> Jim
>
>
>   
>> #create an empty list of teh right length
>> A<-vector(mode="list", length=18028)
>> #now loop filling elements of the list from the biomaRt queries
>> for (i in 1:18028){
>> K<-i
>> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_dna",values=c(RS[i]))
>> }
>> print(K)
>>
>> RS is a vector containing the 18028 refseq ids.
>> the K value is only so that I know where it breaks... because that's 
>> what happens... after a while, it breaks with an error message:
>>
>> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) :
>>          couldn't connect to host
>>
>> This doesn't happen if I send the whole query in ONE go, in a vector... 
>> but if I do it element by element it breaks after 3-4000 queries.
>> Any ideas to do this in a simpler/better way? Or at least one that 
>> doesn't have me coming back to re-start the loop at the position of the 
>> last break?
>>
>> thanks!
>>
>> Jose
>>
>>     
>
>
>   


-- 
Steffen Durinck, Ph.D.

Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/

Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877



More information about the Bioconductor mailing list