[BioC] biomaRt queries: optimal size?

Mon Dec 21 17:09:52 CET 2009

Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:

> Hi Jose,
>
> J.delasHeras at ed.ac.uk wrote:
>>
>> I've recently started to use biomaRt seriously. In teh past I just   
>> did a few tens of searches and all works fine. Now I have several   
>> datasets of several thousand IDs each.
>>
>> I imagine that sending a single search with 3000 ids might not be a  
>>  good idea. I tried, and it broke after a while... and got no  
>> results.
>
> A query of 3000 ids is no problem for biomaRt - you should be able to
> do a much larger query than that without any troubles.
>
> It would be helpful if you tried your query again and if it fails, send
> the results of a traceback().

Hi James,

thanks for the reply.
After what you said, I tried again my 1545 Ids in one simple query,  
rather than in blocks of 200. I got a different error (after a good  
30-40min) which suggests a memory issue now:

"Error in gsub("\n", "", postRes) :
   Calloc could not allocate (841769536 of 1) memory"

which surprised me because as far as I can tell I have plenty or  
memory available...

I do expect the results to be a large dataframe, as I'm retrieving a  
number of different attributes, so each original ID ends up producing  
a good number of rows (which later I would process).

For completeness, in case it matters, my query is:

> BMresults<-getBM(attributes=dataset.attributes,
+                   filters="entrezgene",
+                   values=geneids, mart=ensembl)

where

geneids (values) contain 1545 entrez gene IDs (human)
dataset: human ("hsapiens_gene_ensembl")
mart: ensembl
attributes:
              "entrezgene",
              "ensembl_gene_id",
              "go_cellular_component_id",
              "go_biological_process_id",
              "go_molecular_function_id",
              "go_cellular_component__dm_name_1006",
              "name_1006",
              "go_molecular_function__dm_name_1006",
              "goslim_goa_accession",
              "goslim_goa_description"

Similar queries on the mouse and rat datasets (1200 and 950 ids  
respectively) worked ok.

In this case traceback() only shows it was eliminating end of line  
characters from some object:

> traceback()
2: gsub("\n", "", postRes)
1: getBM(attributes = dataset.attributes, filters = "entrezgene",
        values = geneids, mart = ensembl)

If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM...  
but I suspect R may not be able to use the 3Gb I try to make available  
by using the memory.size() function) then I suppose dividing the task  
into two queries (or three) might help... just not dozens of them. Any  
other suggestion?

Jose

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.