[BioC] biomaRt queries: optimal size?

Mon Dec 21 17:45:28 CET 2009

Dear Javier

Try there:

1. Set
	options(error=recover)
and then use the 'post mortem' debugger to see why postRes (a character 
string) is so large. Let us know what you find!

2. Rather than splitting up the query genes, you could split up the
attributes, and only ask for a few at a time, and/or see which one 
causes the large size of the result

3. Send us a reproducible example (i.e. one that others can reproduce by 
copy-pasting from your email).

	Best wishes
	Wolfgang

J.delasHeras at ed.ac.uk scripsit 12/21/2009 05:09 PM:
> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
> 
>> Hi Jose,
>>
>> J.delasHeras at ed.ac.uk wrote:
>>>
>>> I've recently started to use biomaRt seriously. In teh past I just  
>>> did a few tens of searches and all works fine. Now I have several  
>>> datasets of several thousand IDs each.
>>>
>>> I imagine that sending a single search with 3000 ids might not be a 
>>>  good idea. I tried, and it broke after a while... and got no results.
>>
>> A query of 3000 ids is no problem for biomaRt - you should be able to
>> do a much larger query than that without any troubles.
>>
>> It would be helpful if you tried your query again and if it fails, send
>> the results of a traceback().
> 
> 
> Hi James,
> 
> thanks for the reply.
> After what you said, I tried again my 1545 Ids in one simple query, 
> rather than in blocks of 200. I got a different error (after a good 
> 30-40min) which suggests a memory issue now:
> 
> "Error in gsub("\n", "", postRes) :
>   Calloc could not allocate (841769536 of 1) memory"
> 
> which surprised me because as far as I can tell I have plenty or memory 
> available...
> 
> I do expect the results to be a large dataframe, as I'm retrieving a 
> number of different attributes, so each original ID ends up producing a 
> good number of rows (which later I would process).
> 
> For completeness, in case it matters, my query is:
> 
>> BMresults<-getBM(attributes=dataset.attributes,
> +                   filters="entrezgene",
> +                   values=geneids, mart=ensembl)
> 
> 
> where
> 
> geneids (values) contain 1545 entrez gene IDs (human)
> dataset: human ("hsapiens_gene_ensembl")
> mart: ensembl
> attributes:
>              "entrezgene",
>              "ensembl_gene_id",
>              "go_cellular_component_id",
>              "go_biological_process_id",
>              "go_molecular_function_id",
>              "go_cellular_component__dm_name_1006",
>              "name_1006",
>              "go_molecular_function__dm_name_1006",
>              "goslim_goa_accession",
>              "goslim_goa_description"
> 
> Similar queries on the mouse and rat datasets (1200 and 950 ids 
> respectively) worked ok.
> 
> In this case traceback() only shows it was eliminating end of line 
> characters from some object:
> 
>> traceback()
> 2: gsub("\n", "", postRes)
> 1: getBM(attributes = dataset.attributes, filters = "entrezgene",
>        values = geneids, mart = ensembl)
> 
> If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM... but 
> I suspect R may not be able to use the 3Gb I try to make available by 
> using the memory.size() function) then I suppose dividing the task into 
> two queries (or three) might help... just not dozens of them. Any other 
> suggestion?
> 
> Jose
> 
> 
> 
> 
> 

-- 

Best wishes
      Wolfgang

--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact