[BioC] biomaRt queries: optimal size?

Wolfgang Huber whuber at embl.de
Tue Dec 22 14:04:59 CET 2009


Hola José

sorry for the name confusion. The way that BioMart presents many-to-one 
relationships (producing one single big table with all queried 
attributes, and possibly lots of repetitions in some columns) can be 
very space-inefficient. This is the price that that system's design pays 
for the simplicity.

Anyway, I don't think it should return table rows that are completely 
identical -  if you (or someone else here) comes across such an 
instance,  then please report that on this list!

	Best wishes
	Wolfgang


PS Do you know the way to San ... :)

J.delasHeras at ed.ac.uk scripsit 12/21/2009 07:03 PM:
> Quoting Wolfgang Huber <whuber at embl.de>:
> 
>>
>> Dear Javier
>>
>> Try there:
>>
>> 1. Set
>>     options(error=recover)
>> and then use the 'post mortem' debugger to see why postRes (a character
>> string) is so large. Let us know what you find!
>>
>> 2. Rather than splitting up the query genes, you could split up the
>> attributes, and only ask for a few at a time, and/or see which one
>> causes the large size of the result
>>
>> 3. Send us a reproducible example (i.e. one that others can reproduce
>> by copy-pasting from your email).
>>
>>     Best wishes
>>     Wolfgang
> 
> 
> "My name is not Javier!!!"
> 
> (you had to be in Spain in the 80s to get the joke... nevermind, it was 
> a silly pop song ;-)
> 
> Thank you for the suggestions. I managed to finish what I was doing 
> (breaking the query into chunks of 200ids at a time) but I have some 
> more searches coming and will definitely use a different approach, and 
> try the options(error=recover) method to investigate if I have problems.
> 
> My query, as you suggest above, would be better performed by using less 
> attributes, rather than splitting the ids. I just didn't have enough 
> experience in this. When using multiple attributes, the resulting data 
> frame may contain quite a few more rows of data, if there are multiple 
> values for some of teh attributes... and this happens a lot when looking 
> at gene ontologies.
> I may have started with a 1545 id vector, but ended up with a data frame 
> containing nearly 4 million rows! (assembled from 8 individual queries 
> of ~200 ids at a time) I will definitely not do it again this way!
> Much better to pick less attributes and then process the data, and then 
> I'll probably be able to process all IDs at once.
> 
> Thank you for your help, Wolfgang and Jim.
> 
> Jose
> 


-- 

Best wishes
      Wolfgang


--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact



More information about the Bioconductor mailing list