[BioC] GEOquery and parsing SOFT files
huber at ebi.ac.uk
Mon May 25 21:15:56 CEST 2009
thank you for the feedback and pointing this out. Two general remarks:
1. Please include a reproducible example (R script) for others to
reproduce your experience, and subsequently the output of sessionInfo().
2. Robert Gentleman's book "R Programming for Bioinformatics" (as well
as many free sources on the web) describes how to profile R code in
order to see in which functions the CPU time is spent. Based on this,
you can investigate where to invest developer time for improving the code.
Wacek Kusnierczyk ha scritto:
> The getGEO function from GEOquery parses GEO soft files. With a
> particular GSE file (GSE13638), it took over 15 minutes on my
> not-so-crappy machine to parse the file (a local file, download time
> excluded). I've written a simple parser in perl, and parsing the same
> file and storing the data in a nested hash/array structure takes ca. 2
> seconds. I'm pretty sure there is more essential processing done by
> getGEO to organize the data into a GSE object, but still, there seems to
> be an incredibly inefficient implementation underneath.
> I haven't looked at the source code yet, but here's a question: what is
> the likely reason getGEO is so slow? Is it the parsing itself, or
> rather wraping the data into the appropriate structure? Where should I
> start to look for code to be improved?
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber
More information about the Bioconductor