[BioC] Batch sequence retrieval

Sean Davis sdavis2 at mail.nih.gov
Mon Jun 18 14:50:03 CEST 2007


Daniel Brewer wrote:
> Hi all,
> 
> I am in a situation where I would like to download all the sequences
> associated with human IMAGE clones and then blast them a range of other
> sequences (~3.4 million).  I have the accession numbers for all of them.
> I have tried a number of ways to do this:
> 1) Search for "IMAGE: homo sapiens" and download the fasta sequence.
> This fails after a while for no reason.
> 2) A script using getSeq from the annotate library.  This is very slow,
> but is chugging a way.
> 3) The batchentrez utility.  There seems to a problem with the link at
> the moment.
> 
> Has anyone got any suggestions of a better way to do this.  Does Genbank
> allow SQL access?

Genbank is not stored in a SQL database.  The closest they get to
programmatic access is Eutils.  Have you considered downloading the
appropriate BLAST database and then limiting by GI number?  This
technique is made for doing exactly what you are suggesting.  You simply
need to have a file of GI numbers associated with your sequences and can
then use formatdb to create a custom blast database.

Sean



More information about the Bioconductor mailing list