[BioC] BLAST search sequence for species ID from R?

Sean Davis seandavi at gmail.com
Mon Dec 14 13:03:51 CET 2009


On Mon, Dec 14, 2009 at 6:02 AM, jos matejus <matejus106 at googlemail.com> wrote:
> Dear list members,
>
> A colleague has asked whether I can help him with a bioinformatics
> problem he has as he knows I use R (although I don't usually use R for
> this type of problem) and I was hoping someone might be able to point
> me in the right direction. I have searched the mailing list archives
> and also Googled this particular query, but without success. I ask
> forgiveness in advance if the question is not appropriate for this
> forum.
>
> Anyway, the background is that my colleague has a sample collected
> from the field containing many species of related insects (same genus)
> which he has obtained lots of sequence information (from 454). The
> sequences are saved in a single fasta file. What he wants to do is to
> query Genbank to match each sequence from the fasta file to particular
> species (A nucleotide blast search I believe) and return the top
> ranked match for each sequence. He can do this manually via the web
> page, but he will have a lot of these files in the future and was
> looking for some way of automating the process (hence using R). He
> ultimately wants to be able to restrict the Blast search to a list of
> preselected  Accession numbers or within genus.
>
> As I am not familiar with this field I was wondering whether anyone
> knows of an existing function (or functions) that can do the job. I am
> looking at the package seqinr at the moment to see whether this would
> fit the bill and also whether the Biostrings package would be
> appropriate. However, the learning curve looks a little steep and I
> wanted to make sure I was going down the right road before investing
> lots of time.
>
> Also, is there a package that I can use to access the Genbank database
> directly from within R to do the Blast searches?

Hi, Jos.

R/Bioconductor has the tools that you could use to rewrite blast if
you like, but for cross-species comparisons, blast is a pretty good
tool.  You might consider downloading the blast source or a convenient
binary from NCBI and running it locally.  Of course, a command-line
utility like blast can be easily scripted from R.  As for getting
records from GenBank, consider whether you need programmatic control
over the process at all.  It may be that your colleague already has a
way to get the sequences he/she wants directly and just wants help
with running blast.  However, if it does appear that programmatic
access is necessary, have a look at the EUtilities web services.  With
EUtilities, it is pretty straightforward to construct a URL, use
something like getURL() from the Rcurl package, and get the results.

Hope that helps.

Sean



More information about the Bioconductor mailing list