[BioC] BLAST search sequence for species ID from R?

Mon Dec 14 18:55:01 CET 2009

Dear Jos, Michael.
To add to Michael's good advice. 

Roche 454, as I understand it, generates ~450bp reads, so there should be no problem with the default BLAST word size parameter.

Once you downloaded your nucleotide database files and your copy of NCBI blastall program, do something like:

formatdb -i nucleotide_file -o T -p F -n nucleotide_db

This generates a database to search; then to search your 454 sequences, do:

blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" -m 8 -o output_file

-F "m S" means it will ignore simple sequences in look up phase of BLAST but use them in extending HSPs (I usually use it with protein searches), you can also try -F F or -F T if it runs slow etc.
Type: blastall with no arguments to see full list of what the arguments mean, also formatdb to see those too.  

You could also try; for species sequences, I usually find ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, though for bacterial species I don't know.

Also sounds like some Perl programming magic would be very useful to you here.

Good luck. 

Kind regards,

John.  

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Michael Dondrup
Sent: 14 December 2009 12:34
To: jos matejus
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] BLAST search sequence for species ID from R?

Hi,

if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software 
and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task)

You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form 
( blast option for this is "-m 8" )
this should be the easiest way of getting the data into R.

A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to 
propose you have a look at other metagenomics approaches, e.g. the MEGAN software:
 http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try.
It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low
complexity filtering.

Best
Michael

Am Dec 14, 2009 um 12:02 PM schrieb jos matejus:

> Dear list members,
> 
> A colleague has asked whether I can help him with a bioinformatics
> problem he has as he knows I use R (although I don't usually use R for
> this type of problem) and I was hoping someone might be able to point
> me in the right direction. I have searched the mailing list archives
> and also Googled this particular query, but without success. I ask
> forgiveness in advance if the question is not appropriate for this
> forum.
> 
> Anyway, the background is that my colleague has a sample collected
> from the field containing many species of related insects (same genus)
> which he has obtained lots of sequence information (from 454). The
> sequences are saved in a single fasta file. What he wants to do is to
> query Genbank to match each sequence from the fasta file to particular
> species (A nucleotide blast search I believe) and return the top
> ranked match for each sequence. He can do this manually via the web
> page, but he will have a lot of these files in the future and was
> looking for some way of automating the process (hence using R). He
> ultimately wants to be able to restrict the Blast search to a list of
> preselected  Accession numbers or within genus.
> 
> As I am not familiar with this field I was wondering whether anyone
> knows of an existing function (or functions) that can do the job. I am
> looking at the package seqinr at the moment to see whether this would
> fit the bill and also whether the Biostrings package would be
> appropriate. However, the learning curve looks a little steep and I
> wanted to make sure I was going down the right road before investing
> lots of time.
> 
> Also, is there a package that I can use to access the Genbank database
> directly from within R to do the Blast searches?
> 
> Many many thanks in advance
> Jos
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor