[BioC] BLAST search sequence for species ID from R?

Martin Morgan mtmorgan at fhcrc.org
Mon Dec 14 19:37:54 CET 2009


John Herbert wrote:
> Dear Jos, Michael.
> To add to Michael's good advice. 
>
> Roche 454, as I understand it, generates ~450bp reads, so there should be no problem with the default BLAST word size parameter.
>
> Once you downloaded your nucleotide database files and your copy of NCBI blastall program, do something like:
>
> formatdb -i nucleotide_file -o T -p F -n nucleotide_db
>
> This generates a database to search; then to search your 454 sequences, do:
>
> blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" -m 8 -o output_file
>
> -F "m S" means it will ignore simple sequences in look up phase of BLAST but use them in extending HSPs (I usually use it with protein searches), you can also try -F F or -F T if it runs slow etc.
> Type: blastall with no arguments to see full list of what the arguments mean, also formatdb to see those too.  
>
> You could also try; for species sequences, I usually find ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, though for bacterial species I don't know.
>
> Also sounds like some Perl programming magic would be very useful to you here.
>
>   
The Bioconductor infrastructure here is really powerful (especially if 
you're an R user and not a perl user ;) -- library(ShortRead); fq = 
readFastq("/some/file.fastq") gets the 454 data in as a DNAStringSet. 
There are really great string manipulation / pattern matching tools in 
IRanges / Biostrings / ShortRead to operate on this data in just a few 
lines, .e.g., selecting those reads >250  and < 400 nucleotides (fq = 
fq[width(fq) > 250 & width(fq) < 400]), and then removing the trailing 
10 (?) bp (narrow(fq, start=1, end=width(fq)-10)) which tend to have 
lower quality -- easy to check with q1 = narrow(quality(fq), 
start=width(fq)-10, end=width(fq)); plot(colMeans(as(q1, "matrix"))) -- 
and save to a fastq file again (with writeFastq). Since these sorts of 
data often involve a PCR reaction during sample prep, trimLRPatterns is 
useful for removing primer remnants. The tables() function in ShortRead 
can be useful for summarizing the reads that you do have, including on 
narrowed portions (e.g., first 10 bp) where sequence / sample prep 
artifacts are likely. And 454 data tends to be 'small' so the memory 
requirements are often reasonable.

Perl is also useful as a scripting language, but to launch blast form 
within R is just a call to system().

Martin

> Good luck. 
>
> Kind regards,
>
> John.  
>
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Michael Dondrup
> Sent: 14 December 2009 12:34
> To: jos matejus
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] BLAST search sequence for species ID from R?
>
> Hi,
>
> if your search involves > thousands of sequences, it might be a good idea to download and install the NCBI blast software 
> and the nucleotide (nt) database locally and perform the searches in a program call on a fast machine. (So this is not an R task)
>
> You can then use R to further process the output files, e.g. counting top-hits to single species. If the blast output is in tabular form 
> ( blast option for this is "-m 8" )
> this should be the easiest way of getting the data into R.
>
> A different story will be to tune the blast parameters to work effectively in this short-read 454 setting. I would like to 
> propose you have a look at other metagenomics approaches, e.g. the MEGAN software:
>  http://www-ab.informatik.uni-tuebingen.de/software/megan This might also be a good tool to try.
> It does not use R though, but you can get some ideas about how to set blast parameters like decrease word size and turn of low
> complexity filtering.
>
> Best
> Michael
>
>
> Am Dec 14, 2009 um 12:02 PM schrieb jos matejus:
>
>   
>> Dear list members,
>>
>> A colleague has asked whether I can help him with a bioinformatics
>> problem he has as he knows I use R (although I don't usually use R for
>> this type of problem) and I was hoping someone might be able to point
>> me in the right direction. I have searched the mailing list archives
>> and also Googled this particular query, but without success. I ask
>> forgiveness in advance if the question is not appropriate for this
>> forum.
>>
>> Anyway, the background is that my colleague has a sample collected
>> from the field containing many species of related insects (same genus)
>> which he has obtained lots of sequence information (from 454). The
>> sequences are saved in a single fasta file. What he wants to do is to
>> query Genbank to match each sequence from the fasta file to particular
>> species (A nucleotide blast search I believe) and return the top
>> ranked match for each sequence. He can do this manually via the web
>> page, but he will have a lot of these files in the future and was
>> looking for some way of automating the process (hence using R). He
>> ultimately wants to be able to restrict the Blast search to a list of
>> preselected  Accession numbers or within genus.
>>
>> As I am not familiar with this field I was wondering whether anyone
>> knows of an existing function (or functions) that can do the job. I am
>> looking at the package seqinr at the moment to see whether this would
>> fit the bill and also whether the Biostrings package would be
>> appropriate. However, the learning curve looks a little steep and I
>> wanted to make sure I was going down the right road before investing
>> lots of time.
>>
>> Also, is there a package that I can use to access the Genbank database
>> directly from within R to do the Blast searches?
>>
>> Many many thanks in advance
>> Jos
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>   


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list