[BioC] BLAST search sequence for species ID from R?

Mon Dec 14 19:44:29 CET 2009

Martin Morgan wrote:
> John Herbert wrote:
>> Dear Jos, Michael.
>> To add to Michael's good advice.
>> Roche 454, as I understand it, generates ~450bp reads, so there 
>> should be no problem with the default BLAST word size parameter.
>>
>> Once you downloaded your nucleotide database files and your copy of 
>> NCBI blastall program, do something like:
>>
>> formatdb -i nucleotide_file -o T -p F -n nucleotide_db
>>
>> This generates a database to search; then to search your 454 
>> sequences, do:
>>
>> blastall -a 2 -p blastn -i 454_file -d nucleotide_db -e 0.01 -F "m S" 
>> -m 8 -o output_file
>>
>> -F "m S" means it will ignore simple sequences in look up phase of 
>> BLAST but use them in extending HSPs (I usually use it with protein 
>> searches), you can also try -F F or -F T if it runs slow etc.
>> Type: blastall with no arguments to see full list of what the 
>> arguments mean, also formatdb to see those too. 
>> You could also try; for species sequences, I usually find 
>> ftp://ftp.ncbi.nih.gov/refseq/release download site quite useful, 
>> though for bacterial species I don't know.
>>
>> Also sounds like some Perl programming magic would be very useful to 
>> you here.
>>
>>   
> The Bioconductor infrastructure here is really powerful (especially if 
> you're an R user and not a perl user ;) -- library(ShortRead); fq = 
> readFastq("/some/file.fastq") gets the 454 data in as a DNAStringSet.

Oops, that's assuming it's in fastq format; maybe it's just fasta 
(Biostrings::read.DNAStringSet) or in separate files (ShortRead's 
read454) or even in columns (ShortRead's readXStringColumns).

Martin
> There are really great string manipulation / pattern matching tools in 
> IRanges / Biostrings / ShortRead to operate on this data in just a few 
> lines, .e.g., selecting those reads >250  and < 400 nucleotides (fq = 
> fq[width(fq) > 250 & width(fq) < 400]), and then removing the trailing 
> 10 (?) bp (narrow(fq, start=1, end=width(fq)-10)) which tend to have 
> lower quality -- easy to check with q1 = narrow(quality(fq), 
> start=width(fq)-10, end=width(fq)); plot(colMeans(as(q1, "matrix"))) 
> -- and save to a fastq file again (with writeFastq). Since these sorts 
> of data often involve a PCR reaction during sample prep, 
> trimLRPatterns is useful for removing primer remnants. The tables() 
> function in ShortRead can be useful for summarizing the reads that you 
> do have, including on narrowed portions (e.g., first 10 bp) where 
> sequence / sample prep artifacts are likely. And 454 data tends to be 
> 'small' so the memory requirements are often reasonable.
>
> Perl is also useful as a scripting language, but to launch blast form 
> within R is just a call to system().
>
> Martin
>
>> Good luck.
>> Kind regards,
>>
>> John. 
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch 
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Michael 
>> Dondrup
>> Sent: 14 December 2009 12:34
>> To: jos matejus
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] BLAST search sequence for species ID from R?
>>
>> Hi,
>>
>> if your search involves > thousands of sequences, it might be a good 
>> idea to download and install the NCBI blast software and the 
>> nucleotide (nt) database locally and perform the searches in a 
>> program call on a fast machine. (So this is not an R task)
>>
>> You can then use R to further process the output files, e.g. counting 
>> top-hits to single species. If the blast output is in tabular form ( 
>> blast option for this is "-m 8" )
>> this should be the easiest way of getting the data into R.
>>
>> A different story will be to tune the blast parameters to work 
>> effectively in this short-read 454 setting. I would like to propose 
>> you have a look at other metagenomics approaches, e.g. the MEGAN 
>> software:
>>  http://www-ab.informatik.uni-tuebingen.de/software/megan This might 
>> also be a good tool to try.
>> It does not use R though, but you can get some ideas about how to set 
>> blast parameters like decrease word size and turn of low
>> complexity filtering.
>>
>> Best
>> Michael
>>
>>
>> Am Dec 14, 2009 um 12:02 PM schrieb jos matejus:
>>
>>  
>>> Dear list members,
>>>
>>> A colleague has asked whether I can help him with a bioinformatics
>>> problem he has as he knows I use R (although I don't usually use R for
>>> this type of problem) and I was hoping someone might be able to point
>>> me in the right direction. I have searched the mailing list archives
>>> and also Googled this particular query, but without success. I ask
>>> forgiveness in advance if the question is not appropriate for this
>>> forum.
>>>
>>> Anyway, the background is that my colleague has a sample collected
>>> from the field containing many species of related insects (same genus)
>>> which he has obtained lots of sequence information (from 454). The
>>> sequences are saved in a single fasta file. What he wants to do is to
>>> query Genbank to match each sequence from the fasta file to particular
>>> species (A nucleotide blast search I believe) and return the top
>>> ranked match for each sequence. He can do this manually via the web
>>> page, but he will have a lot of these files in the future and was
>>> looking for some way of automating the process (hence using R). He
>>> ultimately wants to be able to restrict the Blast search to a list of
>>> preselected  Accession numbers or within genus.
>>>
>>> As I am not familiar with this field I was wondering whether anyone
>>> knows of an existing function (or functions) that can do the job. I am
>>> looking at the package seqinr at the moment to see whether this would
>>> fit the bill and also whether the Biostrings package would be
>>> appropriate. However, the learning curve looks a little steep and I
>>> wanted to make sure I was going down the right road before investing
>>> lots of time.
>>>
>>> Also, is there a package that I can use to access the Genbank database
>>> directly from within R to do the Blast searches?
>>>
>>> Many many thanks in advance
>>> Jos
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>     
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>   
>
>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793