[BioC] Blast analysis of two sequences in R

Cook, Malcolm MEC at stowers.org
Wed Apr 23 23:47:45 CEST 2014


Hi,

I tend in cases like this to shirk wrappers and stay closer to the command line usage of tools such as blast.

Below is a generic example of run NCBI blastn (part of blast+ package) under R, and post filter the results.

The approach should work with minor changes if you have not upgraded to NCBI's blast+.

~Malcolm

s<-
    ## The sequences to be blasted and their fasta 'deflines'
    c('>s1','CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC'
      ,'>s2','CTATGTCATCCTCAATCTCTATGAAAGCAACACCGCTACCATAGA'
      )

blast.f6<-
    ## The fields you want back from blast.  c.f. `blastn -help`
    c('qseqid', 'sseqid', 'pident', 'qcovs') 

blast.out<-
    ## The system call to blastn
    system2('blastn',c('-db',"'nt'"
                       ,'-outfmt',sprintf('"6 %s"',paste(collapse=' ',blast.f6))
                       , '-perc_identity',"'.90'"
                       ,'-num_threads', 15 # use 'em if you got 'em !
                       )
            ,input=s
            ,stdout=TRUE
            )

blast.out.df<-
    ## parse blast.out as a table and assign names to it
    `names<-`(read.table(sep='\t',textConnection(b)),blast.f6)

# Query the data frame
head(blast.out.df[with(b.df,pident>=90 & qcovs>95),],3)

  qseqid                      sseqid pident qcovs
1     s1 gi|380719094|gb|JQ281544.1|    100   100
2     s1 gi|161015434|gb|EU143287.2|    100   100
3     s1 gi|161015430|gb|EU143282.2|    100   100


~ Malcolm

 >-----Original Message-----
 >From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Martin Morgan
 >Sent: Wednesday, April 23, 2014 12:12 PM
 >To: prabhakar ghorpade; bioconductor at r-project.org
 >Subject: Re: [BioC] Blast analysis of two sequences in R
 >
 >On 04/21/2014 03:51 AM, prabhakar ghorpade wrote:
 >> Hello,
 >>        I have following sequences for which I want to BLAST and see result only for sequences showing > 95% Query coverage and
 >>90% identity?
 >>
 >> Sequences
 >>> 1
 >> CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
 >>> 2
 >> CTATGTCATCCTCAATCTCTATGAAAGCAACACCGCTACCATAGA
 >
 >The 'annotate' package has 'blastSequences'; I'm not sure that it's useful
 >enough for your purposes. In the 'devel' branch (see
 >
 >   http://bioconductor.org/developers/how-to/useDevel/
 >
 >it has been updated to be more responsive and to return richer data, e.g.,
 >
 >    df = blastSequences("CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC",
 >                        timeout=40, as="data.frame")
 >
 > > head(df, 1)
 >   Hit_num                      Hit_id
 >1       1 gi|380719094|gb|JQ281544.1|
 >                                         Hit_def Hit_accession Hit_len Hsp_num
 >1 Expression vector pAV-UCSF, complete sequence      JQ281544   11534       1
 >   Hsp_bit-score Hsp_score Hsp_evalue Hsp_query-from Hsp_query-to Hsp_hit-from
 >1       82.4379        90 1.2063e-13              1           45         2126
 >   Hsp_hit-to Hsp_query-frame Hsp_hit-frame Hsp_identity Hsp_positive Hsp_gaps
 >1       2170               1             1           45           45        0
 >   Hsp_align-len                                      Hsp_qseq
 >1            45 CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
 >                                        Hsp_hseq
 >1 CTTTGTCTTCCCCTATGTCATCCTCAATCTCTATGAAAGCAACAC
 >                                     Hsp_midline
 >1 |||||||||||||||||||||||||||||||||||||||||||||
 > >
 >
 >Hope that helps; would be happy to hear of other R solutions.
 >
 >Martin
 >
 >>
 >>
 >> Can you please Suggest how can I select them in R in NCBI BLAST so that I get sequences showing > 95% Query coverage and >90%
 >identity. Is there program in R to select them? I want to detect number of organism showing uniques results for given sequences.
 >>   Thanks.
 >>
 >>
 >> Dr. Ghorpade Prabhakar B.
 >> PhD Scholar ( Veterinary Biochemistry),
 >> IVRI,
 >> Izatnagar,
 >> Bareilly, U.P.,
 >> India
 >> 	[[alternative HTML version deleted]]
 >>
 >>
 >>
 >> _______________________________________________
 >> Bioconductor mailing list
 >> Bioconductor at r-project.org
 >> https://stat.ethz.ch/mailman/listinfo/bioconductor
 >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
 >>
 >
 >
 >--
 >Computational Biology / Fred Hutchinson Cancer Research Center
 >1100 Fairview Ave. N.
 >PO Box 19024 Seattle, WA 98109
 >
 >Location: Arnold Building M1 B861
 >Phone: (206) 667-2793
 >
 >_______________________________________________
 >Bioconductor mailing list
 >Bioconductor at r-project.org
 >https://stat.ethz.ch/mailman/listinfo/bioconductor
 >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list