[BioC] Problem with BioStrings: patternMatch ALSO BLAST

Thu Jun 7 08:04:32 CEST 2007

Hi Paul,

Quoting Paul Leo <p.leo at uq.edu.au>:

> These commands are straight from the Biostrings; PatternMatch manual.
> Could not find any help online.
> Any advice on the key 86 error??

See below for the explanation.

> 
> 
> Otherwise has anyone do genomic BLAST from within R. Any advice on the
> best path.... Use Blastcl3 ? ,another standalone blast , ?
> Maybe a 1000 or do short sequences to blast....

There is no BLAST in Biostrings but there is the needwunsQS() function
for _global_ alignment of 2 sequences. I just rewrote this function in C
in Biostrings devel (2.5.9) to make it faster. Now it's possible to align
two 20000-letter sequences (nucleotides or amino acid) in 2 or 3 seconds,
granted that you have enough memory for this.

Look at the Alignment.pdf vignette, the man page for needwunsQS()
(?needwunsQS) and at ?scoring_matrices for more information on this.

[continued below]

> 
> Thanks in advance 
> Paul
> 
> CODE
> 
> >  cI <- Mmusculus$chr13
> > length(cI)
> [1] 120614378
> > class(cI)
> [1] "DNAString"
> attr(,"package")
> [1] "Biostrings"
> 
> 
> 
> >  p <- "UUACAGUUGUUCAACCAGUUACU"

> 
> *********** ERROR****************
> >  matchPattern(p, cI, mismatch = 0)
> Error in CharBuffer.write(data, 1, length, value = src, enc = lkup) : 
>         key 85 not in lookup table
> 
> *************************************

You are applying matchPattern() on a DNAString subject (cI) so the pattern (p)
is expected to be a DNAString object too or a string that can be converted
into a DNAString object. In your example p contains the letter U (looks like
an RNA sequence) so it can't be converted to a DNAString object.
key 85 is the ascii code of the U letter: this code does not belong to
the lookup table used internally to encode the letters of a DNAString object
(letters stored in a DNAString or RNAString objects are encoded, this is a
trick to allow some fast searching algorithms).

I agree that the current error message is not really helpful, sorry! I plan
to work on this and to improve the overall documentation too.

Your pattern (p) can be converted into a RNAString object first with
  rnap <- RNAString(p)
and then into a DNAString object with:
  dnap <- DNAString(rnap)
This will not necessarily give you what you want since when converting
from RNAString to DNAString (or inversely) the current translation is
applied:
  U <-> A
  G <-> C
  C <-> G
  A <-> T
(which mimics the transcription/reverse transcription processus) so
you might want to apply complement() (or reverseComplement()) to
the result.

Cheers,
H.

> 
> > sessionInfo()
> R version 2.5.0 (2007-04-23) 
> i386-pc-mingw32 
> 
> locale:
> LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MON
> ETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
> 
> attached base packages:
> [1] "tools"     "stats"     "graphics"  "grDevices" "utils"
> "datasets" 
> [7] "methods"   "base"     
> 
> other attached packages:
> BSgenome.Mmusculus.UCSC.mm8                    BSgenome 
>                     "1.2.0"                     "1.4.0" 
>                     Biobase                  Biostrings 
>                    "1.14.0"                     "2.4.5" 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>