[BioC] question regarding BSGenome/Biostrings

Elizabeth Purdom epurdom at stat.berkeley.edu
Fri Nov 21 01:37:43 CET 2008


I am trying to run code like that in the pdf documentation of BSGenome 
to match patterns to all the chromosomes of a genome. I have copied the 
function 'runOneStrandAnalysis' from the documentation and have changed 
runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as 

  runAnalysis2 <- function(dict0, outfile="")
  seqnames <- seqnames(Hsapiens)
  runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile, 
  runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile, 

When I run the code using the patterns in the example, it runs fine 
(though of course the patterns in the dictionary were intended for C 
elegans, but that's irrelevant):

runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")

Instead I want to find if regions in the genome are repeated. So I made 
a dictionary that is class 'XStringViews' by using views to grab the 

myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH, 

And I get the following error:
runAnalysis2(myDictViews, outfile="sample.txt")
 >>> Finding all hits in strand + of chromosome chr1 ...
Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
   Biostrings internal error in _init_chrtrtable(): invalid code -1

The problem is in matchPDict:
subject <- Hsapiens[["chr1"]]
mindex <- matchPDict(mypdict, subject)
Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
   Biostrings internal error in _init_chrtrtable(): invalid code -1

I have checked and there are only A,C,G,T values in this region. Is it 
because I have a 'XStringViews' object rather than a DNAStringSet 
object? If this is a bug (I know this part is still under progress) is 
there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'?

Also, if I want to do this on a large scale, is this a reasonable way to 
go or are there tricks for this that would speed it up? I'm still not 
sure that I will ultimately do this in R, but just to know.

UC, Berkeley

More information about the Bioconductor mailing list