[BioC] question regarding BSGenome/Biostrings

Elizabeth Purdom epurdom at stat.berkeley.edu
Fri Nov 21 01:37:43 CET 2008


Hello,

I am trying to run code like that in the pdf documentation of BSGenome 
to match patterns to all the chromosomes of a genome. I have copied the 
function 'runOneStrandAnalysis' from the documentation and have changed 
runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as 
follows:

  runAnalysis2 <- function(dict0, outfile="")
  {
  seqnames <- seqnames(Hsapiens)
  runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile, 
append=FALSE)
  runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile, 
append=TRUE)
  }

When I run the code using the patterns in the example, it runs fine 
(though of course the patterns in the dictionary were intended for C 
elegans, but that's irrelevant):

runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")

Instead I want to find if regions in the genome are repeated. So I made 
a dictionary that is class 'XStringViews' by using views to grab the 
regions:

MAPLENGTH<-36
WRITELENGTH<-100
myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH, 
(1:WRITELENGTH)+MAPLENGTH-1)

And I get the following error:
runAnalysis2(myDictViews, outfile="sample.txt")
 >>> Finding all hits in strand + of chromosome chr1 ...
Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
   Biostrings internal error in _init_chrtrtable(): invalid code -1

The problem is in matchPDict:
mypdict<-PDict(myDictViews)
subject <- Hsapiens[["chr1"]]
mindex <- matchPDict(mypdict, subject)
Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
   Biostrings internal error in _init_chrtrtable(): invalid code -1

I have checked and there are only A,C,G,T values in this region. Is it 
because I have a 'XStringViews' object rather than a DNAStringSet 
object? If this is a bug (I know this part is still under progress) is 
there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'?

Also, if I want to do this on a large scale, is this a reasonable way to 
go or are there tricks for this that would speed it up? I'm still not 
sure that I will ultimately do this in R, but just to know.

Thanks,
Elizabeth
UC, Berkeley



More information about the Bioconductor mailing list