[BioC] Restriction enzyme digestion

Herve Pages hpages at fhcrc.org
Tue May 13 19:05:52 CEST 2008


Hi Anh,

I fixed a few things in Biostrings:

- The "as.character" method for XStringViews (and XStringSet) objects is much
   faster now (about 250x faster than yesterday for your use case):

     library(BSgenome.Hsapiens.UCSC.hg18)
     c1 <- Hsapiens$chr1
     m <- matchPattern('CTAG', c1)
     mgaps <- gaps(m)
     mgaps2 <- as.character(mgaps)  # now takes about 5 sec. instead of 20 minutes

- As a consequence, the getSeq() function is much faster too (it uses as.character()
   internally):

     mgaps3_start <- end(m)[-length(m)] + 1L
     mgaps3_end <- start(m)[-1] - 1L
     ## The empty gaps must be dropped because the 'start' and 'end' args of getSeq()
     ## must verify 'start' <= 'end'
     nonemptygaps <- mgaps3_start <= mgaps3_end
     mgaps3_start <- mgaps3_start[nonemptygaps]
     mgaps3_end <- mgaps3_end[nonemptygaps]
     mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, end=mgaps3_end)

     identical(mgaps3, mgaps2)  # TRUE

This will be available in BioC 2.2 (Biostrings 2.8.5) tonight around midnight
(Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings 2.9.8) around noon
(Seattle time). As always, install with biocLite().

Please let me know by posting here if you run into other problems. Thanks!

H.


hpages at fhcrc.org wrote:
> Hi Anh,
> 
> Quoting Anh Tran <popophobia at gmail.com>:
> 
>> Hi all,
>>
>> I'm new to this mailing list and have a very simple question about 
>> digestion
>> with restriction enzyme for the whole genome.
>>
>> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of
>> enzyme. Is there a faster way (ie. pre-built package) to get the fragment
>> sequence and it's location in UCSC genome browser format (Chr#: start 
>> bp -
>> end bp).
>>
>> I'm planning on finding the restriction site location, then find the
>> substring between two consecutive cut sites.
>>
>> Is there any faster way to do this?
>>
>> Here's the code that I'm thinking of:
>>
>> library(BSgenome.Hsapiens.UCSC.hg18)
>> c1<-Hsapiens$chr1
>> m<-matchPatterns('CTAG', c1)
>>
>> #And the for loop
>> reslt<-NULL
>> for (i in 1:(length(m)-1)) {
>>    reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
>> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2)))
>> }
>>
>> It seems like this code is no where near perfect. Please give me some 
>> input.
> 
> Yes looping on an 623303-element vector at the R level will take a long
> time. To "invert" the views of an XStringViews object i.e. to make the
> current gaps between the views become the new views and the current views
> become the new gaps, just do:
> 
>   mgaps <- gaps(m)
> 
> This is very fast! (< 0.3 sec. on my machine)
> 
> Then if you really want this as a standard character vector, do:
> 
>   mgaps2 <- as.character(mgaps)
> 
> but, unfortunately this is still very slow at the moment because the
> main loop in the "as.character" method for XStringViews objects is still
> written in R and not in C (we need to change this ASAP).
> 
> You could also use the getSeq() function in a vectorized way if you
> really wanted to use getSeq(). Then your code would look like:
> 
>   reslt_start <- end(m)[-length(m)] + 1L
>   reslt_end <- start(m)[-1] - 1L
>   reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2))
> 
> at least in theory... but I realized that there are a few problems
> with the current getSeq() implementation that I need to address too,
> sorry!
> 
> Note that you'll need the latest available version of Biostrings in order
> to use the gaps() function.
> 
> Cheers,
> H.
> 
> 
>>
>> Thanks
>>
>> -- 
>> Regards,
>> Anh Tran
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:  
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list