[BioC] Restriction enzyme digestion

Mon May 12 22:15:56 CEST 2008

Anh Tran writes:

> Hi all,
>
> I'm new to this mailing list and have a very simple question
> about digestion with restriction enzyme for the whole genome.
>
> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut
> site of enzyme. Is there a faster way (ie. pre-built package)
> to get the fragment sequence and it's location in UCSC genome
> browser format (Chr#: start bp - end bp).
>

> I'm planning on finding the restriction site location, then
> find the substring between two consecutive cut sites.
>

getSeq accepts vectors of start and stop sites, so you can try this:

library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg18)
c1<-Hsapiens$chr1
m<-matchPattern('CTAG', c1)
starts <- m at start[1:length(m at start)-1]
ends <- m at start[2:length(m at start)]
seqs <-  getSeq(Hsapiens, 'chr1', starts, ends)

Which works great for a "few" sequences, but stalls on my windows box with this many sequences due to memory constraints.  While Biostrings is fantastic (for example, the matchPattern above takes only a few seconds), it does come at a cost (IMHO) in terms of memory overhead (the authors know this, and explicitly state that retrieving this many sequences at once is not likely to work).  But you might be able to loop through in chunks and rbind together.

An alternative may be to use biomaRt's getSequence in mysql mode (requires mySql libraries to be installed and the RMySQL package).  I am not sure if you can send multiple sites, though (and I can't test it at the moment due to firewall issues), but you could try:

mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE)
getSequence(chromosome="1", start=starts, end=ends, mart=ensembl)

<DriftingOffTopic>

An unpublished alternative is to use a new library I have concocted which will suck sequences directly out of a 2bit file.  You have to download the human genome .2bit file (800Mb) from here:

ftp://hgdownload.cse.ucsc.edu/gbdb/hg18/hg18.2bit

And then, if you had my library installed (which I can send to you if you want it), you could do this:

library(r2bit)
chr <- rep("chr1", length(starts))
seq <- fetch2bit("/path/to/hg18.2bit", chr, starts, ends)

On my box, this takes 35 seconds to retrieve the 623,302 sequences (as character strings, not at Biostrings objects) from my local copy of hg18.2bit.

Personally, as we move into the massively high throughput sequencing era, I wonder if this isn't a good alternative (i.e., coupling biomaRt's ability to quickly figure out what sequences are of interest, and then retrieving those sequences quickly out of a local compressed file)

</DriftingOffTopic>

Cheers,
Eric

> Is there any faster way to do this?
>
> Here's the code that I'm thinking of:
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> c1<-Hsapiens$chr1
> m<-matchPatterns('CTAG', c1)
>
> #And the for loop
> reslt<-NULL
> for (i in 1:(length(m)-1)) {
>    reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3,
> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) }
>
> It seems like this code is no where near perfect. Please give
> me some input.
>
> Thanks
>
> --
> Regards,
> Anh Tran
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

This email message, including any attachments, is for th...{{dropped:6}}