[BioC] obtain DNA sequence

Hervé Pagès hpages at fhcrc.org
Mon Sep 14 21:50:55 CEST 2009


Hi Simon,

The getSeq() function from the BSgenome package is provided for that
purpose:

myseqs <- data.frame(
             Chr=c("chr9", "chr6", "chr8", "chrX", "chr4", "chr11"),
             Start=c(79466420, 50495860, 19687900,
                     90313740, 117732780, 4090400),
             Stop=c(79466570, 50496010, 19688050,
                    90313890, 117732930, 4090550))

 > myseqs
     Chr     Start      Stop
1  chr9  79466420  79466570
2  chr6  50495860  50496010
3  chr8  19687900  19688050
4  chrX  90313740  90313890
5  chr4 117732780 117732930
6 chr11   4090400   4090550

 > getSeq(Mmusculus, myseqs$Chr, start=myseqs$Start, end=myseqs$Stop)
[1] 
"CTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCCAAGTGCTGGGATTAACGGTGTGCACCACCACTGCCTGGC"
[2] 
"TTACTGTCACCCTCAGAATCACGTGTTCAGATATCCAGCTTCCGGGTGACAAACCCACAAAATTCTCTTTTTTCTTTAACCTTACTCTCTCCAACACTTGACCTTTCTTTGTTTATTCCTTCTGGAGTGCCCAGGTCCTTATGCATTATGA"
[3] 
"GGTAGGTAAGTAATGGTCACCTATTCTCTTTCTATCTGGTATGTCTGCAGGTTGACAGGCTGGTGCCTGCCCTTCAACCCAGGAAGCAGAGCTTGTGTTCAATCATTATTGCACATTAACAAGGAAAAAAATGCCTTGTTGGATTCTTAAA"
[4] 
"TCAAAATGGCAAGAAAAACACTTAAGTTTTTATTACTCAGGGCTCACAGCAGCTAAAAGGTTTCAGCAATATTATATGGCATACAAATTGCAGATTTAACTTGGTTGAGGAGCGTCCCCATGCAATCACCATAATATTTTATTGTAGAATA"
[5] 
"TTCAAAACGTCCTCCTGCTTCCTCTGTGGTGACCAGCTATGACTCGGGGCATCCCTCCTCAAGGCCTTAGTGTTTTGGCTTTGCTCAGTTTCCATGAGGCCTGACCAACCCCTAGGAGTCTCCTCTTTCTGCCTCCTGCTACCTGGATGCA"
[6] 
"AGCCTGCTCTGTAGGGAACCTTTAGTGGGCTTGAAGTGTTCCCTGACTGCTCTTGAGCACTGGCCAAAAGCAAGAAAGCAGCTAGCCCATGAATGGCCCTGTGGGTGGCACAGGCACAGGCAGTGAAACCCCAAGAAGACCAGGTATAATG"

See ?getSeq for more information about this function.
Cheers,

H.


Biddie, Simon (NIH/NCI) [F] wrote:
> Dear All,
> 
> I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help.
> 
> I have the following style matrix:
> 
>     Chr     Start      Stop
> 1  chr9  79466420  79466570
> 2  chr6  50495860  50496010
> 3  chr8  19687900  19688050
> 4  chrX  90313740  90313890
> 5  chr4 117732780 117732930
> 6 chr11   4090400   4090550
> 
> I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually:
> 
>> library(BSgenome.Mmusculus.UCSC.mm9)
> 
>> seq1 = subseq(Mmusculus$chr9,79466420,79466570)
> 
>> as(seq1, "character")
> 
> How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo.
> 
> Thank you for any help,
> 
> Simon
> 
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> 
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5
> [3] Biostrings_2.10.22                 IRanges_1.0.16
> [5] R.utils_1.1.3                      R.oo_1.4.6
> [7] R.methodsS3_1.0.3
> 
> loaded via a namespace (and not attached):
> [1] grid_2.8.1         lattice_0.17-25    Matrix_0.999375-23
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list