[BioC] obtain DNA sequence

Patrick Aboyoun paboyoun at fhcrc.org
Tue Sep 1 20:45:29 CEST 2009


Simon,
I had a typo in my code and should have used Stop for the column name 
rather than End. Try

mymat <- <<the matrix you have below>>
uniqueChr <- unique(mymat[,"Chr"])
extractedDNA <- character(nrow(mymat))
for (chr in uniqueChr) {
  selected <- which(mymat[,"Chr"] == chr)
  extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], 
mymat[selected,"Start"], mymat[selected,"Stop"]))
}


Patrick


Biddie, Simon (NIH/NCI) [F] wrote:
> Hi Patrick,
>
> Thanks for your response. I will look into IRanges and Xstring.
> I also tried your code, however it gives me the following error:
>
>   
>> mymat
>>     
>     Chr     Start      Stop
> 1  chr9  79466420  79466570
> 2  chr6  50495860  50496010
> 3  chr8  19687900  19688050
> 4  chrX  90313740  90313890
> 5  chr4 117732780 117732930
> 6 chr11   4090400   4090550
>
>   
>> uniqueChr <- unique(mymat[,"Chr"])
>> extractedDNA <- character(nrow(mymat))
>> for (chr in uniqueChr) {
>>     
> +   selected <- which(mymat[,"Chr"] == chr)
> +   extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], 
> + mymat[selected,"Start"], mymat[selected,"End"]))
> + }
>
> Error in newViews(subject, start = start, end = end, names = names, Class = "XStringViews") : 
>   'start' and 'end' must be numeric vectors
> In addition: Warning message:
> In Views(Mmusculus[[chr]], mymat[selected, "Start"], mymat[selected,  :
>   masks were dropped
>
>
> Simon
>
> -----Original Message-----
> From: Patrick Aboyoun [mailto:paboyoun at fhcrc.org] 
> Sent: Tuesday, September 01, 2009 2:21 PM
> To: Biddie, Simon (NIH/NCI) [F]
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] obtain DNA sequence
>
> Simon,
> Below is code that meets the needs of your explicit question
>
> mymat <- <<the matrix you have below>>
> uniqueChr <- unique(mymat[,"Chr"])
> extractedDNA <- character(nrow(mymat))
> for (chr in uniqueChr) {
>   selected <- which(mymat[,"Chr"] == chr)
>   extractedDNA[selected] <- as.character(Views(Mmusculus[[chr]], 
> mymat[selected,"Start"], mymat[selected,"End"]))
> }
>
> The question I have for you is have you tried using the IRanges 
> framework to represent your ranges? It would make this type of 
> processing easier to perform. There is also write functions such as 
> write.XStringSet and write.XStringViews that provide export 
> functionality without requiring you to coerce the DNA sequences into 
> character vectors.
>
>
>
> Patrick
>
>
>
> Biddie, Simon (NIH/NCI) [F] wrote:
>   
>> Dear All,
>>
>> I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help.
>>
>> I have the following style matrix:
>>
>>     Chr     Start      Stop
>> 1  chr9  79466420  79466570
>> 2  chr6  50495860  50496010
>> 3  chr8  19687900  19688050
>> 4  chrX  90313740  90313890
>> 5  chr4 117732780 117732930
>> 6 chr11   4090400   4090550
>>
>> I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually:
>>
>>   
>>     
>>> library(BSgenome.Mmusculus.UCSC.mm9)
>>>     
>>>       
>>   
>>     
>>> seq1 = subseq(Mmusculus$chr9,79466420,79466570)
>>>     
>>>       
>>   
>>     
>>> as(seq1, "character")
>>>     
>>>       
>> How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo.
>>
>> Thank you for any help,
>>
>> Simon
>>
>>   
>>     
>>> sessionInfo()
>>>     
>>>       
>> R version 2.8.1 (2008-12-22)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>
>> other attached packages:
>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5
>> [3] Biostrings_2.10.22                 IRanges_1.0.16
>> [5] R.utils_1.1.3                      R.oo_1.4.6
>> [7] R.methodsS3_1.0.3
>>
>> loaded via a namespace (and not attached):
>> [1] grid_2.8.1         lattice_0.17-25    Matrix_0.999375-23
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>   
>>     
>
>



More information about the Bioconductor mailing list