[R] "ACCTGMX" to "1223400" in R?

Martin Morgan mtmorgan at fhcrc.org
Tue Jul 20 06:09:28 CEST 2010


On 07/19/2010 06:37 PM, David Winsemius wrote:
> 
> On Jul 19, 2010, at 5:31 PM, John1983 wrote:
> 
>>
>> Hi,
>>
>> I am a newbie in R and was working on some DNA data represented as
>> strings
>> of A,C,T and G (also wild-character like M and X). I use the Bioconductor
>> package in R.
> 
> Well, I guess it's sort of a "meta" package, but it is really more of a
> subculture. It also has its own mailing list.

  http://bioconductor.org/docs/mailList.html

choose the "bioconductor" list.

> 
>> Currently I need to convert a string of the form "ACCTGMX" to
>> "1223400" i.e. A is replaced by 1, C with 2, T with 3, G with 4 and any
>> other character with a 0. I checked with 'replace' and also with a
>> function
>> called 'copySubstitute' found in the Biobase package but this is only for
>> files.
>> The data here is a string ("ACCTGMX" ) and we need to convert it to yet
>> another string ("1223400"). Now I use the strsplit function to split
>> "ACCTGM" into "A" "C" "C" "T" "G" "M" and then use 'which' to assign the
>> corresponding numbers.
>> Is there a faster way to do this or some function I can make use of?
> 
>> tst <- rep( "ACCTGMX", 5)
>> newtst <- gsub("A", "1", tst)
>> newtst <- gsub("C", "2", newtst)
>> newtst <- gsub("T", "3", newtst)
>> newtst <- gsub("G", "4", newtst)
>> newtst <- gsub("[[:alpha:]]", "0", newtst)
>> newtst
> [1] "1223400" "1223400" "1223400" "1223400" "1223400"
> 
> There is also a rollaply function in teh zoo and an strapply function in
> the gsubfn package that might be even more powerful, but I am
> insufficiently talented to give you a one-liner using them.

it sounds like an unusual operation to perform, and I wonder what you're
trying to do in the bigger picture? The place in Bioconductor for this
is the Biostrings package. If you had many of these you might end up at
the DNAStringSet class, and you'd probably like to use IUPAC rather than
ad-hoc encoding. So rather than

> library(Biostrings)
> DNAStringSet("ACCTGMX")
Error in .charToSharedRaw(x, start = start, end = end, width = width:
  key 88 not in lookup table

one might discover (also ?DNA_ALPHABET and eventually ?IUPAC_CODE_MAP
for an appropriate citation)

> DNA_ALPHABET
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"

and use

> x = DNAStringSet("ACCTGMN")
> x
  A DNAStringSet instance of length 1
    width seq
[1]     7 ACCTGMN

(one would usually have 'several' (e.g., millions) of DNA strings in a
set). A DNAStringSet is one of several types of string sets, the most
general of which (no restrictions on alphabet) is a BStringSet.
BStringSet has a chartr() method like that described in another response, so

> old = paste(DNA_ALPHABET, collapse="")
> new = paste(c(1:4, rep(0, length(DNA_ALPHABET)-4)), collapse="")
> chartr(old, new, as(x, "BStringSet"))
  A BStringSet instance of length 1
    width seq
[1]     7 1224300

as.character(chartr(...)) would get a character vector(s) back.

Martin

> 
>>
>> Please advise.
>>
>> Thank you.
>> -- 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-help mailing list