[R] Data parsing question: adding characters within a string of characters

Duncan Murdoch murdoch.duncan at gmail.com
Thu Jan 2 13:27:05 CET 2014


On 14-01-01 10:55 PM, Joshua Banta wrote:
> Dear Listserve,
>
> I have a data-parsing question for you. I recognize this is more in the domain of PERL/Python, but I don't know those languages! On the other hand, I am pretty good overall with R, so I'd rather get the job done within the R "ecosphere."
>
> Here is what I want to do. Consider the following data:
>
> string <- "ATCGCCCGTA[AGA]TAACCG"
>
> I want to alter string so that it looks like this:
>
> ATCGCCCGTA[A][G][A]TAACCG
>
> In other words, I want to design a piece of code that will scan a character string, find bracketed groups of characters, break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. The lengths of the character strings enclosed by a bracket will vary, but in every case, I want to do the same thing: break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string.
>
> So, for example, another string may look like this:
>
> string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA"
>
> I want to alter string so that it looks like this:
>
> "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"

R is fine for that sort of operation, using regular expressions for 
matching and sub() or gsub() for substitution.  For example, this code 
finds all the bracketed strings of 1 or more ATCG letters:

matches <- gregexpr("[[][ATCG]+]", string)

In the result, which looks like this for your example string,

[[1]]
[1] 11
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE


the 11 is the start of the bracketed expression, the 5 is the length of 
the match.  (There may be other starts and lengths if there are multiple 
bracketed expressions.)  So use substr to extract the matches.

You need to be a little careful putting the string back together after 
adding the extra brackets, because `substr<-` won't replace a string 
with one of a different length.  I use this version instead:

`mysubstr<-` <- function(x, start, stop, value)
   paste0(substr(x, 1, start-1), value, substr(x, stop+1, nchar(x))

I'll leave the details of the substitutions to you...

Duncan Murdoch




More information about the R-help mailing list