[R] String processing - is there a better way

Martin Morgan mtmorgan at fhcrc.org
Thu Jul 22 04:27:41 CEST 2010


On 07/21/2010 10:02 AM, Davis, Brian wrote:
> I have a two part question
> 
> Part 1) I am trying to remove characters in a string based on the
> position of
a key character in another string. I have a solution that works but it
requires a for-loop. A vectorized way of doing this has alluded me.

Hi Brian --

This sounds like processing short reads from DNA sequencing experiments.
The Bioconductor project has well-developed tools for doing these types
of operations. See the Bioconductor mailing list, the Biostrings,
ShortRead, IRanges, ... packages including  their vignettes, and perhaps
some of the recent course / training material accessible from the web site.

  http://bioconductor.org/

Also Thomas Girke's group has a straight-forward resource describing use
of these tools at

  http://manuals.bioinformatics.ucr.edu/home/ht-seq

If you explore this avenue, then please post messages to the
Bioconductor mailing list, where a suitable audience of experienced
users will give you prompt advice.

Martin

> 
> CleanRead<-function(x,y) {
> 
>   if (!is.character(x)) 
>     x <- as.character(x)
>   if (!is.character(y)) 
>     y <- as.character(y)
> 
>   idx<-grep("\\*", x, value=FALSE)
>   starpos<-gregexpr("\\*", x[idx])
>   
>   ysplit<-strsplit(y[idx], '')
>   n<-length(idx)
>   for(i in 1:n) {
>     ysplit[[i]][starpos[[i]]] = ""
>   }
> 
>   y[idx]<-unlist(lapply(ysplit, paste, sep='', collapse=''))
>   return(y)
> }
> 
> x<-c("AA*.*A,,,", "**a.a*,,,A", "C*c..", "**aA") 
> y<-c("abcdefghi", "abcdefghij", "abcde", "abcd")
> 
> CleanRead(x,y)
> [1] "abdfghi" "cdeghij" "acde"    "cd"
> 
> 
> Is there a better way to do this?
> 
> Part 2) 
> My next step in the string processing is to take the characters in the output of CleanRead and subtract 33 from the ascii value of the character to obtain an integer. Again I have a solution that works, involving splitting the string into characters then converting them to factors (starting at ascii 34) and using unclass to get the integer value. (kindof a atoi(x)-33 all in one step)
> 
> I looked for the C equivalent of atoi, but the only help I could find (R-help 2003) suggested using as.numeric.  However, the help file (and testing) shows you get 'NA'.   
> 
> Am I missing an easier way to do this?
> 
> 
> 
> Thanks in advance,
> 
> Brian
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-help mailing list