[R] Parsing a Simple Chemical Formula

jim holtman jholtman at gmail.com
Mon Dec 27 01:19:48 CET 2010


try this:

>         f.extract <- function(formula)
+ {
+     # pattern to match the initial chemical
+     # assumes chemical starts with an upper case and optional lower
case followed
+     # by zero or more digits.
+     first <- "^([[:upper:]][[:lower:]]?)([0-9]*).*"
+     # inverse of above to remove the initial chemical
+     last <- "^[[:upper:]][[:lower:]]?[0-9]*(.*)"
+     result <- list()
+     extract <- formula
+     # repeat as long as there is data
+     while ((start <- nchar(extract)) > 0){
+         chem <- sub(first, '\\1 \\2', extract)
+         extract <- sub(last, '\\1', extract)
+         # if the number of characters is the same, then there was an error
+         if (nchar(extract) == start){
+             warning("Invalid formula:", formula)
+             return(NULL)
+         }
+         # append to the list
+         result[[length(result) + 1L]] <- strsplit(chem, ' ')[[1]]
+     }
+     result
+ }
> f.extract("C5H11BrO")
[[1]]
[1] "C" "5"

[[2]]
[1] "H"  "11"

[[3]]
[1] "Br"

[[4]]
[1] "O"

> f.extract("H2O")
[[1]]
[1] "H" "2"

[[2]]
[1] "O"

> f.extract("CCC")
[[1]]
[1] "C"

[[2]]
[1] "C"

[[3]]
[1] "C"

> f.extract("Crr")  # bad
NULL
Warning message:
In f.extract("Crr") : Invalid formula:Crr
>
>
On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
> Hello R Folks...
>
> I've been looking around the 'net and I see many complex solutions in
> various languages to this question, but I have a pretty simple need (and I'm
> not much good at regex).  I want to use a chemical formula as a function
> argument.  The formula would be in "Hill order" which is to list C, then H,
> then all other elements in alphabetical order.  My example will have only a
> limited number of elements, few enough that one can search directly for each
> element.  So some examples would be C5H12, or C5H12O or C5H11BrO (note that
> for oxygen and bromine, O or Br, there is no following number meaning a 1 is
> implied).
>
> Let's say
>
>> form <- "C5H11BrO"
>
> I'd like to get the count of each element, so in this case I need to extract
> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
> weight by mulitplying).  Sounds pretty simple, but my experiments with grep
> and strsplit don't immediately clue me into an obvious solution.  As I said,
> I don't need a general solution to the problem of calculating molecular
> weight from an arbitrary formula, that seems quite challenging, just a way
> to convert "form" into a list or data frame which I can then do the math on.
>
> Here's hoping this is a simple issue for more experienced R users!  TIA,
>  Bryan
> ***********
> Bryan Hanson
> Professor of Chemistry & Biochemistry
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list