[R] Parsing a Simple Chemical Formula

Mike Marchywka marchywka at hotmail.com
Mon Dec 27 14:00:28 CET 2010




> From: hanson at depauw.edu
> To: dwinsemius at comcast.net
> Date: Sun, 26 Dec 2010 22:36:49 -0500
> CC: r-help at stat.math.ethz.ch
> Subject: Re: [R] Parsing a Simple Chemical Formula
>
> Hi David & others...
>
> I did find the function you recommended, plus, it's even easier (but a
> little hidden in the doc): >element(form, "mass"). But, this uses the
> atomic masses from the periodic table, which are weighted averages of
> the isotopes of each element. What I'm doing actually involves mass
> spectrometry, so I need the isotope masses, which are integers (think

There are probably packages specialized to that field. For example, 

http://www.bioconductor.org/help/course-materials/2010/HeidelbergNovember2010/MSintro_LaurentGatto.pdf

which presumably includes relevant physical data files in some format.


> 12C, 13C, 14C, but the periodic table says 12.011 reflecting the
> relative abundances). I used Gabor's solution and got my little
> function humming. Plus, I have several things to read through from
> the various recommendations.
>
> Thanks again, Bryan
>
> On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
>
> >
> > On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
> >
> >> Thanks Spencer, I'll definitely have a look at this package and
> >> it's vignettes. I believe I have looked at it before, but didn't
> >> catch it on this particular search. Bryan
> >
> > Using the thermo list that the makeup function accesses to get its
> > valid atomic symbols one can arrive at the the answer you posited
> > would be too difficult in you first posting, the atomic weight from
> > the formulae:
> >
> > > str(thermo$element)
> > 'data.frame': 130 obs. of 6 variables:
> > $ element: chr "Z" "O" "H" "He" ...
> > $ state : chr "aq" "gas" "gas" "gas" ...
> > $ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
> > $ mass : num 0 16 1.01 4 20.18 ...
> > $ s : num -15.6 49 31.2 30.2 35 ...
> > $ n : int 1 2 2 1 1 1 1 1 2 2 ...
> >
> > patts <- paste("^", rownames(makeup(form)), "$", sep="")
> > makuform<- makeup(form)
> > makuform$amass <- sapply(patts, function(x) {return( thermo
> > $element[ grep(x, thermo$element[[1]])[1], "mass"])} )
> > sum(makuform$amass *makuform$count)
> > # [1] 167.0457
> >
> >>
> >> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
> >>
> >>> p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes.
> >>> I have not looked at these vignettes, but most vignettes provide
> >>> excellent introductions (though rarely with complete coverage) of
> >>> important capabilities of the package. (The 'sos' package
> >>> includes a vignette, which exposes more capabilities than the
> >>> example below.)
> >>>
> >>>
> >>> ######################
> >>> Have you considered the 'CHNOSZ' package?
> >>>
> >>>
> >>>> makeup("C5H11BrO" )
> >>> count
> >>> C 5
> >>> H 11
> >>> Br 1
> >>> O 1
> >>>
> >>>
> >>> I found this using the 'sos' package as follows:
> >>>
> >>>
> >>> library(sos)
> >>> cf <- ???'chemical formula'
> >>> found 21 matches; retrieving 2 pages
> >>> cf
> >>>
> >>>
> >>> The print method for "cf" opened the results in a web browser,
> >>> which showed that the "CHNOSZ" package had 14 of these 11 matches,
> >>> and the other 7 were in 7 different packages. Moreover, the
> >>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and
> >>> Activity Diagrams" and provides many more capabilities that might
> >>> interest you.
> >>>
> >>>
> >>> Hope this helps.
> >>> Spencer
> >>>
> >>>
> >>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
> >>>> Well let me just say thanks and WOW! Four great ideas, each
> >>>> worthy of
> >>>> study and I'll learn several things from each. Interestingly,
> >>>> these
> >>>> solutions seem more general and more compact than the solutions I
> >>>> found on the 'net using python and perl. More evidence for the
> >>>> power
> >>>> of R! A big thanks to each of you! Bryan
> >>>>
> >>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
> >>>>
> >>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson
> >>>>>  wrote:
> >>>>>> Hello R Folks...
> >>>>>>
> >>>>>> I've been looking around the 'net and I see many complex
> >>>>>> solutions in
> >>>>>> various languages to this question, but I have a pretty simple
> >>>>>> need
> >>>>>> (and I'm
> >>>>>> not much good at regex). I want to use a chemical formula as a
> >>>>>> function
> >>>>>> argument. The formula would be in "Hill order" which is to
> >>>>>> list C,
> >>>>>> then H,
> >>>>>> then all other elements in alphabetical order. My example will
> >>>>>> have
> >>>>>> only a
> >>>>>> limited number of elements, few enough that one can search
> >>>>>> directly
> >>>>>> for each
> >>>>>> element. So some examples would be C5H12, or C5H12O or C5H11BrO
> >>>>>> (note that
> >>>>>> for oxygen and bromine, O or Br, there is no following number
> >>>>>> meaning a 1 is
> >>>>>> implied).
> >>>>>>
> >>>>>> Let's say
> >>>>>>
> >>>>>>> form <- "C5H11BrO"
> >>>>>>
> >>>>>> I'd like to get the count of each element, so in this case I
> >>>>>> need to
> >>>>>> extract
> >>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the
> >>>>>> molecular
> >>>>>> weight by mulitplying). Sounds pretty simple, but my experiments
> >>>>>> with grep
> >>>>>> and strsplit don't immediately clue me into an obvious
> >>>>>> solution. As
> >>>>>> I said,
> >>>>>> I don't need a general solution to the problem of calculating
> >>>>>> molecular
> >>>>>> weight from an arbitrary formula, that seems quite challenging,
> >>>>>> just
> >>>>>> a way
> >>>>>> to convert "form" into a list or data frame which I can then do
> >>>>>> the
> >>>>>> math on.
> >>>>>>
> >>>>>> Here's hoping this is a simple issue for more experienced R
> >>>>>> users!
> >>>>>> TIA,
> >>>>>
> >>>>> This can be done by strapply in gsubfn. It matches the regular
> >>>>> expression to the target string passing the back references (the
> >>>>> parenthesized portions of the regular expression) through a
> >>>>> specified
> >>>>> function as successive arguments.
> >>>>>
> >>>>> Thus the first arg is form, your input string. The second arg
> >>>>> is the
> >>>>> regular expression which matches an upper case letter optionally
> >>>>> followed by lower case letters and all that is optionally
> >>>>> followed by
> >>>>> digits. The third arg is a function shown in a formula
> >>>>> representation. strapply passes the back references (i.e. the
> >>>>> portions
> >>>>> within parentheses) to the function as the two arguments. Finally
> >>>>> simplify is another function in formula notation which turns the
> >>>>> result into a matrix and then a data frame. Finally we make the
> >>>>> second column of the data frame numeric.
> >>>>>
> >>>>> library(gsubfn)
> >>>>>
> >>>>> DF <- strapply(form,
> >>>>> "([A-Z][a-z]*)(\\d*)",
> >>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
> >>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
> >>>>> FALSE))
> >>>>> DF[[2]] <- as.numeric(DF[[2]])
> >>>>>
> >>>>> DF looks like this:
> >>>>>
> >>>>>> DF
> >>>>> V1 V2
> >>>>> 1 C 5
> >>>>> 2 H 11
> >>>>> 3 Br 1
> >>>>> 4 O 1
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Statistics & Software Consulting
> >>>>> GKX Group, GKX Associates Inc.
> >>>>> tel: 1-877-GKX-GROUP
> >>>>> email: ggrothendieck at gmail.com
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Spencer Graves, PE, PhD
> >>> President and Chief Operating Officer
> >>> Structure Inspection and Monitoring, Inc.
> >>> 751 Emerson Ct.
> >>> San José, CA 95126
> >>> ph: 408-655-4567
> >>>
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 		 	   		  


More information about the R-help mailing list