[R] String based chemical name identification

Law, Jason Jason.Law at portlandoregon.gov
Thu Jul 4 00:21:59 CEST 2013

Might be better off using a web service like ChemSpider to do the matching for you <http://www.chemspider.com/AboutServices.aspx?>.  The idea that you can identify the synonyms by name is probably optimistic unless they are exact matches.

Here's some python code that seems to make it pretty easy: https://github.com/mcs07/ChemSpiPy.  Search the names, extract the InChI for the best match and then you can match them in R via the InChI.  Might require some fixing by hand afterwards.


Jason Law

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Zsurzsa Laszlo
Sent: Wednesday, July 03, 2013 7:28 AM
To: r-help at r-project.org
Subject: [R] String based chemical name identification

The problem is the following:

I have two big databases one look like this:

  2-Methyl-4-trimethylsilyloxyoct-5-yne   Benzoic acid, methyl ester   Benzoic
acid, 2-methyl-, methyl ester   Acetic acid, phenylmethyl ester
 2,7-Dimethyl-4-trimethylsilyloxyoct-7-en-5-yne   etc.

The second one looks like this:

 Name: D-Tagatose 1,6-bisphosphate  Name: 1-Phosphatidyl-D-myo-inositol;:
1-Phosphatidyl-1D-myo-inositol;: 1-Phosphatidyl-myo-inositol;:
Phosphatidyl-1D-myo-inositol;: (3-Phosphatidyl)-1-D-inositol;:
1,2-Diacyl-sn-glycero-3-phosphoinositol;: Phosphatidylinositol  Name:
Androstenedione;: Androst-4-ene-3,17-dione;: 4-Androstene-3,17-dione  Name:
Spermine;: N,N'-Bis(3-aminopropyl)-1,4-butanediamine  Name: H+;: Hydron  Name:
3-Iodo-L-tyrosine  etc.

Both of them have more then 3000 lines. Matching their name by hand is not an option because I don't know chemistry.

*Possible solution I came up with*:

Go through all the names of the first database and then try to match with the other one. I'm using *regexec *and *strsplit *functions for the matching. Basically I split the name into small chunks and try to get some hit in the other database.

I can supply code If needed but I did not want to spam in the first mail.

Any solution is welcome! It can be in pseudo-cod also or in any type of logical arguing. It does not matter.

Laszlo-Andras Zsurzsa

Msc. Informatics, Technical University Munchen

	[[alternative HTML version deleted]]

R-help at r-project.org mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list