[R] categorical variable coefficients in QSAR

rlittle at ula.ve rlittle at ula.ve
Thu Aug 30 16:02:37 CEST 2007


Dear list:
I am interested in the following sort of problem, as is found frequently
in the field of QSAR. I have biological activity as a function of chemical
structure, with structure defined in a categorical manner in that the
SUBSTITUENT is the levels of the POSITION factor. For example, data from
Kubinyi (http://www.kubinyi.de/dd-12.pdf) for this type of analysis is
presented as follows:
factor para:
H
F
Cl
Br
I
Me
H
H
H
H
H
F
F
F
Cl
Cl
Cl
Br
Br
Br
Me
Me
factor meta:
H
H
H
H
H
H
F
Cl
Br
I
Me
Cl
Br
Me
Cl
Br
Me
Cl
Br
Me
Me
Br
observed biological activity:
7.46
8.16
8.68
8.89
9.25
9.30
7.52
8.16
8.30
8.40
8.46
8.19
8.57
8.82
8.89
8.92
8.96
9.00
9.35
9.22
9.30
9.52

I then think the following analysis should be appropriate


meta<-factor(scan(file="meta",what="character"))
para<-factor(scan(file="para",what="character"))
ba<-scan(file="ba")

rslt<-lm(ba~meta+para-1)

What I wish to obtain is a coefficient for each substituent at each
position, as does Kubinyi:

H F Cl Br I Me
meta 0.00 -0.30 0.21 0.43 0.58 0.45
para 0.00 0.34 0.77 1.02 1.43 1.26


However, I do not get a coefficient for the Br substituent at the para
position. I would like to know if there is an error in this formulation.
The technique is quite well established in the field of medicinal
chemistry and it is traditional that the binary incidence matrix is formed
"by hand" as an intermediate step in the analysis, instead of the much
simpler formulation that I am considering here.

Thank you for whatever insight you may give.

Prof. Roy Little
Dept. Chem.
Universidad de los Andes
Mérida, Venezuela



More information about the R-help mailing list