[R] categorical variable coefficients in QSAR [Broadcast]

Liaw, Andy andy_liaw at merck.com
Fri Sep 7 03:26:33 CEST 2007


No one seemed to have picked up on this, so I'll take a stab:

You need to read para and meta into R as factors, and if you want the coefficients to match the way you showed, you also need to take care that the factor levels are in the same order as you showed in the coefficient table.

I cut-and-pasted the three columns of data into R separately, like so:

[copy "para" data to the clipboard]
R> para <- factor(scan("clipboard", what=""))
Read 22 items
[copy "meta" data to the clipboard]
R> meta <- factor(scan("clipboard", what=""))
Read 22 items
[copy biological activity to the clipboard]
R> y <- scan("clipboard")
Read 22 items
[copy the column heading of the coefficient table to the clipboard]
R> lvl <- scan("clipboard", what="")
Read 6 items
R> para <- factor(as.character(para), levels=lvl)
R> meta <- factor(as.character(meta), levels=lvl)
R> qsar <- lm(y ~ para + meta)
R> qsar

Call:
lm(formula = y ~ para + meta)

Coefficients:
(Intercept)        paraF       paraCl       paraBr        paraI       paraMe  
     7.8213       0.3400       0.7675       1.0200       1.4287       1.2560  
      metaF       metaCl       metaBr        metaI       metaMe  
    -0.3013       0.2068       0.4340       0.5787       0.4540  

These coefficients match the ones you showed quite closely.

If you don't reorder the levels of the factors, then by default R orders them alphabetically, so that Br becomes the "reference" and all coefficients are differences from Br.

HTH,
Andy


From: rlittle at ula.ve
> Dear list:
> I am interested in the following sort of problem, as is found 
> frequently
> in the field of QSAR. I have biological activity as a 
> function of chemical
> structure, with structure defined in a categorical manner in that the
> SUBSTITUENT is the levels of the POSITION factor. For 
> example, data from
> Kubinyi (http://www.kubinyi.de/dd-12.pdf) for this type of analysis is
> presented as follows:
> factor para:
> H
> F
> Cl
> Br
> I
> Me
> H
> H
> H
> H
> H
> F
> F
> F
> Cl
> Cl
> Cl
> Br
> Br
> Br
> Me
> Me
> factor meta:
> H
> H
> H
> H
> H
> H
> F
> Cl
> Br
> I
> Me
> Cl
> Br
> Me
> Cl
> Br
> Me
> Cl
> Br
> Me
> Me
> Br
> observed biological activity:
> 7.46
> 8.16
> 8.68
> 8.89
> 9.25
> 9.30
> 7.52
> 8.16
> 8.30
> 8.40
> 8.46
> 8.19
> 8.57
> 8.82
> 8.89
> 8.92
> 8.96
> 9.00
> 9.35
> 9.22
> 9.30
> 9.52
> 
> I then think the following analysis should be appropriate
> 
> 
> meta<-factor(scan(file="meta",what="character"))
> para<-factor(scan(file="para",what="character"))
> ba<-scan(file="ba")
> 
> rslt<-lm(ba~meta+para-1)
> 
> What I wish to obtain is a coefficient for each substituent at each
> position, as does Kubinyi:
> 
> H F Cl Br I Me
> meta 0.00 -0.30 0.21 0.43 0.58 0.45
> para 0.00 0.34 0.77 1.02 1.43 1.26
> 
> 
> However, I do not get a coefficient for the Br substituent at the para
> position. I would like to know if there is an error in this 
> formulation.
> The technique is quite well established in the field of medicinal
> chemistry and it is traditional that the binary incidence 
> matrix is formed
> "by hand" as an intermediate step in the analysis, instead of the much
> simpler formulation that I am considering here.
> 
> Thank you for whatever insight you may give.
> 
> Prof. Roy Little
> Dept. Chem.
> Universidad de los Andes
> Mérida, Venezuela
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}



More information about the R-help mailing list