[R] Constrained Regression

Mon Nov 1 03:55:02 CET 2010

<in line>

On 10/31/2010 6:26 PM, Jim Silverton wrote:
> I thought I would 'add' some meat to the problem I sent.  This is all I know
> (1) f = a*X1 + (1-a)*X2

How do you know "f = a*X1 + (1-a)*X2"?

Why does this relationship make more sense than, e.g., log(f/(1-f)) = 
a*X1 + (1-a)*X2?

> (2) I know n values of f and X1 which happens to be probabilities

What do you mean that f and X1 are probabilities?  Where do they come from?

Are the the results of binomial experiments like tosses of a biased coin?

> (3) I know nothing about X2 except that it also lies in (0,1)
> (4) X1 is the probability under the null (fisher's exact test) and X2 is the
> alternative. But I have no idea what the alternative to fisher's exact test
> can be..

How do you compute X1?  Do you compute it from data?  If yes, then it's 
more like an estimate of a probability rather than a probability 
itself.  Ditto for X2.

The preferred method for estimating almost anything whenever feasible is 
to write a likelihood function, i.e., the probability of data as a 
function of unknown parameters, then estimate those parameters to 
maximize the likelihood.  Even most nonparametric procedures can be 
expressed this way for an appropriately chosen probability model.

In most situations, I prefer to eliminate constraints by 
transformations, replacing f by ph = log(f/(1-f), X1 and X2 by xi1 = 
log(X1/(1-X1)) and X2 by xi2 = log(X2/(1-X2)), a by alpha = 
log(a/(1-a)), then write a relationship between these unconstrained 
variables and try to estimate alpha by maximum likelihood.

Hope this helps.
Spencer

> (5) I need to estimate a (which is supposed to be a proportion).
> (6) I was thinking about imputing values from (0,1) using  a beta as the
> values of X2.
>
> Any help is greatly appreciated.
>
> Jim...
>
> On Sun, Oct 31, 2010 at 1:44 PM, David Winsemius<dwinsemius at comcast.net>wrote:
>
>> On Oct 31, 2010, at 12:54 PM, Spencer Graves wrote:
>>
>>   Have you tried the 'sos' package?
>> I have, and I am taking this opportunity to load it with my .Rprofile to
>> make it more accessible. It works very well. Very clean display. I also have
>> constructed a variant of RSiteSearch that I find more useful than the
>> current defaults.
>>
>> rhelpSearch<- function(string,
>>                   restrict = c("Rhelp10", "Rhelp08", "Rhelp02", "functions"
>> ),
>>                    matchesPerPage = 100, ...) {
>>               RSiteSearch(string=string,
>>               restrict = restrict,
>>               matchesPerPage = matchesPerPage, ...)}
>>
>>
>>
>>> install.packages('sos') # if not already installed
>>> library(sos)
>>> cr<- ???'constrained regression' # found 149 matches
>>> summary(cr) # in 69 packages
>>> cr # opens a table in a browser listing all 169 matches with links to the
>>> help pages
>>>
>>>      However, I agree with Ravi Varadhan:  I'd want to understand the
>>> physical mechanism generating the data.  If each is, for example, a
>>> proportion, then I'd want to use logistic regression, possible after some
>>> approximate logistic transformation of X1 and X2 that prevents logit(X) from
>>> going to +/-Inf.  This is a different model, but it achieves the need to
>>> avoid predictions of Y going outside the range (0, 1).
>>>
>> No argument. I defer to both of your greater experiences in such problems
>> and your interest in educating us less knowledgeable users. I also need to
>> amend my suggested strategy in situations where a linear model _might_ be
>> appropriate, since I think the inclusion of the surrogate variable in the
>> solve.QP setup is very probably creating confusion. After reconsideration I
>> think one should keep the two approaches separate. These are two approaches
>> to the non-intercept versions of the model that yield the same estimate (but
>> only because the constraints do not get invoked ):
>>
>>> lm(medv~I(age-lstat) +offset(lstat) -1, data=Boston)
>> Call:
>> lm(formula = medv ~ I(age - lstat) + offset(lstat) - 1, data = Boston)
>>
>> Coefficients:
>> I(age - lstat)
>>         0.1163
>>
>>
>>> library(MASS)   ## to access the Boston data
>>>   designmat<- model.matrix(medv~age+lstat-1, data=Boston)
>>>   Dmat<-crossprod(designmat, designmat); dvec<- crossprod(designmat,
>> +  Boston$medv)
>>>    Amat<- cbind(1, diag(NROW(Dmat)));  bvec<- c(1,rep(0,NROW(Dmat)));
>> meq<- 1
>>>   library(quadprog);
>>>   res<- solve.QP(Dmat, dvec, Amat, bvec, meq)
>>> zapsmall(res$solution)  # zapsmall not really needed in this instance
>> [1] 0.1163065 0.8836935
>>
>> --
>> David.
>>
>>
>>>      Spencer
>>>
>>>
>>> On 10/31/2010 9:01 AM, David Winsemius wrote:
>>>
>>>> On Oct 31, 2010, at 2:44 AM, Jim Silverton wrote:
>>>>
>>>>   Hello everyone,
>>>>> I have 3 variables Y, X1 and X2. Each variables lies between 0 and 1. I
>>>>> want
>>>>> to do a constrained regression such that a>0 and (1-a)>0
>>>>>
>>>>> for the model:
>>>>>
>>>>> Y = a*X1 + (1-a)*X2
>>>>>
>>>>
>>>> It would not accomplish the constraint that a>  0 but you could
>>>> accomplish the other constraint within an lm fit:
>>>>
>>>> X3<- X1-X2
>>>> lm(Y ~ X3 + offset(X2) )
>>>>
>>>> Since beta1 is for the model Y ~ 1 + beta1(X1- X2) + 1*X2)
>>>>                             Y ~ intercept + beta1*X1 + (1 -beta1)*X2
>>>>
>>>> ... so beta1 is a.
>>>>
>>>> In the case beta<  0 then I suppose a would be assigned 0. This might be
>>>> accomplished within an iterative calculation framework by a large
>>>> penalization for negative values. In a reply (1) to a question by Carlos
>>>> Alzola in 2008 on rhalp, Berwin Turlach offered a solution to a similar
>>>> problem ( sum(coef) == 1 AND coef non-negative). Modifying his code to
>>>> incorporate the above strategy (and choosing two variables for which
>>>> parameter values might be inside the constraint boundaries) we get:
>>>>
>>>> library(MASS)   ## to access the Boston data
>>>>   designmat<- model.matrix(medv~I(age-lstat) +offset(lstat), data=Boston)
>>>>   Dmat<-crossprod(designmat, designmat); dvec<- crossprod(designmat,
>>>>   Boston$medv)
>>>>   Amat<- cbind(1, diag(NROW(Dmat)))
>>>>   bvec<- c(1,rep(0,NROW(Dmat)))
>>>>   meq<- 1
>>>> library(quadprog)
>>>>   res<- solve.QP(Dmat, dvec, Amat, bvec, meq)
>>>>
>>>>> zapsmall(res$solution)
>>>> [1] 0.686547 0.313453
>>>>
>>>> Turlach specifically advised against any interpretation of this
>>>> particular result which was only contructed to demonstrate the mathematical
>>>> mechanics.
>>>>
>>>>
>>>>> I tried the help on the constrained regression in R but I concede that
>>>>> it
>>>>> was not helpful.
>>>>>
>>>> I must not have that package installed because I got nothing that
>>>> appeared to be useful with ??"constrained regression" .
>>>>
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> 1) http://finzi.psych.upenn.edu/Rhelp10/2008-March/155990.html
>>>>
>>>> ______________________________________________
>>>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>

-- 
Spencer Graves, PE, PhD
President and Chief Operating Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567