[R] Re gression by groups questions

xavier.chardon at free.fr xavier.chardon at free.fr
Thu Jun 18 09:25:39 CEST 2009


Hi,

It seems like your multivariate regression works for groups A and B. For group C, values are calculated using an interaction between x1 and x2. Try to include it in the regression model to find the right coeffs:

y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1

It works for me. Notice in summary() that the coefficients for the x1:x2 interaction are not significantly different from 0 in groups A and B, which is what we expected.

HTH,

Xavier



----- Mail Original -----
De: "Chris Friedl" <cfriedalek at gmail.com>
À: r-help at r-project.org
Envoyé: Mercredi 17 Juin 2009 12h07:08 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: [R] Re gression by groups questions


I have a large dataset grouped by a factor and I want to perform a regression
on each data subset based on this factor. There are many ways to do this,
posted here and elsewhere. I have tried several. However I found one method
posted on the R wiki which works exactly as I want, and I like the elegance
and simplicity of the solution,  but I don't understand how it works. Its
all in the formula specification. Problem is also that I want to extend it
from a single variable case to a multivariable case and haven't really got a
clue how to go about it. Hope someone can help.

# The code below creates a data.frame comprising two marginally noisy
straight lines and performs a 
# regression on each based on the grp factor. This is a single variable case
based on R wiki one liner.
# y = x		for grp A
# y = 2 + 5x	for grp B
set.seed(5)
ind <- as.vector(runif(50))
d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
data1 <- rbind(d1,d2)
fits <- lm(y ~ grp:x + grp - 1, data=data1)
summary(fits)
with(data1, plot(x,y))
abline(coef(fits)[c(1,3)], col="red")
abline(coef(fits)[c(2,4)], col="blue")

Output:
Call:
lm(formula = y ~ grp:x + grp - 1, data = data1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.206414 -0.057582 -0.001376  0.062737  0.242111 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
grpA   -0.02935    0.02811  -1.044    0.299    
grpB    1.96200    0.02811  69.803   <2e-16 ***
grpA:x  1.06276    0.04921  21.596   <2e-16 ***
grpB:x  5.09351    0.04921 103.502   <2e-16 ***
---

We can see that the intercepts of are close to 0 and 2 and slopes close to 1
and 5 as suggested by the data creation equation. In fact these results are
identical to running lm individually on each subset of data with formula y ~
x. What I don't understand is how this result comes about from the formula
specification y ~ grp:x + grp -1 . Can someone offer an explanation?


In addition I want to apply the same idea for a multivariate data set. The
code below shows a multivariate dataset which I hope to fit by group.

# The code below creates a data.frame comprising three marginally noisy
planes. I want to perform a 
# regression on each based on the grp factor. 
# y = x1 + x2 			for grp A
# y = 2 + 2x1 + 4x2 		for grp B
# y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
ind <- matrix(runif(200), ncol=2, dimnames=list(c(), c("x1","x2")))
d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), grp=rep("A",
100))
d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("B", 100))
d3 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
100))
data2 <- rbind(d1,d2,d3)
# visualise
ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
summary(fits)

Output:

Call:
lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.338348 -0.092069 -0.002844  0.094813  1.259094 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
grpA    -0.01943    0.08442  -0.230    0.818    
grpB     1.98345    0.08442  23.495  < 2e-16 ***
grpC     0.51002    0.08442   6.042 4.66e-09 ***
grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2838 on 291 degrees of freedom
Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 

Whilst there seems to be some correlation between coefficients and terms
that doesn't apply in all cases. I've tried many other combinations but noe
that gave me coefficients similar to the defining equations. Bottom line is
I don't understand what the formula specification should be for the
multivariate version.

Thanks for any help given.

Chris

-- 
View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list