[R] Re gression by groups questions

Chris Friedl cfriedalek at gmail.com
Thu Jun 18 14:02:38 CEST 2009


Thanks Xavier. That's exactly the solution. 



xavier.chardon wrote:
> 
> Hi,
> 
> It seems like your multivariate regression works for groups A and B. For
> group C, values are calculated using an interaction between x1 and x2. Try
> to include it in the regression model to find the right coeffs:
> 
> y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1
> 
> It works for me. Notice in summary() that the coefficients for the x1:x2
> interaction are not significantly different from 0 in groups A and B,
> which is what we expected.
> 
> HTH,
> 
> Xavier
> 
> 
> 
> ----- Mail Original -----
> De: "Chris Friedl" <cfriedalek at gmail.com>
> À: r-help at r-project.org
> Envoyé: Mercredi 17 Juin 2009 12h07:08 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: [R] Re gression by groups questions
> 
> 
> I have a large dataset grouped by a factor and I want to perform a
> regression
> on each data subset based on this factor. There are many ways to do this,
> posted here and elsewhere. I have tried several. However I found one
> method
> posted on the R wiki which works exactly as I want, and I like the
> elegance
> and simplicity of the solution,  but I don't understand how it works. Its
> all in the formula specification. Problem is also that I want to extend it
> from a single variable case to a multivariable case and haven't really got
> a
> clue how to go about it. Hope someone can help.
> 
> # The code below creates a data.frame comprising two marginally noisy
> straight lines and performs a 
> # regression on each based on the grp factor. This is a single variable
> case
> based on R wiki one liner.
> # y = x		for grp A
> # y = 2 + 5x	for grp B
> set.seed(5)
> ind <- as.vector(runif(50))
> d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
> d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
> data1 <- rbind(d1,d2)
> fits <- lm(y ~ grp:x + grp - 1, data=data1)
> summary(fits)
> with(data1, plot(x,y))
> abline(coef(fits)[c(1,3)], col="red")
> abline(coef(fits)[c(2,4)], col="blue")
> 
> Output:
> Call:
> lm(formula = y ~ grp:x + grp - 1, data = data1)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.206414 -0.057582 -0.001376  0.062737  0.242111 
> 
> Coefficients:
>        Estimate Std. Error t value Pr(>|t|)    
> grpA   -0.02935    0.02811  -1.044    0.299    
> grpB    1.96200    0.02811  69.803   <2e-16 ***
> grpA:x  1.06276    0.04921  21.596   <2e-16 ***
> grpB:x  5.09351    0.04921 103.502   <2e-16 ***
> ---
> 
> We can see that the intercepts of are close to 0 and 2 and slopes close to
> 1
> and 5 as suggested by the data creation equation. In fact these results
> are
> identical to running lm individually on each subset of data with formula y
> ~
> x. What I don't understand is how this result comes about from the formula
> specification y ~ grp:x + grp -1 . Can someone offer an explanation?
> 
> 
> In addition I want to apply the same idea for a multivariate data set. The
> code below shows a multivariate dataset which I hope to fit by group.
> 
> # The code below creates a data.frame comprising three marginally noisy
> planes. I want to perform a 
> # regression on each based on the grp factor. 
> # y = x1 + x2 			for grp A
> # y = 2 + 2x1 + 4x2 		for grp B
> # y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
> ind <- matrix(runif(200), ncol=2, dimnames=list(c(), c("x1","x2")))
> d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("A",
> 100))
> d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("B", 100))
> d3 <- data.frame(ind,
> y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
> 100))
> data2 <- rbind(d1,d2,d3)
> # visualise
> ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
> fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
> summary(fits)
> 
> Output:
> 
> Call:
> lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -1.338348 -0.092069 -0.002844  0.094813  1.259094 
> 
> Coefficients:
>         Estimate Std. Error t value Pr(>|t|)    
> grpA    -0.01943    0.08442  -0.230    0.818    
> grpB     1.98345    0.08442  23.495  < 2e-16 ***
> grpC     0.51002    0.08442   6.042 4.66e-09 ***
> grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
> grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
> grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
> grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
> grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
> grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 
> Residual standard error: 0.2838 on 291 degrees of freedom
> Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
> F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 
> 
> Whilst there seems to be some correlation between coefficients and terms
> that doesn't apply in all cases. I've tried many other combinations but
> noe
> that gave me coefficients similar to the defining equations. Bottom line
> is
> I don't understand what the formula specification should be for the
> multivariate version.
> 
> Thanks for any help given.
> 
> Chris
> 
> -- 
> View this message in context:
> http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Re%3A-Re-gression-by-groups-questions-tp24087803p24091678.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list