[R] Re gression by groups questions

Chris Friedl cfriedalek at gmail.com
Thu Jun 18 14:01:05 CEST 2009


pasting reply by Xavier into this thread for the sake of others who might
come across it in their R travels. Thanks Xavier

Hi,

It seems like your multivariate regression works for groups A and B. For
group C, values are calculated using an interaction between x1 and x2. Try
to include it in the regression model to find the right coeffs:

y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1

It works for me. Notice in summary() that the coefficients for the x1:x2
interaction are not significantly different from 0 in groups A and B, which
is what we expected.

HTH,

Xavier


Chris Friedl wrote:
> 
> I have a large dataset grouped by a factor and I want to perform a
> regression on each data subset based on this factor. There are many ways
> to do this, posted here and elsewhere. I have tried several. However I
> found one method posted on the R wiki which works exactly as I want, and I
> like the elegance and simplicity of the solution,  but I don't understand
> how it works. Its all in the formula specification. Problem is also that I
> want to extend it from a single variable case to a multivariable case and
> haven't really got a clue how to go about it. Hope someone can help.
> 
> # The code below creates a data.frame comprising two marginally noisy
> straight lines and performs a 
> # regression on each based on the grp factor. This is a single variable
> case based on R wiki one liner.
> # y = x		for grp A
> # y = 2 + 5x	for grp B
> set.seed(5)
> ind <- as.vector(runif(50))
> d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
> d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50))
> data1 <- rbind(d1,d2)
> fits <- lm(y ~ grp:x + grp - 1, data=data1)
> summary(fits)
> with(data1, plot(x,y))
> abline(coef(fits)[c(1,3)], col="red")
> abline(coef(fits)[c(2,4)], col="blue")
> 
> Output:
> Call:
> lm(formula = y ~ grp:x + grp - 1, data = data1)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.206414 -0.057582 -0.001376  0.062737  0.242111 
> 
> Coefficients:
>        Estimate Std. Error t value Pr(>|t|)    
> grpA   -0.02935    0.02811  -1.044    0.299    
> grpB    1.96200    0.02811  69.803   <2e-16 ***
> grpA:x  1.06276    0.04921  21.596   <2e-16 ***
> grpB:x  5.09351    0.04921 103.502   <2e-16 ***
> ---
> 
> We can see that the intercepts of are close to 0 and 2 and slopes close to
> 1 and 5 as suggested by the data creation equation. In fact these results
> are identical to running lm individually on each subset of data with
> formula y ~ x. What I don't understand is how this result comes about from
> the formula specification y ~ grp:x + grp -1 . Can someone offer an
> explanation?
> 
> 
> In addition I want to apply the same idea for a multivariate data set. The
> code below shows a multivariate dataset which I hope to fit by group.
> 
> # The code below creates a data.frame comprising three marginally noisy
> planes. I want to perform a 
> # regression on each based on the grp factor. 
> # y = x1 + x2 			for grp A
> # y = 2 + 2x1 + 4x2 		for grp B
> # y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
> ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2")))
> d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("A", 100))
> d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("B", 100))
> d3 <- data.frame(ind,
> y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
> 100))
> data2 <- rbind(d1,d2,d3)
> # visualise
> ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found"))
> fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
> summary(fits)
> 
> Output:
> 
> Call:
> lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -1.338348 -0.092069 -0.002844  0.094813  1.259094 
> 
> Coefficients:
>         Estimate Std. Error t value Pr(>|t|)    
> grpA    -0.01943    0.08442  -0.230    0.818    
> grpB     1.98345    0.08442  23.495  < 2e-16 ***
> grpC     0.51002    0.08442   6.042 4.66e-09 ***
> grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
> grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
> grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
> grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
> grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
> grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 
> Residual standard error: 0.2838 on 291 degrees of freedom
> Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
> F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 
> 
> Whilst there seems to be some correlation between coefficients and terms
> that doesn't apply in all cases. I've tried many other combinations but
> noe that gave me coefficients similar to the defining equations. Bottom
> line is I don't understand what the formula specification should be for
> the multivariate version.
> 
> Thanks for any help given.
> 
> Chris
> 
> 

-- 
View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24091659.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list