[R] Fixed effect model: different estimation approaches with R return different results

Thu Oct 6 16:45:18 CEST 2022

Thank you Bert, I see your point.
The problem is that I am not sure if this is a statistical issue -
i.e., there is something wrong with my procedure - or, in a manner of
speaking, a software issue: e.g. lm is not expected to return the same
estimates for different approaches because of the way in which the
function estimates the solution, and it should not be used the way I
do. Maybe I should have better stressed this.
Valerio

On Thu, Oct 6, 2022 at 4:31 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>
> You could get lucky here, but strictly speaking, this list is about R programming and statistical issues are typically off topic Someone might respond privately, though.
>
> Cheers,
> Bert
>
> On Thu, Oct 6, 2022 at 4:24 AM Valerio Leone Sciabolazza <sciabolazza using gmail.com> wrote:
>>
>> Good morning,
>> I am trying to use R to estimate a fixed effects model (i.e., a panel
>> regression model controlling for unobserved time-invariant
>> heterogeneities across agents) using different estimation approaches
>> (e.g. replicating xtreg from Stata, see e.g.
>> https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/).
>> I have already asked this question on different stacks exchange forums
>> and contacted package creators who dealt with this issue before, but I
>> wasn't able to obtain an answer to my doubts.
>> I hope to have better luck on this list.
>>
>> Let me introduce the problem, and note that I am using an unbalanced panel.
>>
>> The easiest way to estimate my fixed effect model is using the function lm.
>>
>> Example:
>>
>> # load packages
>> library(dplyr)
>> # set seed for replication purposes
>> set.seed(123)
>> # create toy dataset
>> x <- rnorm(4000)
>> x2 <- rnorm(length(x))
>> id <- factor(sample(500,length(x),replace=TRUE))
>> firm <- data.frame(id = id) %>%
>> group_by(id) %>%
>> mutate(firm = 1:n()) %>%
>> pull(firm)
>> id.eff <- rlnorm(nlevels(id))
>> firm.eff <- rexp(length(unique(firm)))
>> y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
>> db = data.frame(y = y, x = x, id = id, firm = firm)
>> rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>%
>> filter(firm == 1) %>% pull(id)
>> db = db[-which(db$id %in% rm), ]
>> # Run regression
>> test <- lm(y ~ x + id, data = db)
>>
>> Another approach is demeaning the variables included into the model
>> specification.
>> In this way, one can exclude the fixed effects from the model. Of
>> course, point estimates will be correct, while standard errors will be
>> not (because we are not accounting for the degrees of freedom used in
>> the demeaning).
>>
>> # demean data
>> dbm <- as_tibble(db) %>%
>> group_by(id) %>%
>> mutate(y = y - mean(y),
>>        x = x - mean(x)) %>%
>> ungroup()
>> # run regression
>> test2 <- lm(y ~ x, data = dbm)
>> # compare results
>> summary(test)$coefficients[2,1]
>> > 0.9753364
>> summary(test2)$coefficients[2,1]
>> > 0.9753364
>>
>> Another way to do this is to demean the variables and add their grand
>> average (I believe that this is what xtreg from Stata does)
>>
>> # create data
>> n = length(unique(db$id))
>> dbh <- dbm %>%
>> mutate(yh = y + (sum(db$y)/n),
>>        xh = x + (sum(db$x)/n))
>> # run regression
>> test3 <- lm(yh ~ xh, dbh)
>> # compare results
>> summary(test)$coefficients[2,1]
>> > 0.9753364
>> summary(test2)$coefficients[2,1]
>> > 0.9753364
>> summary(test3)$coefficients[2,1]
>> > 0.9753364
>>
>> As one can see, the three approaches report the same point estimates
>> (again, standard errors will be different instead).
>>
>> When I include an additional set of fixed effects in the model
>> specification, the three approaches no longer return the same point
>> estimate. However, differences seem to be negligible and they could be
>> due to rounding.
>>
>> db$firm <- as.factor(db$firm)
>> dbm$firm <- as.factor(dbm$firm)
>> dbh$firm <- as.factor(dbh$firm)
>> testB <- lm(y ~ x + id + firm, data = db)
>> testB2 <- lm(y ~ x + firm, data = dbm)
>> testB3 <- lm(yh ~ xh + firm, data = dbh)
>> summary(testB)$coefficients[2,1]
>> > 0.9834414
>> summary(testB2)$coefficients[2,1]
>> > 0.9833334
>> summary(testB3)$coefficients[2,1]
>> > 0.9833334
>>
>> A similar behavior occurs if I use a dummy variable rather than a
>> continous one. For the only purpose of the example, I show this by
>> transforming my target variable x from a continuous to a dummy
>> variable.
>>
>> # create data
>> x3 <- ifelse(db$x > 0, 1, 0)
>> db <- db %>% mutate(x3 = x3)
>> dbm <- dbm %>%
>> mutate(x3 = x3) %>%
>> group_by(id) %>%
>> mutate(x3 = x3 - mean(x3)) %>%
>> ungroup()
>> dbh <- dbh %>% mutate(x3 = dbm$x3) %>%
>> mutate(x3 = x3 + (sum(db$x3)/n)) %>%
>> ungroup()
>> # Run regressions
>> testC <- lm(y ~ x3 + id + firm, data = db)
>> testC2 <- lm(y ~ x3 + firm, data = dbm)
>> testC3 <- lm(yh ~ x3 + firm, data = dbh)
>> summary(testC)$coefficients[2, 1]
>> > 1.579883
>> summary(testC2)$coefficients[2, 1]
>> > 1.579159
>> summary(testC3)$coefficients[2, 1]
>> > 1.579159
>>
>> Now, I want to estimate both the impact of x when this is higher than
>> 0 (i.e., x3) and when this is lower or equal to zero (call it x4).
>> Again, observe that x3 might as well be a real dummy variable, not a
>> transformation of a continuous variable.
>>
>> In order to do that, I exclude the intercept from my model.
>> Specifically, I do the following:
>>
>> # create data
>> x4 <- ifelse(db$x <= 0, 1, 0)
>> db <- db %>% mutate(x4 = x4)
>> dbm <- dbm %>%
>> mutate(x4 = x4) %>%
>> group_by(id) %>%
>> mutate(x4 = x4 - mean(x4)) %>%
>> ungroup()
>> dbh <- dbh %>% mutate(x4 = dbm$x4) %>%
>> mutate(x4 = x4 + (sum(db$x4)/n)) %>%
>> ungroup()
>> testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db)
>> testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm)
>> testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh)
>> summary(testD)$coefficients[1:2, ]
>> > 1.2372816 -0.3426011
>> summary(testD2)
>> > Call:
>> lm(formula = y ~ x3 + x4 + firm - 1, data = dbm)
>>
>> Residuals:
>>     Min      1Q  Median      3Q     Max
>> -3.8794 -0.7497  0.0010  0.7442  3.8486
>>
>> Coefficients: (1 not defined because of singularities)
>>        Estimate Std. Error t value Pr(>|t|)
>> x3      1.57916    0.03779  41.788  < 2e-16 ***
>> x4           NA         NA      NA       NA
>> ... redacted
>> summary(testD3)$coefficients[1:2]
>> > 3.254654 1.675495
>>
>> As you can see, the second approach is not able to estimate the impact
>> of x4 on y. At the same time, the first and the third approach return
>> very different point estimates.
>>
>> Is anyone able to explain me why I cannot obtain the same point
>> estimates for this last exercise?
>>
>> Is there anything wrong in the way I include the second set of fixed effects?
>> Is there anything wrong in the way I include the variables x3 and x4?
>> Or this is simply a problem due to some internal functions in R?
>>
>> Any hint would be much appreciated.
>>
>> Best,
>> Valerio
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.