[R] Fixed effect model: different estimation approaches with R return different results

Thu Oct 6 16:31:26 CEST 2022

You could get lucky here, but strictly speaking, this list is about R
programming and statistical issues are typically off topic Someone might
respond privately, though.

Cheers,
Bert

On Thu, Oct 6, 2022 at 4:24 AM Valerio Leone Sciabolazza <
sciabolazza using gmail.com> wrote:

> Good morning,
> I am trying to use R to estimate a fixed effects model (i.e., a panel
> regression model controlling for unobserved time-invariant
> heterogeneities across agents) using different estimation approaches
> (e.g. replicating xtreg from Stata, see e.g.
>
> https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/
> ).
> I have already asked this question on different stacks exchange forums
> and contacted package creators who dealt with this issue before, but I
> wasn't able to obtain an answer to my doubts.
> I hope to have better luck on this list.
>
> Let me introduce the problem, and note that I am using an unbalanced panel.
>
> The easiest way to estimate my fixed effect model is using the function lm.
>
> Example:
>
> # load packages
> library(dplyr)
> # set seed for replication purposes
> set.seed(123)
> # create toy dataset
> x <- rnorm(4000)
> x2 <- rnorm(length(x))
> id <- factor(sample(500,length(x),replace=TRUE))
> firm <- data.frame(id = id) %>%
> group_by(id) %>%
> mutate(firm = 1:n()) %>%
> pull(firm)
> id.eff <- rlnorm(nlevels(id))
> firm.eff <- rexp(length(unique(firm)))
> y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
> db = data.frame(y = y, x = x, id = id, firm = firm)
> rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>%
> filter(firm == 1) %>% pull(id)
> db = db[-which(db$id %in% rm), ]
> # Run regression
> test <- lm(y ~ x + id, data = db)
>
> Another approach is demeaning the variables included into the model
> specification.
> In this way, one can exclude the fixed effects from the model. Of
> course, point estimates will be correct, while standard errors will be
> not (because we are not accounting for the degrees of freedom used in
> the demeaning).
>
> # demean data
> dbm <- as_tibble(db) %>%
> group_by(id) %>%
> mutate(y = y - mean(y),
>        x = x - mean(x)) %>%
> ungroup()
> # run regression
> test2 <- lm(y ~ x, data = dbm)
> # compare results
> summary(test)$coefficients[2,1]
> > 0.9753364
> summary(test2)$coefficients[2,1]
> > 0.9753364
>
> Another way to do this is to demean the variables and add their grand
> average (I believe that this is what xtreg from Stata does)
>
> # create data
> n = length(unique(db$id))
> dbh <- dbm %>%
> mutate(yh = y + (sum(db$y)/n),
>        xh = x + (sum(db$x)/n))
> # run regression
> test3 <- lm(yh ~ xh, dbh)
> # compare results
> summary(test)$coefficients[2,1]
> > 0.9753364
> summary(test2)$coefficients[2,1]
> > 0.9753364
> summary(test3)$coefficients[2,1]
> > 0.9753364
>
> As one can see, the three approaches report the same point estimates
> (again, standard errors will be different instead).
>
> When I include an additional set of fixed effects in the model
> specification, the three approaches no longer return the same point
> estimate. However, differences seem to be negligible and they could be
> due to rounding.
>
> db$firm <- as.factor(db$firm)
> dbm$firm <- as.factor(dbm$firm)
> dbh$firm <- as.factor(dbh$firm)
> testB <- lm(y ~ x + id + firm, data = db)
> testB2 <- lm(y ~ x + firm, data = dbm)
> testB3 <- lm(yh ~ xh + firm, data = dbh)
> summary(testB)$coefficients[2,1]
> > 0.9834414
> summary(testB2)$coefficients[2,1]
> > 0.9833334
> summary(testB3)$coefficients[2,1]
> > 0.9833334
>
> A similar behavior occurs if I use a dummy variable rather than a
> continous one. For the only purpose of the example, I show this by
> transforming my target variable x from a continuous to a dummy
> variable.
>
> # create data
> x3 <- ifelse(db$x > 0, 1, 0)
> db <- db %>% mutate(x3 = x3)
> dbm <- dbm %>%
> mutate(x3 = x3) %>%
> group_by(id) %>%
> mutate(x3 = x3 - mean(x3)) %>%
> ungroup()
> dbh <- dbh %>% mutate(x3 = dbm$x3) %>%
> mutate(x3 = x3 + (sum(db$x3)/n)) %>%
> ungroup()
> # Run regressions
> testC <- lm(y ~ x3 + id + firm, data = db)
> testC2 <- lm(y ~ x3 + firm, data = dbm)
> testC3 <- lm(yh ~ x3 + firm, data = dbh)
> summary(testC)$coefficients[2, 1]
> > 1.579883
> summary(testC2)$coefficients[2, 1]
> > 1.579159
> summary(testC3)$coefficients[2, 1]
> > 1.579159
>
> Now, I want to estimate both the impact of x when this is higher than
> 0 (i.e., x3) and when this is lower or equal to zero (call it x4).
> Again, observe that x3 might as well be a real dummy variable, not a
> transformation of a continuous variable.
>
> In order to do that, I exclude the intercept from my model.
> Specifically, I do the following:
>
> # create data
> x4 <- ifelse(db$x <= 0, 1, 0)
> db <- db %>% mutate(x4 = x4)
> dbm <- dbm %>%
> mutate(x4 = x4) %>%
> group_by(id) %>%
> mutate(x4 = x4 - mean(x4)) %>%
> ungroup()
> dbh <- dbh %>% mutate(x4 = dbm$x4) %>%
> mutate(x4 = x4 + (sum(db$x4)/n)) %>%
> ungroup()
> testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db)
> testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm)
> testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh)
> summary(testD)$coefficients[1:2, ]
> > 1.2372816 -0.3426011
> summary(testD2)
> > Call:
> lm(formula = y ~ x3 + x4 + firm - 1, data = dbm)
>
> Residuals:
>     Min      1Q  Median      3Q     Max
> -3.8794 -0.7497  0.0010  0.7442  3.8486
>
> Coefficients: (1 not defined because of singularities)
>        Estimate Std. Error t value Pr(>|t|)
> x3      1.57916    0.03779  41.788  < 2e-16 ***
> x4           NA         NA      NA       NA
> ... redacted
> summary(testD3)$coefficients[1:2]
> > 3.254654 1.675495
>
> As you can see, the second approach is not able to estimate the impact
> of x4 on y. At the same time, the first and the third approach return
> very different point estimates.
>
> Is anyone able to explain me why I cannot obtain the same point
> estimates for this last exercise?
>
> Is there anything wrong in the way I include the second set of fixed
> effects?
> Is there anything wrong in the way I include the variables x3 and x4?
> Or this is simply a problem due to some internal functions in R?
>
> Any hint would be much appreciated.
>
> Best,
> Valerio
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]