[R] Fixed effect model: different estimation approaches with R return different results

Thu Oct 6 13:23:23 CEST 2022

Good morning,
I am trying to use R to estimate a fixed effects model (i.e., a panel
regression model controlling for unobserved time-invariant
heterogeneities across agents) using different estimation approaches
(e.g. replicating xtreg from Stata, see e.g.
https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/).
I have already asked this question on different stacks exchange forums
and contacted package creators who dealt with this issue before, but I
wasn't able to obtain an answer to my doubts.
I hope to have better luck on this list.

Let me introduce the problem, and note that I am using an unbalanced panel.

The easiest way to estimate my fixed effect model is using the function lm.

Example:

# load packages
library(dplyr)
# set seed for replication purposes
set.seed(123)
# create toy dataset
x <- rnorm(4000)
x2 <- rnorm(length(x))
id <- factor(sample(500,length(x),replace=TRUE))
firm <- data.frame(id = id) %>%
group_by(id) %>%
mutate(firm = 1:n()) %>%
pull(firm)
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(length(unique(firm)))
y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
db = data.frame(y = y, x = x, id = id, firm = firm)
rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>%
filter(firm == 1) %>% pull(id)
db = db[-which(db$id %in% rm), ]
# Run regression
test <- lm(y ~ x + id, data = db)

Another approach is demeaning the variables included into the model
specification.
In this way, one can exclude the fixed effects from the model. Of
course, point estimates will be correct, while standard errors will be
not (because we are not accounting for the degrees of freedom used in
the demeaning).

# demean data
dbm <- as_tibble(db) %>%
group_by(id) %>%
mutate(y = y - mean(y),
       x = x - mean(x)) %>%
ungroup()
# run regression
test2 <- lm(y ~ x, data = dbm)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364

Another way to do this is to demean the variables and add their grand
average (I believe that this is what xtreg from Stata does)

# create data
n = length(unique(db$id))
dbh <- dbm %>%
mutate(yh = y + (sum(db$y)/n),
       xh = x + (sum(db$x)/n))
# run regression
test3 <- lm(yh ~ xh, dbh)
# compare results
summary(test)$coefficients[2,1]
> 0.9753364
summary(test2)$coefficients[2,1]
> 0.9753364
summary(test3)$coefficients[2,1]
> 0.9753364

As one can see, the three approaches report the same point estimates
(again, standard errors will be different instead).

When I include an additional set of fixed effects in the model
specification, the three approaches no longer return the same point
estimate. However, differences seem to be negligible and they could be
due to rounding.

db$firm <- as.factor(db$firm)
dbm$firm <- as.factor(dbm$firm)
dbh$firm <- as.factor(dbh$firm)
testB <- lm(y ~ x + id + firm, data = db)
testB2 <- lm(y ~ x + firm, data = dbm)
testB3 <- lm(yh ~ xh + firm, data = dbh)
summary(testB)$coefficients[2,1]
> 0.9834414
summary(testB2)$coefficients[2,1]
> 0.9833334
summary(testB3)$coefficients[2,1]
> 0.9833334

A similar behavior occurs if I use a dummy variable rather than a
continous one. For the only purpose of the example, I show this by
transforming my target variable x from a continuous to a dummy
variable.

# create data
x3 <- ifelse(db$x > 0, 1, 0)
db <- db %>% mutate(x3 = x3)
dbm <- dbm %>%
mutate(x3 = x3) %>%
group_by(id) %>%
mutate(x3 = x3 - mean(x3)) %>%
ungroup()
dbh <- dbh %>% mutate(x3 = dbm$x3) %>%
mutate(x3 = x3 + (sum(db$x3)/n)) %>%
ungroup()
# Run regressions
testC <- lm(y ~ x3 + id + firm, data = db)
testC2 <- lm(y ~ x3 + firm, data = dbm)
testC3 <- lm(yh ~ x3 + firm, data = dbh)
summary(testC)$coefficients[2, 1]
> 1.579883
summary(testC2)$coefficients[2, 1]
> 1.579159
summary(testC3)$coefficients[2, 1]
> 1.579159

Now, I want to estimate both the impact of x when this is higher than
0 (i.e., x3) and when this is lower or equal to zero (call it x4).
Again, observe that x3 might as well be a real dummy variable, not a
transformation of a continuous variable.

In order to do that, I exclude the intercept from my model.
Specifically, I do the following:

# create data
x4 <- ifelse(db$x <= 0, 1, 0)
db <- db %>% mutate(x4 = x4)
dbm <- dbm %>%
mutate(x4 = x4) %>%
group_by(id) %>%
mutate(x4 = x4 - mean(x4)) %>%
ungroup()
dbh <- dbh %>% mutate(x4 = dbm$x4) %>%
mutate(x4 = x4 + (sum(db$x4)/n)) %>%
ungroup()
testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db)
testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm)
testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh)
summary(testD)$coefficients[1:2, ]
> 1.2372816 -0.3426011
summary(testD2)
> Call:
lm(formula = y ~ x3 + x4 + firm - 1, data = dbm)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8794 -0.7497  0.0010  0.7442  3.8486

Coefficients: (1 not defined because of singularities)
       Estimate Std. Error t value Pr(>|t|)
x3      1.57916    0.03779  41.788  < 2e-16 ***
x4           NA         NA      NA       NA
... redacted
summary(testD3)$coefficients[1:2]
> 3.254654 1.675495

As you can see, the second approach is not able to estimate the impact
of x4 on y. At the same time, the first and the third approach return
very different point estimates.

Is anyone able to explain me why I cannot obtain the same point
estimates for this last exercise?

Is there anything wrong in the way I include the second set of fixed effects?
Is there anything wrong in the way I include the variables x3 and x4?
Or this is simply a problem due to some internal functions in R?

Any hint would be much appreciated.

Best,
Valerio