[R] plm issues: error for "within" or "random", but not for "pooling"

Wed Feb 24 12:11:27 CET 2010

Dear Giovanni
Thank you for the quick reply and sorry  for not being able to respond
in kind: since our last e-mail we decided to change the way we measure
the variables, and this took some time. I managed to track down the
original issue, I think, to an improperly specified subset vector to
the "data=df[ , ]" argument. I guess this would count as a user error.

Working with plm I encountered some other potential issues:
- [, "var"] subsetting: on my data the following works fine
> summary(ibes.kld.df.p[ , ]$ibes1.delta1y.diff)
total sum of squares : 2472.4
      id     time
0.289638 0.032026

but the below takes 100% CPU for about a minute, and then fails.
> summary(ibes.kld.df.p[ , "ibes1.delta1y.diff"])
Error in substring(blanks, 1, pad) : invalid substring argument(s)

I am not sure what characteristics of my data causes this (perhaps
many NAs?), but I cannot reproduce a dummy example based on EmplUK:
> data("EmplUK", package = "plm")
> E <- pdata.frame(EmplUK, index = c("firm", "year"), drop.index = TRUE,row.names = TRUE)
> summary(E$emp)
total sum of squares : 261540
       id      time
0.9807654 0.0091085
> summary(E[, "emp"])  ##in the dummy, both ways of subsetting work fine
total sum of squares : 261540
       id      time
0.9807654 0.0091085

- p.value of coef t test == p.value of regression F test (for pooling
and within, but not for random):
> x.pool <- try(plm(get(x.ibes.diff1) ~ get(x.kld.diff1), ibes.kld.df.p, model="pooling"))
> summary(x.pool); x.ibes.diff1; x.kld.diff1
Oneway (individual) effect Pooling Model

Call:
plm(formula = get(x.ibes.diff1) ~ get(x.kld.diff1), data = ibes.kld.df.p,
    model = "pooling")

Unbalanced Panel: n=2336, T=1-15, N=9330

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max.
-5.4500 -0.1500  0.0799  0.2100  4.0500

Coefficients :
                 Estimate Std. Error t-value Pr(>|t|)
(Intercept)       -0.1199     0.0056   -21.4   <2e-16 ***
get(x.kld.diff1)   0.0297     0.0165     1.8    0.071 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    2720
Residual Sum of Squares: 2720
F-statistic: 3.25802 on 1 and 9328 DF, p-value: 0.0711
[1] "ibes2.delta12y.diff"
[1] "kld.delta1y_prod.diff"
> x.fe <- try(plm(get(x.ibes.diff1) ~ get(x.kld.diff1), ibes.kld.df.p, model="within"))
> summary(x.fe); x.ibes.diff1; x.kld.diff1
Oneway (individual) effect Within Model

Call:
plm(formula = get(x.ibes.diff1) ~ get(x.kld.diff1), data = ibes.kld.df.p,
    model = "within")

Unbalanced Panel: n=2336, T=1-15, N=9330

Residuals :
   Min. 1st Qu.  Median 3rd Qu.    Max.
-4.1000 -0.1200  0.0121  0.1600  4.1300

Coefficients :
                 Estimate Std. Error t-value Pr(>|t|)
get(x.kld.diff1)   0.0324     0.0166    1.95    0.051 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    1790
Residual Sum of Squares: 1780
F-statistic: 3.80843 on 1 and 6993 DF, p-value: 0.051
[1] "ibes2.delta12y.diff"
[1] "kld.delta1y_prod.diff"

I suppose that this is OK, since for the pooling case I can confirm it
with the simple lm(), but I am not sure that I understand why this
happens?
> x.simp <- try(lm(get(x.ibes.diff1) ~ get(x.kld.diff1), ibes.kld.df.p))
> summary(x.simp); x.ibes.diff1; x.kld.diff1

Call:
lm(formula = get(x.ibes.diff1) ~ get(x.kld.diff1), data = ibes.kld.df.p)

Residuals:
    Min      1Q  Median      3Q     Max
-5.4501 -0.1501  0.0799  0.2099  4.0499

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)       -0.1199     0.0056   -21.4   <2e-16 ***
get(x.kld.diff1)   0.0297     0.0165     1.8    0.071 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.54 on 9328 degrees of freedom
  (3966 observations deleted due to missingness)
Multiple R-squared: 0.000349,	Adjusted R-squared: 0.000242
F-statistic: 3.26 on 1 and 9328 DF,  p-value: 0.0711

[1] "ibes2.delta12y.diff"
[1] "kld.delta1y_prod.diff"

For random, the two are different:
> x.re <- try(plm(get(x.ibes.diff1) ~ get(x.kld.diff1), ibes.kld.df.p, model="random"))
> summary(x.re); x.ibes.diff1; x.kld.diff1
Oneway (individual) effect Random Effect Model
   (Swamy-Arora's transformation)

Call:
plm(formula = get(x.ibes.diff1) ~ get(x.kld.diff1), data = ibes.kld.df.p,
    model = "random")

Unbalanced Panel: n=2336, T=1-15, N=9330

Effects:
                var std.dev share
idiosyncratic 0.255   0.505  0.88
individual    0.036   0.190  0.12
theta  :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0639  0.1620  0.2340  0.2640  0.4060  0.4340

Residuals :
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-5.24000 -0.14300  0.06630 -0.00171  0.19700  3.79000

Coefficients :
                 Estimate Std. Error t-value Pr(>|t|)
(Intercept)      -0.11510    0.00708  -16.26   <2e-16 ***
get(x.kld.diff1)  0.02935    0.01592    1.84    0.065 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    2420
Residual Sum of Squares: 2420
F-statistic: -0.417224 on 1 and 9328 DF, p-value: 1
[1] "ibes2.delta12y.diff"
[1] "kld.delta1y_prod.diff"

- no R-squared in summary() output: I was a bit surprised to see no
R-squared reported by summary(). Although it is not present in my
plm() regressions nor in the vignette, it is in the output included in
the AER book.

- pFtest() generates an NA p-value. Any ideas on what would cause this?
> pFtest(x.pool, x.fe)

	F test for individual effects

data:  get(x.ibes.diff1) ~ get(x.kld.diff1)
F = 1.3694, df1 = -2335, df2 = 9328, p-value = NA
alternative hypothesis: significant effects

Warning message:
In pf(q, df1, df2, lower.tail, log.p) : NaNs produced

Thank you
Liviu

On 2/4/10, Millo Giovanni <Giovanni_Millo at generali.com> wrote:
> Dear Liviu,
>
>  it's difficult to tell without seeing the data. I might guess that you have some completely empty groups about which Tapply complains when doing the time-demeaning, but it would be just a guess.
>
>  I realize you can't share the data in the present form, but may I suggest you try and subset your data in some random way, find a "problematic" subset (one which gives the error) then change labels and everything so that the data become unrecognizable, and send us that example? You can also randomly transform them, as this is likely to be a missing values issue.
>
>  Giovanni
>