[R] plm(..., model="within", effect="twoways") is very slow on unablanaced data (was: Re: Regressions with fixed-effect in R)

Mon May 17 17:16:25 CEST 2010

Hello Giovanni
I made a minor modification to your function, which now allows to
compute the within R-sq in Twoways Within models (see below).

However I ran into an issue that I have already encountered before:
whenever I try to fit Twoways Within models on my unbalanced data, the
process is strangely slow and I usually terminate it either after
~15min or when my CPU hits 100C. This is similar to what I mentioned
on the list some time ago [1].
[1] http://www.mail-archive.com/r-help@r-project.org/msg89421.html

Unfortunately I get the same slow behaviour when using
pmodel.response(x, model="within", effect="twoways") to compute teh
within R-sq. Below is an example that approximates the structure of my
data.

### define fun to compute the within R-sq
pmodel.response<-plm:::pmodel.response.plm
plmr2 <-
    function(x, adj=TRUE, effect="individual")
{
 ## fetch response and residuals
 y <- pmodel.response(x, model="within", effect=effect)
 myres <- resid(x)
 n <- length(myres)

 if(adj) {
   adjssr <- x$df.residual
   adjtss <- n-1
   } else {
   adjssr <- 1
   adjtss <- 1
   }

 ssr <- sum(myres^2)/adjssr
 tss <- sum(y^2)/adjtss
 return(1-ssr/tss)
}

## prepare the data
set.seed(1)
n <- 2000
T <- 15
x=rnorm(T*n)
x1=runif(T*n)
fe=rep(rnorm(1*n),each=T)
id=rep(1:(1*n),each=T)
ti=rep(1:T,1*n)
e=rnorm(T*n)
y=x+x1+fe+e
subs <- as.logical(rbinom(T*n, 1, .6)); sum(subs)

data <- data.frame(y,x,x1,id,ti)
dim(data)
#[1] 30000     5

pdata <- plm.data(data[subs , ], index = c("id", "ti"))
dim(pdata)
#[1] 18018     5

paste("n=", length(unique(pdata$id))); paste("T=",
length(unique(pdata$ti))); paste("N=", length((pdata$ti)))
#[1] "n= 2000"
#[1] "T= 15"
#[1] "N= 18018"

## individual within model: no issues, 5sec affair
pd_fe1 <- plm(y ~ x + x1, data = pdata,
   model = "within", effect="individual")
summary(pd_fe1)
plmr2(pd_fe1, F, "individual")

## twoways within model: never finishes in reasonable time (prepare to
terminate the process)
pd_fe1 <- plm(y ~ x + x1, data = pdata,
   model = "within", effect="twoways")
summary(pd_fe1)
plmr2(pd_fe1, F, "twoways")  ##neither this on pd_fe1 fitted previously

## individual within with manual time effs: ~10sec
pd_fe2 <- plm(y ~ x + x1 + ti, data = pdata,
   model = "within", effect="individual")
summary(pd_fe2)
plmr2(pd_fe2, F, "individual")
#[1] 0.52888

## extract within R-sq from the above fails (prepare to terminate the process)
plmr2(pd_fe2, F, "twoways")

I would have went ahead and computed the within R-sq manually, from
pd_fe2, however the below computes the "individual" within R-sq
RSS2 <-  sum(residuals(pd_fe2)^2); RSS2
TSS2 <- plm:::tss.plm(pd_fe2); TSS2
1 - RSS2/TSS2
#[1] 0.52888

while the below yields zero
SSR2 <-  sum(fitted(pd_fe2)^2); SSR2
SSR2/TSS2
#[1] 0

because
fitted(pd_fe2)
#NULL

I could also report that plm(..., model="within", effect="twoways") is
quick when the number of individuals is kept low; re-run the same
example with n <- 200.

Any ideas on how to obtain the within R-sq from a Twoways Within fit
on unbalanced panel data with n>2000? Thank you
Liviu

On 5/12/10, Millo Giovanni <Giovanni_Millo at generali.com> wrote:
> Dear Liviu,
>
>  we're still working on measures of fit for panels. If I get you right,
>  what you mean is the R^2 of the demeaned, or "within", regression. A
>  quick and dirty function to do this is:
>
>  pmodel.response<-plm:::pmodel.response.plm # needs this to make the
>  method accessible
>
>  r2<-function(x, adj=TRUE) {
>   ## fetch response and residuals
>   y <- pmodel.response(x, model="within")
>   myres <- resid(x)
>   n <- length(myres)
>
>   if(adj) {
>     adjssr <- x$df.residual
>     adjtss <- n-1
>     } else {
>     adjssr <- 1
>     adjtss <- 1
>     }
>
>   ssr <- sum(myres^2)/adjssr
>   tss <- sum(y^2)/adjtss
>   return(1-ssr/tss)
>   }
>
>  and then
>
>  > r2(yourmodel)
>
>  Hope this helps
>  Giovanni
>
>  ------------------------------