[R] GLM: include observations with missing explanatory variables

David Winsemius dwinsemius at comcast.net
Tue Dec 29 22:03:19 CET 2015

> On Dec 29, 2015, at 5:33 AM, Frank van Berkum <frankieboytje at hotmail.com> wrote:
> Hi all,
> My problem is the following. 
> Suppose I have a dataset with observations Y and explanatory variables X1, ..., Xn, and suppose one of these explanatory variables is geographical area (of which there are ten, j=1,...,10).  For some observations I know the area, but for others it is unknown and therefore record as NA.
> I want to estimate a model of the form Y[i] ~ Poisson( lambda[i] ) with log(lambda[i]) = constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j]
> In words: we estimate a constant for all observations and a factor for each area. If it is unknown what the area is, we only include the constant. 
> When estimating this model using glm(), the records with is.na(area[i]) are 'deleted' from the dataset, and this I do not want. I had hoped that the model as described above could be estimated using the function I() (interpret as), but so far my attempts have not succeeded. 
> Any help on how to approach this is kindly appreciated.
> Kind regards,
> Frank van Berkum 		 	   		  
> 	[[alternative HTML version deleted]]

I don't understand why you don't just recode the NA's to "unknown" and redo the model. This code is untested, but I think demonstrates the two steps needed: add a level and then recode the NA values assuming this factor is named `X6`:

dat$X6 <- factor(dat$X6, levels=c( levels(dat$X6), "unknown") )
dat$X6[ is.na(dat$X6) ] <- "unknown"

(I think this might be is a bit quicker than Rolf's approach.)

In R the default contrasts are "treatment" (which is different than the contrasts you describe) and so each area will be referenced to the first area (and the first level of all other factors) in the lexical ordering of area names. This ordering and the contrast type can be changed. The are many postings on rhelp over the years demonstrating how to do this.

Study these and the basics of R before reposting and if you do so, then post in plain text and include a small example constructed in R:



David Winsemius
Alameda, CA, USA

More information about the R-help mailing list