[R] Imputing missing values

Jan Smit janpsmit at yahoo.co.uk
Thu Sep 2 13:13:23 CEST 2004


Many thanks to Dimitris Rizopoulos, Mahbub Latif,
Manoj, and Frank Harrell for their suggestions and
comments. Dimitris' code gave me what I wanted. 

The data pertain to an impact evaluation of 32
irrigation projects. For each project, there is little
variability in the price farmers receive for each of
their crops within the same season, so I think
mean-imputation is reasonably safe. I have downloaded
Hmisc, though, and will have a close look.

Jan Smit

--- Frank E Harrell Jr <f.harrell at vanderbilt.edu>
wrote: 
> Dimitris Rizopoulos wrote:
> > Hi Jan,
> > 
> > you could try the following:
> > 
> > dat <- data.frame(Price=c(10,12,NA,8,7,9,NA,9,NA),
> >                   Crop=c(rep("Rise", 5),
> rep("Wheat", 4)),
> >                   Season=c(rep("Summer", 3),
> rep("Winter", 4),
> > rep("Summer", 2)))
> > ######
> > dat <- dat[order(dat$Season, dat$Crop),]
> > dat$Price.imp <- unlist(tapply(dat$Price,
> list(dat$Crop, dat$Season),
> > function(x){
> >   mx <- mean(x, na.rm=TRUE)
> >   ifelse(is.na(x), mx, x)
> >   }))
> > 
> > dat
> > 
> > However, you should be careful using this
> imputation technique since
> > you don't take into account the extra variability
> of imputing new
> > values in your data set. I don't know what
> analysis are you planning
> > to do but in any case I would recommend to read
> some standard
> > references for missing values, e.g., Little, R.
> and Rubin, D. (2002).
> > Statistical Analysis with Missing Data, New York:
> Wiley.
> > 
> > I hope this helps.
> > 
> > Best,
> > Dimitris
> > 
> > ----
> > Dimitris Rizopoulos
> > Doctoral Student
> > Biostatistical Centre
> > School of Public Health
> > Catholic University of Leuven
> > 
> > Address: Kapucijnenvoer 35, Leuven, Belgium
> > Tel: +32/16/396887
> > Fax: +32/16/337015
> > Web: http://www.med.kuleuven.ac.be/biostat/
> >     
>
http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> > 
> > 
> > ----- Original Message ----- 
> > From: "Jan Smit" <janpsmit at yahoo.co.uk>
> > To: <R-help at stat.math.ethz.ch>
> > Sent: Wednesday, September 01, 2004 10:43 AM
> > Subject: [R] Imputing missing values
> > 
> > 
> > 
> >>Dear all,
> >>
> >>Apologies for this beginner's question. I have a
> >>variable Price, which is associated with factors
> >>Season and Crop, each of which have several
> levels.
> >>The Price variable contains missing values (NA),
> which
> >>I want to substitute by the mean of the remaining
> >>(non-NA) Price values of the same Season-Crop
> >>combination of levels.
> >>
> >>Price     Crop    Season
> >>10        Rice    Summer
> >>12        Rice    Summer
> >>NA        Rice    Summer
> >>8         Rice    Winter
> >>9         Wheat    Summer
> >>
> >>Price[is.na(Price)] gives me the missing values,
> and
> >>by(Price, list(Crop, Season), mean, na.rm = T) the
> >>values I want to impute. What I've not been able
> to
> >>figure out, by looking at by and the various
> >>incarnations of apply, is how to do the actual
> >>substitution.
> >>
> >>Any help would be much appreciated.
> >>
> >>Jan Smit
> 
> Or see the impute function in the Hmisc package and
> more general 
> solutions also in Hmisc.
> 
> 
> -- 
> Frank E Harrell Jr   Professor and Chair          
> School of Medicine
>                       Department of Biostatistics  
> Vanderbilt University
>




More information about the R-help mailing list