[R] Imputing missing values

Frank E Harrell Jr f.harrell at vanderbilt.edu
Wed Sep 1 14:10:54 CEST 2004


Dimitris Rizopoulos wrote:
> Hi Jan,
> 
> you could try the following:
> 
> dat <- data.frame(Price=c(10,12,NA,8,7,9,NA,9,NA),
>                   Crop=c(rep("Rise", 5), rep("Wheat", 4)),
>                   Season=c(rep("Summer", 3), rep("Winter", 4),
> rep("Summer", 2)))
> ######
> dat <- dat[order(dat$Season, dat$Crop),]
> dat$Price.imp <- unlist(tapply(dat$Price, list(dat$Crop, dat$Season),
> function(x){
>   mx <- mean(x, na.rm=TRUE)
>   ifelse(is.na(x), mx, x)
>   }))
> 
> dat
> 
> However, you should be careful using this imputation technique since
> you don't take into account the extra variability of imputing new
> values in your data set. I don't know what analysis are you planning
> to do but in any case I would recommend to read some standard
> references for missing values, e.g., Little, R. and Rubin, D. (2002).
> Statistical Analysis with Missing Data, New York: Wiley.
> 
> I hope this helps.
> 
> Best,
> Dimitris
> 
> ----
> Dimitris Rizopoulos
> Doctoral Student
> Biostatistical Centre
> School of Public Health
> Catholic University of Leuven
> 
> Address: Kapucijnenvoer 35, Leuven, Belgium
> Tel: +32/16/396887
> Fax: +32/16/337015
> Web: http://www.med.kuleuven.ac.be/biostat/
>      http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> 
> 
> ----- Original Message ----- 
> From: "Jan Smit" <janpsmit at yahoo.co.uk>
> To: <R-help at stat.math.ethz.ch>
> Sent: Wednesday, September 01, 2004 10:43 AM
> Subject: [R] Imputing missing values
> 
> 
> 
>>Dear all,
>>
>>Apologies for this beginner's question. I have a
>>variable Price, which is associated with factors
>>Season and Crop, each of which have several levels.
>>The Price variable contains missing values (NA), which
>>I want to substitute by the mean of the remaining
>>(non-NA) Price values of the same Season-Crop
>>combination of levels.
>>
>>Price     Crop    Season
>>10        Rice    Summer
>>12        Rice    Summer
>>NA        Rice    Summer
>>8         Rice    Winter
>>9         Wheat    Summer
>>
>>Price[is.na(Price)] gives me the missing values, and
>>by(Price, list(Crop, Season), mean, na.rm = T) the
>>values I want to impute. What I've not been able to
>>figure out, by looking at by and the various
>>incarnations of apply, is how to do the actual
>>substitution.
>>
>>Any help would be much appreciated.
>>
>>Jan Smit

Or see the impute function in the Hmisc package and more general 
solutions also in Hmisc.


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list