[R] replacing missing values in a dataframe with reference values.

Jim Regetz regetz at nceas.ucsb.edu
Fri Aug 22 02:29:16 CEST 2008


Gary Collins wrote:
> Any thoughts on the following I'd be most grateful - I'm sure there is 
> an easy and quick way to do this but I'm having a mental block this 
> evening. Essentially, I'm trying to replace missing data in my dataset 
> with reference values based on age and sex.
> 
> So an example dataset is
> set.seed(1)
> X = data.frame(age=rnorm(10, 50, 10), sex=rbinom(10, 1, 0.5), 
> A=rnorm(10), B=rnorm(10))
> 
> X$agegroup = cut(X$age, breaks=c(20, 30, 40, 50, 60, 70, 80), labels = 
> c("20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))
> 
> # make some missing
> X$A[c(1, 5, 6)]=NA
> X$B[c(2, 3, 8)]=NA
> 
> and my reference dataset is
> refval = data.frame(agegroup = rep(c("20-29", "30-39", "40-49", "50-59", 
> "60-69", "70-79"), 2), sex = c(rep(0,6), rep(1,6)),A = c(seq(0, 1, 
> length=6), seq(4,5,length=6)), B = c(seq(1, 2, length=6), 
> seq(3,4,length=6)))

Especially if you're certain every combination of agegroup and sex in X 
has a matching entry in the refval data frame, I would just create a 
concatenated key from your two reference variables, and then use that to 
do the lookups using rownames:

rownames(refval) <- paste(refval$sex, refval$agegroup, sep=".")
X.key <- paste(X$sex, X$agegroup, sep=".")

X$A[is.na(X$A)] <- refval[X.key[is.na(X$A)], "A"]
X$B[is.na(X$B)] <- refval[X.key[is.na(X$B)], "B"]
etc.

HTH,
Jim

> My ugly "solution" is
> 
> for(i in 1:nrow(X)){
>     if(is.na(X$A[i])){
>         X$A[i] = refval$A[refval$sex == X$sex[i] & refval$agegroup == 
> X$agegroup[i]]
>     }
>     if(is.na(X$B[i])){
>         X$B[i] = refval$B[refval$sex == X$sex[i] & refval$agegroup == 
> X$agegroup[i]]
>     }
> }
> 
> Which not only is this ugly, but is slow, in that my actual dataset is 
> over 1 million rows with 5 or 6 variables to check and replace missing 
> values.
> 
> Thanks in advance for any help.
> 
> Gary
> ------------------------------------
> Dr Gary S Collins
> Medical Statistician
> Centre for Statistics in Medicine
> Wolfson College Annexe
> University of Oxford
> Linton Road
> Oxford, OX2 6UD
> 
> Tel: +44 (0)1865 284418
> Fax: +44 (0)1865 284424
> www.csm-oxford.org.uk
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list