[R] replacing missing values in a dataframe with reference values.

Gary Collins collins.gs at gmail.com
Thu Aug 21 21:35:15 CEST 2008


Any thoughts on the following I'd be most grateful - I'm sure there is 
an easy and quick way to do this but I'm having a mental block this 
evening. Essentially, I'm trying to replace missing data in my dataset 
with reference values based on age and sex.

So an example dataset is
set.seed(1)
X = data.frame(age=rnorm(10, 50, 10), sex=rbinom(10, 1, 0.5), 
A=rnorm(10), B=rnorm(10))

X$agegroup = cut(X$age, breaks=c(20, 30, 40, 50, 60, 70, 80), labels = 
c("20-29", "30-39", "40-49", "50-59", "60-69", "70-79"))

# make some missing
X$A[c(1, 5, 6)]=NA
X$B[c(2, 3, 8)]=NA

and my reference dataset is
refval = data.frame(agegroup = rep(c("20-29", "30-39", "40-49", "50-59", 
"60-69", "70-79"), 2), sex = c(rep(0,6), rep(1,6)),A = c(seq(0, 1, 
length=6), seq(4,5,length=6)), B = c(seq(1, 2, length=6), 
seq(3,4,length=6)))

My ugly "solution" is

for(i in 1:nrow(X)){
	if(is.na(X$A[i])){
		X$A[i] = refval$A[refval$sex == X$sex[i] & refval$agegroup == 
X$agegroup[i]]
	}
	if(is.na(X$B[i])){
		X$B[i] = refval$B[refval$sex == X$sex[i] & refval$agegroup == 
X$agegroup[i]]
	}
}

Which not only is this ugly, but is slow, in that my actual dataset is 
over 1 million rows with 5 or 6 variables to check and replace missing 
values.

Thanks in advance for any help.

Gary
------------------------------------
Dr Gary S Collins
Medical Statistician
Centre for Statistics in Medicine
Wolfson College Annexe
University of Oxford
Linton Road
Oxford, OX2 6UD

Tel: +44 (0)1865 284418
Fax: +44 (0)1865 284424
www.csm-oxford.org.uk



More information about the R-help mailing list