# [R] replace Na values with the mean of the column which contains them

John Fox jfox at mcmaster.ca
Mon Jul 29 19:29:09 CEST 2013

```Dear iza.ch1,

I hesitate to say this, because mean imputation is such a bad idea, but it's easy to do what you want with a loop, rather than puzzling over a "cleverer" way to accomplish the task. Here's an example using the Freedman data set in the car package:

> colSums(is.na(Freedman))
population   nonwhite    density      crime
10          0         10          0

> means <- colMeans(Freedman, na.rm=TRUE)

> for (j in 1:ncol(Freedman)){
+     Freedman[is.na(Freedman[, j]), j] <- means[j]
+ }

> colSums(is.na(Freedman))
population   nonwhite    density      crime
0          0          0          0

> colMeans(Freedman)
population   nonwhite    density      crime
1135.99000   10.80273  765.67000 2714.08182

> means
population   nonwhite    density      crime
1135.99000   10.80273  765.67000 2714.08182

Now you should probably think about whether you really want to do this...

Best,
John

On Mon, 29 Jul 2013 18:39:48 +0200
"iza.ch1" <iza.ch1 at op.pl> wrote:
> Hi everyone
>
> I have a problem with replacing the NA values with the mean of the column which contains them. If I replace Na with the means of the rest values in the column, the mean of the whole column will be still the same as if I would have omitted NA values. I have the following data
>
> de
>      [,1]        [,2]       [,3]
>  [1,]          NA -0.26928087 -0.1192078
>  [2,]          NA  1.20925752  0.9325334
>  [3,]          NA  0.38012008 -1.8927164
>  [4,]          NA -0.41778861  1.4330507
>  [5,]          NA -0.49677462  0.2892706
>  [6,]          NA -0.13248754  1.3976522
>  [7,]          NA -0.54179054  0.2295291
>  [8,]          NA  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
>
> and I wrote the code
> de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)})
>
> I get as the result
>    [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.12222376  1.20925752  0.9325334
>  [3,] -0.13412312  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.12222376 -0.49677462  0.2892706
>  [6,] -0.13412312 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.12222376  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
>
> It has replaced the NA values in first column with mean of first column -0.505... and second cell with mean of second column etc.
> I want to have the result like this:
> [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.50575168  1.20925752  0.9325334
>  [3,] -0.50575168  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.50575168 -0.49677462  0.2892706
>  [6,] -0.50575168 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.50575168  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
>