[R] Replacing NA s with the average

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Tue Oct 19 09:32:00 CEST 2021


>>>>> Richard O'Keefe 
>>>>>     on Tue, 19 Oct 2021 14:22:53 +1300 writes:

    > It *sounds* as though you are trying to impute missing data.
    > There are better approaches than just plugging in means.
    > You might want to look into CALIBERrfimpute or missForest.

Yes, indeed!
Put even more strongly:  "Imputation" has been an
important topic for decennia and it has been shown since the
1980s that plugging in columns means can be *very misleading*
for everything you do later with that modified data set.

The Wikipedia page is quite good as short intro
  https://en.wikipedia.org/wiki/Imputation_(statistics)

When I've been teaching about this, I've strongly recommended
multiple imputation and the "state-of-the-art" package  'mice'
which comes with a really good text book:

  Stef van Buuren (2012) -- Flexible Imputation of Missing Data 
  https://doi.org/10.1201/b11826 
  (= reference [12] in the Wikipedia article)

where in the first chapter you see a nice example on how bad
mean imputation typically will be ..

The JSS paper on mice is a more technical (I'd say "to be used
once you are already aware that 'mean imputation' should rarely be used):

> citation(package="mice")

To cite mice in publications use:

  Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate
  Imputation by Chained Equations in R. Journal of Statistical Software, 45(3),
  1-67. URL https://www.jstatsoft.org/v45/i03/.


Best regards,
Martin Maechler
ETH Zurich   and  R Core team


    > On Tue, 19 Oct 2021 at 01:39, Admire Tarisirayi Chirume
    > <atchirume using gmail.com> wrote:
    >> 
    >> Good day colleagues. Below is a csv file attached which i am using in my
    >> > analysis.
    >> >
    >> >
    >> >
    >> > household.id <http://hh.id>
    >> >
    >> > hd17.perm
    >> >
    >> > hd17employ
    >> >
    >> > health.exp
    >> >
    >> > total.food.exp
    >> >
    >> > total.nfood.exp
    >> >
    >> > 1
    >> >
    >> > 2
    >> >
    >> > yes
    >> >
    >> > 1654
    >> >
    >> > 23654
    >> >
    >> > 23655
    >> >
    >> > 2
    >> >
    >> > 2
    >> >
    >> > yes
    >> >
    >> > NA
    >> >
    >> > NA
    >> >
    >> > 65984
    >> >
    >> > 3
    >> >
    >> > 6
    >> >
    >> > no
    >> >
    >> > 2547
    >> >
    >> > 123311
    >> >
    >> > 52416
    >> >
    >> > 4
    >> >
    >> > 8
    >> >
    >> > NA
    >> >
    >> > 2365
    >> >
    >> > 13648
    >> >
    >> > 12544
    >> >
    >> > 5
    >> >
    >> > 6
    >> >
    >> > NA
    >> >
    >> > 1254
    >> >
    >> > 36549
    >> >
    >> > 12365
    >> >
    >> > 6
    >> >
    >> > 8
    >> >
    >> > yes
    >> >
    >> > 1236
    >> >
    >> > 236541
    >> >
    >> > 26522
    >> >
    >> > 7
    >> >
    >> > 8
    >> >
    >> > no
    >> >
    >> > NA
    >> >
    >> > 13264
    >> >
    >> > 23698
    >> >
    >> >
    >> >
    >> >
    >> >
    >> > So I created a df using the above and its a csv file as follows
    >> >
    >> > wbpractice <- read.csv("world_practice.csv")
    >> >
    >> > Now i am doing data cleaning and trying to replace all missing values with
    >> > the averages of the respective columns.
    >> >
    >> > the dimension of the actual dataset is;
    >> >
    >> > dim(wbpractice)
    >> [1] 31998    6
    >> 
    >> I used the following script which i executed by i got some error messages
    >> 
    >> for(i in 1:ncol( wbpractice  )){
    >> wbpractice  [is.na( wbpractice  [,i]), i] <- mean( wbpractice  [,i],
    >> na.rm = TRUE)
    >> }
    >> 
    >> Any help to replace all NAs with average values in my dataframe?
    >>



More information about the R-help mailing list