[R] Adding NA values in random positions in a dataframe

Bert Gunter gunter.berton at gene.com
Fri Nov 29 22:00:04 CET 2013


An essentially identical approach that may be a tad clearer -- but
requires additional space -- first creates a logical vector for the
locations of the NA's in the unlisted data.frame. Further NA positions
are randomly added and then the augmented vector is used as a logical
matrix to index where the NA's should go in the data frame:

df <- data.frame(a = c(1:3,NA,4:6),
                b=c(letters[1:6],NA),
                 c= c(1,NA,runif(5)))

nr <- nrow(df); nc <- ncol(df)
p <- .3 ## desired total proportion of NA's

ina <- is.na(unlist(df)) ## logical vector, TRUE corresponds to NA positions
n2 <- floor(p*nr*nc) - sum(ina)  ## number of new NA's

ina[sample(which(!is.na(ina)), n2)] <- TRUE
df[matrix(ina, nr=nr,nc=nc)]<- NA ## using matrix indexing

df

Cheers,
Bert

On Fri, Nov 29, 2013 at 10:09 AM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> I used that because 10% of the values in the data were already NA.
>
>
> You are right.  Sorry, ?match() is unnecessary.  I was trying another solution with match() which didn't work out and forgot to check whether it was adequate or not.
> set.seed(49)
> dat1[!is.na(dat1)][sample(seq(dat1[!is.na(dat1)]),length(dat1[!is.na(dat1)])*(0.20))] <- NA
> A.K.
>
>
> Thanks for the reply. I don't get the 0.20 multiplied by the length of the non NA value, where did you take it from?
>
> Furthermore, why do we have to use the function match? Wouldn't it be enough to use the saple function?
>
>
> On Thursday, November 28, 2013 12:57 PM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> One way would be:
>  set.seed(42)
>  dat1 <- as.data.frame(matrix(sample(c(1:5,NA),50,replace=TRUE,prob=c(10,15,15,20,30,10)),ncol=5))
> set.seed(49)
>  dat1[!is.na(dat1)][ match( sample(seq(dat1[!is.na(dat1)]),length(dat1[!is.na(dat1)])*(0.20)),seq(dat1[!is.na(dat1)]))] <- NA
> length(dat1[is.na(dat1)])/length(unlist(dat1))
> #[1] 0.28
>
> A.K.
>
>
> Hello, I'm quite new at R so I don't know which is the most efficient
> way to execute a function that I could write easily in other languages.
>
> This is my problem: I have a dataframe with a certain numbers of
> NA (approximately 10%). I want to add other NA values in random
> positions of the dataframes until reaching an overall proportions of NA
> values of 30% (clearly the positions with NA values don't have to
> change). I tried looking at iterative function in R as apply or sapply
> but I can't actually figure out how to use them in this case. Thank you.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374



More information about the R-help mailing list