[R] Randomly drop a percent of data from a data.frame

arun smartpink111 at yahoo.com
Sat Aug 17 00:32:05 CEST 2013



Hi,
Suppose the dataset had odd number of columns:
set.seed(6458)
 data2<- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5))
n<- prod(dim(data2))
 n
#[1] 15
dummy<- rep(F,n/2)
dummy[sample(1:(n/2),n*.2)]<-T
dummy
#[1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

data2[,c("x2", "x3")][matrix(dummy, nc = 2)]  <- NA
#Error in `[<-.data.frame`(`*tmp*`, matrix(dummy, nc = 2), value = NA) : 
 # unsupported matrix index in replacement
#In addition: Warning message:
#In matrix(dummy, nc = 2) :
 # data length [7] is not a sub-multiple or multiple of the number of rows [4]

I might do:
n1<- 2*nrow(data2) ##for 2 columns
dummy<- rep(FALSE,n1)
 dummy[sample(1:n1,n1*.2)]<-TRUE
data2[,c("x2","x3")][matrix(dummy,nc=2)]<-NA
data2
#           x1         x2         x3
#1 -0.55899744  0.6622481 -0.3305958
#2  0.12776368         NA         NA
#3 -1.09734838  0.2069539 -0.6997853
#4  0.75919499 -0.5683809  0.4752002
#5 -0.03063141 -0.7549605  2.6038635


A.K.
________________________________
From: Richard Kwock <richardkwock at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Cc: Christopher Desjardins <cddesjardins at gmail.com>; R help <r-help at r-project.org> 
Sent: Friday, August 16, 2013 5:55 PM
Subject: Re: [R] Randomly drop a percent of data from a data.frame



Try this:

data <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
data <- round(data,digits=3)

#get the total counts
n = prod(dim(data))

#set up a dummy array/matrix
dummy <- rep(F, n/2)
dummy[sample(1:(n/2), n*.2)] <- T

# 5x2 dummy matrix with T and F
matrix(dummy, nc = 2)


#subset the T indices in x3 and x4 and replace with NAs
data[,c("x3", "x4")][matrix(dummy, nc = 2)]  <- NA

data

#      x1     x2     x3     x4
#1 -1.310  0.659     NA  0.510
#2 -3.003 -0.004     NA     NA
#3  0.584  0.310     NA -0.087
#4  1.644 -2.792 -0.390 -0.382
#5 -1.791  0.840  1.137  0.820

Richard



On Fri, Aug 16, 2013 at 2:34 PM, arun <smartpink111 at yahoo.com> wrote:

Hi,
>May be this helps:
>#data1 (changed `data` to `data1`)
>set.seed(6245)
> data1 <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
> data1<- round(data1,digits=3)
>
>data2<- data1
>
>data1[,3:4]<-lapply(data1[,3:4],function(x){x1<- match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4])))));x[is.na(x1)]<-NA;x})
> data1
>#      x1     x2     x3     x4
>#1  0.482  1.320     NA -0.142
>#2 -0.753 -0.041 -0.063  0.886
>#3  0.028 -0.256 -0.069  0.354
>#4 -0.086  0.475  0.244  0.781
>#5  0.690 -0.181  1.274  1.633
>
>
>#or
>data2[,3:4]<-lapply(data2[,3:4],function(x){x1<- match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4])))));x[is.na(x1)]<-NA;x})
> data2
>#      x1     x2     x3     x4
>#1  0.482  1.320 -0.859 -0.142
>#2 -0.753 -0.041     NA     NA
>#3  0.028 -0.256 -0.069  0.354
>#4 -0.086  0.475  0.244  0.781
>#5  0.690 -0.181  1.274  1.633
>A.K.
>
>
>
>
>----- Original Message -----
>From: Christopher Desjardins <cddesjardins at gmail.com>
>To: "r-help at r-project.org" <r-help at r-project.org>
>Cc:
>Sent: Friday, August 16, 2013 3:02 PM
>Subject: [R] Randomly drop a percent of data from a data.frame
>
>Hi,
>I have the following data.
>
>> set.seed(6245)
>> data <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
>> round(data,digits=3)
>      x1     x2     x3     x4
>1  0.482  1.320 -0.859 -0.142
>2 -0.753 -0.041 -0.063  0.886
>3  0.028 -0.256 -0.069  0.354
>4 -0.086  0.475  0.244  0.781
>5  0.690 -0.181  1.274  1.633
>
>What I would like to do is drop 20% of the data. But I want this 20% to
>only come from dropping data from x3 and x4. It doesn't have to be evenly,
>i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
>observation has missing data on only one variable. I just want to drop 20%
>of the data through x3 and x4 only.  In other words,
>
>       x1     x2     x3     x4
>1  0.482  1.320 -0.859 NA
>2 -0.753 -0.041 -0.063  0.886
>3  0.028 -0.256      NA  0.354
>4 -0.086  0.475      NA  0.781
>5  0.690 -0.181      NA  1.633
>
>OR
>
>      x1     x2     x3     x4
>1  0.482  1.320     NA -0.142
>2 -0.753 -0.041 -0.063  0.886
>3  0.028 -0.256      NA  NA
>4 -0.086  0.475  0.244  NA
>5  0.690 -0.181  1.274  1.633
>
>OR
>
>      x1     x2     x3     x4
>1  0.482  1.320 -0.859 -0.142
>2 -0.753 -0.041 -0.063     NA
>3  0.028 -0.256 -0.069     NA
>4 -0.086  0.475  0.244     NA
>5  0.690 -0.181  1.274     NA
>
>ETC. are all fine.
>
>Any ideas how I can do this?
>Chris
>
>    [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list