[R] Subsampling-oversampling from a data frame

B77S bps0002 at auburn.edu
Wed Nov 2 05:53:07 CET 2011

# Perhaps I misunderstand your original need, but....

## I added a few lines to your data and used dput() to get the below data (I
named "df")

df<- structure(list(age = c(15L, 20L, 15L, 10L, 10L, 12L, 17L, 17L, 
11L, 12L, 16L, 20L, 23L, 14L, 22L, 16L, 10L, 11L, 21L, 10L, 13L, 
17L), sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("f", 
"m"), class = "factor"), class = structure(c(2L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 
2L, 1L), .Label = c("high", "low"), class = "factor")), .Names = c("age", 
"sex", "class"), class = "data.frame", row.names = c(NA, -22L

## the following line uses which(), sample(), and rbind(), along with some
indexing to get a new dataframe; see ?which, ?sample, and ?rbind for more
# For the "indexing", play with it, ... type in df[1:3,1:2] as an example

new_df <- rbind(df[sample(which(df$class=="low"), 4),],
df[sample(which(df$class=="high"), 4),])

Now replace 4 with the the size of each you want.

hgwelec wrote:
> Thank you for your answer.
> The problem is that i am learning R now, so i do not know how i could do
> this.
> I have found the following code but it does not work unfortunately
> (=create distribution 0.1 "low" class - 0.9 high) :
> data[c(rownames(data.df[data.df$class=="high",]),
> sample(rownames(data[data.df$class=="low"]), 0.1)) , ]

2 posts
This post has NOT been accepted by the mailing list yet.
Dear members, 

Consider the following data frame (first 4 rows shown) 

  age sex class 
  15   m   low 
  20   f  high 
  15   f   low 
  10   m   low 

in my original data set i have 1200 rows and a class distribution of low=0.3
and high=0.7 

My question : how can i create a new data frame as the one shown above but
with the 'high' class subsampled so that in the new data frame the class
distribution is low=0.5 and high=0.5? 

I tried looking at the sample function and prob option but all examples i
seen do not use an imbalanced class problem as the one shown above 

Thank you in advance 

Thank you in advance   

View this message in context: http://r.789695.n4.nabble.com/Subsampling-oversampling-from-a-data-frame-tp3965771p3971840.html
Sent from the R help mailing list archive at Nabble.com.

More information about the R-help mailing list