[R] sample equal number of cases per class

Rui Barradas ruipbarradas at sapo.pt
Sun Nov 4 13:32:42 CET 2012


Hello,

Function caret::createDatapartition preserves the proportions of 
classes, like its documentation says, so you should expected the result 
to be balanced only if the original data.frame is also balanced. A 
solution is to write a small function that chooses a balanced set of 
indices. Note that ths function below does _not_ use the same arguments 
as caret::createDataPartition, its arguments are:

x - the original vector, matrix or data.frame.
y - a vector, what to balance.
p - proportion of x to choose.


createSets <- function(x, y, p){
     nr <- NROW(x)
     size <- (nr * p) %/% length(unique(y))
     idx <- lapply(split(seq_len(nr), y), function(.x) sample(.x, size))
     unlist(idx)
}
ind <- createSets(df, df$class, 0.8)
lrn <- df[ind,]
summary(lrn)


Also, 'df' is a bad name for a variable, it allready is an R function. 
Use, for instance, 'dat'.

Hope this helps,

Rui Barradas
Em 04-11-2012 10:47, ollestrat escreveu:
> Dear community
>
> I have a dataframe and want to split it into a learn and a test partition.
> However the learnset should be balanced, i.e. each class should have the
> same number of cases. I tried and searched a lot, without success so far.
> Maybe you can help?
>
> Some example code
> *# generate example data
> df <- data.frame(class = as.factor(sample(1:3, 20, replace = T)), var1 =
> rnorm(20,3), var2 = rnorm(20,6))
> summary(df)
>
> # split into learn and test sets using the caret package
> require(caret)
> ind <- createDataPartition(df$class, p=.8, list = F, times = 1)
>
> # The problem is here: class sizes are not equal)
> learnset <- df[ind,]
> summary(learnset)*
>
> Version info:
> /> R.Version()
> $platform
> [1] "x86_64-pc-mingw32"
> $arch
> [1] "x86_64"
> $os
> [1] "mingw32"
> $system
> [1] "x86_64, mingw32"
> $major
> [1] "2"
> $minor
> [1] "15.1"/
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/sample-equal-number-of-cases-per-class-tp4648381.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list