[R] Removing columns that are na or constant

Rui Barradas ruipbarradas at sapo.pt
Tue Nov 20 23:53:07 CET 2012


Hello,

Inline.
Em 20-11-2012 22:03, Brian Feeny escreveu:
> I have a dataset that has many columns which are NA or constant, and so I remove them like so:
>
>
> same <- sapply(dataset, function(.col){
>    all(is.na(.col))  || all(.col[1L] == .col)
> })
> dataset <- dataset[!same]
>
> This works GREAT (thanks to the r-users list archive I found this)
>
> however, then when I do my data sampling like so:
>
> testSize <- floor(nrow(x) * 10/100)
> test <- sample(1:nrow(x), testSize)
>      
> train_data <- x[-test,]
> test_data <- x[test, -1]
> test_class <- x[test, 1]
>
> It is now possible that test_data or train_data contain columns that are constants, however as one dataset they did not.

Suppose they do. If you now remove those columns from one of train_data 
or test_data, and not from the other, then their structures are no 
longer the same.
>
> So the solution for me is to just re-run lines to remove all constants

Or write a function. I would have the function return the indices of the 
good columns and then intersect the results for train_data and test_data.

notSame <- function(dataset){
     same <- sapply(dataset, function(.col){
         all(is.na(.col))  || all(.col[1L] == .col)
     })
     which(!same)
}

good1 <- notSame(train_data)
good2 <- notSame(test_data)
dataset <- dataset[intersect(good1, good2)]


Now you can sample from a "safe" subset of your dataset.

> ......not a problem, but is this normal?  is this how I should
> be handling this in R?  many models I am attempting to use (SVM, lda, etc) don't like if a column has all the same value.......
> so as a beginner, this is how I am handling it in R, but I am looking for someone to sanity check what I am doing is sound.

Only you can tell whether it's sound to eliminate variables from your 
analysis, and which ones.

Hope this helps,

Rui Barradas
>
> Brian
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list