[R] subset data.frame with value != in all columns

Tim Howard tghoward at gw.dec.state.ny.us
Fri Feb 4 17:17:40 CET 2005


Because I'll be doing this on big datasets and time is important, I
thought I'd time all the different approaches that were suggested on a
small dataframe. The results were very instructive so I thought I'd pass
them on. I also discovered that my numeric columns (e.g. -9999.000)
weren't found by apply() but were found by which() and the simple
replace. Was it apply's fault or something else?

Note how much faster unique(which()) is; wow! Thanks to Marc Schwartz
for this blazing solution.

> nrow(in.df)
[1] 40000
#extract rows with no -9999
> system.time(x <- subset(in.df, apply(in.df, 1,
function(in.df){all(in.df != -9999)})))
[1] 3.25 0.00 3.25   NA   NA
> system.time(y<- in.df[-unique(which(in.df == -9999, arr.ind = TRUE)[,
1]), ])
[1] 0.17 0.00 0.17   NA   NA
> system.time({is.na(in.df) <-in.df == -9999; z <- na.omit(in.df)})
[1] 0.25 0.02 0.26   NA   NA

> nrow(x);nrow(y);nrow(z)
[1] 39990
[1] 39626
[1] 39626

#extract rows with -9999
> system.time(d<-subset(in.df, apply(in.df, 1,
function(in.df){any(in.df == -9999)})))
[1] 3.40 0.00 3.45   NA   NA
> system.time(e<-in.df[unique(which(in.df == -9999, arr.ind = TRUE)[,
1]), ])
[1] 0.11 0.00 0.11   NA   NA

> nrow(d); nrow(e)
[1] 10
[1] 374

Tim Howard


>>> Marc Schwartz <MSchwartz at MedAnalytics.com> 02/03/05 03:24PM >>>
On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote: 
  ... snip...
> My questions: 
> Is there a cleaner way to extract all rows containing a specified
> value?
> How can I extract all rows that don't have this value in any col?
> 
> #create dummy dataset
> x <- data.frame(
> c1=c(-99,-99,-99,4:10),
> c2=1:10,
> c3=c(1:3,-99,5:10),
> c4=c(10:1),
> c5=c(1:9,-99))
> 
..snip...

How about this, presuming that your data frame is all numeric:

For rows containing -99:

> x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
    c1 c2  c3 c4  c5
1  -99  1   1 10   1
2  -99  2   2  9   2
3  -99  3   3  8   3
4    4  4 -99  7   4
10  10 10  10  1 -99


For rows not containing -99:

> x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
  c1 c2 c3 c4 c5
5  5  5  5  6  5
6  6  6  6  5  6
7  7  7  7  4  7
8  8  8  8  3  8
9  9  9  9  2  9


What I have done here is to use which(), setting arr.ind = TRUE. This
returns the row, column indices for the matches to the boolean
statement. The first column returned by which() in this case are the
row
numbers matching the statement, so I take the first column only.

Since it is possible that more than one element in a row can match the
boolean, I then use unique() to get the singular row values.

Thus, I can use the returned row indices above to subset the data
frame.

HTH,

Marc Schwartz




More information about the R-help mailing list