[R] conditional selection of dataframe rows

Marc Schwartz marc_schwartz at me.com
Thu Aug 12 23:11:19 CEST 2010


On Aug 12, 2010, at 3:06 PM, Toby Gass wrote:

> Thank you all for the quick responses.  So far as I've checked, 
> Marc's solution works perfectly and is quite speedy.  I'm still 
> trying to figure out what it is doing. :)
> 
> Henrique's solution seems to need some columns somewhere.  David's 
> solution does not find all the other measurements, possibly with 
> positive values, taken on the same day.
> 
> Thank you again for your efforts.
> 
> Toby

<snip>

Toby,

Working from the inside out:

The ave() function splits (sub-groups) the data frame by one or more factors, internally using split() and then passing the desired column from each sub-group to the function defined by using lapply(). By default, that is mean(). 

The great thing about using ave(), is that it will replicate the scalar sub-group based result of the function, once for each row in the sub-group. In addition, the result vector will be sorted in the order of the rows in the original data frame, rather than in the order of the sub-group rows. So in this case, if any of the rows in the sub-group has a SLOPE with negative value, all rows in the sub-group get a TRUE.


You can get an initial feel for the internal data organizing process by using:

> split(toy, list(toy$CH, toy$DAY))
$`3.4`
  CH DAY SLOPE
1  3   4   0.2
4  3   4   0.5

$`4.4`
  CH DAY SLOPE
2  4   4   0.3
5  4   4   0.6

$`5.4`
  CH DAY SLOPE
3  5   4   0.4

$`3.5`
  CH DAY SLOPE
7  3   5   0.1

$`4.5`
  CH DAY SLOPE
8  4   5     0

$`5.5`
  CH DAY SLOPE
6  5   5   0.2
9  5   5  -0.1



So the first step is:

> with(toy, ave(SLOPE, CH, DAY, FUN = function(x) any(x < 0)))
[1] 0 0 0 0 0 1 0 0 1


Note that I use with() to define that SLOPE, CH and DAY are all to be evaluated (found) within the 'toy' data frame. That is easier than using:

> ave(toy$SLOPE, toy$CH, toy$DAY, FUN = function(x) any(x < 0))
[1] 0 0 0 0 0 1 0 0 1


This returns a vector of 0's and 1's (FALSE and TRUE coerced to a numeric). Note that the returned vector does not correspond to the sequence of rows in the result of split() above, but to the sequence of rows in the original 'toy' data frame. That is, rows 6 and 9 are 1 (TRUE):

> cbind(toy, flag = with(toy, ave(SLOPE, CH, DAY, 
                                  FUN = function(x) any(x < 0))))
  CH DAY SLOPE flag
1  3   4   0.2    0
2  4   4   0.3    0
3  5   4   0.4    0
4  3   4   0.5    0
5  4   4   0.6    0
6  5   5   0.2    1
7  3   5   0.1    0
8  4   5   0.0    0
9  5   5  -0.1    1


The next step is to remove those rows. You could do that by using regular indexing, but by using subset(), I can replicate the behavior of having used with() above, since the arguments in subset() are evaluated within the data frame defined. Thus, I can eliminate the use of with() and have a shorter solution. Then, by negating the result of ave() so that 0 (FALSE) becomes TRUE, retain only those rows where the ave() result was 0:

> subset(toy, !ave(SLOPE, CH, DAY, FUN = function(x) any(x < 0)))
  CH DAY SLOPE
1  3   4   0.2
2  4   4   0.3
3  5   4   0.4
4  3   4   0.5
5  4   4   0.6
7  3   5   0.1
8  4   5   0.0


I hope that clarifies the process.

Marc



More information about the R-help mailing list