[R] Manipulating groups of boolean data subject to group size and distance from other groups

David Winsemius dwinsemius at comcast.net
Mon Nov 28 21:25:14 CET 2016

> On Nov 28, 2016, at 9:38 AM, Morway, Eric <emorway at usgs.gov> wrote:
> The example below is a pared-down version of a much larger dataset.  My
> goal is to use the binary data contained in DF$col2 to guide manipulation
> of the binary data itself, subject to the following:
>   - Groups of '1' that are separated from other, larger groups of "1's" in
>   'col2' by 2 or more years should be converted to "0"
>   - Groups of '1' need to be at least 2 consecutive years to be preserved
> So in the example provided below, DF$col2 would be manipulated such that
> its values are overrided to:
> c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)
> That is, the first group of 1's in positions 2 through 6 are separated from
> other groups of 1's by 2 (or more) years, and the second group of 1's
> (positions 11 & 12) span only a single year and do not meet the criteria of
> being at least 2 years long.
> The example R script below shows a small example I'm working with, called
> "DF".  The code that comes after the first line is my attempt to go through
> some R-gymnastics to append a column to DF called "isl2" that reflects the
> number of consecutive years in the 0/1 groups, where the +/- sign acts as
> (or denotes) the original binary condition: 0 = negative, 1 = positive.
> However, I'm stuck with how to proceed further.  Could someone please help
> me come up with script that modifies DF$col2 shown below to be like that
> shown above?
> DF <- data.frame(col1=rep(1991:2004,
> each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

It's not clear from you verbal description why the first group pf 1's with length 4 is discarded while the second group of ones also of length 4 is preserved. There's ambiguity in the rules about "how large" a run must be in order to be "safe" from removal.

In any case the answer will almost surely involve the use of the rle function which if you have not encountered it should be your next visit to the help pages.

> DF$inc <- c(0, abs(diff(DF$col2)))
> DF$cum <- cumsum(DF$inc)
> ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x)))
> names(ex1) <- c('cum','isl')
> tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE)
> tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl
> tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]
> DF$grpng <- tmp1a$isl2
> At this point I was thinking I could use DF$grpng to sweep through col2 and
> make adjustments, but I didn't know how to proceed.
> For debugging purposes, a slightly different example would go from:
> DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1,
> 1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))
> to 'col2' looking like:
> c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)
> That is, even though the first group of 1's is greater than two consecutive
> years, it is separated from a larger group of 1's by 2 (or more years).

> 	[[alternative HTML version deleted]]
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

More information about the R-help mailing list