[R] Manipulating groups of boolean data subject to group size and distance from other groups

Mon Nov 28 18:38:02 CET 2016

The example below is a pared-down version of a much larger dataset.  My
goal is to use the binary data contained in DF$col2 to guide manipulation
of the binary data itself, subject to the following:

   - Groups of '1' that are separated from other, larger groups of "1's" in
   'col2' by 2 or more years should be converted to "0"
   - Groups of '1' need to be at least 2 consecutive years to be preserved

So in the example provided below, DF$col2 would be manipulated such that
its values are overrided to:

c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1)

That is, the first group of 1's in positions 2 through 6 are separated from
other groups of 1's by 2 (or more) years, and the second group of 1's
(positions 11 & 12) span only a single year and do not meet the criteria of
being at least 2 years long.

The example R script below shows a small example I'm working with, called
"DF".  The code that comes after the first line is my attempt to go through
some R-gymnastics to append a column to DF called "isl2" that reflects the
number of consecutive years in the 0/1 groups, where the +/- sign acts as
(or denotes) the original binary condition: 0 = negative, 1 = positive.
However, I'm stuck with how to proceed further.  Could someone please help
me come up with script that modifies DF$col2 shown below to be like that
shown above?

DF <- data.frame(col1=rep(1991:2004,
each=2),col2=c(0,0,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

DF$inc <- c(0, abs(diff(DF$col2)))
DF$cum <- cumsum(DF$inc)

ex1 <- aggregate(col1 ~ cum, data=DF, function(x) length(unique(x)))
names(ex1) <- c('cum','isl')

tmp1a <- merge(DF, ex1, by="cum", all.x=TRUE)
tmp1a$isl2 <- (-1*tmp1a$col2) * tmp1a$isl
tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]

DF$grpng <- tmp1a$isl2

At this point I was thinking I could use DF$grpng to sweep through col2 and
make adjustments, but I didn't know how to proceed.

For debugging purposes, a slightly different example would go from:

DF <- data.frame(col1=rep(1991:2004, each=2),col2=c(1,1,1,1,
1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

to 'col2' looking like:

c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1)

That is, even though the first group of 1's is greater than two consecutive
years, it is separated from a larger group of 1's by 2 (or more years).

	[[alternative HTML version deleted]]