[R] interval partition problem [was: (no subject)]

Sat Feb 5 05:49:43 CET 2005

Gabor Grothendieck <ggrothendieck <at> myway.com> writes:

: 
: Soukup, Matt <SoukupM <at> cder.fda.gov> writes:
: 
: : 
: : Hi.
: : 
: : I have a problem that I can't seem to find an optimal way of solving other
: : than by doing things manually. I'm trying to subset a data frame by the
: : number of observations that occurred at a given row but want to take into
: : account the number of observations of preceding rows. Here's an example.
: : 
: : I'm looking at intervals of data [10,20), [10, 30), ....., [10,120) which
: : contain a certain number of observations for treatment A and treatment B. 
An
: : example is given by the following code.
: : 
: : >int <- as.factor(paste("[", rep(10, 11), ",", seq(20,120, by=10), ")"))
: : >nsamA <- c(62, 83, 118, 151, 180, 201, 212, 215, 216, 217, 218)
: : >nsamB <- c(65, 90, 128, 163, 190, 199, 209, 214, 215, 216, 218)
: : 
: : >df0 <- data.frame(int, nsamA, nsamB)
: : >df0
: : 
: : Since the interval [10, s) with n_s samples is nested in [10, t)with n_t
: : sample for s < t, we know n_s - n_t samples exist in the interval [s, t). 
If
: : this sample size of the difference is small I want to exclude the interval
: : [10,s). This can be done comparing adjacent preceding rows using the
: : following.
: : 
: : > df0$itagA <- ifelse(c(10, diff(nsamA)) <= 4, 1, 0)
: : >df0$itagB <- ifelse(c(10, diff(nsamB)) <= 4, 1, 0)
: : >df0
: : ># Subset df0 on the tag results
: : > df1 <- df0[df0$itagA != 1 & df0$itagB != 1,]
: : > df1
: : 
: : This works fine, but here is my problem. This simply looks at only the
: : immediate preceding row and not at rows further "down the line". What I
: : would like to do is include the next interval that includes 5 or more
: : samples from each group since earlier intervals are nested in the latter
: : intervals. In the example given this would include the final interval [10,
: : 120) as this contains more than 4 samples for each treatment. I can do this
: : by hand using something like
: : 
: : > df0[c(1:7,11),]
: : 
: : But this is not an attractive solution as it requires me to actually look 
at
: : the data set each time and determine the row numbers. This works for this
: : case, but I have many intervals (rows of data) to look at and this would be
: : cumbersome. I've considered using diff with different lag arguments, but
: : this still doesn't seem to work. I also want to note that I need to keep 
the
: : int factor (as used in the example above) as this is used throughout my
: : analysis (i.e. this is a true factor variable and not simply denoting an
: : interval). I'd be grateful for any possible suggestions as I'm stumped at
: : this moment. 
: : 
: 
: Delete the rows one by one and then recalculate diff
: after each deletion (rather than diff'ing all at once 
: and then deleting all at once).  Also, assuming you want 
: every interval to be covered, force the last interval to 
: end at the last row.
: 
: Assume too.few(df0, i) is a function, not shown here, which 
: returns TRUE if there are too few As or Bs in row i minus row 
: i-1 of df0 and otherwise FALSE. Then:
: 
: last.row <- df0[nrow(df0),]
: i <- 1
: while(i < nrow(df0)) if (too.few(df0, i)) df0 <- df0[-i,] else i <- i + 1

That should be i <= nrow(df0)

: df0[nrow(df0),] <- last.row
: 
: P.S.
: 
: Please start a new thread rather than replying to an existing thread
: and please use a meaningful subject.
: 
: ______________________________________________
: R-help <at> stat.math.ethz.ch mailing list
: https://stat.ethz.ch/mailman/listinfo/r-help
: PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
: 
: