[R] interval partition problem [was: (no subject)]

Gabor Grothendieck ggrothendieck at myway.com
Sat Feb 5 05:36:14 CET 2005


Soukup, Matt <SoukupM <at> cder.fda.gov> writes:

: 
: Hi.
: 
: I have a problem that I can't seem to find an optimal way of solving other
: than by doing things manually. I'm trying to subset a data frame by the
: number of observations that occurred at a given row but want to take into
: account the number of observations of preceding rows. Here's an example.
: 
: I'm looking at intervals of data [10,20), [10, 30), ....., [10,120) which
: contain a certain number of observations for treatment A and treatment B. An
: example is given by the following code.
: 
: >int <- as.factor(paste("[", rep(10, 11), ",", seq(20,120, by=10), ")"))
: >nsamA <- c(62, 83, 118, 151, 180, 201, 212, 215, 216, 217, 218)
: >nsamB <- c(65, 90, 128, 163, 190, 199, 209, 214, 215, 216, 218)
: 
: >df0 <- data.frame(int, nsamA, nsamB)
: >df0
: 
: Since the interval [10, s) with n_s samples is nested in [10, t)with n_t
: sample for s < t, we know n_s - n_t samples exist in the interval [s, t). If
: this sample size of the difference is small I want to exclude the interval
: [10,s). This can be done comparing adjacent preceding rows using the
: following.
: 
: > df0$itagA <- ifelse(c(10, diff(nsamA)) <= 4, 1, 0)
: >df0$itagB <- ifelse(c(10, diff(nsamB)) <= 4, 1, 0)
: >df0
: ># Subset df0 on the tag results
: > df1 <- df0[df0$itagA != 1 & df0$itagB != 1,]
: > df1
: 
: This works fine, but here is my problem. This simply looks at only the
: immediate preceding row and not at rows further "down the line". What I
: would like to do is include the next interval that includes 5 or more
: samples from each group since earlier intervals are nested in the latter
: intervals. In the example given this would include the final interval [10,
: 120) as this contains more than 4 samples for each treatment. I can do this
: by hand using something like
: 
: > df0[c(1:7,11),]
: 
: But this is not an attractive solution as it requires me to actually look at
: the data set each time and determine the row numbers. This works for this
: case, but I have many intervals (rows of data) to look at and this would be
: cumbersome. I've considered using diff with different lag arguments, but
: this still doesn't seem to work. I also want to note that I need to keep the
: int factor (as used in the example above) as this is used throughout my
: analysis (i.e. this is a true factor variable and not simply denoting an
: interval). I'd be grateful for any possible suggestions as I'm stumped at
: this moment. 
: 


Delete the rows one by one and then recalculate diff
after each deletion (rather than diff'ing all at once 
and then deleting all at once).  Also, assuming you want 
every interval to be covered, force the last interval to 
end at the last row.

Assume too.few(df0, i) is a function, not shown here, which 
returns TRUE if there are too few As or Bs in row i minus row 
i-1 of df0 and otherwise FALSE. Then:

last.row <- df0[nrow(df0),]
i <- 1
while(i < nrow(df0)) if (too.few(df0, i)) df0 <- df0[-i,] else i <- i + 1
df0[nrow(df0),] <- last.row


P.S.

Please start a new thread rather than replying to an existing thread
and please use a meaningful subject.




More information about the R-help mailing list