[R] select rows by criteria
savicky at cs.cas.cz
Thu Mar 1 18:23:54 CET 2012
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> > I am stuck with selecting the right rows from a data frame. I think the
> > problem is rather how to select them
> > then how to implement the R code.
> > Consider the following data frame:
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value =
> > c(34,12,23,25,34,42,48,29,30,27))
> > What I want to achieve is to select 7 rows (values) so that the mean value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to 45.
> > However, each value is only
> > allowed to be sampled once!
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
> sum(df$value) - 3*45 #  169
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
> # generate all triples and compute their means
> tripleMeans <- colMeans(combn(df$value, 3))
> # select the index of the triple with mean closest to 35
> indClosest <- which.min(abs(tripleMeans - 35))
> # generate the indices, which form the closest triple in df$value
> tripleInd <- combn(1:length(df$value), 3)[, indClosest]
> tripleInd #  1 3 7
> # check the mean of the triple
> mean(df$value[tripleInd]) #  35
> This code constructs all triples. If it is used for k-tuples
> for a larger k and for a set of n values, its complexity
> will be proportional to choose(n, k), so it will be large
> even for moderate n, k. It is hard to provide a significant
> speed up, since some variants of "knapsack problem", which
> is NP-complete, may be reduced to your question. Consequently,
> it is, in general, NP-complete.
Also this statement requires a correction. It applies to the
search of an exact optimum if the numbers in df$value are large.
There are efficient algorithms, which find an approximate solution.
Also, if the numbers in df$value are integers (or may be rounded
to integers after an appropriate scaling), then there is an
algorithm, whose complexity is O(k*n*max(df$value)). This
may be significantly less than choose(n, k).
CRAN task view Optimization and Mathematical Programming
may suggest also other solutions.
More information about the R-help