[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Martin Maechler maechler at stat.math.ethz.ch
Fri Jul 28 21:55:37 CEST 2006


>>>>> "Kevin" == Kevin B Hendricks <kevin.hendricks at sympatico.ca>
>>>>>     on Fri, 28 Jul 2006 14:53:57 -0400 writes:

    [.........]

    Kevin> The idea is to somehow make functions that work well
    Kevin> over small sub- sequences of a much longer vector
    Kevin> without resorting to splitting the vector into many
    Kevin> smaller vectors.

    Kevin> In my particular case, the problem was my data frame
    Kevin> had over 1 million lines had probably over 500,000
    Kevin> unique sort keys (ie. think of it as an R factor with
    Kevin> over 500,000 levels).  The implementation of "by"
    Kevin> uses "tapply" which in turn uses "split".  So "split"
    Kevin> simply ate up all the time trying to create 500,000
    Kevin> vectors each of short length 1, 2, or 3; and the
    Kevin> associated garbage collection.

Not that I have spent enough time thinking about this thread's
topic, but I have seen more than one case where using  tapply()
unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing
tapply() by a few lines of code {one of which using lapply() IIRC},
sped up that computation by a factor (of 2 ? or more?).

I also vaguely remember that I thought about making tapply()
faster, but came to the conclusion it could not be
sped up quickly, because it works in a quite more general
context than it was used in that application (and maybe yours?).


    Kevin> I simple loop that walked the short sequence of
    Kevin> values (since the data frame was already sorted)
    Kevin> calculating what it needed, would work much faster
    Kevin> than splitting the original vector into so very many
    Kevin> smaller vectors (and the associated copying of data).

    Kevin> That problem is very similar problem to the
    Kevin> calculation of basic stats on a short moving window
    Kevin> over a very long vector.

    >> The author of that message ultimately wrote the caTools R
    >> package which contains some optimized versions.

    Kevin> I will look into that package and maybe use it for a
    Kevin> model for what I want to do.

    Kevin> Thanks,

    Kevin> Kevin

    Kevin> ______________________________________________
    Kevin> R-devel at r-project.org mailing list
    Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list