[R] Median of streaming data

Mohan Radhakrishnan radhakrishnan.mohan at gmail.com
Wed Sep 24 16:55:46 CEST 2014


I meant the papers. I hit a paywall. Can we reconstruct the code from the
papers ?

Thanks,
Mohan

On Wed, Sep 24, 2014 at 3:59 PM, Martyn Byng <martyn.byng at nag.co.uk> wrote:

> Something else that might be of interest ...
>
> Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in
> high speed data streams Proceedings of the 19th International Conference on
> Scientific and Statistical Database Management IEEE Computer Society 29
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Martin Maechler
> Sent: 24 September 2014 09:17
> To: Rolf Turner
> Cc: R-help at r-project.org; R-SIG-robust at r-project.org
> Subject: Re: [R] Median of streaming data
>
> >>>>> Rolf Turner <r.turner at auckland.ac.nz>
> >>>>>     on Wed, 24 Sep 2014 18:43:34 +1200 writes:
>
>     > On 24/09/14 17:31, Mohan Radhakrishnan wrote:
>     >> Hi,
>     >>
>     >> I have streaming data(1 TB) that can't fit in memory. Is
>     >> there a way for me to find the median of these streaming
>     >> integers assuming I can fit only a small part in memory ?
>     >> This is about the statistical approach to find the median
>     >> of a large number of values when I can inspect only a
>     >> part of them due to memory constraints.
>
>     > You cannot, I'm pretty sure, calculate the median
>     > recursively.  However there are "approximate" recursive
>     > median algorithms which provide an estimate of location
>     > that has the same asymptotic properties as the median.
>
>     > See:
>
>     > * U. Holst, Recursive estimators of location.
>     > Commun. Statist. Theory Meth., vol. 16, 1987,
>     > pp. 2201--2226.
>
>     > and
>
>     > * Murray A. Cameron and T. Rolf Turner, Recursive location
>     > and scale estimators, Commun. Statist. Theory Meth.,
>     > vol. 22, 1993, pp. 2503--2515.
>
> This is really interesting to me, thank you, Rolf!
>
> OTOH,
>
> 1) has your proposal ever been provided in R?
>    I'd be happy to add it to the robustX
>    (http://cran.ch.r-project.org/web/packages/robustX) or even
>    robustbase (http://cran.ch.r-project.org/web/packages/robustbase)
> package.
>
> 2) Would anybody know of more recent research on the subject?
>    (I quickly "googled around" and found research more geared
>     for the time series situation which is more involved anyway)
>
>    --> Hence CC'ing the experts' list  R-SIG-robust
>
>
> Martin Maechler,  ETH Zurich
>
>
>     > cheers,
>     > Rolf Turner
>
>     > --
>     > Rolf Turner Technical Editor ANZJS
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ________________________________________________________________________
> This e-mail has been scanned for all viruses by Star.\ _...{{dropped:3}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list