[R] Mixture of Normals with Large Data

Bert Gunter gunter.berton at gene.com
Wed Aug 8 01:18:18 CEST 2007


 

Have you considered the situation of wanting to characterize
probability densities of prevalence estimates based on a complex
random sample of some large population.

No -- and I stand by my statement. The empirical distribution of the data
themselves are the best "characterization" of the density. You and others
are free to disagree.

-- Bert



On 8/7/07, Bert Gunter <gunter.berton at gene.com> wrote:
> Why would anyone want to fit a mixture of normals with 110 million
> observations?? Any questions about the distribution that you would care to
> ask can be answered directly from the data. Of course, any test of
normality
> (or anything else) would be rejected.
>
> More to the point, the data are certainly not a random sample of anything.
> There will be all kinds of systematic nonrandom structure in them. This is
> clearly a situation where the researcher needs to think more carefully
about
> the substantive questions of interest and how the data may shed light on
> them, instead of arbitrarily and perhaps reflexively throwing some silly
> statistical methodology at them.
>
> Bert Gunter
> Genentech Nonclinical Statistics
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Tim Victor
> Sent: Tuesday, August 07, 2007 3:02 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Mixture of Normals with Large Data
>
> I wasn't aware of this literature, thanks for the references.
>
> On 8/5/07, RAVI VARADHAN <rvaradhan at jhmi.edu> wrote:
> > Another possibility is to use "data squashing" methods.  Relevant papers
> are: (1) DuMouchel et al. (1999), (2) Madigan et al. (2002), and (3) Owen
> (1999).
> >
> > Ravi.
> > ____________________________________________________________________
> >
> > Ravi Varadhan, Ph.D.
> > Assistant Professor,
> > Division of Geriatric Medicine and Gerontology
> > School of Medicine
> > Johns Hopkins University
> >
> > Ph. (410) 502-2619
> > email: rvaradhan at jhmi.edu
> >
> >
> > ----- Original Message -----
> > From: "Charles C. Berry" <cberry at tajo.ucsd.edu>
> > Date: Saturday, August 4, 2007 8:01 pm
> > Subject: Re: [R] Mixture of Normals with Large Data
> > To: tvictor at dolphin.upenn.edu
> > Cc: r-help at stat.math.ethz.ch
> >
> >
> > > On Sat, 4 Aug 2007, Tim Victor wrote:
> > >
> > >  > All:
> > >  >
> > >  > I am trying to fit a mixture of 2 normals with > 110 million
> > > observations. I
> > >  > am running R 2.5.1 on a box with 1gb RAM running 32-bit windows and
> > > I
> > >  > continue to run out of memory. Does anyone have any suggestions.
> > >
> > >
> > >  If the first few million observations can be regarded as a SRS of the
> > >
> > >  rest, then just use them. Or read in blocks of a convenient size and
> > >
> > >  sample some observations from each block. You can repeat this process
> > > a
> > >  few times to see if the results are sufficiently accurate.
> > >
> > >  Otherwise, read in blocks of a convenient size (perhaps 1 million
> > >  observations at a time), quantize the data to a manageable number of
> > >
> > >  intervals - maybe a few thousand - and tabulate it. Add the counts
> > > over
> > >  all the blocks.
> > >
> > >  Then use mle() to fit a multinomial likelihood whose probabilities
> > > are the
> > >  masses associated with each bin under a mixture of normals law.
> > >
> > >  Chuck
> > >
> > >  >
> > >  > Thanks so much,
> > >  >
> > >  > Tim
> > >  >
> > >  >    [[alternative HTML version deleted]]
> > >  >
> > >  > ______________________________________________
> > >  > R-help at stat.math.ethz.ch mailing list
> > >  >
> > >  > PLEASE do read the posting guide
> > >  > and provide commented, minimal, self-contained, reproducible code.
> > >  >
> > >
> > >  Charles C. Berry                            (858) 534-2098
> > >                                               Dept of
> > > Family/Preventive Medicine
> > >  E                     UC San Diego
> > >    La Jolla, San Diego 92093-0901
> > >
> > >  ______________________________________________
> > >  R-help at stat.math.ethz.ch mailing list
> > >
> > >  PLEASE do read the posting guide
> > >  and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list