[R] how to generate a random data from a empirical, distribition

Tue Jul 27 16:35:34 CEST 2010

On 7/27/2010 6:00 AM, r-help-request at r-project.org wrote:
> Date: Mon, 26 Jul 2010 11:36:29 -0700 (PDT)
> From: xin wei<xinwei at stat.psu.edu>
> To:r-help at r-project.org
> Subject: [R] how to generate a random data from a empirical
> 	distribition
> Message-ID:<1280169389379-2302716.post at n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
>
>
> hi, this is more a statistical question than a R question. but I do want to
> know how to implement this in R.
> I have 10,000 data points. Is there any way to generate a empirical
> probablity distribution from it (the problem is that I do not know what
> exactly this distribution follows, normal, beta?). My ultimate goal is to
> generate addition 20,000 data point from this empirical distribution created
> from the existing 10,000 data points.
> thank you all in advance.
>
>
> -- View this message in context: 
> http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html 
> Sent from the R help mailing list archive at Nabble.com.

Ah! This brings back memories of the halcyon days of my youth when, as a 
junior in college, I took a course in introductory probability theory 
around this time during the summer in preparation for working as a co-op 
student the coming fall.

Conceptually, why not treat your empirical sample as an "urn" with 
10,000 items. Then take a sample of 20,000 by sampling with equal 
probabilities and replacement (otherwise you'll run out of cases before 
20,000). Remember that all the common distributions (normal, etc.) 
either were derived because they fit certain common situations (e.g., 
binomial), are of particular use (e.g., Student's t), can be derived 
from other distributions (e.g., normal and the Central Limit Theorem), 
or some combination of such things. In other words, whether or not an 
empirical sample fits one of them is always contingent, although 
understanding any underlying processes that generate the sample might 
point in the direction of certain distributions over others. 
Nonetheless, for something like a Monte Carlo simulation, knowledge of 
an underlying distribution is not necessary.

Also remember that many things in statistics were developed largely 
because they made certain problems mathematically tractable. (Hence, for 
example, the large number of situations involving independent, 
identically distributed random samples or the popularity of ordinary 
least-squares regression.) Today, most of us have more computing power 
at our desks than entire mainframe computing centers had a few decades 
ago. So in many instances, we don't need no stinkin' complex formulas 
anymore.

If you suspect the distribution corresponds to one of the mathematically 
studied distributions, why not fit a curve to a plot of your data points 
and see if it looks familiar? Then do some kind of goodness-of-fit test 
to see if the theoretical distribution is a reasonable approximation.

-- 
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research 
<http://www.uri.edu/prov/research/urbanstudies.html>
The University of Rhode Island <http://www.uri.edu>
email: marsh @ uri .edu (remove spaces) <mailto:marsh%20%5C%20uri%20.edu>