[R] Nonparametric bivariate distribution estimation and sampling
David Winsemius
dwinsemius at comcast.net
Fri Mar 23 22:29:00 CET 2012
On Mar 23, 2012, at 3:55 PM, heyi xiao wrote:
> David,
> Thanks a lot for the specific suggestions. That’s very helpful. My
> question 1 is fully answered now. I guess I am not clear enough for
> my question 2. I would like to generate a random sample using the
> estimated probability density (as a result of my question 1) as the
> reference distribution.
> Say, I get a matrix of the estimated density (at some grid points)
> using MASS::kde2d. How can I use that result as a reference
> distribution to sample data from? I know it is a trivial issue for
> parametric distributions like bivariate normal, but what about such
> a nonparametric bivariate reference distribution? Any particular
> procedures or functions I can use?
See if this works:
data(geyser, package="MASS")
x <- cbind(geyser$duration, geyser$waiting)
est <- bkde2D(x, bandwidth=c(0.7, 7))
# Heh, I realized after I did this that I started with
KernSmooth::bkde2D
# and checked the results with MASS::kde2d
# only difference appears to be name of density matrix
# Construct a dataframe with X.Y information and the data from the
bivariate density.
# The output of bkde2D with n=50 is:
#List of 3
#$ x1 : num [1:51] -0.2167 -0.0823 0.052 0.1863 0.3207 ...
# $ x2 : num [1:51] 32.5 34.2 35.9 37.7 39.4 ...
# $ fhat: num [1:51, 1:51] 3.05e-19 2.17e-19 3.25e-19 2.17e-19 0.00 ...
# The index X.Y could be X + 51*Y and there would be a 1:1 mapping
from (X,Y) to X.Y
# and the fhat values would be properly arranged
dfrm <- expand.grid(X=1:51, Y=1:51)
dfrm$fhat <- c(est$fhat)
#Sample randomly from X.Y with length=51*51 using the fhat values for
prob.
# The X.Y "index" never actually gets computed
# but is implicit in the order of the data.frame
sampfrm <- dfrm[sample(51*51, 300, prob=est$fhat) , ]
f2 <- with(sampfrm, MASS::kde2d(X, Y, n = 50, lims = c(0, 51, 0,
51)) )
persp(f2)
# Looks reasonable to my eye anyway.
--
David.
> The reason I don’t want to use sampling (with replacement, I can
> sample more data than I have without replacement), as this will
> generate lots of duplicate data points, if I want to generated
> bigger dataset yet my raw data do not have a big sample size. The
> scatter plot of the sampled data doesn’t look good this way.
> Heyi
>
>
> --- On Fri, 3/23/12, David Winsemius <dwinsemius at comcast.net> wrote:
>
>> From: David Winsemius <dwinsemius at comcast.net>
>> Subject: Re: [R] Nonparametric bivariate distribution estimation
>> and sampling
>> To: "heyi xiao" <xiaoheyiyh at yahoo.com>
>> Cc: "Sarah Goslee" <sarah.goslee at gmail.com>, r-help at r-project.org
>> Date: Friday, March 23, 2012, 2:20 PM
>>
>> On Mar 23, 2012, at 1:53 PM, heyi xiao wrote:
>>
>>> Sarah,
>>> Thanks for the response. I actually have several years
>> of working experience with R and statistics, although may
>> not be as good as you. that’s why I am here ;) I dug
>> deeper into R documentations and previous R-help posts, and
>> couldn’t found anything particular help. Again, I want to
>> do two things: (1) estimate the probability density of this
>> bivariate distribution using some nonparametric method
>> (kernel, spline etc);
>>
>> ?MASS::kde2d
>> ?KernSmooth::bkde2D
>> ?ade4::s.kde2d
>> help(package=locfit)
>>
>>> (2) sample a big dataset from this bivariate
>> distribution for a simulation study.
>>
>> What is wrong with `sample`?
>>
>> # to get sample of size n without replacement
>> set.seed(42)
>> dfrm[ sample(1:NROW(dfrm), n) , ]
>>
>> --David.
>>> If my questions are not clear enough show my how I can
>> improve, or which part is not clear enough. If you have any
>> particular suggestions/comments, you are more than welcome.
>> Thanks!
>>> Heyi
>>>
>>>
>>> --- On Fri, 3/23/12, Sarah Goslee <sarah.goslee at gmail.com>
>> wrote:
>>>
>>>> From: Sarah Goslee <sarah.goslee at gmail.com>
>>>> Subject: Re: [R] Nonparametric bivariate
>> distribution estimation and sampling
>>>> To: "heyi xiao" <xiaoheyiyh at yahoo.com>
>>>> Cc: r-help at r-project.org
>>>> Date: Friday, March 23, 2012, 12:26 PM
>>>> R can do all of that and more.
>>>>
>>>> But you'll need to put some work in reading about
>> how to use
>>>> R, about
>>>> the statistical methods involved, and about how to
>> use them
>>>> to best
>>>> effect. You might want, for instance, generalized
>> additive
>>>> models. Or
>>>> not. If your question isn't more fully-formed than
>> this,
>>>> your best bet
>>>> is almost certainly to talk to a local
>> statistician, spend
>>>> some time
>>>> working with R, and then come back to the list
>> with
>>>> specific
>>>> questions.
>>>>
>>>> Sarah
>>>>
>>>> On Fri, Mar 23, 2012 at 12:17 PM, heyi xiao <xiaoheyiyh at yahoo.com>
>>>> wrote:
>>>>> Dear all,
>>>>> I have a bivariate dataset from a preliminary
>> study. I
>>>> want to do two things: (1) estimate the probability
>> density
>>>> of this bivariate distribution using some
>> nonparametric
>>>> method (kernel, spline etc); (2) sample a big
>> dataset from
>>>> this bivariate distribution for a simulation
>> study.
>>>>> Is there any good method or package I can use
>> in R for
>>>> my work? I don’t want parametric models like
>> bivariate
>>>> normal distribution etc, as I would like to
>> accurate model
>>>> my data. I don’t want to use the bootstrapping
>> approach,
>>>> i.e. sampling with replacement, as this will
>> generate lots
>>>> of duplicate data points. Any thoughts or input
>> will be
>>>> highly appreciated!
>>>>> Heyi
>>>>>
>>>>>
>>>>
>>>> --Sarah Goslee
>>>> http://www.functionaldiversity.org
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org
>> mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list