[R] Nonparametric bivariate distribution estimation and sampling

heyi xiao xiaoheyiyh at yahoo.com
Sun Mar 25 17:40:02 CEST 2012


David,
Great idea! It is simple yet very effective. I didn’t know that sample can be used with probability/weight. I feel that I learned a lot. Thanks a lot for the help!
Heyi


--- On Fri, 3/23/12, David Winsemius <dwinsemius at comcast.net> wrote:

> From: David Winsemius <dwinsemius at comcast.net>
> Subject: Re: [R] Nonparametric bivariate distribution estimation and sampling
> To: "heyi xiao" <xiaoheyiyh at yahoo.com>
> Cc: "Sarah Goslee" <sarah.goslee at gmail.com>, r-help at r-project.org
> Date: Friday, March 23, 2012, 5:29 PM
> 
> On Mar 23, 2012, at 3:55 PM, heyi xiao wrote:
> 
> > David,
> > Thanks a lot for the specific suggestions. That’s
> very helpful. My question 1 is fully answered now. I guess I
> am not clear enough for my question 2. I would like to
> generate a random sample using the estimated probability
> density (as a result of my question 1) as the reference
> distribution.
> 
> > Say, I get a matrix of the estimated density (at some
> grid points) using MASS::kde2d. How can I use that result as
> a reference distribution to sample data from? I know it is a
> trivial issue for parametric distributions like bivariate
> normal, but what about such a nonparametric bivariate
> reference distribution? Any particular procedures or
> functions I can use?
> 
> See if this works:
> 
> data(geyser, package="MASS")
> x <- cbind(geyser$duration, geyser$waiting)
> est <- bkde2D(x, bandwidth=c(0.7, 7))
> 
> # Heh, I realized after I did this that I started with
> KernSmooth::bkde2D
> # and checked the results with MASS::kde2d
> # only difference appears to be name of density matrix
> # Construct a dataframe with X.Y information and the data
> from the bivariate density.
> # The output of bkde2D with n=50 is:
> 
> #List of 3
>  #$ x1  : num [1:51] -0.2167 -0.0823 0.052 0.1863
> 0.3207 ...
> # $ x2  : num [1:51] 32.5 34.2 35.9 37.7 39.4 ...
> # $ fhat: num [1:51, 1:51] 3.05e-19 2.17e-19 3.25e-19
> 2.17e-19 0.00 ...
> 
> # The index X.Y could be X + 51*Y and there would be a 1:1
> mapping from (X,Y) to X.Y
> # and the fhat values would be properly arranged
> 
> dfrm <- expand.grid(X=1:51, Y=1:51)
> dfrm$fhat <- c(est$fhat)
> 
> #Sample randomly from X.Y with length=51*51 using the fhat
> values for prob.
> 
> # The X.Y "index" never actually gets computed
> # but is implicit in the order of the data.frame
>  sampfrm <- dfrm[sample(51*51, 300, prob=est$fhat) , ]
>  f2 <- with(sampfrm,  MASS::kde2d(X, Y, n = 50, lims
> = c(0, 51, 0, 51)) )
>  persp(f2)
> 
> # Looks reasonable to my eye anyway.
> 
> --David.
> 
> > The reason I don’t want to use sampling (with
> replacement, I can sample more data than I have without
> replacement), as this will generate lots of duplicate data
> points, if I want to generated bigger dataset yet my raw
> data do not have a big sample size. The scatter plot of the
> sampled data doesn’t look good this way.
> > Heyi
> > 
> > 
> > --- On Fri, 3/23/12, David Winsemius <dwinsemius at comcast.net>
> wrote:
> > 
> >> From: David Winsemius <dwinsemius at comcast.net>
> >> Subject: Re: [R] Nonparametric bivariate
> distribution estimation and sampling
> >> To: "heyi xiao" <xiaoheyiyh at yahoo.com>
> >> Cc: "Sarah Goslee" <sarah.goslee at gmail.com>,
> r-help at r-project.org
> >> Date: Friday, March 23, 2012, 2:20 PM
> >> 
> >> On Mar 23, 2012, at 1:53 PM, heyi xiao wrote:
> >> 
> >>> Sarah,
> >>> Thanks for the response. I actually have
> several years
> >> of working experience with R and statistics,
> although may
> >> not be as good as you. that’s why I am here ;) I
> dug
> >> deeper into R documentations and previous R-help
> posts, and
> >> couldn’t found anything particular help. Again, I
> want to
> >> do two things: (1) estimate the probability density
> of this
> >> bivariate distribution using some nonparametric
> method
> >> (kernel, spline etc);
> >> 
> >> ?MASS::kde2d
> >> ?KernSmooth::bkde2D
> >> ?ade4::s.kde2d
> >> help(package=locfit)
> >> 
> >>> (2) sample a big dataset from this bivariate
> >> distribution for a simulation study.
> >> 
> >> What is wrong with `sample`?
> >> 
> >> # to get sample of size n without replacement
> >> set.seed(42)
> >> dfrm[ sample(1:NROW(dfrm), n) , ]
> >> 
> >> --David.
> >>> If my questions are not clear enough show my
> how I can
> >> improve, or which part is not clear enough. If you
> have any
> >> particular suggestions/comments, you are more than
> welcome.
> >> Thanks!
> >>> Heyi
> >>> 
> >>> 
> >>> --- On Fri, 3/23/12, Sarah Goslee <sarah.goslee at gmail.com>
> >> wrote:
> >>> 
> >>>> From: Sarah Goslee <sarah.goslee at gmail.com>
> >>>> Subject: Re: [R] Nonparametric bivariate
> >> distribution estimation and sampling
> >>>> To: "heyi xiao" <xiaoheyiyh at yahoo.com>
> >>>> Cc: r-help at r-project.org
> >>>> Date: Friday, March 23, 2012, 12:26 PM
> >>>> R can do all of that and more.
> >>>> 
> >>>> But you'll need to put some work in reading
> about
> >> how to use
> >>>> R, about
> >>>> the statistical methods involved, and about
> how to
> >> use them
> >>>> to best
> >>>> effect. You might want, for instance,
> generalized
> >> additive
> >>>> models. Or
> >>>> not. If your question isn't more
> fully-formed than
> >> this,
> >>>> your best bet
> >>>> is almost certainly to talk to a local
> >> statistician, spend
> >>>> some time
> >>>> working with R, and then come back to the
> list
> >> with
> >>>> specific
> >>>> questions.
> >>>> 
> >>>> Sarah
> >>>> 
> >>>> On Fri, Mar 23, 2012 at 12:17 PM, heyi xiao
> <xiaoheyiyh at yahoo.com>
> >>>> wrote:
> >>>>> Dear all,
> >>>>> I have a bivariate dataset from a
> preliminary
> >> study. I
> >>>> want to do two things: (1) estimate the
> probability
> >> density
> >>>> of this bivariate distribution using some
> >> nonparametric
> >>>> method (kernel, spline etc); (2) sample a
> big
> >> dataset from
> >>>> this bivariate distribution for a
> simulation
> >> study.
> >>>>> Is there any good method or package I
> can use
> >> in R for
> >>>> my work? I don’t want parametric models
> like
> >> bivariate
> >>>> normal distribution etc, as I would like
> to
> >> accurate model
> >>>> my data. I don’t want to use the
> bootstrapping
> >> approach,
> >>>> i.e. sampling with replacement, as this
> will
> >> generate lots
> >>>> of duplicate data points. Any thoughts or
> input
> >> will be
> >>>> highly appreciated!
> >>>>> Heyi
> >>>>> 
> >>>>> 
> >>>> 
> >>>> --Sarah Goslee
> >>>> http://www.functionaldiversity.org
> >>>> 
> >>> 
> >>> ______________________________________________
> >>> R-help at r-project.org
> >> mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal,
> self-contained,
> >> reproducible code.
> >> 
> >> David Winsemius, MD
> >> West Hartford, CT
> >> 
> >> 
> 
> David Winsemius, MD
> West Hartford, CT
> 
>



More information about the R-help mailing list