[R] generate distribution based on summary data and add random noise

Thu Feb 3 22:16:47 CET 2022

I suggest taking Bert's suggestion and looking more closely. Take a different dataset where you have measures for each particle. Then apply the binning function. Use Bert's approach to recovering the "raw" data and then plot the real raw data and the approximated raw data. If you are happy with the result then proceed. If you are not happy, then consider that there is always the option not to do something. 

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Bert Gunter
Sent: Thursday, February 3, 2022 12:35 PM
To: PIKAL Petr <petr.pikal using precheza.cz>
Cc: R-help <r-help using r-project.org>
Subject: Re: [R] generate distribution based on summary data and add random noise

[External Email]

Nope. I think I provided what you asked for, random data in each bin with the amount of data proportional to bin percentage and the distribution of that data uniform (nor normal) within the bin. So maybe someone else can give you what you want if this ain't it.

Cheers,
Bert

"The trouble with having an open mind is that people keep coming along and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Thu, Feb 3, 2022 at 8:44 AM PIKAL Petr <petr.pikal using precheza.cz> wrote:

> Hallo Bert
>
> probably not, sorry. Did you try my examples?
>
> To make it maybe simpler
> 1. sample a vector with given proportion and generate new data 2. add 
> random noise to each generated value with sd given by value of a 
> vector.
>
> let say
>
> x <- c(10, 100)
> y <- c(.6, .4)
> set.seed(200)
> z <- sample(x, 10, rep=TRUE, prob=y)
> ind <- order(z)
> bins <- rle(z[ind])
> bin1 <- rnorm(bins$lengths[1], mean = 0, sd=bins$values[1]/5)
> bin2 <- rnorm(bins$lengths[2], mean = 0, sd=bins$values[2]/5) z[ind] + 
> c(bin1, bin2)
>
> Sorry that I did not explain myself more clearly, I hoped that example 
> showed what I have on mind.
>
> Basically it is particle size cumulative distribution but size is 
> expressed as size bins. Normally I have exact size measurement for 
> each particle.
>
> S pozdravem | Best Regards
> RNDr. Petr PIKAL
> Vedoucí Výzkumu a vývoje | Research Manager PRECHEZA a.s.
> nábř. Dr. Edvarda Beneše 1170/24 | 750 02 Přerov | Czech Republic
> Tel: +420 581 252 256 | GSM: +420 724 008 364 
> mailto:petr.pikal using precheza.cz | 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_&
> d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe
> 94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=2A8zlu8YYruq
> c_HvJ_ZM1ZUyHYUxhnkzEcK0r7gqw1U&e=
>
> Osobní údaje: Informace o zpracování a ochraně osobních údajů 
> obchodních partnerů PRECHEZA a.s. jsou zveřejněny na:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_z
> asady-2Dochrany-2Dosobnich-2Dudaju_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&
> r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=LnvV0OF3Gt0WokwkVLzk8zlw5EIaPMiJnIlIV4FQWzs&e=  | Information about processing and protection of business partner’s personal data are available on website:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_e
> n_personal-2Ddata-2Dprotection-2Dprinciples_&d=DwIFaQ&c=sJ6xIWYx-zLMB3
> EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--
> H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=1ctBn30QSySXAHR0-xjdk_VQKrru3bF2TNX4j
> 2buz7I&e=
> Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou 
> důvěrné a podléhají tomuto právně závaznému prohlášení o vyloučení
> odpovědnosti: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_0
> 1-2Ddovetek_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-
> g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s
> =Xyy9O_noGq_K3Nto-64iGqZC0R-JhoAwiMSNQIixvZU&e=  | This email and any 
> documents attached to it may be confidential and are subject to the 
> legally binding disclaimer: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_e
> n_01-2Ddisclaimer_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzs
> n7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-
> _CUq&s=FQbPQ1XP9xb1RJMbKZlXhQQaa9zSnUxJOevqrzrypRo&e=
>
> From: Bert Gunter <bgunter.4567 using gmail.com>
> Sent: Thursday, February 3, 2022 5:10 PM
> To: PIKAL Petr <petr.pikal using precheza.cz>
> Cc: R-help <r-help using r-project.org>
> Subject: Re: [R] generate distribution based on summary data and add 
> random noise
>
> If I understand correctly:
> To generate a sample of total size N, generate a uniform sample of 
> size p*N for a bin with proportion p?
> ?runif
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along 
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Thu, Feb 3, 2022 at 7:52 AM PIKAL Petr 
> <mailto:petr.pikal using precheza.cz>
> wrote:
> Hallo all
>
> I have summary data with size bins and percentage below that size.
>
> dat <- structure(list(size = c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 
> 90L, 100L, 110L, 120L, 130L, 140L, 150L, 160L, 170L, 180L, 190L, 200L, 
> 250L, 300L, 400L, 500L), percent = c(0L, 0L, 0L, 1L, 1L, 2L, 4L, 8L, 
> 13L, 18L, 24L, 31L, 38L, 44L, 50L, 57L, 65L, 72L, 76L, 83L, 95L, 98L, 
> 100L, 100L)), class = "data.frame", row.names = c(NA,
> -24L))
>
> #I want to generate original distribution (I know it is better not to 
> do it but I have no other choice) so I calculated #mids of those bins
>
> xd <-dat$size-c(5,diff(dat$size)/2)
> xd<- xd[-1]
>
> #I can sample the size bins with probability given by percent.
> Result <- sample(xd, 1000, rep=T, prob=diff(dat$percent)/100)
> plot(ecdf(Result))
>
> #and I can add some noise to it, which is satisfactory with lower size 
> bins but not enough for higher size bins.
>
> Result <- sample(xd, 1000, rep=T, 
> prob=diff(dat$percent)/100)+rnorm(1000,
> mean=0, sd=5)
> plot(ecdf(Result))
> I can increase sd to satisfy bigger bin size but in that case noise is 
> too big for lower bin size.
>
> I would like to add smaller random noise to lower size bins and bigger 
> random noise to higher size bins, which seems to be easy task but I am 
> stuck how to do it. It should be somehow proportional to size value.
> The only way forward I see is to sort generated result and to use 
> something like
>
> + rnorm(1000, mean=xd, sd=xd/10)
> But it is not correct.
>
> I'd appreciate any hint how to add random noise to values in ordered 
> manner.
>
> Best regards.
> Petr
>
> Osobní údaje: Informace o zpracování a ochraně osobních údajů 
> obchodních partnerů PRECHEZA a.s. jsou zveřejněny na:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_z
> asady-2Dochrany-2Dosobnich-2Dudaju_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&
> r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=LnvV0OF3Gt0WokwkVLzk8zlw5EIaPMiJnIlIV4FQWzs&e=  | Information about processing and protection of business partner’s personal data are available on website:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_e
> n_personal-2Ddata-2Dprotection-2Dprinciples_&d=DwIFaQ&c=sJ6xIWYx-zLMB3
> EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--
> H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=1ctBn30QSySXAHR0-xjdk_VQKrru3bF2TNX4j
> 2buz7I&e=
> Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou 
> důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení
> odpovědnosti: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_0
> 1-2Ddovetek_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-
> g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s
> =Xyy9O_noGq_K3Nto-64iGqZC0R-JhoAwiMSNQIixvZU&e=  | This email and any 
> documents attached to it may be confidential and are subject to the 
> legally binding disclaimer: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.precheza.cz_e
> n_01-2Ddisclaimer_&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzs
> n7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-
> _CUq&s=FQbPQ1XP9xb1RJMbKZlXhQQaa9zSnUxJOevqrzrypRo&e=
>
> ______________________________________________
> mailto:R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, 
> see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> man_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJ
> xR-_CUq&s=G7mrJcOOgGTjgbVjY_TNsusk-0cEKAFjvYGjiD5RZeM&e=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3N
> JxR-_CUq&s=b-P7V72w6IHT7gpAGwzaTN42gMwGRy9jkOWeQ4dX1QI&e=
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=G7mrJcOOgGTjgbVjY_TNsusk-0cEKAFjvYGjiD5RZeM&e=
PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=-AB8wceVe94Vlj5r4Ys5b6hqY2KsI4sq--H-Gl-PqL2ayWSE3H0EGJ3NJxR-_CUq&s=b-P7V72w6IHT7gpAGwzaTN42gMwGRy9jkOWeQ4dX1QI&e=
and provide commented, minimal, self-contained, reproducible code.