[R] [FORGED] Generating random data with non-linear correlation between two variables

peter dalgaard pdalgd at gmail.com
Sat Apr 9 14:42:13 CEST 2016


> On 09 Apr 2016, at 13:09 , Muhammad Bilal <Muhammad2.Bilal at live.uwe.ac.uk> wrote:
> 
> The goal is to test a developed model against two sets of hypothetical data, where the relationship between on data set is linear whereas non-linear (e.g., polynomial) with another. However, the distributions of the v1 and v2 should not be other than normal or slightly positively skewed or slightly negatively skewed. 
> 
> In Oracle, random data is generated with packaged function dbms_random.value(lowerbound, upperbound), which can be called from SQL query with where clause (level <= no_of_rows) for the number of rows you want.
> 
> After the rows are generated, we can write custom functions to spread the data points along the y-axis, so that they wouldn't overlap. 
> 
> I hope this may clear the use case further.

Not really...

You can do lots of stuff with random number generation in R, but it is not clear to what extent we should take your requirements seriously. E.g., you say you want the range of v1 to be 500-1500 and the mean to be 1100. It is easy enough to generate uniform random numbers between 500 and 1500: 

> v1 <- runif(1000,500,1500)

but the theoretical mean of v1 is 1000, not 1100:

> mean(v1)
[1] 985.7375

To increase the mean, on could play around with scaled beta distributions, e.g.

> v1 <- 500 + 1000 * rbeta(1000, 1.2, 0.8)
> mean(v1)
[1] 1093.685

but it is not clear how you ever passed the same requirement to Oracle. 

Next, you wanted v2 with the following requirements

v2 between 300 and 850
mean(v2) == 400
v2 nonlinearly related to v1

If we postulate a relation where the conditional expectation is, like, 
E(v2 | v1) = a0 + a1 * v1 - a2 * v1^2, and v1 is as above, then the constants can be twiddled to satisfy E(v2) = 400. Then to generate random output with that mean and range, one could again use a scaled beta distribution. 

It is, however, not at all clear that this is in fact the kind of solution that you want....

-pd

> 
> Many Thanks and 
> 
> Kind Regards
> --
> Muhammad Bilal
> 
> 
> Research Assistant and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY 
> 
> 
> muhammad2.bilal at live.uwe.ac.uk
> 
> 
> ________________________________________
> From: David R Forrest <drf at vims.edu>
> Sent: 09 April 2016 11:48
> To: Muhammad Bilal
> Cc: Rolf Turner; r-help at r-project.org
> Subject: Re: [R] [FORGED] Generating random data with non-linear correlation between two variables
> 
> Please specify your goal in the oracle/psql analytical functions you know or specify what you mean by nonlinear correlation
> 
> Sent from my iPhone
> 
>> On Apr 9, 2016, at 6:09 AM, Muhammad Bilal <Muhammad2.Bilal at live.uwe.ac.uk> wrote:
>> 
>> No its not. I am doing all these experiments for my own learning purpose. I am Oracle SQL & PLSQL programmer and  I can do these things with Oracle analytical functions.
>> 
>> However at present I am keen to learn R, with no other interest right now.
>> 
>> Thanks
>> --
>> Muhammad Bilal
>> Research Assistant and PhD Student,
>> Bristol Enterprise, Research and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>> 
>> muhammad2.bilal at live.uwe.ac.uk
>> 
>> 
>> ________________________________________
>> From: Rolf Turner <r.turner at auckland.ac.nz>
>> Sent: 09 April 2016 04:46
>> To: Muhammad Bilal
>> Cc: r-help at r-project.org
>> Subject: Re: [FORGED] [R] Generating random data with non-linear correlation between two variables
>> 
>>> On 09/04/16 06:57, Muhammad Bilal wrote:
>>> Hi All,
>>> 
>>> I am new to R and don't know how to achieve it.
>>> 
>>> I am interested in generating a hypothetical dataframe that is consisted of say two variables named v1 and v2, based on the following constraints:
>>> 1. The range of v1 is 500-1500.
>>> 2. The mean of v1 is say 1100
>>> 3. The range of v2 is 300-950.
>>> 4. The mean of v2 is say 400
>>> 5. There exists a positive trend between these two variables, meaning that as v1 increases, v2 be also increase.
>>> 6. But the trend should be slightly non-linear. i.e., curved line.
>>> 
>>> Is it possible to automatically generate through functions like rnorm.
>>> 
>>> Any help will be highly appreciated.
>> 
>> This sounds to me very much like a homework problem.  We don't do
>> people's homework for them on this list.
>> 
>> cheers,
>> 
>> Rolf Turner
>> 
>> --
>> Technical Editor ANZJS
>> Department of Statistics
>> University of Auckland
>> Phone: +64-9-373-7599 ext. 88276
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list