[R] generate ordered categorical variable in R

Marc Schwartz marc_schwartz at me.com
Wed Sep 16 23:07:00 CEST 2015


> On Sep 16, 2015, at 3:40 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
> 
> Nope. Take it back. I stand uncorrected.
> 
>> system.time(z <-sample(1:10,1e6, rep=TRUE))
>   user  system elapsed
>  0.045   0.001   0.047
> 
>> system.time(z <-sample.int(10,1e6,rep=TRUE))
>   user  system elapsed
>  0.012   0.000   0.013
> 
> 
> sample() has to do subscripting in the general case; sample.int doesn't.
> 
> But I would agree that the difference is likely almost always unnoticeable.


Well, in your defense Bert, given the nuance of the example you provided, it actually gets worse the larger the initial sample space is, if defined as a vector rather than a scalar.

On my MacBook Pro, with 16 Gb of RAM and a 2.5 Ghz i7, running R version 3.2.2 (2015-08-14):

> system.time(x1 <- sample(1:1e10, 1e8, replace = TRUE))
Killed: 9

That ran for a couple of minutes and eventually crashed R.

However, as below:

> system.time(x1 <- sample(1e10, 1e8, replace = TRUE))
   user  system elapsed 
  2.943   0.238   3.191 

> system.time(x1 <- sample.int(1e10, 1e8, replace = TRUE))
   user  system elapsed 
  3.135   0.198   3.336 


Here is another example that works, showing a larger time difference with the sample space as a vector:

> system.time(x1 <- sample(1:1e9, 1e8, replace = TRUE))
   user  system elapsed 
  7.069   1.317   8.399 

> system.time(x1 <- sample(1e9, 1e8, replace = TRUE))
   user  system elapsed 
  1.324   0.111   1.438 

> system.time(x1 <- sample.int(1e9, 1e8, replace = TRUE))
   user  system elapsed 
  1.328   0.116   1.450 


If one is running Monte Carlo simulations, repeating the above a very large number of times, it can become a meaningful difference.

Thus, there is an incentive for one to specify the sample space as a scalar and perhaps consider the resultant vector, if needed, as indices (1:x) into the actual sample space desired.

Interesting...

Regards,

Marc


> 
> 
> -- Bert
> Bert Gunter
> 
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>   -- Clifford Stoll
> 
> 
> On Wed, Sep 16, 2015 at 1:34 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>> Yes. Thanks Marc. I stand corrected.
>> 
>> -- Bert
>> Bert Gunter
>> 
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>>   -- Clifford Stoll
>> 
>> 
>> On Wed, Sep 16, 2015 at 1:28 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
>>> 
>>>> On Sep 16, 2015, at 1:06 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>>> 
>>>> Yikes! The uniform distribution is a **continuous** distribution over
>>>> an interval. You seem to want to sample over a discrete distribution.
>>>> See ?sample for that, as in:
>>>> 
>>>> sample(1:4,100,rep=TRUE)
>>>> 
>>>> ## or for this special case and faster
>>>> 
>>>> sample.int(4,size=100,rep=TRUE)
>>> 
>>> 
>>> Bert,
>>> 
>>> I am not sure that it is really faster, since internally, sample() calls sample.int():
>>> 
>>>> sample
>>> function (x, size, replace = FALSE, prob = NULL)
>>> {
>>>    if (length(x) == 1L && is.numeric(x) && x >= 1) {
>>>        if (missing(size))
>>>            size <- x
>>>        sample.int(x, size, replace, prob)
>>>    }
>>>    else {
>>>        if (missing(size))
>>>            size <- length(x)
>>>        x[sample.int(length(x), size, replace, prob)]
>>>    }
>>> }
>>> 
>>> 
>>> set.seed(1)
>>> 
>>>> system.time(x1 <- sample(1e10, 1e8, replace = TRUE))
>>>   user  system elapsed
>>>  2.755   0.170   2.925
>>> 
>>> 
>>> set.seed(1)
>>>> system.time(x2 <- sample.int(1e10, 1e8, replace = TRUE))
>>>   user  system elapsed
>>>  2.767   0.183   2.951
>>> 
>>> 
>>>> all(x1 == x2)
>>> [1] TRUE
>>> 
>>> 
>>> Regards,
>>> 
>>> Marc
>>> 
>>> 
>>>> 
>>>> Cheers,
>>>> Bert
>>>> 
>>>> Bert Gunter
>>>> 
>>>> "Data is not information. Information is not knowledge. And knowledge
>>>> is certainly not wisdom."
>>>>  -- Clifford Stoll
>>>> 
>>>> 
>>>> On Wed, Sep 16, 2015 at 10:11 AM, thanoon younis
>>>> <thanoon.younis80 at gmail.com> wrote:
>>>>> Dear R- users
>>>>> 
>>>>> I want to generate ordered categorical variable vector with 200x1 dimension
>>>>> and from 1 to 4 categories and i tried with this code
>>>>> 
>>>>> Q1=runif(200,1,4) the results are not just 1 ,2 3,4, but the results with
>>>>> decimals like 1.244, 2.342,4,321 and so on ... My question how can i
>>>>> generate a vector and also a matrix with orered categorical variables and
>>>>> without decimals just 1,2,3 ,4 ,1,2,3,4, ....
>>>>> 
>>>>> Many thanks in advance
>>> 



More information about the R-help mailing list