[R] How important is set.seed

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Tue Mar 22 17:30:37 CET 2022


Well, course! -- any procedure that incorporates "randomness" will
produce *different* random results from different random choices.
set.seed() assures you get the same random choices and hence the same
random results.

> set.seed(567)
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 4
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 2

## change seed
> set.seed(123)
> sample(1:5,1)
[1] 3
> sample(1:5,1)
[1] 3
> sample(1:5,1)
[1] 2
> sample(1:5,1)
[1] 2
> sample(1:5,1)
[1] 3
> sample(1:5,1)
[1] 5

## back to original. All subsequent random values as previously
> set.seed(567)
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 4
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 5
> sample(1:5,1)
[1] 2
> sample(1:5,1)
[1] 5


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Tue, Mar 22, 2022 at 9:19 AM Neha gupta <neha.bologna90 using gmail.com> wrote:
>
> I read a paper two days ago (and that's why I then posted here about set.seed) which used interpretable machine learning.
>
> According to the authors, different explanations (of the black-box models) will be produced by the ML models if different seeds are used or never used.
>
>
>
> On Tue, Mar 22, 2022 at 5:12 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>>
>> OK, I'm somewhat puzzled by this discussion. Maybe I'm just clueless. But...
>>
>> 1. set.seed() is used to make any procedure that uses R's
>> pseudo-random number generator -- including, for example, sampling
>> from a distribution, random data splitting, etc. -- "reproducible".
>> That is, if the procedure is repeated *exactly,* by invoking
>> set.seed() with its original argument values (once!) *before* the
>> procedure begins, exactly the same results should be produced by the
>> procedure. Full stop. It does not matter how many times random number
>> generation occurs within the procedure thereafter -- R preserves the
>> state of the rng between invocations (but see the notes in ?set.seed
>> for subtle qualifications of this claim).
>>
>> 2. Hence, if no (pseudo-) random number generation is used, set.seed()
>> is irrelevant. Full stop.
>>
>> 3. Hence, if you don't care about reproducibility (you should! -- if
>> for no other reason than debugging), you don't need set.seed()
>>
>> 4. The "randomness" of any sequence of results from any particular
>> set.seed() arguments (including further calls to the rng) is a complex
>> issue. ?set.seed has some discussion of this, but one needs
>> considerable expertise to make informed choices here. As usual, we
>> untutored users should be guided by the expert recommendations of the
>> Help file.
>>
>> *** If anything I have said above is wrong, I would greatly appreciate
>> a public response here showing my error.***
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>>
>> On Tue, Mar 22, 2022 at 7:48 AM Neha gupta <neha.bologna90 using gmail.com> wrote:
>> >
>> > Hello Tim
>> >
>> > In some of the examples I see in the tutorials, they put the random seed
>> > just before the model training e.g train function in case of caret library.
>> > Should I follow this?
>> >
>> > Best regards
>> > On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>> >
>> > > Ah, so maybe what you need is to think of “set.seed()” as a treatment in
>> > > an experiment. You could use a random number generator to select an
>> > > appropriate number of seeds, then use those seeds repeatedly in the
>> > > different models to see how seed selection influences outcomes. I am not
>> > > quite sure how many seeds would constitute a good sample. For me that would
>> > > depend on what I find and how long a run takes.
>> > >
>> > >   In parallel processing you set seed in master and then use a random
>> > > number generator to set seeds in each worker.
>> > >
>> > > Tim
>> > >
>> > >
>> > >
>> > > *From:* Neha gupta <neha.bologna90 using gmail.com>
>> > > *Sent:* Tuesday, March 22, 2022 6:33 AM
>> > > *To:* Ebert,Timothy Aaron <tebert using ufl.edu>
>> > > *Cc:* Jeff Newmiller <jdnewmil using dcn.davis.ca.us>; r-help using r-project.org
>> > > *Subject:* Re: How important is set.seed
>> > >
>> > >
>> > >
>> > > *[External Email]*
>> > >
>> > > Thank you all.
>> > >
>> > >
>> > >
>> > > Actually I need set.seed because I have to evaluate the consistency of
>> > > features selection generated by different models, so I think for this, it's
>> > > recommended to use the seed.
>> > >
>> > >
>> > >
>> > > Warm regards
>> > >
>> > > On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>> > >
>> > > If you are using the program for data analysis then set.seed() is not
>> > > necessary unless you are developing a reproducible example. In a standard
>> > > analysis it is mostly counter-productive because one should then ask if
>> > > your presented results are an artifact of a specific seed that you selected
>> > > to get a particular result. However, in cases where you need a reproducible
>> > > example, debugging a program, or specific other cases where you might need
>> > > the same result with every run of the program then set.seed() is an
>> > > essential tool.
>> > > Tim
>> > >
>> > > -----Original Message-----
>> > > From: R-help <r-help-bounces using r-project.org> On Behalf Of Jeff Newmiller
>> > > Sent: Monday, March 21, 2022 8:41 PM
>> > > To: r-help using r-project.org; Neha gupta <neha.bologna90 using gmail.com>; r-help
>> > > mailing list <r-help using r-project.org>
>> > > Subject: Re: [R] How important is set.seed
>> > >
>> > > [External Email]
>> > >
>> > > First off, "ML models" do not all use random numbers (for prediction I
>> > > would guess very few of them do). Learn and pay attention to what the
>> > > functions you are using do.
>> > >
>> > > Second, if you use random numbers properly and understand the precision
>> > > that your specific use case offers, then you don't need to use set.seed.
>> > > However, in practice, using set.seed can allow you to temporarily avoid
>> > > chasing precision gremlins, or set up specific test cases for testing code,
>> > > not results. It is your responsibility to not let this become a crutch... a
>> > > randomized simulation that is actually sensitive to the seed is unlikely to
>> > > offer an accurate result.
>> > >
>> > > Where to put set.seed depends a lot on how you are performing your
>> > > simulations. In general each process should set it once uniquely at the
>> > > beginning, and if you use parallel processing then use the features of your
>> > > parallel processing framework to insure that this happens. Beware of
>> > > setting all worker processes to use the same seed.
>> > >
>> > > On March 21, 2022 5:03:30 PM PDT, Neha gupta <neha.bologna90 using gmail.com>
>> > > wrote:
>> > > >Hello everyone
>> > > >
>> > > >I want to know
>> > > >
>> > > >(1) In which cases, we need to use set.seed while building ML models?
>> > > >
>> > > >(2) Which is the exact location we need to put the set.seed function i.e.
>> > > >when we split data into train/test sets, or just before we train a model?
>> > > >
>> > > >Thank you
>> > > >
>> > > >       [[alternative HTML version deleted]]
>> > > >
>> > > >______________________________________________
>> > > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > > >https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm
>> > > >an_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRz
>> > > >sn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf
>> > > >0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e=
>> > > >PLEASE do read the posting guide
>> > > >https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org
>> > > >_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsR
>> > > >zsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrm
>> > > >f0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
>> > > >and provide commented, minimal, self-contained, reproducible code.
>> > >
>> > > --
>> > > Sent from my phone. Please excuse my brevity.
>> > >
>> > > ______________________________________________
>> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.
>> > > ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=
>> > > 9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_
>> > > AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2Wy
>> > > RxpXsq4Y3TRMU&e=
>> > > PLEASE do read the posting guide https://urldefense.proofpoint.
>> > > com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.
>> > > html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=
>> > > s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcL
>> > > wt2jrmf0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
>> > > and provide commented, minimal, self-contained, reproducible code.
>> > >
>> > >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list