[BioC] set.seed() in gcrma

Kasper Daniel Hansen khansen at stat.Berkeley.EDU
Wed Feb 13 20:54:13 CET 2008


On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote:

> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen
> <khansen at stat.berkeley.edu> wrote:
>> In my opinion gcrma should _not_ play with the seed. That is
>> something for the user to decide.
>
> It's a trade-off between providing reproducible examples and adding to
> the overall confusion.  I think it is important to be able to generate
> identical results for identical calls, and since it is not clear to
> most users that the gcRMA parameters are estimated based on a sample
> one I can see how the current solution avoids the problem of having
> ask people to set the seed explicitly before calling gcrma() [also I
> have a hard time to keep track on what is going on in all methods].

When a function sets the seed within itself and then does a sample()  
immediately afterwards what it really does is using a fixed index  
vector for arrays of a certain size. To call this "a random sample"  
is simply wrong. It would be easier and more transparent to just  
choose a regular grid like
   seg(from = 1, to = maxIndex, length.out = sampleSize)
Now some people would argue that it seems wrong to use a regular grid  
like this and that the probes should be picked at random. To which I  
repeat what i just wrote: when you do set.seed you are not selecting  
anything at random - but you are obscuring this fact. If it really is  
important to not always use the same probes, the seed should not be  
set and you should live with the fact that you do not get  
reproducible effects.

I agree that for gcRMA it would make sense to have the results be  
highly reproducible (otherwise I foresee many posts about this), but  
that is achieved even better by using a grid as above. Or perhaps do  
something like having an argument called (say) randomSample = FALSE  
as default and then do something like
if(randomSample)
   gridIdx = sample(maxIndex, size = sampleSize)
else
   gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize)
(appropriately modified if you want to exclude certain indices  
corresponding to QC probes). Setting the seed inside a function  
should in my opinion always be discouraged.

Kasper



> One solution would be to move the value of seed to be an argument with
> a default value and if NA/NULL the seed is *not* set.  That would
> provide a backward-compatible upgrade to gcrma().

This is the solution taken by most people who want a similar  
functionality. I still


> Comments?
>
> Henrik
>
>>
>> Kasper
>>
>>
>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote:
>>
>>> Bill
>>> Thank you for that suggestion. We will edit gcrma to restore the
>>> seed in
>>> the next release.
>>> Zhijin
>>> Henrik Bengtsson wrote:
>>>> Hi,
>>>>
>>>> I think you have spotted something very important, and I agree that
>>>> the random seed should be reset by any function that set it in  
>>>> order
>>>> to get reproducible samples.  We've been working with a port/ 
>>>> wrapper
>>>> for gcrma() but this never struck us - thanks for bringing it up.
>>>>
>>>> FYI, I've just sent a question to R-devel, as it is more  
>>>> appropriate
>>>> there, on how to best push the random generator back to the  
>>>> state it
>>>> was before calling set.seed().
>>>>
>>>> Best,
>>>>
>>>> Henrik
>>>>
>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at duke.edu>  
>>>> wrote:
>>>>> Hello,
>>>>>
>>>>> While recently using the gcrma package, I noticed some strange
>>>>> behavior
>>>>> that I want to report.  I was applying the pre-processing
>>>>> algorithm (using
>>>>> default settings) inside a resampling scheme and noticed that
>>>>> subsequent
>>>>> calls of sample() were returning a constant vector of values.
>>>>> Example
>>>>> output shown below:
>>>>>
>>>>>> temp <- gcrma(celdat[,1:2])
>>>>> Adjusting for optical effect..Done.
>>>>> Computing affinities.Done.
>>>>> Adjusting for non-specific binding..Done.
>>>>> Normalizing
>>>>> Calculating Expression
>>>>>
>>>>>> sample(1:10000,3)
>>>>> [1] 7030 5473 8109
>>>>>
>>>>>> temp <- gcrma(celdat[,1:2])
>>>>> Adjusting for optical effect..Done.
>>>>> Computing affinities.Done.
>>>>> Adjusting for non-specific binding..Done.
>>>>> Normalizing
>>>>> Calculating Expression
>>>>>
>>>>>> sample(1:10000,3)
>>>>> [1] 7030 5473 8109
>>>>>
>>>>>> sessionInfo()
>>>>> R version 2.6.1 (2007-11-26)
>>>>> x86_64-redhat-linux-gnu
>>>>>
>>>>> locale:
>>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e 
>>>>> n_
>>>>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= 
>>>>> en
>>>>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ 
>>>>> US
>>>>> .UTF-8;LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] splines   tools     stats     graphics  grDevices utils
>>>>> datasets
>>>>> [8] methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] hu6800probe_2.0.0    hu6800cdf_2.0.0      gcrma_2.10.0
>>>>> [4] matchprobes_1.10.0   affy_1.16.0          preprocessCore_1.0.0
>>>>> [7] affyio_1.6.1         Biobase_1.16.2
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] rcompgen_0.1-17
>>>>>
>>>>> In looking at the gcrma functions, it appears that gcrma.engine2
>>>>> () is
>>>>> setting the seed without saving and restoring .Random.seed .
>>>>> This was
>>>>> easily corrected in the external loop of the resampling scheme,
>>>>> but I was
>>>>> unsure whether this would be considered a 'bug' in gcrma.
>>>>>
>>>>> Kind regards,
>>>>> Bill
>>>>>
>>>>>
>>>>> -------------------------------
>>>>> William T. Barry, Ph.D.
>>>>> Dept. of Biostatistics & Bioinformatics, DUMC
>>>>> Duke Institute for Genome Science & Policy
>>>>> Hock Plaza, Room 8030
>>>>> DUMC Box 2717
>>>>> Durham, NC 27710
>>>>>
>>>>> Phone: 919-681-5047
>>>>> Fax: 919-668-9335
>>>>>
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/
>>>>> gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/
>>>> gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> -------------------------------------------
>>> Zhijin (Jean) Wu
>>> Assistant Professor of Biostatistics
>>> Brown University, Box G-S121
>>> Providence, RI  02912
>>>
>>> Tel: 401 863 1230
>>> Fax: 401 863 9182
>>> http://stat.brown.edu/~zwu
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/
>>> gmane.science.biology.informatics.conductor
>>
>>



More information about the Bioconductor mailing list