[BioC] set.seed() in gcrma

lgautier at altern.org lgautier at altern.org
Mon Feb 18 21:16:03 CET 2008


> Anderson, S. Keith scripsit:
>> Pierre,
>> I have not tried that.
>> And yes, I would expect to obtain slightly different results with just
shuffling the order of the CEL files.
>> S. Keith Anderson
>> -----Original Message-----
>> From: Pierre Neuvial [mailto:pierre.neuvial at curie.fr]
>> Sent: Friday, February 15, 2008 3:22 AM
>> To: Anderson, S. Keith
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] set.seed() in gcrma
>> Hi,
>> Does all this imply that one may (and will, in fact) obtain slightly
different results when reanalysing exactly the same CEL files if their
order is simply shuffled (e.g. if one of the CEL files is renamed) ??
Pierre.
>
>
> Hi,
>
> Yes, this has been my experience. Also, if you reorder the order of the
probes within the probesets in the CDF environment. And there was a
discussion some time ago on differences in gcrma results between linux and
windows:
> https://stat.ethz.ch/pipermail/bioconductor/2007-February/015924.html
AfaIk the set.seed in the gcrma function was introduced in response to
that, followed later by improvements in the sampling schemes as
> mentioned by Jean in this thread.
>
> Given the wide usage and usefulness of the method it would be good to
have a better understanding and documentation of its precision under what
people deem "small" variations of the input data.
>
>   Best wishes
>     Wolfgang
>

I think that the order in which CEL files are submitted, or
the order in which probes in probesets are defined should not qualify as a
particular of source of variations (in the later case, that order is often
the one of probes along the matching reference sequence). If there is a
random component, it could probably be explicit and users educated about
it and understand that the resulting expression values, as well as the
outcome of an analysis is likely to change.

I do agree that a certain amount of understanding and documentation of the
uncertainty of the results is needed. However, analysis
performed with gcRMA might just require to be done "a few times" (just
like k-means clustering on random starting seeds is probably often
performed several times).

With the characterization of the variability of results obtained with
gcRMA, one could also look into having procedures for common microarray
data analysis tasks that would account for the intrinsic variability of
gcRMA. There is actually a chance that this improves results obtained with
gcRMA (by cutting down the number of false-positives).


Laurent




>
>
>> Anderson, S. Keith a écrit :
>>> Hi All,
>>> The sample step occurs in the gcrma.engine/gcrma.engine2 sub function.
A random sample of the entire matrix of PM values is selected to
calculate an linear regression coefficient that is used in the
full_model and affinities model.
>>> I believe the set.seed allows one to obtain the same results for any
reanalysis of exactly the same CEL files.
>>> If one adds or subtracts even one CEL file the sampling of that entire
matrix change and a different result is obtained.
>>> S. Keith Anderson, MS
>>> Statistician, Cancer Center Statistics
>>> Mayo Clinic
>>> email:  anderson.s at mayo.edu
>>> -----Original Message-----
>>> From: bioconductor-bounces at stat.math.ethz.ch
>>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik
Bengtsson
>>> Sent: Wednesday, February 13, 2008 2:41 PM
>>> To: Kasper Daniel Hansen
>>> Cc: William T Barry; bioconductor at stat.math.ethz.ch
>>> Subject: Re: [BioC] set.seed() in gcrma
>>> On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen
>>> <khansen at stat.berkeley.edu> wrote:
>>>> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote:
>>>>> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen
>>>>> <khansen at stat.berkeley.edu> wrote:
>>>>>> In my opinion gcrma should _not_ play with the seed. That is
something for the user to decide.
>>>>> It's a trade-off between providing reproducible examples and adding
>>> to
>>>>> the overall confusion.  I think it is important to be able to
>>> generate
>>>>> identical results for identical calls, and since it is not clear to
most users that the gcRMA parameters are estimated based on a sample one I
can see how the current solution avoids the problem of having ask people
to set the seed explicitly before calling gcrma() [also I have a hard time
to keep track on what is going on in all methods].
>>>> When a function sets the seed within itself and then does a sample()
immediately afterwards what it really does is using a fixed index
vector for arrays of a certain size. To call this "a random sample" is
simply wrong. It would be easier and more transparent to just
choose a regular grid like
>>>>    seg(from = 1, to = maxIndex, length.out = sampleSize)
>>>> Now some people would argue that it seems wrong to use a regular grid
like this and that the probes should be picked at random. To which I
repeat what i just wrote: when you do set.seed you are not selecting
anything at random - but you are obscuring this fact. If it really is
important to not always use the same probes, the seed should not be set
and you should live with the fact that you do not get
>>>> reproducible effects.
>>>> I agree that for gcRMA it would make sense to have the results be
highly reproducible (otherwise I foresee many posts about this), but that
is achieved even better by using a grid as above. Or perhaps do something
like having an argument called (say) randomSample = FALSE as default and
then do something like
>>>> if(randomSample)
>>>>    gridIdx = sample(maxIndex, size = sampleSize)
>>>> else
>>>>    gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize)
>>>> (appropriately modified if you want to exclude certain indices
corresponding to QC probes). Setting the seed inside a function
should in my opinion always be discouraged.
>>> I don't disagree with your arguments.  However, there is a risk of
selecting probes/units with common properties if seq() is used.  This
comes down to how Affymetrix has outlined the probes on the particular
chip types being analyzed, and since we don't know what chip types will
show up in the future, I would like to argue that a per-chip-type
pre-determined sample() is safer than a uniform seq().
>>> I've checked the latest BioC devel version of 'gcrma' and I noticed
that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is
called, but I cannot see any code that actually sample:s anything. Are
those set.seed() left overs or am I missing something?
>>> /Henrik
>>>> Kasper
>>>>> One solution would be to move the value of seed to be an argument
>>> with
>>>>> a default value and if NA/NULL the seed is *not* set.  That would
provide a backward-compatible upgrade to gcrma().
>>>> This is the solution taken by most people who want a similar
>>>> functionality. I still
>>>>> Comments?
>>>>> Henrik
>>>>>> Kasper
>>>>>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote:
>>>>>>> Bill
>>>>>>> Thank you for that suggestion. We will edit gcrma to restore the
seed in
>>>>>>> the next release.
>>>>>>> Zhijin
>>>>>>> Henrik Bengtsson wrote:
>>>>>>>> Hi,
>>>>>>>> I think you have spotted something very important, and I agree
>>> that
>>>>>>>> the random seed should be reset by any function that set it in
order
>>>>>>>> to get reproducible samples.  We've been working with a port/
wrapper
>>>>>>>> for gcrma() but this never struck us - thanks for bringing it up.
FYI, I've just sent a question to R-devel, as it is more
>>>>>>>> appropriate
>>>>>>>> there, on how to best push the random generator back to the state
it
>>>>>>>> was before calling set.seed().
>>>>>>>> Best,
>>>>>>>> Henrik
>>>>>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at duke.edu>
wrote:
>>>>>>>>> Hello,
>>>>>>>>> While recently using the gcrma package, I noticed some strange
behavior
>>>>>>>>> that I want to report.  I was applying the pre-processing
algorithm (using
>>>>>>>>> default settings) inside a resampling scheme and noticed that
subsequent
>>>>>>>>> calls of sample() were returning a constant vector of values.
Example
>>>>>>>>> output shown below:
>>>>>>>>>> temp <- gcrma(celdat[,1:2])
>>>>>>>>> Adjusting for optical effect..Done.
>>>>>>>>> Computing affinities.Done.
>>>>>>>>> Adjusting for non-specific binding..Done.
>>>>>>>>> Normalizing
>>>>>>>>> Calculating Expression
>>>>>>>>>> sample(1:10000,3)
>>>>>>>>> [1] 7030 5473 8109
>>>>>>>>>> temp <- gcrma(celdat[,1:2])
>>>>>>>>> Adjusting for optical effect..Done.
>>>>>>>>> Computing affinities.Done.
>>>>>>>>> Adjusting for non-specific binding..Done.
>>>>>>>>> Normalizing
>>>>>>>>> Calculating Expression
>>>>>>>>>> sample(1:10000,3)
>>>>>>>>> [1] 7030 5473 8109
>>>>>>>>>> sessionInfo()
>>>>>>>>> R version 2.6.1 (2007-11-26)
>>>>>>>>> x86_64-redhat-linux-gnu
>>>>>>>>> locale:
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e
>>>>>>>>> n_
>>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER>>>>>>>>
en
>>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_
>>>>>>>>> US
>>>>>>>>> .UTF-8;LC_IDENTIFICATION=C
>>>>>>>>> attached base packages:
>>>>>>>>> [1] splines   tools     stats     graphics  grDevices utils
datasets
>>>>>>>>> [8] methods   base
>>>>>>>>> other attached packages:
>>>>>>>>> [1] hu6800probe_2.0.0    hu6800cdf_2.0.0      gcrma_2.10.0 [4]
matchprobes_1.10.0   affy_1.16.0
>>> preprocessCore_1.0.0
>>>>>>>>> [7] affyio_1.6.1         Biobase_1.16.2
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] rcompgen_0.1-17
>>>>>>>>> In looking at the gcrma functions, it appears that gcrma.engine2
() is
>>>>>>>>> setting the seed without saving and restoring .Random.seed .
This was
>>>>>>>>> easily corrected in the external loop of the resampling scheme,
but I was
>>>>>>>>> unsure whether this would be considered a 'bug' in gcrma. Kind
regards,
>>>>>>>>> Bill
>>>>>>>>> -------------------------------
>>>>>>>>> William T. Barry, Ph.D.
>>>>>>>>> Dept. of Biostatistics & Bioinformatics, DUMC
>>>>>>>>> Duke Institute for Genome Science & Policy
>>>>>>>>> Hock Plaza, Room 8030
>>>>>>>>> DUMC Box 2717
>>>>>>>>> Durham, NC 27710
>>>>>>>>> Phone: 919-681-5047
>>>>>>>>> Fax: 919-668-9335
>>>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives: http://news.gmane.org/
>>>>>>>>> gmane.science.biology.informatics.conductor
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives: http://news.gmane.org/
>>>>>>>> gmane.science.biology.informatics.conductor
>>>>>>> --
>>>>>>> -------------------------------------------
>>>>>>> Zhijin (Jean) Wu
>>>>>>> Assistant Professor of Biostatistics
>>>>>>> Brown University, Box G-S121
>>>>>>> Providence, RI  02912
>>>>>>> Tel: 401 863 1230
>>>>>>> Fax: 401 863 9182
>>>>>>> http://stat.brown.edu/~zwu
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/
>>>>>>> gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor .
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Best wishes
>   Wolfgang
>
> ------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list