[BioC] Is GCRMA influenced by lexicographical order?

Mon Nov 8 20:30:22 CET 2010

On Mon, Nov 8, 2010 at 9:11 AM, Zhijin Wu <zwu at stat.brown.edu> wrote:
> Henrik is correct that the non-identical result is due to the sampling of
> the subsets in adjusting for specific binding (GSB.adj).
>
> The discrepancy I get from re-ordering the arrays on some human arrays I
> have is ranged between -.004 and 0.006. However I can change the code so
> that one does get absolutely identical result on the same data set.

I would like to suggest that for backward compatibility you add some
argument to the methods so that it is possible for users/developers to
use older variants too.  This lower the risk for surprises. For
example:

raw <- ReadAffy(filenames=getPathnames(cs));

es <- gcrma(raw, flavor="2010-11-08", verbose=TRUE);  # The new one
es <- gcrma(raw, flavor="2010-04-08", verbose=TRUE);  # The current one (say)

where the default

es <- gcrma(raw, verbose=TRUE);

is one of the "flavors" recommended (there may be better options than
using dates to specify the variant).  Argument 'flavor' need to be
added to more low-level functions, e.g. gcrma.engine() etc.

The above strategy allows you to add new variants of the estimators
and when you find them be stable you can make one of them the new
default, while still not breaking support for the old ones.   Also,
with an explicit argument 'flavor', it is more clear to the
user/developer that there are/have been different variants of the
algorithm.  This would avoid surprises such as when gcRMA was
significantly restructured a couple of years ago.  Eventually you can
deprecate some variants and drop them in the future.  This makes the
transition smoother for everyone.

Being able to reproduce also old results is an important feature.

Just some thoughts

/Henrik

>
> Jean
> On 11/8/2010 11:44 AM, Henrik Bengtsson wrote:
>>
>> On Mon, Nov 8, 2010 at 8:10 AM, Benilton Carvalho
>> <beniltoncarvalho at gmail.com>  wrote:
>>>
>>> just set the random seed to the same value prior to running
>>> bg.correct.gcrma() and you should be fine...
>>>
>>> ## setwd(CASEI)
>>> set.seed(1)
>>> CEL = ReadAffy()
>>> ebag.1 = exprs(bg.adjust.gcrma(CEL))
>>> ## setwd(CASEII)
>>> set.seed(1)
>>> CEL = ReadAffy()
>>> ebag.2 = exprs(bg.adjust.gcrma(CEL))
>>
>> I don't think that works, because of how subsets used for estimates
>> are drawn from all arrays, e.g.
>>
>>     Subset<- sample(1:length(as.matrix(pms)[index.affinities,]),25000)
>>     y<- log2(pms)[index.affinities,][Subset]
>>
>> It samples 25,000 signals out of all probes across all arrays - note,
>> not the same probes across all arrays.  Thus, the ordering of the
>> arrays (the columns) in 'pms' matters.
>>
>> Basically, by design of the estimators, that is due to the random
>> sampling in GCRMA, one cannot/should not treat GCRMA to give identical
>> estimates and signals from run to run.  Any comparison between
>> multiple runs should be done allowing for some deviation where the
>> assumption is that it gives almost the same results from run to run.
>> The reintroduction of "set.seed(1)" in the internal GCRMA code, makes
>> this less clear and a bit confusing.
>>
>> In aroma.affymetrix we have ported GCRMA partly by reimplementing
>> GCRMA and partly by using wrappers.  Part of the redundancy testing is
>> to validate the correctness.  See
>>
>>   http://aroma-project.org/replication/gcRMA
>>
>> which show how closely we reproduce the results.  Those results are
>> without fixing the seed, meaning that they illustrate the amount of
>> randomness you should expect in the estimates/GCRMA signals.
>>
>> /Henrik
>>
>>>
>>> b
>>>
>>> On 8 November 2010 15:50, Wolfgang Huber<whuber at embl.de>  wrote:
>>>>
>>>> Hi Markus,
>>>>
>>>> Jean or Rafa will be able to give a more competent response, but as far
>>>> as I
>>>> can see from the code, there are some places where the first array is
>>>> treated specially:
>>>>
>>>> ~/madman/Rpacks/gcrma/R$ grep ",1]" *
>>>> gcrma.R:      index2=which(!is.na(anc[,1]))
>>>> gcrma.engine2.R:    index.affinities<- which(!is.na(pm.affinities[,1]))
>>>> justGCRMA.R:       mm<- read.probematrix(filenames=filenames[i],
>>>> which="mm")$mm[,1]
>>>> justGCRMA.R:    mm<- read.probematrix(filenames=filenames[i],
>>>> which="mm")$mm[,1]
>>>>
>>>> I agree that this is not the most desirable feature.
>>>>
>>>> Best wishes
>>>>        Wolfgang
>>>>
>>>>
>>>> Il Nov/8/10 10:19 AM, Markus Boenn ha scritto:
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I've made some experiment about GCRMA using bg.adjust.gcrma() contained
>>>>> in the gcrma package.
>>>>>
>>>>> I take two CEL files, A.cel and B.cel (CASE I). Both are from arrays of
>>>>> the same type, for instance Mouse Gene 430 2.0 Arrays. In addition, I
>>>>> rename both files in a way, which changes the lexicographical order,
>>>>> say, I rename A.cel to BA.cel and I rename B.cel to AB.cel (CASE II).
>>>>>
>>>>> Using bg.adjust.gcrma() and exprs() to obtain the expression values,
>>>>> for
>>>>> some probe sets I get different values for A.cel (in CASE I) than for
>>>>> BA.cel (in CASE II), although both files contain the same data. Of
>>>>> course, for B.cel and AB.cel the same phenomenon becomes obvious.
>>>>>
>>>>> How can this be possible? To me, it seems as if GCRMA normalizes the
>>>>> data with respect to the lexicographical order, but as far as I know,
>>>>> GCRMA treats each array independently.
>>>>>
>>>>> This effect is not reproducible using call.exprs() with argument
>>>>> algorithm="rma" or "mas5". But is reproducible for other arrays like
>>>>> HGU95A, for example.
>>>>>
>>>>> Please, can anybody try to explain me, if this effect is a bug or a
>>>>> feature (and why)?
>>>>>
>>>>> Best wishes,
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>> #################################################################
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.12.0 (2010-10-15)
>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>> other attached packages:
>>>>> [1] hgu133plus2probe_2.7.0 AnnotationDbi_1.12.0 hgu133plus2cdf_2.7.0
>>>>> [4]
>>>>> simpleaffy_2.26.0 genefilter_1.32.0 gcrma_2.22.0 [7] affy_1.28.0
>>>>> Biobase_2.10.0
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biostrings_2.18.0 DBI_0.2-5 IRanges_1.8.2 [4] RSQLite_0.9-2
>>>>> affyio_1.18.0 annotate_1.28.0 [7] preprocessCore_1.12.0 splines_2.12.0
>>>>> survival_2.35-8 [10] tools_2.12.0 xtable_1.5-6
>>>>>
>>>>> Also tried on other machine with similar packages
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.11.1 Patched (2010-09-16 r52943)
>>>>> Platform: i686-pc-linux-gnu (32-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>>> [5] LC_MONETARY=C LC_MESSAGES=de_DE.UTF-8
>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] simpleaffy_2.24.0 genefilter_1.30.0 gcrma_2.20.0 affy_1.26.1
>>>>> [5] Biobase_2.8.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] affyio_1.16.0 annotate_1.26.1 AnnotationDbi_1.10.2
>>>>> [4] Biostrings_2.16.9 DBI_0.2-5 IRanges_1.6.17
>>>>> [7] preprocessCore_1.10.0 RSQLite_0.9-2 splines_2.11.1
>>>>> [10] survival_2.35-8 tools_2.11.1 xtable_1.5-6
>>>>>
>>>>>
>>>>> #################################################################
>>>>>
>>>>> Code example
>>>>>
>>>>> setwd("PATH TO CASE I DATA") #the directory containing A.cel and B.cel
>>>>> CEL<- ReadAffy()
>>>>> bag.1<- bg.adjust.gcrma(CEL)
>>>>> ebag.1<- exprs(bag.1)
>>>>>
>>>>> setwd("PATH TO CASE II DATA") #the directory containing BA.cel and
>>>>> AB.cel
>>>>> CEL<- ReadAffy()
>>>>> bag.2<- bg.adjust.gcrma(CEL)
>>>>> ebag.2<- exprs(bag.2)
>>>>>
>>>>> par(mfrow=c(2,1))
>>>>> # differences should be zero
>>>>> plot(ebag.1[,1]-ebag.2[,2])
>>>>> plot(ebag.1[,2]-ebag.2[,1])
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> ------------------------------------
> Zhijin (Jean) Wu
> Assistant Professor of Biostatistics
> Brown University, Box G-S121
> Providence, RI  02912
>
> Tel: 401 863 1230
> Fax: 401 863 9182
> http://www.stat.brown.edu/zwu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>