[BioC] Is GCRMA influenced by lexicographical order?

Mon Nov 8 17:44:06 CET 2010

On Mon, Nov 8, 2010 at 8:10 AM, Benilton Carvalho
<beniltoncarvalho at gmail.com> wrote:
> just set the random seed to the same value prior to running
> bg.correct.gcrma() and you should be fine...
>
> ## setwd(CASEI)
> set.seed(1)
> CEL = ReadAffy()
> ebag.1 = exprs(bg.adjust.gcrma(CEL))
> ## setwd(CASEII)
> set.seed(1)
> CEL = ReadAffy()
> ebag.2 = exprs(bg.adjust.gcrma(CEL))

I don't think that works, because of how subsets used for estimates
are drawn from all arrays, e.g.

    Subset <- sample(1:length(as.matrix(pms)[index.affinities,]),25000)
    y <- log2(pms)[index.affinities,][Subset]

It samples 25,000 signals out of all probes across all arrays - note,
not the same probes across all arrays.  Thus, the ordering of the
arrays (the columns) in 'pms' matters.

Basically, by design of the estimators, that is due to the random
sampling in GCRMA, one cannot/should not treat GCRMA to give identical
estimates and signals from run to run.  Any comparison between
multiple runs should be done allowing for some deviation where the
assumption is that it gives almost the same results from run to run.
The reintroduction of "set.seed(1)" in the internal GCRMA code, makes
this less clear and a bit confusing.

In aroma.affymetrix we have ported GCRMA partly by reimplementing
GCRMA and partly by using wrappers.  Part of the redundancy testing is
to validate the correctness.  See

  http://aroma-project.org/replication/gcRMA

which show how closely we reproduce the results.  Those results are
without fixing the seed, meaning that they illustrate the amount of
randomness you should expect in the estimates/GCRMA signals.

/Henrik

>
> b
>
> On 8 November 2010 15:50, Wolfgang Huber <whuber at embl.de> wrote:
>> Hi Markus,
>>
>> Jean or Rafa will be able to give a more competent response, but as far as I
>> can see from the code, there are some places where the first array is
>> treated specially:
>>
>> ~/madman/Rpacks/gcrma/R$ grep ",1]" *
>> gcrma.R:      index2=which(!is.na(anc[,1]))
>> gcrma.engine2.R:    index.affinities <- which(!is.na(pm.affinities[,1]))
>> justGCRMA.R:       mm <- read.probematrix(filenames=filenames[i],
>> which="mm")$mm[,1]
>> justGCRMA.R:    mm <- read.probematrix(filenames=filenames[i],
>> which="mm")$mm[,1]
>>
>> I agree that this is not the most desirable feature.
>>
>> Best wishes
>>        Wolfgang
>>
>>
>> Il Nov/8/10 10:19 AM, Markus Boenn ha scritto:
>>>
>>> Dear all,
>>>
>>> I've made some experiment about GCRMA using bg.adjust.gcrma() contained
>>> in the gcrma package.
>>>
>>> I take two CEL files, A.cel and B.cel (CASE I). Both are from arrays of
>>> the same type, for instance Mouse Gene 430 2.0 Arrays. In addition, I
>>> rename both files in a way, which changes the lexicographical order,
>>> say, I rename A.cel to BA.cel and I rename B.cel to AB.cel (CASE II).
>>>
>>> Using bg.adjust.gcrma() and exprs() to obtain the expression values, for
>>> some probe sets I get different values for A.cel (in CASE I) than for
>>> BA.cel (in CASE II), although both files contain the same data. Of
>>> course, for B.cel and AB.cel the same phenomenon becomes obvious.
>>>
>>> How can this be possible? To me, it seems as if GCRMA normalizes the
>>> data with respect to the lexicographical order, but as far as I know,
>>> GCRMA treats each array independently.
>>>
>>> This effect is not reproducible using call.exprs() with argument
>>> algorithm="rma" or "mas5". But is reproducible for other arrays like
>>> HGU95A, for example.
>>>
>>> Please, can anybody try to explain me, if this effect is a bug or a
>>> feature (and why)?
>>>
>>> Best wishes,
>>> Markus
>>>
>>>
>>>
>>> #################################################################
>>>
>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>> other attached packages:
>>> [1] hgu133plus2probe_2.7.0 AnnotationDbi_1.12.0 hgu133plus2cdf_2.7.0 [4]
>>> simpleaffy_2.26.0 genefilter_1.32.0 gcrma_2.22.0 [7] affy_1.28.0
>>> Biobase_2.10.0
>>> loaded via a namespace (and not attached):
>>> [1] Biostrings_2.18.0 DBI_0.2-5 IRanges_1.8.2 [4] RSQLite_0.9-2
>>> affyio_1.18.0 annotate_1.28.0 [7] preprocessCore_1.12.0 splines_2.12.0
>>> survival_2.35-8 [10] tools_2.12.0 xtable_1.5-6
>>>
>>> Also tried on other machine with similar packages
>>>
>>> sessionInfo()
>>> R version 2.11.1 Patched (2010-09-16 r52943)
>>> Platform: i686-pc-linux-gnu (32-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>> [5] LC_MONETARY=C LC_MESSAGES=de_DE.UTF-8
>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] simpleaffy_2.24.0 genefilter_1.30.0 gcrma_2.20.0 affy_1.26.1
>>> [5] Biobase_2.8.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] affyio_1.16.0 annotate_1.26.1 AnnotationDbi_1.10.2
>>> [4] Biostrings_2.16.9 DBI_0.2-5 IRanges_1.6.17
>>> [7] preprocessCore_1.10.0 RSQLite_0.9-2 splines_2.11.1
>>> [10] survival_2.35-8 tools_2.11.1 xtable_1.5-6
>>>
>>>
>>> #################################################################
>>>
>>> Code example
>>>
>>> setwd("PATH TO CASE I DATA") #the directory containing A.cel and B.cel
>>> CEL <- ReadAffy()
>>> bag.1 <- bg.adjust.gcrma(CEL)
>>> ebag.1 <- exprs(bag.1)
>>>
>>> setwd("PATH TO CASE II DATA") #the directory containing BA.cel and AB.cel
>>> CEL <- ReadAffy()
>>> bag.2 <- bg.adjust.gcrma(CEL)
>>> ebag.2 <- exprs(bag.2)
>>>
>>> par(mfrow=c(2,1))
>>> # differences should be zero
>>> plot(ebag.1[,1]-ebag.2[,2])
>>> plot(ebag.1[,2]-ebag.2[,1])
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>