[BioC] Is GCRMA influenced by lexicographical order?

Zhijin Wu zwu at stat.brown.edu
Mon Nov 8 18:11:52 CET 2010


Henrik is correct that the non-identical result is due to the sampling 
of the subsets in adjusting for specific binding (GSB.adj).

The discrepancy I get from re-ordering the arrays on some human arrays I 
have is ranged between -.004 and 0.006. However I can change the code so 
that one does get absolutely identical result on the same data set.

Jean
On 11/8/2010 11:44 AM, Henrik Bengtsson wrote:
> On Mon, Nov 8, 2010 at 8:10 AM, Benilton Carvalho
> <beniltoncarvalho at gmail.com>  wrote:
>> just set the random seed to the same value prior to running
>> bg.correct.gcrma() and you should be fine...
>>
>> ## setwd(CASEI)
>> set.seed(1)
>> CEL = ReadAffy()
>> ebag.1 = exprs(bg.adjust.gcrma(CEL))
>> ## setwd(CASEII)
>> set.seed(1)
>> CEL = ReadAffy()
>> ebag.2 = exprs(bg.adjust.gcrma(CEL))
>
> I don't think that works, because of how subsets used for estimates
> are drawn from all arrays, e.g.
>
>      Subset<- sample(1:length(as.matrix(pms)[index.affinities,]),25000)
>      y<- log2(pms)[index.affinities,][Subset]
>
> It samples 25,000 signals out of all probes across all arrays - note,
> not the same probes across all arrays.  Thus, the ordering of the
> arrays (the columns) in 'pms' matters.
>
> Basically, by design of the estimators, that is due to the random
> sampling in GCRMA, one cannot/should not treat GCRMA to give identical
> estimates and signals from run to run.  Any comparison between
> multiple runs should be done allowing for some deviation where the
> assumption is that it gives almost the same results from run to run.
> The reintroduction of "set.seed(1)" in the internal GCRMA code, makes
> this less clear and a bit confusing.
>
> In aroma.affymetrix we have ported GCRMA partly by reimplementing
> GCRMA and partly by using wrappers.  Part of the redundancy testing is
> to validate the correctness.  See
>
>    http://aroma-project.org/replication/gcRMA
>
> which show how closely we reproduce the results.  Those results are
> without fixing the seed, meaning that they illustrate the amount of
> randomness you should expect in the estimates/GCRMA signals.
>
> /Henrik
>
>>
>> b
>>
>> On 8 November 2010 15:50, Wolfgang Huber<whuber at embl.de>  wrote:
>>> Hi Markus,
>>>
>>> Jean or Rafa will be able to give a more competent response, but as far as I
>>> can see from the code, there are some places where the first array is
>>> treated specially:
>>>
>>> ~/madman/Rpacks/gcrma/R$ grep ",1]" *
>>> gcrma.R:      index2=which(!is.na(anc[,1]))
>>> gcrma.engine2.R:    index.affinities<- which(!is.na(pm.affinities[,1]))
>>> justGCRMA.R:       mm<- read.probematrix(filenames=filenames[i],
>>> which="mm")$mm[,1]
>>> justGCRMA.R:    mm<- read.probematrix(filenames=filenames[i],
>>> which="mm")$mm[,1]
>>>
>>> I agree that this is not the most desirable feature.
>>>
>>> Best wishes
>>>         Wolfgang
>>>
>>>
>>> Il Nov/8/10 10:19 AM, Markus Boenn ha scritto:
>>>>
>>>> Dear all,
>>>>
>>>> I've made some experiment about GCRMA using bg.adjust.gcrma() contained
>>>> in the gcrma package.
>>>>
>>>> I take two CEL files, A.cel and B.cel (CASE I). Both are from arrays of
>>>> the same type, for instance Mouse Gene 430 2.0 Arrays. In addition, I
>>>> rename both files in a way, which changes the lexicographical order,
>>>> say, I rename A.cel to BA.cel and I rename B.cel to AB.cel (CASE II).
>>>>
>>>> Using bg.adjust.gcrma() and exprs() to obtain the expression values, for
>>>> some probe sets I get different values for A.cel (in CASE I) than for
>>>> BA.cel (in CASE II), although both files contain the same data. Of
>>>> course, for B.cel and AB.cel the same phenomenon becomes obvious.
>>>>
>>>> How can this be possible? To me, it seems as if GCRMA normalizes the
>>>> data with respect to the lexicographical order, but as far as I know,
>>>> GCRMA treats each array independently.
>>>>
>>>> This effect is not reproducible using call.exprs() with argument
>>>> algorithm="rma" or "mas5". But is reproducible for other arrays like
>>>> HGU95A, for example.
>>>>
>>>> Please, can anybody try to explain me, if this effect is a bug or a
>>>> feature (and why)?
>>>>
>>>> Best wishes,
>>>> Markus
>>>>
>>>>
>>>>
>>>> #################################################################
>>>>
>>>> sessionInfo()
>>>> R version 2.12.0 (2010-10-15)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>> other attached packages:
>>>> [1] hgu133plus2probe_2.7.0 AnnotationDbi_1.12.0 hgu133plus2cdf_2.7.0 [4]
>>>> simpleaffy_2.26.0 genefilter_1.32.0 gcrma_2.22.0 [7] affy_1.28.0
>>>> Biobase_2.10.0
>>>> loaded via a namespace (and not attached):
>>>> [1] Biostrings_2.18.0 DBI_0.2-5 IRanges_1.8.2 [4] RSQLite_0.9-2
>>>> affyio_1.18.0 annotate_1.28.0 [7] preprocessCore_1.12.0 splines_2.12.0
>>>> survival_2.35-8 [10] tools_2.12.0 xtable_1.5-6
>>>>
>>>> Also tried on other machine with similar packages
>>>>
>>>> sessionInfo()
>>>> R version 2.11.1 Patched (2010-09-16 r52943)
>>>> Platform: i686-pc-linux-gnu (32-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>> [5] LC_MONETARY=C LC_MESSAGES=de_DE.UTF-8
>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] simpleaffy_2.24.0 genefilter_1.30.0 gcrma_2.20.0 affy_1.26.1
>>>> [5] Biobase_2.8.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] affyio_1.16.0 annotate_1.26.1 AnnotationDbi_1.10.2
>>>> [4] Biostrings_2.16.9 DBI_0.2-5 IRanges_1.6.17
>>>> [7] preprocessCore_1.10.0 RSQLite_0.9-2 splines_2.11.1
>>>> [10] survival_2.35-8 tools_2.11.1 xtable_1.5-6
>>>>
>>>>
>>>> #################################################################
>>>>
>>>> Code example
>>>>
>>>> setwd("PATH TO CASE I DATA") #the directory containing A.cel and B.cel
>>>> CEL<- ReadAffy()
>>>> bag.1<- bg.adjust.gcrma(CEL)
>>>> ebag.1<- exprs(bag.1)
>>>>
>>>> setwd("PATH TO CASE II DATA") #the directory containing BA.cel and AB.cel
>>>> CEL<- ReadAffy()
>>>> bag.2<- bg.adjust.gcrma(CEL)
>>>> ebag.2<- exprs(bag.2)
>>>>
>>>> par(mfrow=c(2,1))
>>>> # differences should be zero
>>>> plot(ebag.1[,1]-ebag.2[,2])
>>>> plot(ebag.1[,2]-ebag.2[,1])
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
------------------------------------
Zhijin (Jean) Wu
Assistant Professor of Biostatistics
Brown University, Box G-S121
Providence, RI  02912

Tel: 401 863 1230
Fax: 401 863 9182
http://www.stat.brown.edu/zwu



More information about the Bioconductor mailing list