[BioC] LARGE Change in GCRMA expression values between versions (REPEAT QUESTION: PLEASE REPLY)

Fri Jul 11 11:03:12 CEST 2008

Hi

One thing I have noticed with the "new" GCRMA (2.12.x) is that for low-expressed probesets, the expression values are exactly the same across arrays. I'm presuming that this explains the peak in p-values near 1 that Guido is seeing. I also note that the  "infinitesimal amount of uniformly distributed noise" mentioned in the Lim et al paper is not added in the gcrma package. I think the result of this is you are likely to see many probesets with zero difference between conditions, and hence presumably a p-value of 1.

When the Lim et al paper was presented at ISMB, I asked whether the new version of GCRMA had been benchmarked using benchmarks such as affycomp. I was told that it hadn't been but that it would be.

FYI, I am currently working on a benchmark for combinations of summarisation/differential expression methods, and will include the "old" and "new" versions of GCRMA.

Best wishes

Richard.

Hooiveld, Guido wrote:
> Hi,
> 
> Since we by default use GCRMA in our pipeline, I (also) had a look how
> the newest GCRMA library (2.12.1) performed on several of our datasets.
> I must admit the outcomes were rather startling to me....
> 
> Based on some testing we did few years ago we decided to use GCRMA
> _only_ using the option fast=FALSE. Thus:
> gcrma.data <- gcrma(data, fast=FALSE)
> 
> The reason for this was mainy because when comparing the effects of a
> synthetic drug (agonist) in wild-type and receptor-null mice we noticed
> a very strange p-value distribution when we used the option 'fast=TRUE'.
> As expected, it peaked at the left (= because of the differential
> expessed genes), but instead of gradually fading out to the right we
> noticed a small bump in the middle of this distribution graph. This
> became extremely prominent when we compared the effects of the drug in
> the receptor null mice; we clearly observed a bump in the middle of this
> curve although we expected an almost flat curve.... (The receptor this
> drug binds to was knocked out in these null mice, so we _knew_ treatment
> with this drug should only have a very little effect, if all, on gene
> expression! [in contrast to the WT mice]).
> 
> Remarkably, this unexpected bump completely disappeared when using the
> same CEL files and GCRMA with the option  'fast=FALSE'. 
> See slides 3 and 4 here:
> http://members.chello.nl/g.hooiveld/pics/GCRMA_evaluation.pdf
> We therefore concluded that we better should not use the maximum
> likelihood estimate (MLE) for background, but rather the Emperical Bayes
> (EB) estimate for BG correction.
> 
> 
> 
> Wei Keat Lim et al. then reported about a specific problem with GCRMA,
> and proposed a modification. Since as a non-statistician I could not
> fully understand what his modification did, I decided to ask him about
> some practical aspects (see below for full conversation). Based on this
> I learned that both ML and EB estimates for bakckground are affected,
> although at different levels.
> 
> So, considering the above, with the release of GCRMA v2.12.1 I decided
> to analyse/compare the before mentioned arrays again. Results of this
> are at slides 1 and 2 [note however that the width of the bins is
> smaller than those in slides 3&4, but this does not affect the message].
> Thus, I expected that the latest GCRMA would not show this strange bump
> in the p-value distribution plots anymore, regardles whether the ML
> (fast=TRUE) or EB (fast=FALSE) estimate for BG was used. However, this
> is in contrast with what I observed, for fast=TRUE I noticed this bump
> is still there, but (this is new) also that the p-values 'peaked'
> completely at the right of the graph (p=0.95-1), both in the WT and
> receptor null mice! Surprisingly, for fast=FALSE the p-value distr
> looked as expected....
> 
> What the exact cause (and solution) is for this strange behaviour I
> leave to the real knowledgeable people/experts, but I thought it is good
> to share my observations.
> 
> Regards,
> Guido 
> 
> 
> 
> 
> 
> E-mail conversation with Wei Keat Lim in August 2007:
> Q (Guido): do you expect that the specific artifacts introduced using
> the ML estimate also occur when using the EB estimate (because our data
> suggest EB seems not to be affected)? 
> A (Lim): Yes, the artifact introduced in GCRMA is independent of the
> method used (MLE or EB). You observed a difference in your results
> because the default threshold used to truncate uninformative probe is
> lower in EB. The threshold is a function of 'fast', i.e.
> k=6*fast+0.5*(1-fast). Hence, k=6 in MLE and k=0.5 in EB. The artifact
> still exists but less obvious in EB implementation.
> 
> 
> I then followed up on this:
>  
> Q (Guido):
>> Thus, based on these analyses I conclude that when using the EB
> estimate for NSB calculation, your mod doesn't really alter/improve the
> expression level determination. However, in your first reply you
> mentioned that your mod should also affect the "slow" gcRMA. Based on my
> data, do you agree this is only marginally and therefore there is no
> real need to apply your mod? Or did I overlook something and am I
> completely wrong? Related to this, have you checked how the "slow"
> default gcRMA is positioned in Fig8 of your paper?
> A (WKL):
>> Thanks a lot for sharing the information. The part in gcrma that we
> suggest to modify does not relate to MLE or EB methods. The fact that
> you observed a difference between them is just because they are using a
> different cutoff value (k) by default. You will see a big difference if
> you choose k=6 (default in MLE) in EB implementation.
>> Another thing to note is that we did not expect to see big changes in
> differential expression analysis, and what we showed earlier is just the
> effects in gene-gene statistical dependencies. You can always use
> gcrma-EB if that does not affect downstream analysis in your dataset. We
> just proposed a correction that could eliminate artificial gene-gene
> correlation in the implementation, independent of MLE, EB and parameter
> k.
>> Thanks,
>> Wei Keat
> 
>  
> This was followed by:
> Q (Guido):
> I just read your paper again and you clearly show that your GCRMA
> modification improves the estimation of gene-gene correlations. Since
> you applied the MLE method of GCRMA this effect is [likely] more
> pronounced than when applying the EB implementation (for reasons you
> described; it's dependent on setting of k). 
> Since we normally use the EB implementation, I used your code on one of
> our datasets, and found that this did not really affect the calculated
> gene expression levels and subsequent statistical analysis. Therefore I
> concluded that for us there is no need to use your modded GCRMA
> algorithm. However, this may be a wrong conclusion since you adjusted
> the GSB procedure, which is the same for the MLE or EB implementation.
> In other words, it still would be wise to use your modified GCRMA code,
> because I did not look at parameters that specifically 'tested' the
> impact of the mod (which you did in your paper). Thus form our point of
> view: your modification will at least gives outcomes identical to the
> default GCRMA (EB imlementation) ("worst" case), but it will likely
> (although slightly) improve our analyses as well.
> Guido
> 
> A (WKL):
> Guido,
> Yes. You got the point :)
> It's possible that you observe changes in gene-gene correlation but not
> in differential expression. Imagine that there are 2 genes in 10
> samples. Both of them have low expression in control samples and high
> expression in treatment samples. Artificial gene-gene correlation in
> these samples does not change the fact that level of expression is
> different between control and treatment. The effect is just to increase
> the correlation among control/treatment samples. 
> Best,
> Wei Keat
>   
> 
> 
> ------------------------------------------------ 
> Guido Hooiveld, PhD 
> Nutrition, Metabolism & Genomics Group
> Division of Human Nutrition 
> Wageningen University 
> Biotechnion, Bomenweg 2 
> NL-6703 HD Wageningen 
> the Netherlands 
> tel: (+)31 317 485788 
> fax: (+)31 317 483342 
> internet:   http://nutrigene.4t.com
> email:      guido.hooiveld at wur.nl
> 
> 
> 
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch 
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Zhijin Wu
>> Sent: 08 July 2008 18:22
>> To: Richard Friedman
>> Cc: Rafael A. Irizarry; bio c bioconductor
>> Subject: Re: [BioC] LARGE Change in GCRMA expression values 
>> between versions (REPEAT QUESTION: PLEASE REPLY)
>>
>> Yes, the 2.12 version has made the change so probes without 
>> affinity information do not go through GSB(gene specific 
>> binding) adjustment.  So there may be a small number of 
>> probesets that are affected.
>>
>> best
>> Zhijin
>>  Richard Friedman wrote:
>>> Dear Zhijin,
>>>
>>>     (I sent the following in May and  June to the list (worded 
>>> slightly differently) but did not receive a reply.
>>>  I would appreciate it if you, or someone on the list could  could 
>>> clear up this problem)
>>>
>>>     I am noticing a large change in the absolute values of 
>> intensity 
>>> measurements
>>>
>>>     For the same probeset and and array normalized with the same 8 
>>> other arrays done with GCRMA 2.10 I got 5.27 but for GCRMA 
>> 2.12  I got 
>>> 3.14
>>>
>>> Does this sound like a change that can be expected between versions 
>>> 2.10 and 2.12, or does it sound as if I had made an error 
>> of some kind?
>>> Is the change due to revising GCRMA to take the criticism  
>> of Lim et 
>>> al. into account?
>>> Is so has version 2.12 been benchmarked against affycomp?
>>>
>>>
>>> Thanks and best wishes,
>>> Rich
>>> ------------------------------------------------------------
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource Herbert Irving Comprehensive 
>>> Cancer Center (HICCC) Lecturer, Department of Biomedical 
>> Informatics 
>>> (DBMI) Educational Coordinator, Center for Computational 
>> Biology and 
>>> Bioinformatics (C2B2)/ National Center for Multiscale Analysis of 
>>> Genomic Networks (MAGNet) Box 95, Room 130BB or P&S 1-420C Columbia 
>>> University Medical Center 630 W. 168th St.
>>> New York, NY 10032
>>> (212)305-6901 (5-6901) (voice)
>>> friedman at cancercenter.columbia.edu
>>> http://cancercenter.columbia.edu/~friedman/
>>>
>>> In Memoriam,
>>> Algirdas Jonas Budrys
>>>
>>>
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Richard D. Pearson             richard.pearson at postgrad.manchester.ac.uk
School of Computer Science,    http://www.cs.man.ac.uk/~pearsonr
University of Manchester,      Tel: +44 161 275 6178
Oxford Road,                   Mob: +44 7971 221181
Manchester M13 9PL, UK.        Fax: +44 161 275 6204