[BioC] summarizing probe intensites before or after normalization- 1. how to do with RMA 2. Opinions?

k. brand k.brand at erasmusmc.nl
Mon Sep 11 17:54:51 CEST 2006


Thank you Jim,

Given the QPCR results, i believe the differences are consistent; so i 
will definitely follow up your suggestions.

Apologies for 'falling off' the list, the jpeg attachments kept 
bouncing. One way or another will stick to it in future.

Again, thank you for your thoughts,

Karl

on 9/11/2006 5:37 PM James W. MacDonald said the following:
> Hi Karl,
> 
> Please keep this on the list.
> 
> Since you have repeated the experiment with new samples, and there is 
> such a huge difference between the two sets, I would tend to separate 
> them (e.g., run RMA separately on the two batches), and then fit a mixed 
> effects model. You can use the GeneMetaEx package to do this. See this 
> post for more info:
> 
> http://article.gmane.org/gmane.science.biology.informatics.conductor/7981/match=gentleman+mixed+model 
> 
> 
> There may not be enough replication in each set to accurately estimate 
> the variance/covariance matrix (somebody like Robert Gentleman would be 
> better able to answer that question).
> 
> Alternatively, if the old arrays are consistently 'dimmer' than the new 
> by a relatively constant amount, you might just be able to fit a fixed 
> effects ANOVA with a batch effect. An example of doing that can be found 
> here, where they fit a 'day' effect:
> 
> http://bioinf.wehi.edu.au/marray/jsm2005/lab5/lab5.html
> 
> Best,
> 
> Jim
> 
> k. brand wrote:
>> Jim,
>>
>> Yes, i succeeded in normalizing the RMA data with:
>>
>> datrma.postqnorm <- normalize(datrma)
>>
>> Sorry to shock you...wait till you see the un-normalized data 
>> (attached- "RMA no norm.jpeg"!
>>
>> Unfortunately the disparity is technical not biological. The low 
>> arrays were 2 years past expiration, i know the quality is 
>> significantly impaired. However good QPCR results justified my 
>> department allowing me to use some fresh arrays to finish the experiment.
>>
>> So i have half my experiment (2 replicates for three tissues) on old 
>> arrays, and the other half (2 replicates for three tissues) on fresh 
>> arrays. The data is ok, i get some pathways at the end which look ok. 
>> BUt i want to wring the most from this imperfect data set.
>>
>> And yes, it is definitely the correct RMA output. Normalizing after 
>> summarizing DOES however enforce equal distributions for all arrays 
>> (see attached "RMA then BB norm.jpeg"). Yet i still try to understand 
>> what is the better approach, at least in my "disparate" situation- to 
>> normalize before or after summarization.
>>
>> What do you think?
>>
>> Karl
>>
>> PS running another 2 arrays is not an option....
>>
>>
>> on 9/11/2006 4:57 PM James W. MacDonald said the following:
>>
>>> Hi Karl,
>>>
>>> You have to convert the expression data to a data.frame in order to 
>>> get a boxplot for each column.
>>>
>>> boxplot(data.frame(exprs(datrma)))
>>>
>>> I am shocked at the boxplots for rma(). I have never seen RMA 
>>> processed data look that different, which makes me wonder what the 
>>> raw data look like. Also, are you sure these are RMA processed data 
>>> (e.g., you didn't accidently mask the rma() processed data with other 
>>> data)? Unless the raw data are completely horrible, I don't know how 
>>> you could get such disparate results.
>>>
>>>
>>>
>>> Best,
>>>
>>> Jim
>>>
>>> k. brand wrote:
>>>
>>>> James,
>>>>
>>>> Thank you for your fast detailed response.
>>>>
>>>> At your suggetion i tried your suugestion- ie.,
>>>>
>>>> library(affyPLM)
>>>> dat <- ReadAffy()
>>>> datrma <- rma(dat, normalize=FALSE)
>>>> datrma <- normalize.quantiles(exprs(datrma))
>>>> boxplot(datrma)
>>>>
>>>> Find attached the boxplot output "RMA then QnormJimscript.jpeg" 
>>>> which indicates some missing syntax since there is only one box for 
>>>> the 12 arrays?! In fact in my attempts to realize this endeavour i 
>>>> saw this ouput too. Unfortunately my lack of R knowledge prevented 
>>>> me getting around whatever the problem is. If you have a further 
>>>> suggestion im all ears.
>>>>
>>>> "data", the result of inheriting my teachers bad habits, is now 
>>>> "dat"...
>>>>
>>>> Further i attach a boxplot (better than histogram right- "RMA vs 
>>>> Qunt norm of MAS5 preproced & summed.jpg") comparing the two methods 
>>>> in my orignal post. When i look at the variation of the RMA 
>>>> processed and normed data i question how effectively can i compare 
>>>> these arrays with each. Especially along side the method shown which 
>>>> is MAS5 preprocessed then quantile normalized using a colleagues 
>>>> script. These arrays look much more comparable, even ideally so. Why 
>>>> dont i just use this appraoch you may ask? A: im convinced RMA is a 
>>>> superior preprocessing and summarizing method...i just need to be 
>>>> able to reconcile the apparent variation in the final output. Or 
>>>> perhaps, better understand it.
>>>>
>>>> Your further thoughts, suggests greatly appreciated,
>>>>
>>>> karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> on 9/11/2006 3:42 PM James W. MacDonald said the following:
>>>>
>>>>> k. brand wrote:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I compared two normalization approaches for an experiment using 
>>>>>> twelve affy 430-2.0 chips. (histogram plot comparing bith methods 
>>>>>> forwarded on request).
>>>>>>
>>>>>> #1. RMA
>>>>>> library(affy)
>>>>>> data <- ReadAffy()
>>>>>> datarma <- rma(data)
>>>>>> exprs2excel(datarma, file="dataRMA.csv")
>>>>>>
>>>>>> Plotting histograms of the output shows arrays NOT perfectly 
>>>>>> aligning at the means and spreads.
>>>>>>
>>>>>> I used a custom script to effect a quantile normalization on MAS5 
>>>>>> preprocessed but unnormalized data-
>>>>>>
>>>>>> #2. Mas5 sans interchip normalization
>>>>>> library(affy)
>>>>>> data <- ReadAffy()
>>>>>> datamas5sannorm <- mas5(data, normalize=FALSE)
>>>>>> exprs2excel(datamas5sannorm, file="datamas5sannorm.csv")
>>>>>> f.qnorm <- function(x,qinit=0.75,perc=100)  {...
>>>>>>
>>>>>> The means and spreads of this normalization approach do align 
>>>>>> perfectly.
>>>>>>
>>>>>> THUS- summarizing probe intensites before or after normalization 
>>>>>> does appear to make a noticeable difference, as may be expected.
>>>>>>
>>>>>> My questions/requests-
>>>>>>
>>>>>> 1. Help to effect Bolstad normalization of the RMA preprocessed 
>>>>>> and summarized data. Whilst I succeed in generating unnormalized 
>>>>>> RMA preprocessed data with-
>>>>>>
>>>>>> library(affy)
>>>>>> data <- ReadAffy()
>>>>>> datarma <- rma(data, normalize=FALSE)
>>>>>
>>>>>
>>>>>
>>>>> Next step would be
>>>>>
>>>>> datarma <- normalize.quantiles(exprs(datarma))
>>>>>
>>>>> also note that 'data' is not a very good variable name, as you are 
>>>>> masking an existing function. When creating variable names it is 
>>>>> often enlightening to type the name first at an R prompt to see if 
>>>>> you get any response.
>>>>>
>>>>>
>>>>>>
>>>>>> As a result of my limited R experience, I failed in finding a 
>>>>>> method to effect Bolstad (quantile) normalization on this output.
>>>>>>
>>>>>> 2. Thoughts/comments on the benefits/caveats of normalizing before 
>>>>>> or after summarizing probe intensities.
>>>>>
>>>>>
>>>>>
>>>>> Normalizing after summarization for something like rma() seems 
>>>>> questionable to me. Since the expression values are based on 
>>>>> fitting a model to the PM probe values, if you don't normalize 
>>>>> first you are ignoring any non-biological variability which may end 
>>>>> up biasing your results. Using median polish for the model fit 
>>>>> should help protect against this, but I don't know that I would 
>>>>> want to take chances.
>>>>>
>>>>> As an aside, how far off are the histograms? Are you sure that 
>>>>> there is a reasonable difference? Eyeballing a histogram isn't the 
>>>>> best way to determine if the mean and variance are different or 
>>>>> not. A quick run through with some data here shows very little 
>>>>> differences:
>>>>>
>>>>>  > eset <- justRMA(filenames=list.celfiles()[1:10])
>>>>>  > apply(exprs(eset),2,summary)
>>>>>         Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7
>>>>> Min.       4.085    4.070    4.091    4.051    4.068    4.090    4.087
>>>>> 1st Qu.    5.835    5.859    5.832    5.812    5.842    5.858    5.852
>>>>> Median     7.079    7.069    7.048    7.061    7.070    7.077    7.080
>>>>> Mean       7.225    7.227    7.224    7.227    7.229    7.225    7.232
>>>>> 3rd Qu.    8.352    8.324    8.351    8.363    8.361    8.330    8.347
>>>>> Max.      14.550   14.440   14.420   14.400   14.490   14.430   14.260
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>>
>>>>>> I look forward to any thoughts, advice & suggestions from users.
>>>>>>
>>>>>> thanks in advance,
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> ===========================================
>>>>>>
>>>>>>    > sessionInfo()
>>>>>> Version 2.3.0 (2006-04-24)
>>>>>> i386-pc-mingw32
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] "tools"     "methods"   "stats"     "graphics"  "grDevices" 
>>>>>> "utils"
>>>>>>        "datasets"  "base"
>>>>>>
>>>>>> other attached packages:
>>>>>>        affy   affyio  Biobase
>>>>>> "1.10.0"  "1.0.0" "1.10.0"
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 

-- 
Karl Brand <k.brand at erasmusmc.nl>
Department of Cell Biology and Genetics
Erasmus MC
Dr Molewaterplein 50
3015 GE Rotterdam
lab +31 (0)10 408 7409 fax +31 (0)10 408 9468



More information about the Bioconductor mailing list