[BioC] expresso: performing RMA on NON-Affy data?

Wed Apr 29 17:56:33 CEST 2009

Quoting Mark Robinson <mrobinson at wehi.EDU.AU>:

> Jose,
>
> Can I add a (possibly naive) suggestion?
>
> You say normalization is not the issue, its the summarization of 3-8
> probes per probeset for your 1-colour Nimblegen data.  So maybe you
> just want to fit the log-additive RMA linear model to your data.  This
> is akin to estimating a model with probe effects and chip effects for
> each probeset ... of which you are really interested in the chip
> effects.
>
> Say you were able to collect your data from a single probeset into a
> matrix Y (concocted example below):
>
> ce <- rnorm(10,mean=6)   # 10 samples
> pe <- rnorm(5)           # 5 probes in probeset
> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  # add noise
>
> One way to do the fits in a quick and robust way is median polish:
>
> f <- medpolish(Y)
>
> ---------------
>> f
>
> Median Polish Results (Dataset: "Y")
>
> Overall: 6.949745
>
> Row Effects:
>  [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
>  [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
>
> Column Effects:
> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01 -4.283675e-01
>
> Residuals:
>              [,1]      [,2]      [,3]       [,4]       [,5]
>  [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
>  [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
>  [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
>  [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
>  [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
>  [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
>  [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
>  [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
>  [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
>
>> dim(Y)
> [1] 10  5
>> f$overall + f$row  # estimated chip effects
>  [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 6.514582
>  [9] 5.691860 7.430009
>> ce                 # true chip effects
>  [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 6.101933
>  [9] 5.117349 6.905161
> ---------------
>
> quick sketch ... it would be (fairly) easy to split up your table of
> log-transformed normalized probe intensities into a matrix for each
> probeset (e.g. using split() ...), robustly fit the model, extract the
> chip effects and whammo, there is your table of summarized expression
> values ... this would only be a few lines of R code and probably not
> too inefficient (?) ... this is effectively what goes on under the hood
> of affy::rma() and the like (although it uses C code and in a very
> general way that uses CDF environments).
>
> I suspect you could use the 'oligo' package to do much the same thing,
> after using pdInfoBuilder() to correctly assign probes to probesets ...
>
> ... anyways, just some thoughts.
>
> Cheers,
> Mark

Hi Mark,

Thanks for your email! Your *naive* suggestion actually solves my most  
important issue.
I wanted to apply the method and then learn what happened under the  
hood, but I should have probably gone the other way ;-)

I tried the median polish and it reproduces almost exactly the  
summarised data I had, so I'm not sure this is the method they employed.

I just altered your example code in one place.

The example had a matrix where columns corresponded to different chips  
(10 replicates) and rows corresponded to probes in the probeset (5).  
One matrix per probeset.
In your code you suggested I took:

f$overall + f$row (where f=medpolish(matrix))

to obtain the estimated chip effects. This summarises the 10 chips  
into a single figure... per probe.
What I wanted was to summarise each probeset into a single number, so  
I take instead:

f$overall + f$col

This gives me the summarised probeset *per chip*, reducing my table of  
72000ish rows (probes) x 12 chips to one of 24000ish rows (probesets)  
x 12 chips.

That is great.

The obvious question is... would it be reasonable to repeat the  
process to summarise replicate chips?
Essentially it'll use the whole matrix, and the result could be  
obtained as a single column (24000ish values, one per probeset) like  
this:

f$overall + f$row (as you suggested initially).

probably a slow process, but does this make sense at least?

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.