[BioC] Array normalisation with Limma: would this be reasonable?

Mon Dec 4 14:30:21 CET 2006

Hi,

I am having trouble trying to normalise my data properly.
Briefly, I have a number of 2-colour cDNA arrays. Every slide is  
hybridised to 1) a reference sample (non-transfected RNA from a cell  
line), and 2) a transfected sample (on teh same cell line).
So the question is transfection vs. non-transfection. So far so good.

What's the problem?
The transfection is of a plasmid that will "activate" expression of  
many genes (it's a fusion protein between a DNA-binding domain that  
would target many gene promoters, especially silenced ones, and a  
potent transactivator domain). This means that a large proportion of  
genes are differentially expressed, with most going from no or very  
little expression, to a clearly detectable level.

This means that loess normalisation doesn't work very well. Actually,  
it works "too well". On the raw data, if you plot Cy3 vs Cy5 (logged),  
there's the usual diagonal with the bulk of the data, and then a  
(usually) large spike with low Cy3 and varying Cy5 (parallel to the  
Cy5 axis), or viceversa, depending on how the transfection was labelled.
(See http://mcnach.com/MISC/RG_scatterplots.png).
BUt then, after print-tip group loess, what I see is that the spike  
gets severely distorted, pulled towards the bulk of the data in the  
diagonal, and this results in a clear underestimation of the number of  
real DE genes.

I'm exploring alternatives, and I had an idea. It seems a bit "rough",  
so I wonder what more experienced people think.
This is teh idea: I can identify most of the spots on the "spike" by  
virtue of their having just about background signal on one channel,  
and decent signal on teh other. This I can do on the raw data, either  
by looking and the foreground and background intensities on each  
slide, or at the signal to noise ratio (SNR) that Genepix produces.  
Once these are located, I can assign zero weight to them, which means  
that the normalisation (loess) is applied using only the bulk of the  
spots, that mostly don't change that much.
My hope is that this would remove the distortion of the spike due to  
loess, but would still be adequate enough to "balance" the Cy3 and Cy5  
channels appropriately.

I have experimented trying different values for teh 'span' parameter  
in loess, from the default 0.3 up to 1.0. The higher the span, the  
smaller the distortion, although the angle of the spike varies and  
it's still not quite right.

In the light of what the raw data scatterplots look like (attachment),  
does anyone have objections to my "solution"?

I realise that the best would be to have a set of control spots for  
these arrays, but unfortunately I don't have that luxury. I have  
identified a small set of genes that do not change expression,  
consistently across experiments, even when done in another cell line.  
But these are only 7 genes, which cover the effective range of A  
values, and I don't think that 7 genes is enough (when I tried limma's  
normalisation method 'control' it gave me an error that appear to be  
due to too few spots used as controls).

I'd be grateful for any comments.

Thanks!

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK