[BioC] data normalization

Tue Jul 21 16:37:36 CEST 2009

No one pays me for my opinions on this subject, so you may have mine  
for free.

First, normalization is a slightly nasty business when it comes to  
microarrays. The basic
idea, of course, is to use some mechanism to remove obvious systematic  
effects.

For example, in a two color system, the two dyes may have slightly  
different intensity profiles
when measuring the same sample. My first piece of advice is that you  
use some mechanism
to SEE what that effect looks like in your system. I think the limma  
package (a big favorite in these
parts) has a function called plotDensities().  One can make these in R  
using the density() function. You can also create a plot of
the log fold change vs average log intensity for this type of array,  
you will generally observe a
pattern looks like a banana. In other words, the residuals of the  
regression line through this
plot show obvious local trends. If you ignore this, then you are  
accepting that your experimental
conditions have somehow conjured this up, and this is not at all likely.

The most intuitively obvious solution is to straighten the banana out,  
and you can achieve this by loess,
also available in limma. Loess creates a local regression curve  
through the middle of the banana,
then applies predictions based on this line to adjust one channel or  
the other, straightening it.

Interestingly, you can get a rather similar result by quantile  
normalization, which forces two data sets
to share a common distribution. It took me a minute to envision why  
this is true, but it is.

Another possibility, one that I have not tried, is based on variance  
stabilization. This makes are rather
different set of assumptions, and I am also going to play with this in  
the near future.

Whatever approach you choose, you can be assured that your  
normalization approach will be
creating new artifacts in your data. There is no perfect world here.  
This fact alone makes
people edgy sometimes. Second, there are many other systematic effects  
that are much
more complicated than intensity dependent dye effects. The good news  
is that if you understand
the magnitude of your unwanted systematic effects pretty well, you can  
hopefully do enough normalization
of the right sort to partly compensate for it without introducing  
enormous artifacts.

In summary, this is not a turnkey system where you just drop all the  
numbers into a magical grinder
and out pops the correct answer without any though or understanding on  
anyone's part. It takes time
and consideration to do these things, a fact that most (but not all)  
of the people who pay the rent
around here understand.

Best,

Tom

On Jul 21, 2009, at 8:56 AM, James W. MacDonald wrote:

> Hi Barbara,
>
> Barbara Uszczynska wrote:
>> Dear R-Users,
>> I use home-made spotted arrays to do some research contected with  
>> alergies.
>> The matrix consist of : 60% of the genes are up-regulated and 40%  
>> of genes
>> that are down-regulated and spikes. I didn't use any genes with  
>> constant
>> expression. How I should analyse this experiment?  According to  
>> statistics I
>> should focuse on external spike  controls and compare all genes  
>> with spikes.
>> It is two coulour experiment.  So I have to build quite complicated
>> statistical model.  I'm not sure if it is a right pathway. What do  
>> you
>> think?
>
> I think two things:
>
> First, asking the same question over and over will not endear you to  
> the listserv community, and will increase the likelihood that your  
> posts will simply be deleted by those who might help you.
>
> Second, what you are asking for is statistical help in analyzing  
> your experiment rather than help using software. Since many of the  
> people on this list are practicing statisticians, what you are  
> asking is for them to do what they get paid to do for you for free.  
> I would suggest that a more reasonable approach is to find a local  
> statistician to help you with your analysis, as you are unlikely to  
> get any (reasonable) help on a listserv.
>
> Best,
>
> Jim
>
>
>> Regards,
>> Barbara
>> 	[[alternative HTML version deleted]]
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor