[BioC] Microarray data normalization

Wed Jul 30 11:32:04 CEST 2014

Dear Bernarnd

my preference would be option 2, but the first thing to do if you’re unsure is to try both and see if it makes any difference. Presumably the differences are minimal and within the uncertainty of your analysis.

If option 2 were the right thing to do, then with the same logic you could go out to the internet (ArrayExpress, GEO), download a few thousand more arrays, throw them in, and get even better results.

The view "purpose of normalization is to remove batch effects” is not quite right, as batch effects can affect the data in all sorts of ways, but e.g. rma only addresses those types of efffects that affect all the data on an array in the same way, i.e. overall higher or lower background, or overall more or less cDNA used, over overall longer or shorter exposure to the scanner. What it does not remove is, for instance, if the way that the signal depends on probe GC content or cDNA length changes (and this can happen as reagents & material change). 

Best wishes
Wolfgang

Il giorno Jul 29, 2014, alle ore 20:32 EDT, Bernard Lee Kok Bang <bernard.lee at carif.com.my> ha scritto:

> Dear all, I would like to ask a question in regards to microarray data normalization. 
> 
> Scenario;
> I have in hand a collection of 300 cancer cell lines (multiple cancer types) raw ‘.CEL’ files, all from the same study/batch. My aim is to obtain the gene expression values and use them downstream. However I am only interested in a subset of these .CEL files, for example I am only interested in NON-blood cancer cell lines (n=250). 
> 
> I’m wondering which of these two options is more appropriate for my scenario:
> 
> Option 1:
> 1)	Normalize all 300 .CEL by rma.
> 2)	After normalization, manually remove the 50 blood samples I am NOT interested in
> 3)	Use the normalized data of 250 samples for downstream analysis
> 
> Option 2:
> 1)	Normalize ONLY the 250 .CEL by rma (imagine as if the 50 blood samples does not exists)
> 2)	Use the normalized data of 250 samples for downstream analysis
> 
> My downstream analysis simply involves ranking the gene from highest expression to the lowest. 
> 
>> From my point of view, I am favoring the first option. This is because since I have all the solid tumor and blood cell line data, I might as well normalized them altogether first before manually excluding the blood cell line, as to my knowledge the purpose of normalization is to remove batch effects?? So the larger the sample size during rma normalization the better??
> 
> 
> Thanks in advance.
> 
> Bernard Lee
> Research Assistant
> Cancer Research Initiatives Foundation (CARIF)
> University of Malaya (UM)
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor