[BioC] Affy normalization question

Sat Dec 22 23:08:45 CET 2007

Jim,

My understanding is that our lab normally randomizes by
1. treatment
2. RNA extraction
3. labeling
4. hybridization

In addition, we sometimes have multiple brain regions, and, for the 
purpose of the MA run, each region is treated as an independent 
experiment, thus there is no randomization across brain regions for the 
above factors.

My question arises because of two recent situations. First, in one 
experiment, for a reason not clear to me, the labeling and hybridization 
groups were combined and there is a clear batch effect when this 
labeling-hybridization factor is put into Limma. In such a case, would 
separate normalization be suggested? It will make the batch effect 
larger, but would seem to be addressed by using the batch-effect as a 
factor.

Secondly, in another experiment I need to perform an analysis across 5 
brain regions to look for overall gene expression differences resulting 
from genetic differences between strains. In that experiment the 4 
factors mentioned at the beginning were randomized for so there is no 
batch effect within-brain region, but there is across brain region. In 
this experiment I am not trying to find differences across brain 
regions, which would be impossible to separate out from a batch effect, 
but rather between two treatments that are independent of brain region. 
One way I have done this in the past has been to simply average all 5 
brain regions together to come up with an average-brain expression 
measure, but, I wonder if it would be better to put brain region in as a 
factor. Regardless of whether I average or not, I need to decide whether 
to normalize all brain regions together or, because they were run as 
separate MA experiments, to normalize them individually.

Really, the question seems to be whether RMA should be used on a group 
of CEL files in the presence of a non-chip related batch effect, if so, 
will it make a batch effect "go away" (not from my experience), and then 
if not, how to incorporate the batch effect in a model.

Finally, I realize that by randomizing at each step mentioned at the 
top, one spreads any variance out so that it cannot be picked up with a 
batch effect. With the "n" we usually use, if one were to take each of 
the 4 factors into account one usually would run out of degrees of 
freedom. Nevertheless the variance induced at each step of the wet-lab 
is there, it is just not apparent and presumably doesn't induce bias. It 
does, however, decrease power, and I wonder if it wouldn't be better to 
block by treatment, so that equal numbers from each treatment are in a 
group, but that then each group is processed totally together. There the 
  batch effect would be large, but it would be present as only one 
factor, which with large enough "n" one could take into account in a 
statistical model. That, it seems, might increase power to detect 
differential expression. Maybe this is counter-intuitive, and would 
probably only work if "n" were large enough to provide enough degrees of 
freedom, but it makes some sense to me. Am I nuts? (many people think 
so, so don't be shy about saying so ;) ).

Thanks so much for your helpful input,
Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)

mwkimpel<at>gmail<dot>com

******************************************************************

James W. MacDonald wrote:
> Hi Mark,
> 
> Mark W Kimpel wrote:
>> Not infrequently on this list the question arises as to how to perform 
>> RMA on a large number of CEL files. The simple answer, of course, is 
>> to use "justRMA" or buy more RAM.
>>
>> As I have learned more about the wet-lab side of microarray 
>> experiments it has come to my attention that there is a technical 
>> limitation in our lab as to how many chips can actually be run at one 
>> time and that there is a substantial batch effect between batches.
>>
>> So, in my case at least, it seems to me that it would be incorrect to 
>> normalize 60 CEL files at once when in fact they have been run in 4 
>> batches of 16. Would it not be better to normalize them separately, 
>> within-batch, and then include a batch effect in an analytical model?
> 
> Ideally you would randomize the samples when you are processing them (we 
> randomize at four different steps) so you don't have batches that are 
> processed together all the way through.
> 
> Whether or not you fit a batch effect in a linear model depends on how 
> the samples were processed. If the lab processed all the same type of 
> samples in each of the batches (please say they didn't), then any batch 
> effect will be aliased with the sample types and fitting an effect won't 
> really help.
> 
> If the batches were at least semi-randomized, then with 60 samples you 
> won't be losing that many degrees of freedom, and it probably won't hurt 
> to do so, and it just might help.
> 
>>
>> Is my situation unique or, in fact, is this the way most MA wet-labs 
>> are set up? If the latter is correct, should the recommendation not be 
>> to use justRMA on 80 CEL files if they have been run in batches?
> 
> Regardless of how the lab is set up, once you get to large sample sets 
> there will always be batches. If you do proper randomization of the 
> samples during processing IMO there should be no need to do any 
> post-processing adjustments for the batches.
> 
> Best,
> 
> Jim
> 
> 
>>
>> Thanks,
>> Mark
>