[R] How to normalize to a set of internal references

Frank E Harrell Jr f.harrell at vanderbilt.edu
Mon Mar 2 14:04:09 CET 2009


Waverley wrote:
> Thanks for the advice.  My question is more on how to do this?
> 
> Let me use a biology gene analysis example to illustrate:
> In biology, there are always some house keeping genes which differ
> little even at pathological conditions.
> 
> We know that at different batches, there are external factors affect
> the measurements.  For example, overall signal intensity might be
> different due to lab reagents.
> A simplified picture:
> Day 1:  Using control samples, I have measured #1 to #110 genes and get data.
> Day 2: Using disease samples, I have measured again #1 to #110 genes
> and get data.
> 
> For those two data sets, I noticed the overall signal intensity in Day
> 1, for each gene, is more than Day 2.
> I know, from biological literature,  gene 101 to 110, are "house
> keeping" genes, should not change much between disease and control.
> My questions arise, technically, how do I use gene 101 to 110 values
> to adjust the signals of gene 1 to 100 such that the batch effect can
> be corrected.  The differences revealing from the comparative analysis
> of 1 ~ 100 genes between disease and control are due to biology rather
> than lab artifacts.
> 
> So the question is how to do that mathematically? If I have only one
> house keeping gene, then I can divide every gene to that to normalize,
> then compare.  But now I have 10 genes which can be utilized for
> normalization.  I assume, the more reference genes to be  used, the
> better, under this context.
> 
> Can you help again?
> 
> Thanks much in advance.

That is an inappropriate experimental design that has caused major 
problems in the biomedical research literature (look up the famous 
Petricoin fiasco - google for petricoin baggerly; Baggerly discovered 
the error).  You have day and disease completely confounded and no model 
can correct for that (day and disease are completely collinear).  Once 
you randomize the order of samples to be run and analyzed, you can 
include day as a blocking factor to adjust for any day effect.  If 
analyzing log intensity, the regression adjustment for day will involve 
a ratio correction on the original scale.

If you are completely correct that the housekeeping genes cannot be 
disease-related, there is hope for some kind of internal control if you 
make a strong assumption about the time effect being the same for 
housekeeping genes as for other genes.  But why not just do the proper 
design?

Frank

> 
> 
> Waverley wrote:
>> Hi,
>>
>> I have a question of the method as how to normalize the data sets
>> according to a set of the internal measurements.
>>
>> For example, I have performed two batches of experiments contrasting
>> two different conditions (positive versus negative conditions): one at
>> a time.
>>
>> 1. each experiment, I measure signals of variable v1 to v100. I want
>> to understand v1 to v100 change under these two contrasting conditions
>>
>> 2. Also I know different variables v101 to v1110, total of 10 of them,
>> although they are different from each other, but they would of the
>> same or similar values under these two contrasting conditions
>>
>> 3. How do I do the internal normalization?  How can I use the the
>> variable v101 to v110 values to normalize the measures of v1 to v100
>> at either positive or negative condition to minimize batch effect?  I
>> hope the comparisons of values (v1 to v100) between two different
>> conditions can be more accurate and robust to external noises.
>>
>> In general, I have a couple of matrices of the same dimensions and a
>> reference matrix of values to be used as reference values to be
>> normalize to.  How should I do that?
>>
> 
> I don't understand your problem well, but in general internal
> normalization is by and large an attempt to avoid appropriate modeling
> (e.g., incorporating block effects or certain covariates in a regression
> model), and results in overstated confidence of the final estimates by
> not taking into account the imprecision in the normalizing factors.
> 
> Frank


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list