[BioC] Bugs in HTqPCR implementation of data normalization

Tue Jul 1 23:48:16 CEST 2014

Hello,

I was hoping to use HTqPCR to analyze my data, but found some bugs
that make the package hard to use (or giving wrong results...).

Looking at the source code of normalizeCtData, it appears that there
is a bug in the geometric.mean and scale.rankinvariant methods - that
data seem to be multiplied by scaling factors instead of being divided
by them. This magnifies inter-sample differences instead of reducing
them!

Here's an example script that illustrates the problem. I generate 4
samples with a small number of measurements, with all measurements the
same in each sample.

library(HTqPCR)
n = 10;
m = 4;
htData =  new( "qPCRset", exprs = t(matrix(20 + 2*(1:m), m, n)));
featureNames(htData) = paste0("mir.", 1:n);
sampleNames(htData) = paste0("sample", 1:m);
featureType(htData) = rep("Target", n);

exprs(htData)

#       sample1 sample2 sample3 sample4
#mir.1       22      24      26      28
#mir.2       22      24      26      28
#mir.3       22      24      26      28
#mir.4       22      24      26      28
#mir.5       22      24      26      28
#mir.6       22      24      26      28
#mir.7       22      24      26      28
#mir.8       22      24      26      28
#mir.9       22      24      26      28
#mir.10      22      24      26      28

htNorm.gm = normalizeCtData(htData, norm = "geometric.mean")
exprs(htNorm.gm);

#Scaling Ct values
#   Using geometric mean within each sample
#  Scaling factors: 1.00 1.09 1.18 1.27
> exprs(htNorm.gm);
#       sample1  sample2  sample3  sample4
#mir.1       22 26.18182 30.72727 35.63636
#mir.2       22 26.18182 30.72727 35.63636
#mir.3       22 26.18182 30.72727 35.63636
#mir.4       22 26.18182 30.72727 35.63636
#mir.5       22 26.18182 30.72727 35.63636
#mir.6       22 26.18182 30.72727 35.63636
#mir.7       22 26.18182 30.72727 35.63636
#mir.8       22 26.18182 30.72727 35.63636
#mir.9       22 26.18182 30.72727 35.63636
#mir.10      22 26.18182 30.72727 35.63636

Note that the sample differences were maginfied, not reduced.

Same thing happens for the scale.rankinvariant method:

n = 100;
m = 4;
htData.2 =  new( "qPCRset", exprs = t(matrix(20 + 2*(1:m), m, n)));
set.seed(1);
exprs(htData.2) = exprs(htData.2) + runif(m)
htNorm.sri = normalizeCtData(htData.2, norm = "scale.rankinvariant")

head(exprs(htNorm.sri))

Furthermore, the geometric.mean normalization has been implemented
differently from the article referenced in connection with it
(Mestdagh et al., 2009).

While the article is written so it lacks exact details and formulas, I
have confirmed with the authors that "geometric mean" applies to the
linearized expression data, that is, to  exponentiated Ct-values; this
is equivalent to averaging Ct using arithmetic mean, not geometric
mean as implemented in normalizeCtData. Further, the article seems to
suggest dividing the linearized values by the geometric mean, which
means subtracting, not dividing, the Ct values by the arithmetic mean.

This distinction should be pointed out in the vignette, or the method
should be re-implemented according to the article.

I also keep getting errors when trying to print objects. In the above
example, typing

htData

produces the following output:

An object of class "qPCRset"
Size:  10 features, 4 samples
Feature types:
Feature names:         mir.1 mir.2 mir.3 ...
Feature classes:
Error in `rownames<-`(`*tmp*`, value = c("mir.1", "mir.2", "mir.3", "mir.4",  :
  attempt to set 'rownames' on an object with no dimensions

Best,

Peter