[BioC] Unexpected results using limma with numerical factor

Wed Aug 25 09:40:48 CEST 2004

I've recently asked a similar question but have got no feedback, see
point 2 onwards in the following.
https://www.stat.math.ethz.ch/pipermail/bioconductor/2004-August/005802.
html

As I now understand it (perhaps/probably wrongly) the standard approach
(equilivent to ANOVA?) is to use a non-numeric factor for the fit.
However, lm() in R is also capable of regression fits. Looking (but not
understanding) lm.fit in limma, it appears to be an independent function
(lm bit written in fortran?) and doesn't call lm() from stats. So the
question is really if lm.fit can do numeric regression?

Another consideration is that you wouldn't use a design (factor) but a
numeric vector. My guess is that your design is being taken as a factor,
(and if you look in the user guide) the -ve values may indicate dye
swaps, which could be interesting! I've just tried lmfit with a numeric
vector (but as there are replicates each number appears 3 times) and got
meaningless results - all lods>30 and all tiny p-values 1e-24. So
initially it looks like you can't use limma like this, but I'd like to
hear an expert verdict as if the approach is completely wrong.

Anyway if you're looking for correlations then you could perhaps try
pearson (see my previous post about p-values and R2 from lm().
try-

Test1 <- c(0.58,-2.36,-12.24,-14.84,0.15,-3.23,-11.66,-12.91)
Correl <- esApply(eset, 1, function(x) {
cor(x, Test1)
})

be aware that you might get alot of correlations by chance, particularly
as your scores seem to be in 2 groups, close to zero and < -11. And a
straight line between two groups gives a good pearson as it's sensitive
to outliers.

Perhaps it's best to change your test scores to factors anyway-
Test1.high, Test1.low, Test2.high, Test2.low, and do a conventional
limma analysis. Without a regular distribution of test scores,
correlations are going to be largely meaningless.

HTH, and that someone can answer this more definetely.

Cheers,
Matt

Hi again,

For an experiment I'm analyzing, I do not have a series of factors.
Rather, I 
have a pair of test-score (numerical) for each replicate animal.  This
is 
Affymetrix data, and for my initial pass I tried with only a single
test-score.  
The commands I used are below:

> library(gcrma);
Welcome to Bioconductor 
         Vignettes contain introductory material.  To view, 
         simply type: openVignette() 
         For details on reading vignettes, see
         the openVignette help page.
> library(limma);
> cel.files <- c(
+ 'RAE230_2_060104_LH_IM07T.CEL',
+ 'RAE230_2_060104_LH_IM08T.CEL',
+ 'RAE230_2_060104_LH_IM09T.CEL',
+ 'RAE230_2_060104_LH_IM10T.CEL',
+ 'RAE230_2_060204_LH_IM07T.CEL',
+ 'RAE230_2_060204_LH_IM08T.CEL',
+ 'RAE230_2_060204_LH_IM09T.CEL',
+ 'RAE230_2_060204_LH_IM10T.CEL'
+ );
> eset <- ReadAffy(filenames=cel.files, phenoData="phenodata.txt");
> eset <- ReadAffy(filenames=cel.files, phenoData="phenodata.txt");
> eset;
AffyBatch object
size of arrays=834x834 features (43476 kb)
cdf=Rat230_2 (31099 affyids)
number of samples=8
number of genes=31099
annotation=rat2302
> pData(eset);
                               TestScore1
RAE230_2_060104_LH_IM07T.CEL         0.58
RAE230_2_060104_LH_IM08T.CEL        -2.36
RAE230_2_060104_LH_IM09T.CEL       -12.24
RAE230_2_060104_LH_IM10T.CEL       -14.84
RAE230_2_060204_LH_IM07T.CEL         0.15
RAE230_2_060204_LH_IM08T.CEL        -3.23
RAE230_2_060204_LH_IM09T.CEL       -11.66
RAE230_2_060204_LH_IM10T.CEL       -12.91
> eset <- rma(eset);
Background correcting
Normalizing
Calculating Expression
> design <- model.matrix(~-1 + TestScore1, pData(eset));
> design;
                               TestScore1
RAE230_2_060104_LH_IM07T.CEL         0.58
RAE230_2_060104_LH_IM08T.CEL        -2.36
RAE230_2_060104_LH_IM09T.CEL       -12.24
RAE230_2_060104_LH_IM10T.CEL       -14.84
RAE230_2_060204_LH_IM07T.CEL         0.15
RAE230_2_060204_LH_IM08T.CEL        -3.23
RAE230_2_060204_LH_IM09T.CEL       -11.66
RAE230_2_060204_LH_IM10T.CEL       -12.91
attr(,"assign")
[1] 1
> fit1 <- lmFit(eset, design);
> fit3 <- eBayes(fit1);

All proceeds well without any error-messages, so I believed I had
successfully 
fit my model.  When I extract the data, however, I get some unexpected
results:

> topTable(fit3, coef=1, number=20, adjust="fdr");
                       ID         M        A         t      P.Value
B
104            1367555_at -1.212175 14.77173 -6.808742 3.912627e-07
14.75482
105          1367556_s_at -1.203275 14.67175 -6.762709 3.912627e-07
14.48816
3411           1370862_at -1.185363 14.42457 -6.674599 3.912627e-07
13.98156
2777           1370228_at -1.184768 14.46704 -6.666414 3.912627e-07
13.93475
549            1368000_at -1.175866 14.33987 -6.622929 3.912627e-07
13.68683
2697           1370148_at -1.174858 14.32747 -6.617795 3.912627e-07
13.65764
837            1368288_at -1.170412 14.25957 -6.596453 3.912627e-07
13.53649
710          1368161_a_at -1.169621 14.27702 -6.589664 3.912627e-07
13.49801
420            1367871_at -1.166552 14.09481 -6.588461 3.912627e-07
13.49120
2558           1370009_at -1.162241 14.18252 -6.552347 3.912627e-07
13.28707
147            1367598_at -1.161072 14.19550 -6.543620 3.912627e-07
13.23787
2576         1370027_a_at -1.158955 14.13682 -6.536068 3.912627e-07
13.19533
1136           1368587_at -1.154447 14.04036 -6.517046 3.912627e-07
13.08836
196            1367647_at -1.154999 14.07951 -6.516673 3.912627e-07
13.08627
31081 AFFX-r2-P1-cre-5_at -1.153568 14.05140 -6.510380 3.912627e-07
13.05094
19150          1387082_at -1.153501 14.05431 -6.509674 3.912627e-07
13.04697
2898         1370349_a_at -1.153164 14.07047 -6.505935 3.912627e-07
13.02599
1235           1368686_at -1.152660 14.04966 -6.504796 3.912627e-07
13.01960
8200           1375651_at -1.152557 14.04428 -6.504674 3.912627e-07
13.01892
19359          1387291_at -1.151666 14.03699 -6.499744 3.912627e-07
12.99127

The near-identical M, A, and p-values indicate a problem, and none of
the genes 
on this list are very plausible biologically for our test-system.  Based
on that 
I'm pretty sure I've gone astray somewhere.

Is it possible to use numeric scores in fitting a linear model with
limma?  If 
so, am I asking the question in the right manner?  If not, are there any

BioConductor tools appropriate for this kind of question?

Any help very much appreciated,
Paul