[BioC] Limma: Error with dupcor.series/duplicateCorrelatoin function

Wed Mar 17 00:20:03 MET 2004

At 08:50 AM 17/03/2004, Valtteri Wirta wrote:
>Hello Limma developers and users,
>
>(sorry for the long posting)
>
>Today I encountered a similar problem as the one that was reported by 
>Jason Skelton.
>
>I'm using Limma version 1.5.7 and R 1.8.1 on Windows XP platform.

You're certainly up to data, limma 1.5.7 has only been out for a few hours!

>I have 16 slides divided into four comparisons (four in each).
>Each array contains 30 000 elements, representing 15 000 probes printed in 
>duplicate (two identical fields => spacing=15000).
>The M-values are stored in a matrix called Ms
>Data for unreliable features is removed through filtration which causes 
>some probes to have no valid measurements.
>
>The modelMatrix (previously known as designMatrix) looks like this:
> > design
>    HF HM LF LM
>1   1  0  0  0
>2   0  0  0  1
>3   0  0  1  0
>4   0  1  0  0
>5   0  1  0  0
>6   1  0  0  0
>7   0  0  0  1
>8   0  0  1  0
>9   0  1  0  0
>10  1  0  0  0
>11  0  0  0  1
>12  0 -1  0  0
>13 -1  0  0  0
>14  0  0  0 -1
>15  0  0  1  0
>16  0  0 -1  0
>
>When estimating the within-slide correlation using dupcor.series (or 
>duplicateCorrelation)
> > cor <- dupcor.series(Ms, design=design, ndups=2, spacing=15000)
>
>the following error message is displayed:
>Error in if (rho[i] < -1) rho[i] <- -1 : missing value where TRUE/FALSE needed

This doesn't seem to be your problem, rather it seems to be a limma bug. I 
have put a fix in limma 1.5.8.

A little background: in limma 1.3.15 (8 Feb 2004) I replaced all the 
numeric routines which underly the dupcor.series/duplicateCorrelation 
functions. In particular I replaced with function 'gls' from the nlme 
package with my own REML/randomized block functions from the statmod 
library. 'gls' is rather slow but is very reliable. The new functions are 
several times as quick as 'gls' on most microarray problems. On my machine 
this means about 8-10 minutes for a typical 'duplicateCorrelation' run 
rather than 40-45 minutes, so it's a worthwhile saving. But it seems that 
the new functions can fail on some data sets. More testing and feedback 
would be welcome.

>Ok. I replaced all NAs with 1 and now it works nicely.
>It seems therefore that this has something to do with the NAs.
>
>As the purpose of this step is to estimate the within-slide duplicate spot 
>correlation, I thought that changing the design object into a new one 
>considering all slides belonging to the same comparison should be ok.
>
> > cor <- dupcor.series(Ms, design=rep(1,16), ndups=2, spacing=15000)
>
>And this works nicely. (Only the warning message: Max iterations exceeded 
>in: glmgam.fit(dx, dy, start = c(mean(dy), 0)) is displayed, but that 
>should not matter).
>Do you think it is ok to do like this?

You can't do this unless you've subsetted Ms appropriately as well. You 
will get a correlation which is too small.

>Now, back to the similarity with Jason's example:
>I tried to further investigate the problem. Now I'm back to the old design 
>with four comparisons (shown above) and for each comparison I change the 
>M-value for all genes with no valid measurement on any of the four slides to 1.
>
> > cor <- dupcor.series(Ms.new, design=design, ndups=2, spacing=15000)
>
>This works nicely.
>
>This brings up the question of "how many valid measurements are needed?"
>For genes with no valid measurement on any of the four slides I replaced 1 
>of the measurements (always the first one) with 1.
>Then
>
> > cor <- dupcor.series(Ms.new2, design=design, ndups=2, spacing=15000)
>
>gives the following error:
>  Error in if (rho[i] < -1) rho[i] <- -1 : missing value where TRUE/FALSE 
> needed
>
>When I change two of the four NAs to 1, this error is displayed:
>Error in chol(XVX + lambda * I) : the leading minor of order 2 is not 
>positive definite
>
>Same goes for three of four, and four of four into 1 works nicely, as 
>expected.
>
>To make this even more difficult to understand... If I break out the four 
>slides corresponding to the first comparison (HF in the design matrix 
>above) the dupcor.series function works nicely also on the matrix containg 
>the original NA values:
>
>Ms.new3 <- Ms[,c(1,6,10,13)]
>cor.3 <- dupcor.series(Ms.new3, design=c(1,1,1,-1), ndups=2, spacing=15000)
>
>
>To conclude, it seems like there is a problem when a modelMatrix 
>(designMatrix) is used in the dupcor.series/dupliceCorrelation function 
>and that this is related to the number of valid measurements. I'd 
>appreciate if someone could comment on this and perhaps explain what is 
>wrong or what I am doing wrong
>
>Finally, I have a small suggestions. Would it be possibly to have LIMMA 
>display the version number when the package is loaded?

Well, this would be easy to implement. But I am reluctant to do so because 
no other R package, with the exception of Biobase, outputs any information 
on loading successfully. I don't think that limma has any unique importance 
which means it should be different. You can always type

help(package="limma")

to get the version number.

Cheers
Gordon

>regards,
>Valtteri