[BioC] Influence of expression correlation on false positive ratio

Mon Jul 9 15:42:26 CEST 2012

hi,

On 07/09/2012 01:53 PM, January Weiner wrote:
> Hi Jeff, many thanks.
>
>> If the tests are only dependent in small groups, say because genes are
>> grouped into small modules,  then most FDR methods in the p.adjust()
>> function or the methods in the qvalue package will work.
>
> Yes, but I wonder how it behaves in light of a more extensive
> co-expression network (see for example the 2003 Stuart paper in
> Science, http://www.sciencemag.org/content/302/5643/249.short). Has
> anyone tried to simulate this? The co-expression modules, as have been
> found in many papers, are sometimes anything but small.

i guess a simple way to simulate this kind of situation would be 
sampling multivariate normal observations from a covariance matrix with 
a specific pattern of conditional independences reflected on its 
inverse, this can be accomplished with the function qpG2Sigma() from the 
Bioconductor package 'qpgraph'. i'll work out a very simple example with 
1000 genes out of which 100 are differentially expressed (DE):

library(limma)   ## for DE analysis: lmFit(), etc.
library(mvtnorm) ## for rmvnorm()

p <- 1000 ## number of genes
pDE <- 100 ## number of DE genes

## let's say all genes are truly mutually independent and simulate
## observations for two groups of samples with differences in means
## (4 vs 5) in the first 100 genes
set.seed(123)
y_c1 <- rmvnorm(n=20, mean=rep(4, p), sigma=diag(p))
y_c2 <- rmvnorm(n=20, mean=c(rep(5, pDE), rep(4, p-pDE)), sigma=diag(p))
y <- t(rbind(y_c1, y_c2))
rownames(y) <- paste0("g", 1:p)
dim(y)
[1] 1000   40

## find DE genes using limma
design <- cbind(grp1=1, grp2vs1=c(rep(0, 20), rep(1, 20)))
fit <- lmFit(y, design)
fit <- eBayes(fit)
summary(decideTests(fit)) ## number DE genes at 5% FDR
    grp1 grp2vs1
-1    0       0
0     0     953
1  1000      47
tt <- topTable(fit, coef=2, p.value=0.05, n=Inf)

## how many of the previous 47 genes are truly DE?
sum(as.integer(gsub("g", "", tt$ID)) <= 100)
[1] 47

## now let's simulate the setting with dependence among genes

library(graph)   ## for randomEgraph()
library(qpgraph) ## for qpG2Sigma()

set.seed(123)

## simulate an E-R random graph with 5% density
G <- randomEGraph(V=as.character(1:p), p=0.05)

## simulate a covariance matrix whose inverse contains zeroes
## on the row and column indexed by missing edges in G and
## edges have a mean marginal Pearson correlation rho=0.5
set.seed(123)
Sigma <- qpG2Sigma(G, rho=0.5, verbose=TRUE) ## some 2 or 3 minutes

## just check that we get those 0s in the inverse covariance
## for the missing edges of G ...
invSigma <- solve(Sigma)
mean(invSigma[upper.tri(invSigma) & as(G, "matrix") == 0])
[1] -4.836399e-07

## ... and the mean marginal correlation is rho=0.5 as specified
## in the call to qpG2Sigma()
mean(cov2cor(as.matrix(Sigma))[upper.tri(Sigma) & as(G, "matrix")])
[1] 0.501911

## simulate data again under this pattern of dependences and the
## previous grouping of samples and DE gene distribution
set.seed(123)
y_c1 <- rmvnorm(n=20, mean=rep(4, p), sigma=as.matrix(Sigma))
y_c2 <- rmvnorm(n=20, mean=c(rep(5, pDE), rep(4, p-pDE)), 
sigma=as.matrix(Sigma))
y <- t(rbind(y_c1, y_c2))
rownames(y) <- paste0("g", 1:p)
dim(y)
[1] 1000   40

## find DE genes using limma
design <- cbind(grp1=1, grp2vs1=c(rep(0, 20), rep(1, 20)))
fit <- lmFit(y, design)
fit <- eBayes(fit)
summary(decideTests(fit))
    grp1 grp2vs1
-1    0       0
0     0     995
1  1000       2

i.e., under dependence, a classical DE analysis assuming independence 
among genes can become overly conservative, at least with this very 
straightforward simulated data and this very specific dependence pattern.

robert.

>> The Bonferroni
>> correction controls a more conservative error rate, but also holds under
>> dependence.
>
> Sure, Bonferroni does not assume independence of the test, but it's
> meager power means that many are not even considering this as an
> option (I definitely use it when the test is strictly in hypothesis
> testing mode and not further experiments are planned).
>
> Best regards,
> j.
>
>
>
>>
>> If the sources of dependence are more pervasive, like if there are batch
>> effects:
>>
>> http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
>>
>> Then you can either use the batch correction methods in Limma if, say, you
>> know the date the samples were processed. Or, if you don't know the sources
>> of large scale dependence, you can use the sva package:
>>
>> http://www.bioconductor.org/packages/devel/bioc/html/sva.html
>>
>> which implements the methods described here:
>>
>> http://www.pnas.org/content/early/2008/11/24/0808709105.abstract
>>
>>
>> Best,
>>
>>
>> Jeff
>>
>>
>>
>> On Jul 9, 2012 7:08 AM, "January Weiner"
>> <january.weiner at mpiib-berlin.mpg.de>  wrote:
>>>
>>> Hello,
>>>
>>> statistical methods for assessing significance of differences in
>>> expression assume, correct me if I'm wrong, independence of the tests.
>>> Does anyone have at hand any papers on the performance -- in terms of
>>> type I error -- of methods such as limma / eBayes? I'm sure this issue
>>> has been investigated in depth.
>>>
>>> Kind regards,
>>>
>>> January
>>>
>>> --
>>> -------- Dr. January Weiner 3 --------------------------------------
>>> Max Planck Institute for Infection Biology
>>> Charitéplatz 1
>>> D-10117 Berlin, Germany
>>> Web   : www.mpiib-berlin.mpg.de
>>> Tel     : +49-30-28460514
>>> Fax    : +49-30-28450505
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550