[BioC] DESeq for two-factor models

Tue Mar 27 15:06:58 CEST 2012

Hi all,

i have some questions regarding a two-factor-glm using DESeq. The design of my study has two factors, genotype and treatment. Genotype has three levels (A,B,C), "treatment" has two levels ("control", "stress"). 

I am interested in the following questions:

1. Which genes are affected by "treatment"?
2. Which genes are affected by "genotype"?
3. Which genes are affected by both, "treatment" and "genotype"?
4. Which genes are affected by an interaction of "treatment" and "genotype"?

Pairwise comparisons:
5. Which genes are affected by "treatment" in a particular genotype?
6. Which genes are affected by "genotype" in a particular treatment

There are some posts concerning similar questions in the archive, e.g.
https://stat.ethz.ch/pipermail/bioconductor/2012-January/042823.html

However, it is still not clear to me how to correctly address these questions using DESeq.

Let me delineate my problems using some code:

###### Start Code
library(DESeq)

# Generate a dataframe of neg.binom counts
set.seed(50)
s <- as.data.frame(matrix(rnbinom(750,mu=20,size=20),ncol=3))
u <- as.data.frame(matrix(rnbinom(750,mu=30,size=25),ncol=3))
v <- as.data.frame(matrix(rnbinom(750,mu=40,size=20),ncol=3))
w <- as.data.frame(matrix(rnbinom(750,mu=50,size=25),ncol=3))
x <- as.data.frame(matrix(rnbinom(750,mu=60,size=20),ncol=3))
y <- as.data.frame(matrix(rnbinom(750,mu=70,size=25),ncol=3))
z <- as.data.frame(cbind(s,u,v,w,x,y))
names(z) <- c("A_Con1", "A_Con2","A_Con3","A_Str1", "A_Str2","A_Str3", 
"B_Con1","B_Con2","B_Con3","B_Str1","B_Str2","B_Str3",
"C_Con1","C_Con2","C_Con3","C_Str1","C_Str2","C_Str3")

# Design matrix
genotype <- c(rep("A",6),rep("B",6),rep("C",6))
treatment <-rep(c(rep("Control",3),rep("Stress",3)),3)

Design <- as.data.frame(cbind(genotype,treatment),row.names=colnames(z))  

# Creating the CountDataSet-Object
CDS <- newCountDataSet(z,Design)

### Calculating size factors
CDS <- estimateSizeFactors(CDS)

### Estimate dispersion
CDS <- estimateDispersions(CDS,method="pooled")

### Models
fit0 <- fitNbinomGLMs(CDS,count~1)
fit1 <- fitNbinomGLMs(CDS,count~treatment)
fit2 <- fitNbinomGLMs(CDS,count~genotype)
fit3 <- fitNbinomGLMs(CDS,count~treatment+genotype)
fit4 <- fitNbinomGLMs(CDS,count~treatment*genotype)

### Inference
#
# 1. Which genes are affected by treatment?

Pvals_Trt <- nbinomGLMTest(fit1,fit0)
Padj_Trt <- p.adjust(Pvals_Trt,method="BH")
tmp1 <- Padj_Trt < 0.3 # arbitrary value to get some affected genes
length(Padj_Trt[tmp1==TRUE])
# There is one gene affected by treatment

# 2. Which genes are affected by genotype?

Pvals_gen <- nbinomGLMTest(fit2,fit0)
Padj_gen <- p.adjust(Pvals_gen,method="BH")
tmp2 <- Padj_gen < 0.3                     
length(Padj_gen[tmp2==TRUE])
# 2 genes are affected by treatment

# 3. Which genes are affected by both, treatment and genotype?

#In this case, the full model is "fit3", but what is the correct reduced model, #"fit0", "fit1" or "fit2"?

#Since I expect a stronger effect of "treatment" compared to "genotype" in my #real data, i decided to go on with to test for an additional effect of #"genotype" on genes affected by "treatment":

Pvals <- nbinomGLMTest(fit3,fit1)
Padj <- p.adjust(Pvals,method="BH")
tmp3 <- Padj < 0.3                     
length(Padj[tmp3==TRUE])
# for 2 genes, there is an additional effect of "genotype"

# 4. Which genes are affected by an interaction of "treatment" and "genotype"?
# Intuitively, i would test for this by comparing "fit3" and "fit4", but again # I am not absolutely sure that "fit3" is the 
# correct choice for the reduced model?

Pvals_int <- nbinomGLMTest(fit4,fit3)
Padj_int <- p.adjust(Pvals_int,method="BH")
tmp4 <- Padj_int < 0.3                     
length(Padj_int[tmp4==TRUE])
# no gene is affected by an interaction of both factors

sessionInfo()
#R version 2.14.0 (2011-10-31)
#Platform: x86_64-pc-linux-gnu (64-bit)
#
#locale:
# [1] LC_CTYPE=de_DE.UTF-8          LC_NUMERIC=C                 
# [3] LC_TIME=de_DE.UTF-8           LC_COLLATE=de_DE.UTF-8       
# [5] LC_MONETARY=de_DE.UTF-8       LC_MESSAGES=de_DE.UTF-8      
# [7] LC_PAPER=de_DE.UTF-8          LC_NAME=de_DE.UTF-8          
# [9] LC_ADDRESS=de_DE.UTF-8        LC_TELEPHONE=de_DE.UTF-8     
#[11] LC_MEASUREMENT=de_DE.UTF-8    LC_IDENTIFICATION=de_DE.UTF-8
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#[1] DESeq_1.6.1    locfit_1.5-6   lattice_0.20-0 akima_0.5-7    Biobase_2.14.0
#[6] JGR_1.7-9      iplots_1.1-4   JavaGD_0.5-5   rJava_0.9-3   
#
#loaded via a namespace (and not attached):
# [1] annotate_1.32.1       AnnotationDbi_1.16.18 DBI_0.2-5            
# [4] genefilter_1.36.0     geneplotter_1.32.1    grid_2.14.0          
# [7] IRanges_1.12.6        RColorBrewer_1.0-5    RSQLite_0.11.1       
#[10] splines_2.14.0        survival_2.36-12      tools_2.14.0         
#[13] xtable_1.7-0         

##### END code

Now, since both factors have an effect on some genes, i would like to know: 
5. Which genes are affected by treatment in a particular genotype, say genotype "A"?

In modelling approach, this question is addressed by selecting contrasts or coefficient from the model matrix. In the
vignette of DESeq, there is an example of a two-factorial design, but it is not clear to me whether DESeq has implemented an option to 
specify contrasts.
If this is not the case, then the only way to answer question number 5 would be to run DESeq genotype-wise and check for treatment-effects
in one-factor models (separate models per genotype)?

Thanks for your comments and help,

Henning 

----------------------------------------------
Dr. Henning Wildhagen

Chair of Tree Physiology, University of Freiburg, Germany

-- 

Jetzt informieren: http://mobile.1und1.de/?ac=OM.PW.PW003K20328T7073a