[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE

Sun Oct 16 13:39:09 CEST 2011

Hi all,

I have some questions regarding multi-factor-glms in edgeR. I am
working on a RNA-seq experiment:

I have 24 samples from 3 "treatments" each having two levels. This
means 3 biological replicates per treatment combination. I want to
model the full set of possible interactions
(sex.conds*eel.conds*pop.conds), as expecially two-fold interactions
could be very interetsing for my research-question.

I want to categorize genes (contigs form a 454-transcriptome assembly,
genome is unsequenced) I mapped my reads against into categoris:
a) only different for sex
b) only different for eel
c) only different for pop
d1)d2)d3) different for sex:eel, sex:pop, eel:pop

In glmLRT giving simple coefficients would compare the complete model
to a model removing one coefficient at a time. From application of
glms in ecology I remember that an interaction effect should not be
left in the model if the main effect is removed. Does this apply here?
Should I compare the full model against e.g. the model minus pop,
sex:pop, eel:pop and sex:eel:pop, when I want to remove condition
"pop" from the model?

Hope this code demonstrates what I mean:

## CODE start
library(edgeR)

## generate a df of neg. bionm. counts
y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
              "AA_R2M",  "AA_R8F",  "AA_T12F", "AA_T20F",
              "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
              "AJ_R1F",  "AJ_R1M",  "AJ_R3F",  "AJ_R3M",
              "AJ_R5F",  "AJ_R5M",  "AJ_T19M", "AJ_T20M",
              "AJ_T25M", "AJ_T26F", "AJ_T5F",  "AJ_T8F")

sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))

design <- model.matrix(~sex.conds*eel.conds*pop.conds)

## Counts frame to full DGEList
d <- DGEList(y, lib.size=colSums(y))
d <- calcNormFactors(d)
d <- estimateGLMCommonDisp(d, design=design)
d <- estimateGLMTrendedDisp(d, design=design)
d <- estimateGLMTagwiseDisp(d, design=design)

glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)

glm.wrapper <- function(de.obj, fit.obj, conds.regex){
  glm.list <- list()
  coe <- names(as.data.frame(fit.obj$design))
  coe.l <- lapply(conds.regex, function (x) grep(x, coe))
  for (i in 1:length(conds.regex)){
    glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj, coef=coe.l[[i]])
  }
  return(glm.list)
}

## selects all coefficients being contained in each other hierachically
combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
glm.l <- glm.wrapper(d, glmfit, combi.conds)

## show what is compared
lapply(glm.l, function (x) x$comparison)

## topTags works (same as using p.adjust directly)
topTags.l <- lapply(glm.l, function (x){
  tt <- topTags(x, n=40000) ## set as high as the length
  tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
})

## Code End

Then I would look through the topTags list to categorize the contigs
as stated above. E.g. from topTags.l[[1]] I would take only those not
in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
all a bit complicated to me, is this a correct way of doing this?

I am alos a bit worried that decideTestsDGE seems to not work on
DGELRT-objects with complicated coefficients. Eg.

## Code Start
## decideTestsDGE does not work
decideTestsDGE.l <- lapply(glm.l, function (x){
   subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
## Code End

I saw that for simple coefficients the results of decideTestsDGE
differ from topTags. Is this expected, what is the difference between
the two, thought both do p-value adjustment the same way (I could
provide code if these differenced would not be the expected
behaviour)?

These are my questions for now. I would be very greatful for help!

All the Best,
Emanuel

sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets
methods   base

other attached packages:
[1] edgeR_2.2.6

loaded via a namespace (and not attached):
[1] limma_3.8.3  tools_2.13.0

-- 
Emanuel Heitlinger

Karlsruhe Institute of Technology (KIT)
Zoological Institute; Ecology/Parasitology
Kornblumenstr. 13
76131 Karlsruhe
Germany
Telephone +49 (0)721-608 47654

or
University of Edinburgh
Institute of Evolutionary Biology
Kings Buildings, Ashworth Laboratories, West Mains Road
Edinburgh EH9 3JT
Scotland, UK
Telephone:+44 (0)131-650 7403