[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE

Gordon K Smyth smyth at wehi.EDU.AU
Tue Oct 18 01:19:12 CEST 2011


Dear Emanuel,

> From: Emanuel Heitlinger <emanuelheitlinger at googlemail.com>
> To: bioconductor at r-project.org
> Subject: [BioC] edgeR - coefficients in 3-factor experiment,	complex
> 	contrasts and decideTestsDGE
>
> Hi all,
>
> I have some questions regarding multi-factor-glms in edgeR. I am working 
> on a RNA-seq experiment:
>
> I have 24 samples from 3 "treatments" each having two levels. This means 
> 3 biological replicates per treatment combination. I want to model the 
> full set of possible interactions (sex.conds*eel.conds*pop.conds), as 
> expecially two-fold interactions could be very interetsing for my 
> research-question.
>
> I want to categorize genes (contigs form a 454-transcriptome assembly,
> genome is unsequenced) I mapped my reads against into categoris:
> a) only different for sex
> b) only different for eel
> c) only different for pop
> d1)d2)d3) different for sex:eel, sex:pop, eel:pop

I am always a bit uneasy about the use of factorial models in the context 
of genomic research.  As a statistician, I've used factorial models all my 
working life, but the hierarchical hypotheses implied by two and three 
level interactions in a three factor model don't seem to me to correspond 
to scientific questions that one would want to ask in genomic research. I 
have to admit that is why I haven't made it particularly easy to input 
factorial models into limma or edgeR.  In your case, I'd probably feel 
more comfortable making direct contrasts between your eight distinct 
groups.

I wonder what you will do with genes that show a significant 3-way 
interaction?  In the factorial model framework, there is nothing you can 
do with such genes, no further hypotheses you can test.  You would have to 
put them into all of your categories d1, d2 and d3.

Anyway, I'll try to answer your questions directly.  Thanks for giving a 
code example.  Let's say that you want to find genes whose expression does 
or does not depend on eel or pop in any way.  If not, then sex would be 
the only predictor.  In your example, sex in the second column of the 
design matrix:

   > colnames(design)
   [1] "(Intercept)"               "sex.condsmale"
   [3] "eel.condsAj"               "pop.condsTW"
   [5] "sex.condsmale:eel.condsAj" "sex.condsmale:pop.condsTW"
   [7] "eel.condsAj:pop.condsTW"   "sex.condsmale:eel.condsAj:pop.condsTW"

So you can use:

   d <- DGEList(y)
   d <- estimateGLMCommonDisp(d, design=design)
   fit <- glmFit(d, design)
   lrt <- glmLRT(d, fit, coef=3:8)
   topTags(lrt)

This does a test on 6 degrees of freedom.  Genes that appear in the 
toptable do depend on eel or pop either in a main effect or in an 
interaction.

Some of your questions are composite questions that don't have simple 
statistical answers even for univariate data.  For example, when you want 
genes that depend only on sex, you want sex to be significant as main 
effect but not the test given above.  Traditional statistics has a bit of 
difficulty with testing whether a difference is not present.  I think 
you'll have to make your own ad hoc judgement as to what constitutes 
non-significance, then make up a truth vector yourself for each hypothesis 
you want to test, then find overlaps between the results yourself, rather 
than using decideTestsDGE.

> In glmLRT giving simple coefficients would compare the complete model to 
> a model removing one coefficient at a time. From application of glms in 
> ecology I remember that an interaction effect should not be left in the 
> model if the main effect is removed. Does this apply here? Should I 
> compare the full model against e.g. the model minus pop, sex:pop, 
> eel:pop and sex:eel:pop, when I want to remove condition "pop" from the 
> model?

Yes, that's true.  To test whether pop makes any contribution to 
expression changes, you need:

   lrt <- glmFit(d,design,coef=grep("pop",colnames(design)))

etc.  That does a test on 4 degrees of freedom.

Best wishes
Gordon

> Hope this code demonstrates what I mean:
>
> ## CODE start
> library(edgeR)
>
> ## generate a df of neg. bionm. counts
> y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
> names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
>              "AA_R2M",  "AA_R8F",  "AA_T12F", "AA_T20F",
>              "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
>              "AJ_R1F",  "AJ_R1M",  "AJ_R3F",  "AJ_R3M",
>              "AJ_R5F",  "AJ_R5M",  "AJ_T19M", "AJ_T20M",
>              "AJ_T25M", "AJ_T26F", "AJ_T5F",  "AJ_T8F")
>
> sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
> eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
> pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))
>
> design <- model.matrix(~sex.conds*eel.conds*pop.conds)
>
> ## Counts frame to full DGEList
> d <- DGEList(y, lib.size=colSums(y))
> d <- calcNormFactors(d)
> d <- estimateGLMCommonDisp(d, design=design)
> d <- estimateGLMTrendedDisp(d, design=design)
> d <- estimateGLMTagwiseDisp(d, design=design)
>
> glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)
>
> glm.wrapper <- function(de.obj, fit.obj, conds.regex){
>  glm.list <- list()
>  coe <- names(as.data.frame(fit.obj$design))
>  coe.l <- lapply(conds.regex, function (x) grep(x, coe))
>  for (i in 1:length(conds.regex)){
>    glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj, coef=coe.l[[i]])
>  }
>  return(glm.list)
> }
>
> ## selects all coefficients being contained in each other hierachically
> combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
> glm.l <- glm.wrapper(d, glmfit, combi.conds)
>
> ## show what is compared
> lapply(glm.l, function (x) x$comparison)
>
> ## topTags works (same as using p.adjust directly)
> topTags.l <- lapply(glm.l, function (x){
>  tt <- topTags(x, n=40000) ## set as high as the length
>  tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
> })
>
> ## Code End
>
> Then I would look through the topTags list to categorize the contigs
> as stated above. E.g. from topTags.l[[1]] I would take only those not
> in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
> topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
> all a bit complicated to me, is this a correct way of doing this?
>
> I am alos a bit worried that decideTestsDGE seems to not work on
> DGELRT-objects with complicated coefficients. Eg.
>
> ## Code Start
> ## decideTestsDGE does not work
> decideTestsDGE.l <- lapply(glm.l, function (x){
>   subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
> ## Code End
>
> I saw that for simple coefficients the results of decideTestsDGE
> differ from topTags. Is this expected, what is the difference between
> the two, thought both do p-value adjustment the same way (I could
> provide code if these differenced would not be the expected
> behaviour)?
>
> These are my questions for now. I would be very greatful for help!
>
> All the Best,
> Emanuel
>
>
> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
> [1] edgeR_2.2.6
>
> loaded via a namespace (and not attached):
> [1] limma_3.8.3  tools_2.13.0
>
> -- 
> Emanuel Heitlinger
>
> Karlsruhe Institute of Technology (KIT)
> Zoological Institute; Ecology/Parasitology
> Kornblumenstr. 13
> 76131 Karlsruhe
> Germany
> Telephone +49 (0)721-608 47654
>
> or
> University of Edinburgh
> Institute of Evolutionary Biology
> Kings Buildings, Ashworth Laboratories, West Mains Road
> Edinburgh EH9 3JT
> Scotland, UK
> Telephone:+44 (0)131-650 7403

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list