[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE

Tue Oct 18 18:47:05 CEST 2011

Dear Gordon,

thanks a lot for your reply! This helps me so much to think more deeply
and focused about the kind of analyses I want to conduct on my dataset.
Excuse me for going a bit into the biology and my conceptions of how
I want to tackle some questions with this RNA-seq dataset. If you feel that
some of my point are related to capabilities of edgeR and the way it/you
approach glms in genomics I would be very happy about further comments.

If I go to much in detail feel free to skip to a more specific
question at the end
of this mail!

On 18/10/2011, Gordon K Smyth <smyth at wehi.edu.au> wrote:
> Dear Emanuel,
>
>> From: Emanuel Heitlinger <emanuelheitlinger at googlemail.com>
>> To: bioconductor at r-project.org
>> Subject: [BioC] edgeR - coefficients in 3-factor experiment,	complex
>> 	contrasts and decideTestsDGE
>>
>> Hi all,
>>
>> I have some questions regarding multi-factor-glms in edgeR. I am working
>> on a RNA-seq experiment:
>>
>> I have 24 samples from 3 "treatments" each having two levels. This means
>> 3 biological replicates per treatment combination. I want to model the
>> full set of possible interactions (sex.conds*eel.conds*pop.conds), as
>> expecially two-fold interactions could be very interetsing for my
>> research-question.
>>
>> I want to categorize genes (contigs form a 454-transcriptome assembly,
>> genome is unsequenced) I mapped my reads against into categoris:
>> a) only different for sex
>> b) only different for eel
>> c) only different for pop
>> d1)d2)d3) different for sex:eel, sex:pop, eel:pop
>
> I am always a bit uneasy about the use of factorial models in the context
> of genomic research.  As a statistician, I've used factorial models all my
> working life, but the hierarchical hypotheses implied by two and three
> level interactions in a three factor model don't seem to me to correspond
> to scientific questions that one would want to ask in genomic research. I
> have to admit that is why I haven't made it particularly easy to input
> factorial models into limma or edgeR.  In your case, I'd probably feel
> more comfortable making direct contrasts between your eight distinct
> groups.

I hope this is not to hard to reconcile with my approach to the whole
project. I had
some training in ecology and want to make bringing this together with
"questions one
would want to ask in genomic research"  the centre of my work during
the next years.

The experiment I conducted was a "cross infection experiment under common
garden conditions".
The main focus is to test evolutionary divergence of two pop[ulations]
of parasites
in two species of eel[-hosts]. PopEU has accquired eelAa as a new host, upon
Introduction from Asia, where popTW is still found in the native host
eelAj. PopEU
since then spent ~90 generations in its new host eelAa.
So my main focus is on differences in populations: Evolutionary divergence of
gene-expression. I have a specific question on this at the end of this mail...

I obtained data for both sexes of the parasite mainly as a backup to
have something to
work on and publish if the much more interesting populations and host
comparisons
would not pan out (and I also felt better having de-facto 12replicates
for pop and eel).

>From my gut-feeling I want to model all the data at once and include
sex and its
interactions to have more power while focusing on pop. It could
however be that the
differences in sex are so abundant (~5000 of 40000 contigs differing -
p.adjust <0.05 - for sex and its interactions) that it would make more
sense to do
separate modeling for the 12 samples from each sex. Difference for pop were in
only ~150 contigs)...?

>
> I wonder what you will do with genes that show a significant 3-way
> interaction?  In the factorial model framework, there is nothing you can
> do with such genes, no further hypotheses you can test.  You would have to
> put them into all of your categories d1, d2 and d3.
>

The 3 way interaction could be probably left out. I am not too worried
about the reduced
sensitivity of the analysis so missing some genes behaving differently
according to all
three factors will be probably okay...

> Anyway, I'll try to answer your questions directly.  Thanks for giving a
> code example.  Let's say that you want to find genes whose expression does
> or does not depend on eel or pop in any way.  If not, then sex would be
> the only predictor.  In your example, sex in the second column of the
> design matrix:
>
>    > colnames(design)
>    [1] "(Intercept)"               "sex.condsmale"
>    [3] "eel.condsAj"               "pop.condsTW"
>    [5] "sex.condsmale:eel.condsAj" "sex.condsmale:pop.condsTW"
>    [7] "eel.condsAj:pop.condsTW"   "sex.condsmale:eel.condsAj:pop.condsTW"
>
> So you can use:
>
>    d <- DGEList(y)
>    d <- estimateGLMCommonDisp(d, design=design)
>    fit <- glmFit(d, design)
>    lrt <- glmLRT(d, fit, coef=3:8)
>    topTags(lrt)
>
> This does a test on 6 degrees of freedom.  Genes that appear in the
> toptable do depend on eel or pop either in a main effect or in an
> interaction.
>
> Some of your questions are composite questions that don't have simple
> statistical answers even for univariate data.  For example, when you want
> genes that depend only on sex, you want sex to be significant as main
> effect but not the test given above.  Traditional statistics has a bit of
> difficulty with testing whether a difference is not present.  I think
> you'll have to make your own ad hoc judgement as to what constitutes
> non-significance, then make up a truth vector yourself for each hypothesis
> you want to test, then find overlaps between the results yourself, rather
> than using decideTestsDGE.
>

I hope that I don't rely on inference about absence of differences and
try to view the
categories rather as highlighting genes at different level of interest:

Everything to do with pop differences is highly interesting, twoway
pop:eel could be
even more interesting as it could be interpreted as an evolution
towards the optimal
expression levels in the particular host.
(Aj/TW and Aa/EU are the species/host associations found in nature).

Aj   Aa       Aj     Aa         :   host
EU  EU      TW   TW         :  population
a     a          b      b          :  simple diverged expression seen in "pop"
a     b          b      a          :  "adapted expression" seen in pop:eel
where a and b are replaced by "up" or "down" in a row.

This seems doable, leaving sex in the model and just ignoring it for
these comparisons.
To establish a clean cut between pop-only and pop:eel is not necessary.

So  after identifying
lrt <- glmLRT(d,glmfit,coef=grep("pop",colnames(design))) ## thanks
for the update!
## and something like
subset(topTags(lrt, n=40000), table$adj.P.Val<0.05])

I will do some checking to see that logFC, fitted values and original
values are
following the pattern above and end with some candidate genes for "diverged
expression" and "adapted expression".

Here is the hopefully specific question:

Would there be a way to test my basic asumption of divergence of
gene-expression. I.e.
to test whether the extend of changes in expression betweeen
populations -across genes-
is bigger than what is expected by change?
I am asking because when I use my replicates as a pseudo-condition
(and leave out
pop for it), there are up to half as many false-positives between
replicates found than between pops.
Would it be valid to obtain p.values, adjusted.p.values, or logFC from
permutations of replicates
used as pseudo-condition and then compare this to the p.values,
adjusted.p.values and logFC
from the real model?
Or would it be better to look at the size of the sets identified as DE
by pseudo vs. pop conditions?

Thank you very much for time,
Emanuel

>> In glmLRT giving simple coefficients would compare the complete model to
>> a model removing one coefficient at a time. From application of glms in
>> ecology I remember that an interaction effect should not be left in the
>> model if the main effect is removed. Does this apply here? Should I
>> compare the full model against e.g. the model minus pop, sex:pop,
>> eel:pop and sex:eel:pop, when I want to remove condition "pop" from the
>> model?
>
> Yes, that's true.  To test whether pop makes any contribution to
> expression changes, you need:
>
>    lrt <- glmFit(d,design,coef=grep("pop",colnames(design)))
>
> etc.  That does a test on 4 degrees of freedom.
>
> Best wishes
> Gordon
>
>> Hope this code demonstrates what I mean:
>>
>> ## CODE start
>> library(edgeR)
>>
>> ## generate a df of neg. bionm. counts
>> y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
>> names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
>>              "AA_R2M",  "AA_R8F",  "AA_T12F", "AA_T20F",
>>              "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
>>              "AJ_R1F",  "AJ_R1M",  "AJ_R3F",  "AJ_R3M",
>>              "AJ_R5F",  "AJ_R5M",  "AJ_T19M", "AJ_T20M",
>>              "AJ_T25M", "AJ_T26F", "AJ_T5F",  "AJ_T8F")
>>
>> sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
>> eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
>> pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))
>>
>> design <- model.matrix(~sex.conds*eel.conds*pop.conds)
>>
>> ## Counts frame to full DGEList
>> d <- DGEList(y, lib.size=colSums(y))
>> d <- calcNormFactors(d)
>> d <- estimateGLMCommonDisp(d, design=design)
>> d <- estimateGLMTrendedDisp(d, design=design)
>> d <- estimateGLMTagwiseDisp(d, design=design)
>>
>> glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)
>>
>> glm.wrapper <- function(de.obj, fit.obj, conds.regex){
>>  glm.list <- list()
>>  coe <- names(as.data.frame(fit.obj$design))
>>  coe.l <- lapply(conds.regex, function (x) grep(x, coe))
>>  for (i in 1:length(conds.regex)){
>>    glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj,
>> coef=coe.l[[i]])
>>  }
>>  return(glm.list)
>> }
>>
>> ## selects all coefficients being contained in each other hierachically
>> combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
>> glm.l <- glm.wrapper(d, glmfit, combi.conds)
>>
>> ## show what is compared
>> lapply(glm.l, function (x) x$comparison)
>>
>> ## topTags works (same as using p.adjust directly)
>> topTags.l <- lapply(glm.l, function (x){
>>  tt <- topTags(x, n=40000) ## set as high as the length
>>  tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
>> })
>>
>> ## Code End
>>
>> Then I would look through the topTags list to categorize the contigs
>> as stated above. E.g. from topTags.l[[1]] I would take only those not
>> in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
>> topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
>> all a bit complicated to me, is this a correct way of doing this?
>>
>> I am alos a bit worried that decideTestsDGE seems to not work on
>> DGELRT-objects with complicated coefficients. Eg.
>>
>> ## Code Start
>> ## decideTestsDGE does not work
>> decideTestsDGE.l <- lapply(glm.l, function (x){
>>   subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
>> ## Code End
>>
>> I saw that for simple coefficients the results of decideTestsDGE
>> differ from topTags. Is this expected, what is the difference between
>> the two, thought both do p-value adjustment the same way (I could
>> provide code if these differenced would not be the expected
>> behaviour)?
>>
>> These are my questions for now. I would be very greatful for help!
>>
>> All the Best,
>> Emanuel
>>
>>
>> sessionInfo()
>> R version 2.13.0 (2011-04-13)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8
>> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>> LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] splines   stats     graphics  grDevices utils     datasets
>> methods   base
>>
>> other attached packages:
>> [1] edgeR_2.2.6
>>
>> loaded via a namespace (and not attached):
>> [1] limma_3.8.3  tools_2.13.0
>>
>> --
>> Emanuel Heitlinger
>>
>> Karlsruhe Institute of Technology (KIT)
>> Zoological Institute; Ecology/Parasitology
>> Kornblumenstr. 13
>> 76131 Karlsruhe
>> Germany
>> Telephone +49 (0)721-608 47654
>>
>> or
>> University of Edinburgh
>> Institute of Evolutionary Biology
>> Kings Buildings, Ashworth Laboratories, West Mains Road
>> Edinburgh EH9 3JT
>> Scotland, UK
>> Telephone:+44 (0)131-650 7403
>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the
> addressee.
> You must not disclose, forward, print or use it without the permission of
> the sender.
> ______________________________________________________________________
>

-- 
Emanuel Heitlinger

Karlsruhe Institute of Technology (KIT)
Zoological Institute; Ecology/Parasitology
Kornblumenstr. 13
76131 Karlsruhe
Germany
Telephone +49 (0)721-608 47654

or
University of Edinburgh
Institute of Evolutionary Biology
Kings Buildings, Ashworth Laboratories, West Mains Road
Edinburgh EH9 3JT
Scotland, UK
Telephone:+44 (0)131-650 7403