[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE

Wed Oct 19 04:33:57 CEST 2011

Hi Emanuel,

I don't have time for a longer reply, but one way to see overall 
differences between samples is to use plotMDS(): distances on the plot are 
biological coefficients of variation between samples.

I don't see the need to treat replicates as pseudo samples, or necessarily 
to do permutations.  The regular analysis already tries to assess 
differences between populations relative to variation between replicates.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Tue, 18 Oct 2011, Emanuel Heitlinger wrote:

> Dear Gordon,
>
> thanks a lot for your reply! This helps me so much to think more deeply
> and focused about the kind of analyses I want to conduct on my dataset.
> Excuse me for going a bit into the biology and my conceptions of how
> I want to tackle some questions with this RNA-seq dataset. If you feel that
> some of my point are related to capabilities of edgeR and the way it/you
> approach glms in genomics I would be very happy about further comments.
>
> If I go to much in detail feel free to skip to a more specific
> question at the end
> of this mail!
>
> On 18/10/2011, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>> Dear Emanuel,
>>
>>> From: Emanuel Heitlinger <emanuelheitlinger at googlemail.com>
>>> To: bioconductor at r-project.org
>>> Subject: [BioC] edgeR - coefficients in 3-factor experiment,	complex
>>> 	contrasts and decideTestsDGE
>>>
>>> Hi all,
>>>
>>> I have some questions regarding multi-factor-glms in edgeR. I am working
>>> on a RNA-seq experiment:
>>>
>>> I have 24 samples from 3 "treatments" each having two levels. This means
>>> 3 biological replicates per treatment combination. I want to model the
>>> full set of possible interactions (sex.conds*eel.conds*pop.conds), as
>>> expecially two-fold interactions could be very interetsing for my
>>> research-question.
>>>
>>> I want to categorize genes (contigs form a 454-transcriptome assembly,
>>> genome is unsequenced) I mapped my reads against into categoris:
>>> a) only different for sex
>>> b) only different for eel
>>> c) only different for pop
>>> d1)d2)d3) different for sex:eel, sex:pop, eel:pop
>>
>> I am always a bit uneasy about the use of factorial models in the context
>> of genomic research.  As a statistician, I've used factorial models all my
>> working life, but the hierarchical hypotheses implied by two and three
>> level interactions in a three factor model don't seem to me to correspond
>> to scientific questions that one would want to ask in genomic research. I
>> have to admit that is why I haven't made it particularly easy to input
>> factorial models into limma or edgeR.  In your case, I'd probably feel
>> more comfortable making direct contrasts between your eight distinct
>> groups.
>
> I hope this is not to hard to reconcile with my approach to the whole
> project. I had
> some training in ecology and want to make bringing this together with
> "questions one
> would want to ask in genomic research"  the centre of my work during
> the next years.
>
> The experiment I conducted was a "cross infection experiment under common
> garden conditions".
> The main focus is to test evolutionary divergence of two pop[ulations]
> of parasites
> in two species of eel[-hosts]. PopEU has accquired eelAa as a new host, upon
> Introduction from Asia, where popTW is still found in the native host
> eelAj. PopEU
> since then spent ~90 generations in its new host eelAa.
> So my main focus is on differences in populations: Evolutionary divergence of
> gene-expression. I have a specific question on this at the end of this mail...
>
> I obtained data for both sexes of the parasite mainly as a backup to
> have something to
> work on and publish if the much more interesting populations and host
> comparisons
> would not pan out (and I also felt better having de-facto 12replicates
> for pop and eel).
>
> From my gut-feeling I want to model all the data at once and include
> sex and its
> interactions to have more power while focusing on pop. It could
> however be that the
> differences in sex are so abundant (~5000 of 40000 contigs differing -
> p.adjust <0.05 - for sex and its interactions) that it would make more
> sense to do
> separate modeling for the 12 samples from each sex. Difference for pop were in
> only ~150 contigs)...?
>
>>
>> I wonder what you will do with genes that show a significant 3-way
>> interaction?  In the factorial model framework, there is nothing you can
>> do with such genes, no further hypotheses you can test.  You would have to
>> put them into all of your categories d1, d2 and d3.
>>
>
> The 3 way interaction could be probably left out. I am not too worried
> about the reduced
> sensitivity of the analysis so missing some genes behaving differently
> according to all
> three factors will be probably okay...
>
>> Anyway, I'll try to answer your questions directly.  Thanks for giving a
>> code example.  Let's say that you want to find genes whose expression does
>> or does not depend on eel or pop in any way.  If not, then sex would be
>> the only predictor.  In your example, sex in the second column of the
>> design matrix:
>>
>>   > colnames(design)
>>    [1] "(Intercept)"               "sex.condsmale"
>>    [3] "eel.condsAj"               "pop.condsTW"
>>    [5] "sex.condsmale:eel.condsAj" "sex.condsmale:pop.condsTW"
>>    [7] "eel.condsAj:pop.condsTW"   "sex.condsmale:eel.condsAj:pop.condsTW"
>>
>> So you can use:
>>
>>    d <- DGEList(y)
>>    d <- estimateGLMCommonDisp(d, design=design)
>>    fit <- glmFit(d, design)
>>    lrt <- glmLRT(d, fit, coef=3:8)
>>    topTags(lrt)
>>
>> This does a test on 6 degrees of freedom.  Genes that appear in the
>> toptable do depend on eel or pop either in a main effect or in an
>> interaction.
>>
>> Some of your questions are composite questions that don't have simple
>> statistical answers even for univariate data.  For example, when you want
>> genes that depend only on sex, you want sex to be significant as main
>> effect but not the test given above.  Traditional statistics has a bit of
>> difficulty with testing whether a difference is not present.  I think
>> you'll have to make your own ad hoc judgement as to what constitutes
>> non-significance, then make up a truth vector yourself for each hypothesis
>> you want to test, then find overlaps between the results yourself, rather
>> than using decideTestsDGE.
>>
>
> I hope that I don't rely on inference about absence of differences and
> try to view the
> categories rather as highlighting genes at different level of interest:
>
> Everything to do with pop differences is highly interesting, twoway
> pop:eel could be
> even more interesting as it could be interpreted as an evolution
> towards the optimal
> expression levels in the particular host.
> (Aj/TW and Aa/EU are the species/host associations found in nature).
>
> Aj   Aa       Aj     Aa         :   host
> EU  EU      TW   TW         :  population
> a     a          b      b          :  simple diverged expression seen in "pop"
> a     b          b      a          :  "adapted expression" seen in pop:eel
> where a and b are replaced by "up" or "down" in a row.
>
> This seems doable, leaving sex in the model and just ignoring it for
> these comparisons.
> To establish a clean cut between pop-only and pop:eel is not necessary.
>
> So  after identifying
> lrt <- glmLRT(d,glmfit,coef=grep("pop",colnames(design))) ## thanks
> for the update!
> ## and something like
> subset(topTags(lrt, n=40000), table$adj.P.Val<0.05])
>
> I will do some checking to see that logFC, fitted values and original
> values are
> following the pattern above and end with some candidate genes for "diverged
> expression" and "adapted expression".
>
> Here is the hopefully specific question:
>
> Would there be a way to test my basic asumption of divergence of
> gene-expression. I.e.
> to test whether the extend of changes in expression betweeen
> populations -across genes-
> is bigger than what is expected by change?
> I am asking because when I use my replicates as a pseudo-condition
> (and leave out
> pop for it), there are up to half as many false-positives between
> replicates found than between pops.
> Would it be valid to obtain p.values, adjusted.p.values, or logFC from
> permutations of replicates
> used as pseudo-condition and then compare this to the p.values,
> adjusted.p.values and logFC
> from the real model?
> Or would it be better to look at the size of the sets identified as DE
> by pseudo vs. pop conditions?
>
> Thank you very much for time,
> Emanuel
>
>
>>> In glmLRT giving simple coefficients would compare the complete model to
>>> a model removing one coefficient at a time. From application of glms in
>>> ecology I remember that an interaction effect should not be left in the
>>> model if the main effect is removed. Does this apply here? Should I
>>> compare the full model against e.g. the model minus pop, sex:pop,
>>> eel:pop and sex:eel:pop, when I want to remove condition "pop" from the
>>> model?
>>
>> Yes, that's true.  To test whether pop makes any contribution to
>> expression changes, you need:
>>
>>    lrt <- glmFit(d,design,coef=grep("pop",colnames(design)))
>>
>> etc.  That does a test on 4 degrees of freedom.
>>
>> Best wishes
>> Gordon
>>
>>> Hope this code demonstrates what I mean:
>>>
>>> ## CODE start
>>> library(edgeR)
>>>
>>> ## generate a df of neg. bionm. counts
>>> y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
>>> names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
>>>              "AA_R2M",  "AA_R8F",  "AA_T12F", "AA_T20F",
>>>              "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
>>>              "AJ_R1F",  "AJ_R1M",  "AJ_R3F",  "AJ_R3M",
>>>              "AJ_R5F",  "AJ_R5M",  "AJ_T19M", "AJ_T20M",
>>>              "AJ_T25M", "AJ_T26F", "AJ_T5F",  "AJ_T8F")
>>>
>>> sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
>>> eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
>>> pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))
>>>
>>> design <- model.matrix(~sex.conds*eel.conds*pop.conds)
>>>
>>> ## Counts frame to full DGEList
>>> d <- DGEList(y, lib.size=colSums(y))
>>> d <- calcNormFactors(d)
>>> d <- estimateGLMCommonDisp(d, design=design)
>>> d <- estimateGLMTrendedDisp(d, design=design)
>>> d <- estimateGLMTagwiseDisp(d, design=design)
>>>
>>> glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)
>>>
>>> glm.wrapper <- function(de.obj, fit.obj, conds.regex){
>>>  glm.list <- list()
>>>  coe <- names(as.data.frame(fit.obj$design))
>>>  coe.l <- lapply(conds.regex, function (x) grep(x, coe))
>>>  for (i in 1:length(conds.regex)){
>>>    glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj,
>>> coef=coe.l[[i]])
>>>  }
>>>  return(glm.list)
>>> }
>>>
>>> ## selects all coefficients being contained in each other hierachically
>>> combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
>>> glm.l <- glm.wrapper(d, glmfit, combi.conds)
>>>
>>> ## show what is compared
>>> lapply(glm.l, function (x) x$comparison)
>>>
>>> ## topTags works (same as using p.adjust directly)
>>> topTags.l <- lapply(glm.l, function (x){
>>>  tt <- topTags(x, n=40000) ## set as high as the length
>>>  tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
>>> })
>>>
>>> ## Code End
>>>
>>> Then I would look through the topTags list to categorize the contigs
>>> as stated above. E.g. from topTags.l[[1]] I would take only those not
>>> in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
>>> topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
>>> all a bit complicated to me, is this a correct way of doing this?
>>>
>>> I am alos a bit worried that decideTestsDGE seems to not work on
>>> DGELRT-objects with complicated coefficients. Eg.
>>>
>>> ## Code Start
>>> ## decideTestsDGE does not work
>>> decideTestsDGE.l <- lapply(glm.l, function (x){
>>>   subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
>>> ## Code End
>>>
>>> I saw that for simple coefficients the results of decideTestsDGE
>>> differ from topTags. Is this expected, what is the difference between
>>> the two, thought both do p-value adjustment the same way (I could
>>> provide code if these differenced would not be the expected
>>> behaviour)?
>>>
>>> These are my questions for now. I would be very greatful for help!
>>>
>>> All the Best,
>>> Emanuel
>>>
>>>
>>> sessionInfo()
>>> R version 2.13.0 (2011-04-13)
>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> LC_TIME=en_US.UTF-8
>>> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>>> LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] splines   stats     graphics  grDevices utils     datasets
>>> methods   base
>>>
>>> other attached packages:
>>> [1] edgeR_2.2.6
>>>
>>> loaded via a namespace (and not attached):
>>> [1] limma_3.8.3  tools_2.13.0
>>>
>>> --
>>> Emanuel Heitlinger
>>>
>>> Karlsruhe Institute of Technology (KIT)
>>> Zoological Institute; Ecology/Parasitology
>>> Kornblumenstr. 13
>>> 76131 Karlsruhe
>>> Germany
>>> Telephone +49 (0)721-608 47654
>>>
>>> or
>>> University of Edinburgh
>>> Institute of Evolutionary Biology
>>> Kings Buildings, Ashworth Laboratories, West Mains Road
>>> Edinburgh EH9 3JT
>>> Scotland, UK
>>> Telephone:+44 (0)131-650 7403

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}