[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE

Wed Oct 19 09:00:39 CEST 2011

Hi Gordon,

thanks for the reply! Posting the question and then explaining the
background helped me.

Maybe I will come back to my problem with TypeI-Error in a post with
code and data, if it persists.

Best wishes,
Emanuel

On 19/10/2011, Gordon K Smyth <smyth at wehi.edu.au> wrote:
> Hi Emanuel,
>
> I don't have time for a longer reply, but one way to see overall
> differences between samples is to use plotMDS(): distances on the plot are
> biological coefficients of variation between samples.
>
> I don't see the need to treat replicates as pseudo samples, or necessarily
> to do permutations.  The regular analysis already tries to assess
> differences between populations relative to variation between replicates.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> smyth at wehi.edu.au
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Tue, 18 Oct 2011, Emanuel Heitlinger wrote:
>
>> Dear Gordon,
>>
>> thanks a lot for your reply! This helps me so much to think more deeply
>> and focused about the kind of analyses I want to conduct on my dataset.
>> Excuse me for going a bit into the biology and my conceptions of how
>> I want to tackle some questions with this RNA-seq dataset. If you feel
>> that
>> some of my point are related to capabilities of edgeR and the way it/you
>> approach glms in genomics I would be very happy about further comments.
>>
>> If I go to much in detail feel free to skip to a more specific
>> question at the end
>> of this mail!
>>
>> On 18/10/2011, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>> Dear Emanuel,
>>>
>>>> From: Emanuel Heitlinger <emanuelheitlinger at googlemail.com>
>>>> To: bioconductor at r-project.org
>>>> Subject: [BioC] edgeR - coefficients in 3-factor experiment,	complex
>>>> 	contrasts and decideTestsDGE
>>>>
>>>> Hi all,
>>>>
>>>> I have some questions regarding multi-factor-glms in edgeR. I am working
>>>> on a RNA-seq experiment:
>>>>
>>>> I have 24 samples from 3 "treatments" each having two levels. This means
>>>> 3 biological replicates per treatment combination. I want to model the
>>>> full set of possible interactions (sex.conds*eel.conds*pop.conds), as
>>>> expecially two-fold interactions could be very interetsing for my
>>>> research-question.
>>>>
>>>> I want to categorize genes (contigs form a 454-transcriptome assembly,
>>>> genome is unsequenced) I mapped my reads against into categoris:
>>>> a) only different for sex
>>>> b) only different for eel
>>>> c) only different for pop
>>>> d1)d2)d3) different for sex:eel, sex:pop, eel:pop
>>>
>>> I am always a bit uneasy about the use of factorial models in the context
>>> of genomic research.  As a statistician, I've used factorial models all
>>> my
>>> working life, but the hierarchical hypotheses implied by two and three
>>> level interactions in a three factor model don't seem to me to correspond
>>> to scientific questions that one would want to ask in genomic research. I
>>> have to admit that is why I haven't made it particularly easy to input
>>> factorial models into limma or edgeR.  In your case, I'd probably feel
>>> more comfortable making direct contrasts between your eight distinct
>>> groups.
>>
>> I hope this is not to hard to reconcile with my approach to the whole
>> project. I had
>> some training in ecology and want to make bringing this together with
>> "questions one
>> would want to ask in genomic research"  the centre of my work during
>> the next years.
>>
>> The experiment I conducted was a "cross infection experiment under common
>> garden conditions".
>> The main focus is to test evolutionary divergence of two pop[ulations]
>> of parasites
>> in two species of eel[-hosts]. PopEU has accquired eelAa as a new host,
>> upon
>> Introduction from Asia, where popTW is still found in the native host
>> eelAj. PopEU
>> since then spent ~90 generations in its new host eelAa.
>> So my main focus is on differences in populations: Evolutionary divergence
>> of
>> gene-expression. I have a specific question on this at the end of this
>> mail...
>>
>> I obtained data for both sexes of the parasite mainly as a backup to
>> have something to
>> work on and publish if the much more interesting populations and host
>> comparisons
>> would not pan out (and I also felt better having de-facto 12replicates
>> for pop and eel).
>>
>> From my gut-feeling I want to model all the data at once and include
>> sex and its
>> interactions to have more power while focusing on pop. It could
>> however be that the
>> differences in sex are so abundant (~5000 of 40000 contigs differing -
>> p.adjust <0.05 - for sex and its interactions) that it would make more
>> sense to do
>> separate modeling for the 12 samples from each sex. Difference for pop
>> were in
>> only ~150 contigs)...?
>>
>>>
>>> I wonder what you will do with genes that show a significant 3-way
>>> interaction?  In the factorial model framework, there is nothing you can
>>> do with such genes, no further hypotheses you can test.  You would have
>>> to
>>> put them into all of your categories d1, d2 and d3.
>>>
>>
>> The 3 way interaction could be probably left out. I am not too worried
>> about the reduced
>> sensitivity of the analysis so missing some genes behaving differently
>> according to all
>> three factors will be probably okay...
>>
>>> Anyway, I'll try to answer your questions directly.  Thanks for giving a
>>> code example.  Let's say that you want to find genes whose expression
>>> does
>>> or does not depend on eel or pop in any way.  If not, then sex would be
>>> the only predictor.  In your example, sex in the second column of the
>>> design matrix:
>>>
>>>   > colnames(design)
>>>    [1] "(Intercept)"               "sex.condsmale"
>>>    [3] "eel.condsAj"               "pop.condsTW"
>>>    [5] "sex.condsmale:eel.condsAj" "sex.condsmale:pop.condsTW"
>>>    [7] "eel.condsAj:pop.condsTW"
>>> "sex.condsmale:eel.condsAj:pop.condsTW"
>>>
>>> So you can use:
>>>
>>>    d <- DGEList(y)
>>>    d <- estimateGLMCommonDisp(d, design=design)
>>>    fit <- glmFit(d, design)
>>>    lrt <- glmLRT(d, fit, coef=3:8)
>>>    topTags(lrt)
>>>
>>> This does a test on 6 degrees of freedom.  Genes that appear in the
>>> toptable do depend on eel or pop either in a main effect or in an
>>> interaction.
>>>
>>> Some of your questions are composite questions that don't have simple
>>> statistical answers even for univariate data.  For example, when you want
>>> genes that depend only on sex, you want sex to be significant as main
>>> effect but not the test given above.  Traditional statistics has a bit of
>>> difficulty with testing whether a difference is not present.  I think
>>> you'll have to make your own ad hoc judgement as to what constitutes
>>> non-significance, then make up a truth vector yourself for each
>>> hypothesis
>>> you want to test, then find overlaps between the results yourself, rather
>>> than using decideTestsDGE.
>>>
>>
>> I hope that I don't rely on inference about absence of differences and
>> try to view the
>> categories rather as highlighting genes at different level of interest:
>>
>> Everything to do with pop differences is highly interesting, twoway
>> pop:eel could be
>> even more interesting as it could be interpreted as an evolution
>> towards the optimal
>> expression levels in the particular host.
>> (Aj/TW and Aa/EU are the species/host associations found in nature).
>>
>> Aj   Aa       Aj     Aa         :   host
>> EU  EU      TW   TW         :  population
>> a     a          b      b          :  simple diverged expression seen in
>> "pop"
>> a     b          b      a          :  "adapted expression" seen in pop:eel
>> where a and b are replaced by "up" or "down" in a row.
>>
>> This seems doable, leaving sex in the model and just ignoring it for
>> these comparisons.
>> To establish a clean cut between pop-only and pop:eel is not necessary.
>>
>> So  after identifying
>> lrt <- glmLRT(d,glmfit,coef=grep("pop",colnames(design))) ## thanks
>> for the update!
>> ## and something like
>> subset(topTags(lrt, n=40000), table$adj.P.Val<0.05])
>>
>> I will do some checking to see that logFC, fitted values and original
>> values are
>> following the pattern above and end with some candidate genes for
>> "diverged
>> expression" and "adapted expression".
>>
>> Here is the hopefully specific question:
>>
>> Would there be a way to test my basic asumption of divergence of
>> gene-expression. I.e.
>> to test whether the extend of changes in expression betweeen
>> populations -across genes-
>> is bigger than what is expected by change?
>> I am asking because when I use my replicates as a pseudo-condition
>> (and leave out
>> pop for it), there are up to half as many false-positives between
>> replicates found than between pops.
>> Would it be valid to obtain p.values, adjusted.p.values, or logFC from
>> permutations of replicates
>> used as pseudo-condition and then compare this to the p.values,
>> adjusted.p.values and logFC
>> from the real model?
>> Or would it be better to look at the size of the sets identified as DE
>> by pseudo vs. pop conditions?
>>
>> Thank you very much for time,
>> Emanuel
>>
>>
>>>> In glmLRT giving simple coefficients would compare the complete model to
>>>> a model removing one coefficient at a time. From application of glms in
>>>> ecology I remember that an interaction effect should not be left in the
>>>> model if the main effect is removed. Does this apply here? Should I
>>>> compare the full model against e.g. the model minus pop, sex:pop,
>>>> eel:pop and sex:eel:pop, when I want to remove condition "pop" from the
>>>> model?
>>>
>>> Yes, that's true.  To test whether pop makes any contribution to
>>> expression changes, you need:
>>>
>>>    lrt <- glmFit(d,design,coef=grep("pop",colnames(design)))
>>>
>>> etc.  That does a test on 4 degrees of freedom.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Hope this code demonstrates what I mean:
>>>>
>>>> ## CODE start
>>>> library(edgeR)
>>>>
>>>> ## generate a df of neg. bionm. counts
>>>> y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
>>>> names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
>>>>              "AA_R2M",  "AA_R8F",  "AA_T12F", "AA_T20F",
>>>>              "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
>>>>              "AJ_R1F",  "AJ_R1M",  "AJ_R3F",  "AJ_R3M",
>>>>              "AJ_R5F",  "AJ_R5M",  "AJ_T19M", "AJ_T20M",
>>>>              "AJ_T25M", "AJ_T26F", "AJ_T5F",  "AJ_T8F")
>>>>
>>>> sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
>>>> eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
>>>> pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))
>>>>
>>>> design <- model.matrix(~sex.conds*eel.conds*pop.conds)
>>>>
>>>> ## Counts frame to full DGEList
>>>> d <- DGEList(y, lib.size=colSums(y))
>>>> d <- calcNormFactors(d)
>>>> d <- estimateGLMCommonDisp(d, design=design)
>>>> d <- estimateGLMTrendedDisp(d, design=design)
>>>> d <- estimateGLMTagwiseDisp(d, design=design)
>>>>
>>>> glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)
>>>>
>>>> glm.wrapper <- function(de.obj, fit.obj, conds.regex){
>>>>  glm.list <- list()
>>>>  coe <- names(as.data.frame(fit.obj$design))
>>>>  coe.l <- lapply(conds.regex, function (x) grep(x, coe))
>>>>  for (i in 1:length(conds.regex)){
>>>>    glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj,
>>>> coef=coe.l[[i]])
>>>>  }
>>>>  return(glm.list)
>>>> }
>>>>
>>>> ## selects all coefficients being contained in each other hierachically
>>>> combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
>>>> glm.l <- glm.wrapper(d, glmfit, combi.conds)
>>>>
>>>> ## show what is compared
>>>> lapply(glm.l, function (x) x$comparison)
>>>>
>>>> ## topTags works (same as using p.adjust directly)
>>>> topTags.l <- lapply(glm.l, function (x){
>>>>  tt <- topTags(x, n=40000) ## set as high as the length
>>>>  tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
>>>> })
>>>>
>>>> ## Code End
>>>>
>>>> Then I would look through the topTags list to categorize the contigs
>>>> as stated above. E.g. from topTags.l[[1]] I would take only those not
>>>> in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
>>>> topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
>>>> all a bit complicated to me, is this a correct way of doing this?
>>>>
>>>> I am alos a bit worried that decideTestsDGE seems to not work on
>>>> DGELRT-objects with complicated coefficients. Eg.
>>>>
>>>> ## Code Start
>>>> ## decideTestsDGE does not work
>>>> decideTestsDGE.l <- lapply(glm.l, function (x){
>>>>   subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
>>>> ## Code End
>>>>
>>>> I saw that for simple coefficients the results of decideTestsDGE
>>>> differ from topTags. Is this expected, what is the difference between
>>>> the two, thought both do p-value adjustment the same way (I could
>>>> provide code if these differenced would not be the expected
>>>> behaviour)?
>>>>
>>>> These are my questions for now. I would be very greatful for help!
>>>>
>>>> All the Best,
>>>> Emanuel
>>>>
>>>>
>>>> sessionInfo()
>>>> R version 2.13.0 (2011-04-13)
>>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> LC_TIME=en_US.UTF-8
>>>> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>>>> LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> LC_ADDRESS=C
>>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>>>> LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] splines   stats     graphics  grDevices utils     datasets
>>>> methods   base
>>>>
>>>> other attached packages:
>>>> [1] edgeR_2.2.6
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] limma_3.8.3  tools_2.13.0
>>>>
>>>> --
>>>> Emanuel Heitlinger
>>>>
>>>> Karlsruhe Institute of Technology (KIT)
>>>> Zoological Institute; Ecology/Parasitology
>>>> Kornblumenstr. 13
>>>> 76131 Karlsruhe
>>>> Germany
>>>> Telephone +49 (0)721-608 47654
>>>>
>>>> or
>>>> University of Edinburgh
>>>> Institute of Evolutionary Biology
>>>> Kings Buildings, Ashworth Laboratories, West Mains Road
>>>> Edinburgh EH9 3JT
>>>> Scotland, UK
>>>> Telephone:+44 (0)131-650 7403
>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the
> addressee.
> You must not disclose, forward, print or use it without the permission of
> the sender.
> ______________________________________________________________________
>

-- 
Emanuel Heitlinger

Karlsruhe Institute of Technology (KIT)
Zoological Institute; Ecology/Parasitology
Kornblumenstr. 13
76131 Karlsruhe
Germany
Telephone +49 (0)721-608 47654

or
University of Edinburgh
Institute of Evolutionary Biology
Kings Buildings, Ashworth Laboratories, West Mains Road
Edinburgh EH9 3JT
Scotland, UK
Telephone:+44 (0)131-650 7403