[BioC] edgeR: Analyze mini-time-series MeDIP data of pooled DNAs without replicates

Wed Aug 13 07:00:52 CEST 2014

Dear Vang,

There is still a misunderstanding.  You can do the analysis just as easily 
using either voom-limma or edgeR, but you cannot combine the two.  You 
certainly cannot input voom output to edgeR.

You simply use:

dge <- calcNormFactors(dge)
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, contrasts=c(3,5))
topTags(lrt)

Once the design matrix is made, it's just the same basic pipeline as for 
any example in the edgeR User's Guide.

Best wishes
Gordon

On Wed, 13 Aug 2014, Vang Quy Le / Region Nordjylland wrote:

> Dear Gordon,

> Your email is absolutely helpful, I would consider it’s enough help from 
> design point of view. And sorry about putting almost everything into the 
> R script. I thought that would be the easier way for you to understand 
> my questions. I will post my workflow to benefit others when I am done 
> fixing everything around it. Currently there are still some difficulty 
> of feeding voom’s output into edgeR workflow. Any further help from you 
> or others is much appreciated. But you can ignore this if you don’t have 
> time to look at it. I will try something around and see what’s going on.
>
> The error message:
>
> ######ERROR####
>> y <- estimateGLMCommonDisp(v, design, verbose = T)
> Error in aveLogCPM.default(y, weights = weights) : y must be non-negative
>
> In addition:
> Warning message:
>
> In log(colSums(y)) : NaNs produced
> #####/ERROR####
> Commands leading to the error:
>
>    dge <- DGEList(counts=counts, group = Group)
>    dge <- calcNormFactors(dge)
>    v <- voom(dge,design,plot=TRUE)
>    y <- estimateGLMCommonDisp(v, design, verbose = T)
>
> This is structure of ‘v'
>> str(v)
>
> Formal class 'EList' [package "limma"] with 1 slots
>  ..@ .Data:List of 4
>  .. ..$ :'data.frame':	8 obs. of  3 variables:
>  .. .. ..$ group       : Factor w/ 2 levels "Control","Treatment": 1 1 1 1 2 2 2 2
>  .. .. ..$ lib.size    : num [1:8] 3.23e+08 4.24e+08 4.72e+08 3.81e+08 2.29e+08 ...
>  .. .. ..$ norm.factors: num [1:8] 1.008 1.001 1.047 1.005 0.977 ...
>  .. ..$ : num [1:18013364, 1:8] -6.17 -3.72 -3.72 -3.66 -3.84 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:18013364] "00000002" "00000003" "00000004" "00000005" ...
>  .. .. .. ..$ : chr [1:8] "C1" "C2" "C3" "C4" ...
>  .. ..$ : num [1:18013364, 1:8] 2.99 16.96 17.11 17.62 16.06 ...
>  .. ..$ : num [1:8, 1:6] 1 1 1 1 1 1 1 1 0 0 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:8] "1" "2" "3" "4" ...
>  .. .. .. ..$ : chr [1:6] "(Intercept)" "GroupTreatment" "GroupControl:Linear" "GroupTreatment:Linear" ...
>  .. .. ..- attr(*, "assign")= int [1:6] 0 1 2 2 3 3
>  .. .. ..- attr(*, "contrasts")=List of 1
>  .. .. .. ..$ Group: chr "contr.treatment"
> On 12 Aug 2014, at 02:46, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Vang,
>>
>> You don't describe in the text of your email what analysis your are trying to do, but after looking at the code it appears you are trying to follow my suggestion to fit a quadratic trend to the time courses.  Indeed, you seem to have already done the analysis without any problem.
>>
>> Here is a slight variant of what you have done:
>>
>>> targets
>>  Sample.Name     Group Week
>> 1          C1   Control    2
>> 2          C2   Control    4
>> 3          C3   Control    6
>> 4          C4   Control    8
>> 5          T1 Treatment    2
>> 6          T2 Treatment    4
>> 7          T3 Treatment    6
>> 8          T4 Treatment    8
>>> Linear <- poly(targets$Week,2)[,1]
>>> Quadratic <- poly(targets$Week,2)[,2]
>>> design <- model.matrix(~Group+Group:Linear+Group:Quadratic,data=targets)
>>> colnames(design)
>> [1] "(Intercept)"              "GroupTreatment"
>> [3] "GroupControl:Linear"      "GroupTreatment:Linear"
>> [5] "GroupControl:Quadratic"   "GroupTreatment:Quadratic"
>>
>> Then you can use
>>
>>  glmLRT(fit, contrasts=c(3,5))
>>
>> to find genes that change over time in the control, or
>>
>>  glmLRT(fit, contrasts=c(4,6))
>>
>> to find genes that change over time in the treatment group, or
>>
>>  glmLRT(fit, contrasts=c(0,0,-1,1,-1,1))
>>
>> to find genes that change differently between the treatment group and control, or
>>
>>  glmLRT(fit, coef=4)
>>
>> to find genes that go up or down overall in the treatment group, etc etc.
>>
>> I hope this is enough help for you.  I can't promise to look at detailed code for an individual dataset such as this one.
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Mon, 11 Aug 2014, Vang Quy Le / Region Nordjylland wrote:
>>
>>> Dear Gordon,
>>
>>> I hope this email will reach you this time. Thank you for your expert inputs. You made two precise assumptions. I generated counts per 100-base windows across all chromosomes with MEDIPS package. I also read "What to do if you have no replicates” and found it does not completely fit our case so I want to seek for other approaches. I read through the publications and the limma documentation you mentioned. They gave much useful background information.
>>
>>> I wrote a rough script while I still have little knowledge about what really going on behind the code, especially the ‘design’ and subsequent analysis choices. Could you or some expert take a look and provide some more detailed inputs about how logics (execution choices and function choices) in this R script?  In case there is some problem with attaching the script to the maillinglist, here is the Dropbox link to it: https://www.dropbox.com/s/miltokwipqm34bs/test_edgeR.limma.METF01.R (Link may expire after 1 month).
>>>
>>> Thank you very much.
>>>
>>> Best regards
>>> Vang
>>>
>>> On 08 Aug 2014, at 05:26, Gordon K Smyth <smyth at wehi.edu.au<mailto:smyth at wehi.edu.au>> wrote:
>>>
>>> Dear Vang Quy Le,
>>>
>>> I don't have any experience analysing MeDIP data myself, so I will assume that you know how to generate counts from MeDIP that are suitable for edgeR, and I will just answer the linear modelling question.
>>>
>>> I also assume that you have read the section of the edgeR User's Guide called "What to do if you have no replicates."
>>>
>>> In terms of an edgeR analysis, you have two main choices.  First, you could simply omit the dispersion estimation steps of the edgeR pipeline and run glmFit() and glmLRT() with a manually set value for the dispersion.  You have already mentioned this possibility.  There is no problems with this except that the number of DE genes will be highly dependent on the dispersion value you choice.  Nevertheless, it is much better than assuming Poisson variability.  We took this approach for our paper in Cell Reports:
>>>
>>> http://www.ncbi.nlm.nih.gov/pubmed/23375371
>>>
>>> Second, you could manufacture residual degrees of freedom by fitting a smooth curve to the time course trends.  This approach is explained the section called "Many time points" in the limma User's Guide.  We took this approach to analyse the development stages of Drosophila melanogaster in the voom paper:
>>>
>>> http://genomebiology.com/2014/15/2/R29
>>>
>>> Personally, I would take the second approach.  I would fit an orthogonal quadratic polynomial to the time trend for the Control and Treated groups separately.  This will allow you to use the complete edgeR pipeline including dispersion estimation.  This will allow you to test for time trends for each gene in each of the groups.  It will also allow you examine at which time point each gene has its peak expression.  See the analysis in the voom paper.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> Vang Quy Le / Region Nordjylland vql at rn.dk<http://rn.dk>
>>> Wed Aug 6 10:19:58 CEST 2014
>>>
>>> Hello edgeR users,
>>>
>>> I have been checking around to find out how to best analyze our data as well as remedy our experiment design. It's hard to decide so I will try my best to explain what we have, what we want in a rather long message. Please bear with me.
>>>
>>> FIRSTLY SOME WORDS ABOUT OUR SAMPLES:
>>
>>> We did experiments with two groups of test animals namely treated "T" and control "C" (without treatment) groups. Bot T and C have 20 subjects each. We took samples 4 times with 2-week intervals. For example, at week 2, we extracted DNA samples from 5 subjects in the C group and 5 subjects in T group. At week 4, we extracted DNA for each group from each 5 other subjects. And so on. However, before MeDIP-seq we pooled all 5 DNA samples of the same group and same week together to save time and money. This is like making direct biological everage. So the sequencing samples look like this:
>>>
>>> targets
>>> Week Control Treated
>>> 1 week2      C2      T2
>>> 2 week4      C4      T4
>>> 3 week6      C6      T6
>>> 4 week8      C8      T8
>>>
>>> Where: C2 is the DNA pool of 5 control subjects at Week2; T2 is the DNA pool of 5 treated subjects at Week 2 and so on. We then subject the pools through MeDIP-Seq protocols and sequence them on NGS platform (color-space).
>>>
>>> OUR RESEARCH QUESTIONS:
>>> 1. Which genes are hypo/hyper-methylated in response to our treatment?
>>> 2. How does methylation rate/status change from one time point to another (i.e. which one responded early, which one responded later)?
>>>
>>> CONSIDERATIONS SO FAR:
>>> I have looked through R's package, MEDIPS. But it is not intended for our case because it does not have functions to handle time-series data. It also relies very much on edgeR to do its job. So I think it would be more flexible to use edgeR directly myself. Reading edgeRUsersGuide.pdf, I found that our case is similar to the example in Section "3.3 Treatment effects over all times". However, we don't have replicates the way edgeR expects.
>>>
>>> CONSIDERING WE HAVE TIME-SERIES AND DATA AT EACH TIME POINT IS POOLED DATA, MY QUESTIONS ARE:
>>> 1. Can we find the answers for our research questions based on the current data by using edgeR? And how, in high-level view?
>>> 2. I am thinking since we have the pools (biological mean) we can somehow skip some statistics treatment (i.e. relax p.value, set dispersion value to something reasonable) and get on with the workflow on Section 3.3. Will this be alright in terms of data analysis practice and edgeR expectations?
>>> 3. If we must do something from sequencing step, what would be the most economical and time-saveing things to do?
>>> 4. Would you suggest anything else to make the best out of this case?
>>>
>>> Thank you for your time.
>>>
>>> Kind regards,
>>>
>>> Vang Quy Le
>>> Bioinformatician, Molecular Biologist, PhD
>>>
>>> +45 97 66 56 29
>>> vql at rn.dk<http://rn.dk>
>>>
>>> AALBORG UNIVERSITY HOSPITAL
>>> Section for Molecular Diagnostics,
>>> Clinical Biochemistry
>>> Reberbansgade
>>> DK 9000 Aalborg
>>> www.aalborguh.rn.dk<http://www.aalborguh.rn.dk>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}