[BioC] edgeR: Analyze mini-time-series MeDIP data of pooled DNAs without replicates?

Wed Aug 6 10:19:58 CEST 2014

Hello edgeR users,

I have been checking around to find out how to best analyze our data as well as remedy our experiment design. It's hard to decide so I will try my best to explain what we have, what we want in a rather long message. Please bear with me.

FIRSTLY SOME WORDS ABOUT OUR SAMPLES:
We did experiments with two groups of test animals namely treated "T" and control "C" (without treatment) groups. Bot T and C have 20 subjects each.  We took samples 4 times with 2-week intervals. For example, at week 2, we extracted DNA samples from 5 subjects in the C group and 5 subjects in T group. At week 4, we extracted DNA for each group from each 5 other subjects. And so on. However, before MeDIP-seq we pooled all 5 DNA samples of the same group and same week together to save time and money. This is like making direct biological everage. So the sequencing samples look like this:

> targets
   Week Control Treated
1 week2      C2      T2
2 week4      C4      T4
3 week6      C6      T6
4 week8      C8      T8

Where: C2 is the DNA pool of 5 control subjects at Week2; T2 is the DNA pool of  5 treated subjects at Week 2 and so on. We then subject the pools through MeDIP-Seq protocols and sequence them on NGS platform (color-space).

OUR RESEARCH QUESTIONS:
1. Which genes are hypo/hyper-methylated in response to our treatment?
2. How does methylation rate/status change from one time point to another (i.e. which one responded early, which one responded later)?

CONSIDERATIONS SO FAR:
I have looked through R's package, MEDIPS. But it is not intended for our case because it does not have functions to handle time-series data. It also relies very much on edgeR to do its job. So I think it would be more flexible to use edgeR directly myself. Reading edgeRUsersGuide.pdf, I found that our case is similar to the example in Section "3.3 Treatment effects over all times". However, we don't have replicates the way edgeR expects. 

CONSIDERING WE HAVE TIME-SERIES AND DATA AT EACH TIME POINT IS POOLED DATA, MY QUESTIONS ARE:
1. Can we find the answers for our research questions based on the current data by using edgeR? And how, in high-level view? 
2. I am thinking since we have the pools (biological mean) we can somehow skip some statistics treatment (i.e. relax p.value, set dispersion value to something reasonable) and get on with the workflow on Section 3.3. Will this be alright in terms of data analysis practice and edgeR expectations?
3. If we must do something from sequencing step, what would be the most economical and time-saveing things to do?
4. Would you suggest anything else to make the best out of this case?

Thank you for your time.

Kind regards,

Vang Quy Le
Bioinformatician, Molecular Biologist, PhD

+45 97 66 56 29
vql at rn.dk

AALBORG UNIVERSITY HOSPITAL
Section for Molecular Diagnostics,
Clinical Biochemistry
Reberbansgade
DK 9000 Aalborg
www.aalborguh.rn.dk