[BioC] Questions on doing EdgeR Analysis of Timeseries Data

Fri Jun 17 06:50:33 CEST 2011

Dear Mark,

I'd suggest that you filter out genes that fail to achieve a reasonable 
count in any library.  I can't give you a firm cutoff for this, it depends 
on the sequencing depth, and the analysis should not be sensitive to this 
anyway.  In my lab, we have tended to keep genes that achieve at least 1 
count per thousand reads in at least a minimum number of libraries.  If y 
is a DGEList object, you might use

   cpm <- t(t(y$counts)/y$samples$lib.size)
   keep <- rowSums(cpm>1)>0
   y.filtered <- y[keep,]

Since you have two replicates of one lifecycle, you could base your 
estimate of variability (the biological coefficient of variation) on these 
replicates.

   design <- model.matrix(~lifecycle)
   y.filtered <- estimateGLMCommonDisp(y.filtered,design)

Perhaps also

   y.filtered <- estimateGLMTrendedDisp(y.filtered,design)

Then

   fit <- glmFit(y.filtered,design)

Then you could test for genes that change at all through the 6 lifecycles 
by

   lrt <- glmLRT(y.filtered,fit,coef=2:6)
   topTags(lrt)

This will perform a likelihood ratio test for each gene on 5 degrees of 
freedom.  Then you could examine the significant genes to see what 
patterns they show across the lifecycles:

   is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
   library(limma)
   plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:

> Hi Gordon
>
> Thanks for getting back to me on this. I don't mind the transfer, 
> whatever leads to answers to my questions is fine by me.
>
> I am sorry that I forgot to classify the lifecycles of the samples. They 
> represent 6 distinct lifecycles (the first measured lifecycle has two 
> samples). Will this affect the prior.n you suggested?
>
> ~ Mark
>
> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>
>> Hi Mark,
>>
>> This is an analysis question rather than a devel suggestion, so I've 
>> transferred it from Bioc-sig-seq to Bioconductor.  Hope you don't mind.
>>
>> If you have divided your samples into two groups, and you have 7 
>> samples, then I recommend you set prior.n=4.  See
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-June/002042.html
>>
>> for an explanation.
>>
>> I can't answer your other questions without knowing whether you have 
>> biological replicates in your experiment.  You have 7 samples.  Do 
>> these samples all correspond to distinct lifecycle stages, or are some 
>> stages observed more than once?  How many stages are there?
>>
>> Best wishes
>> Gordon
>>
>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>> From: "Lawson, Mark (mjl3p)" <mjl3p at virginia.edu>
>>> To: "bioc-sig-sequencing at r-project.org"
>>> 	<bioc-sig-sequencing at r-project.org>
>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>> 	Timeseries Data
>>>
>>> (I apologize if this message was sent more than once)
>>>
>>> Dear Analysis Gurus
>>>
>>> I am currently performing a gene expression analysis on a plant 
>>> parasite. I have mapped Illumina read counts for various stages in 
>>> this parasites lifecycle. Of interest for us in this analysis are 
>>> genes that are differentially expressed during these lifecycles. To 
>>> determine this, I have focused on two types of differential 
>>> expression: "peaks" and "cliffs." "Peaks" occur when a gene is 
>>> differentially expressed in one time sample (either higher or lower 
>>> than the remaining samples) and "cliffs" occur when a gene is 
>>> differentially expressed between two groups of sample (for instance 
>>> higher expression in the first three samples than the last three).
>>>
>>> To determine these peaks and cliffs, I have been creating groups in 
>>> which the desired peak/cliff is "case" and the remaining samples are 
>>> "control." I then run common dispersion and/or tagwise dispersion and 
>>> extract those reads with an FDR of less than 0.1. So, my questions:
>>>
>>> 1.) How much filtering of data should I do? Right now I have a fair 
>>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems 
>>> logical that I would filter out genes that have no expression, but at 
>>> what level should it stop? Also, should there be different filtering 
>>> depending on the analysis (peak or cliff)?
>>>
>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I 
>>> currently have 7 samples)? Does it depend on the type of analysis?
>>>
>>> 3.) Should I investigate using a more advanced glm based analysis? Any 
>>> advice on crafting a design for this?
>>>
>>> 4.) Any other ideas on analyses to perform on a set of timeseries data 
>>> with EdgeR?
>>>
>>> I greatly appreciate any help/advice and thank you in advance!
>>>
>>>
>>> Mark J. Lawson, Ph.D.
>>> Bioinformatics Research Scientist
>>> Center for Public Health Genomics, UVA
>>> mlawson at virginia.edu

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}