[BioC] Questions on doing EdgeR Analysis of Timeseries Data

Tue Jun 21 16:53:19 CEST 2011

Thanks for you help on this Gordon. 

I am still playing around with the analysis but just to make sure I am on the right track, I want to ensure that I created my design correctly.

lifecycles <- c("1G", "1G", "2G", "3G", "4G", "61G", "62G")
lifecycles <- factor(lifecycles, levels=c("1G", "2G", "3G", "4G", "61G", "62G"))
design <- model.matrix(~0+lifecycles)
colnames(design) <- levels(lifecycles)

(Recall, the first lifecycle has two sets of data)

~ Mark

On Jun 17, 2011, at 7:45 PM, Gordon K Smyth wrote:

> Counts per million should have been:
> 
>  cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size)
> 
> Gordon
> 
> On Fri, 17 Jun 2011, Gordon K Smyth wrote:
> 
>> Dear Mark,
>> 
>> I'd suggest that you filter out genes that fail to achieve a reasonable count in any library.  I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway. In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries.  If y is a DGEList object, you might use
>> 
>> cpm <- t(t(y$counts)/y$samples$lib.size)
>> keep <- rowSums(cpm>1)>0
>> y.filtered <- y[keep,]
>> 
>> Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates.
>> 
>> design <- model.matrix(~lifecycle)
>> y.filtered <- estimateGLMCommonDisp(y.filtered,design)
>> 
>> Perhaps also
>> 
>> y.filtered <- estimateGLMTrendedDisp(y.filtered,design)
>> 
>> Then
>> 
>> fit <- glmFit(y.filtered,design)
>> 
>> Then you could test for genes that change at all through the 6 lifecycles by
>> 
>> lrt <- glmLRT(y.filtered,fit,coef=2:6)
>> topTags(lrt)
>> 
>> This will perform a likelihood ratio test for each gene on 5 degrees of freedom.  Then you could examine the significant genes to see what patterns they show across the lifecycles:
>> 
>> is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
>> library(limma)
>> plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)
>> 
>> Best wishes
>> Gordon
>> 
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>> smyth at wehi.edu.au
>> http://www.wehi.edu.au
>> http://www.statsci.org/smyth
>> 
>> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:
>> 
>>> Hi Gordon
>>> Thanks for getting back to me on this. I don't mind the transfer, whatever leads to answers to my questions is fine by me.
>>> I am sorry that I forgot to classify the lifecycles of the samples. They represent 6 distinct lifecycles (the first measured lifecycle has two samples). Will this affect the prior.n you suggested?
>>> ~ Mark
>>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>>>> Hi Mark,
>>>> This is an analysis question rather than a devel suggestion, so I've transferred it from Bioc-sig-seq to Bioconductor.  Hope you don't mind.
>>>> If you have divided your samples into two groups, and you have 7 samples, then I recommend you set prior.n=4.  See
>>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-June/002042.html
>>>> for an explanation.
>>>> I can't answer your other questions without knowing whether you have biological replicates in your experiment.  You have 7 samples.  Do these samples all correspond to distinct lifecycle stages, or are some stages observed more than once?  How many stages are there?
>>>> Best wishes
>>>> Gordon
>>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at virginia.edu>
>>>>> To: "bioc-sig-sequencing at r-project.org"
>>>>> 	<bioc-sig-sequencing at r-project.org>
>>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>>>> 	Timeseries Data
>>>>> (I apologize if this message was sent more than once)
>>>>> Dear Analysis Gurus
>>>>> I am currently performing a gene expression analysis on a plant parasite. I have mapped Illumina read counts for various stages in this parasites lifecycle. Of interest for us in this analysis are genes that are differentially expressed during these lifecycles. To determine this, I have focused on two types of differential expression: "peaks" and "cliffs." "Peaks" occur when a gene is differentially expressed in one time sample (either higher or lower than the remaining samples) and "cliffs" occur when a gene is differentially expressed between two groups of sample (for instance higher expression in the first three samples than the last three).
>>>>> To determine these peaks and cliffs, I have been creating groups in which the desired peak/cliff is "case" and the remaining samples are "control." I then run common dispersion and/or tagwise dispersion and extract those reads with an FDR of less than 0.1. So, my questions:
>>>>> 1.) How much filtering of data should I do? Right now I have a fair amount of genes that are expressed in 0, 1, 2 etc. samples. It seems logical that I would filter out genes that have no expression, but at what level should it stop? Also, should there be different filtering depending on the analysis (peak or cliff)?
>>>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I currently have 7 samples)? Does it depend on the type of analysis?
>>>>> 3.) Should I investigate using a more advanced glm based analysis? Any advice on crafting a design for this?
>>>>> 4.) Any other ideas on analyses to perform on a set of timeseries data with EdgeR?
>>>>> I greatly appreciate any help/advice and thank you in advance!
>>>>> Mark J. Lawson, Ph.D.
>>>>> Bioinformatics Research Scientist
>>>>> Center for Public Health Genomics, UVA
>>>>> mlawson at virginia.edu
>> 
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:17}}