[BioC] Questions on doing EdgeR Analysis of Timeseries Data

Sat Jun 18 01:45:24 CEST 2011

Counts per million should have been:

   cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size)

Gordon

On Fri, 17 Jun 2011, Gordon K Smyth wrote:

> Dear Mark,
>
> I'd suggest that you filter out genes that fail to achieve a reasonable count 
> in any library.  I can't give you a firm cutoff for this, it depends on the 
> sequencing depth, and the analysis should not be sensitive to this anyway. 
> In my lab, we have tended to keep genes that achieve at least 1 count per 
> thousand reads in at least a minimum number of libraries.  If y is a DGEList 
> object, you might use
>
>  cpm <- t(t(y$counts)/y$samples$lib.size)
>  keep <- rowSums(cpm>1)>0
>  y.filtered <- y[keep,]
>
> Since you have two replicates of one lifecycle, you could base your estimate 
> of variability (the biological coefficient of variation) on these replicates.
>
>  design <- model.matrix(~lifecycle)
>  y.filtered <- estimateGLMCommonDisp(y.filtered,design)
>
> Perhaps also
>
>  y.filtered <- estimateGLMTrendedDisp(y.filtered,design)
>
> Then
>
>  fit <- glmFit(y.filtered,design)
>
> Then you could test for genes that change at all through the 6 lifecycles by
>
>  lrt <- glmLRT(y.filtered,fit,coef=2:6)
>  topTags(lrt)
>
> This will perform a likelihood ratio test for each gene on 5 degrees of 
> freedom.  Then you could examine the significant genes to see what patterns 
> they show across the lifecycles:
>
>  is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
>  library(limma)
>  plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:
>
>> Hi Gordon
>> 
>> Thanks for getting back to me on this. I don't mind the transfer, whatever 
>> leads to answers to my questions is fine by me.
>> 
>> I am sorry that I forgot to classify the lifecycles of the samples. They 
>> represent 6 distinct lifecycles (the first measured lifecycle has two 
>> samples). Will this affect the prior.n you suggested?
>> 
>> ~ Mark
>> 
>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>> 
>>> Hi Mark,
>>> 
>>> This is an analysis question rather than a devel suggestion, so I've 
>>> transferred it from Bioc-sig-seq to Bioconductor.  Hope you don't mind.
>>> 
>>> If you have divided your samples into two groups, and you have 7 samples, 
>>> then I recommend you set prior.n=4.  See
>>> 
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-June/002042.html
>>> 
>>> for an explanation.
>>> 
>>> I can't answer your other questions without knowing whether you have 
>>> biological replicates in your experiment.  You have 7 samples.  Do these 
>>> samples all correspond to distinct lifecycle stages, or are some stages 
>>> observed more than once?  How many stages are there?
>>> 
>>> Best wishes
>>> Gordon
>>> 
>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at virginia.edu>
>>>> To: "bioc-sig-sequencing at r-project.org"
>>>> 	<bioc-sig-sequencing at r-project.org>
>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>>> 	Timeseries Data
>>>> 
>>>> (I apologize if this message was sent more than once)
>>>> 
>>>> Dear Analysis Gurus
>>>> 
>>>> I am currently performing a gene expression analysis on a plant parasite. 
>>>> I have mapped Illumina read counts for various stages in this parasites 
>>>> lifecycle. Of interest for us in this analysis are genes that are 
>>>> differentially expressed during these lifecycles. To determine this, I 
>>>> have focused on two types of differential expression: "peaks" and 
>>>> "cliffs." "Peaks" occur when a gene is differentially expressed in one 
>>>> time sample (either higher or lower than the remaining samples) and 
>>>> "cliffs" occur when a gene is differentially expressed between two groups 
>>>> of sample (for instance higher expression in the first three samples than 
>>>> the last three).
>>>> 
>>>> To determine these peaks and cliffs, I have been creating groups in which 
>>>> the desired peak/cliff is "case" and the remaining samples are "control." 
>>>> I then run common dispersion and/or tagwise dispersion and extract those 
>>>> reads with an FDR of less than 0.1. So, my questions:
>>>> 
>>>> 1.) How much filtering of data should I do? Right now I have a fair 
>>>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems 
>>>> logical that I would filter out genes that have no expression, but at 
>>>> what level should it stop? Also, should there be different filtering 
>>>> depending on the analysis (peak or cliff)?
>>>> 
>>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I 
>>>> currently have 7 samples)? Does it depend on the type of analysis?
>>>> 
>>>> 3.) Should I investigate using a more advanced glm based analysis? Any 
>>>> advice on crafting a design for this?
>>>> 
>>>> 4.) Any other ideas on analyses to perform on a set of timeseries data 
>>>> with EdgeR?
>>>> 
>>>> I greatly appreciate any help/advice and thank you in advance!
>>>> 
>>>> 
>>>> Mark J. Lawson, Ph.D.
>>>> Bioinformatics Research Scientist
>>>> Center for Public Health Genomics, UVA
>>>> mlawson at virginia.edu
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}