[BioC] Questions on doing EdgeR Analysis of Timeseries Data

Fri Jun 17 17:52:50 CEST 2011

Hi Dr. Smith,

Sorry if I am coming back to an issue previously discussed. I am following up on Dr. Lawson's question because it can be useful to him as well, I think.

Can you ,please, clarify  how to interpret a topTag table containing more than one coefficient?
As an example, those are the first two lines of the table I have. I understand why there are the coefficients for day34 and groupNT.  What does locConc, LR P.value and FDR refers to?

> topTags(lrt_glmfit_ENS_exon_data_c_TMM)
Coefficient:  day34 groupNT 
                  		   logConc    day34     groupNT       LR       P.Value           FDR
ENSBTAG00000020056 -7.499402 6.832396  0.16466282 559.8360 2.711126e-122 4.432692e-118
ENSBTAG00000010050 -8.214241 7.723259 -0.23952379 531.3709 4.114163e-116 3.363329e-112

Thanks so much, sir
Fernando

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] VennDiagram_1.0.1 baySeq_1.6.0      DESeq_1.4.1       locfit_1.5-6      lattice_0.19-26   akima_0.5-4       edgeR_2.2.5       Biobase_2.12.1   

loaded via a namespace (and not attached):
 [1] annotate_1.30.0      AnnotationDbi_1.14.1 DBI_0.2-5            genefilter_1.34.0    geneplotter_1.30.0   limma_3.8.2          RColorBrewer_1.0-2   RSQLite_0.9-4       
 [9] survival_2.36-9      tools_2.13.0         xtable_1.5-6        
>

-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Gordon K Smyth
Sent: Thursday, June 16, 2011 11:51 PM
To: Lawson, Mark (mjl3p)
Cc: Bioconductor mailing list
Subject: Re: [BioC] Questions on doing EdgeR Analysis of Timeseries Data

Dear Mark,

I'd suggest that you filter out genes that fail to achieve a reasonable count in any library.  I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway.  In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries.  If y is a DGEList object, you might use

   cpm <- t(t(y$counts)/y$samples$lib.size)
   keep <- rowSums(cpm>1)>0
   y.filtered <- y[keep,]

Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates.

   design <- model.matrix(~lifecycle)
   y.filtered <- estimateGLMCommonDisp(y.filtered,design)

Perhaps also

   y.filtered <- estimateGLMTrendedDisp(y.filtered,design)

Then

   fit <- glmFit(y.filtered,design)

Then you could test for genes that change at all through the 6 lifecycles by

   lrt <- glmLRT(y.filtered,fit,coef=2:6)
   topTags(lrt)

This will perform a likelihood ratio test for each gene on 5 degrees of freedom.  Then you could examine the significant genes to see what patterns they show across the lifecycles:

   is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
   library(limma)
   plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth

On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:

> Hi Gordon
>
> Thanks for getting back to me on this. I don't mind the transfer, 
> whatever leads to answers to my questions is fine by me.
>
> I am sorry that I forgot to classify the lifecycles of the samples. 
> They represent 6 distinct lifecycles (the first measured lifecycle has 
> two samples). Will this affect the prior.n you suggested?
>
> ~ Mark
>
> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>
>> Hi Mark,
>>
>> This is an analysis question rather than a devel suggestion, so I've 
>> transferred it from Bioc-sig-seq to Bioconductor.  Hope you don't mind.
>>
>> If you have divided your samples into two groups, and you have 7 
>> samples, then I recommend you set prior.n=4.  See
>>
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-June/002042.h
>> tml
>>
>> for an explanation.
>>
>> I can't answer your other questions without knowing whether you have 
>> biological replicates in your experiment.  You have 7 samples.  Do 
>> these samples all correspond to distinct lifecycle stages, or are 
>> some stages observed more than once?  How many stages are there?
>>
>> Best wishes
>> Gordon
>>
>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>> From: "Lawson, Mark (mjl3p)" <mjl3p at virginia.edu>
>>> To: "bioc-sig-sequencing at r-project.org"
>>> 	<bioc-sig-sequencing at r-project.org>
>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>> 	Timeseries Data
>>>
>>> (I apologize if this message was sent more than once)
>>>
>>> Dear Analysis Gurus
>>>
>>> I am currently performing a gene expression analysis on a plant 
>>> parasite. I have mapped Illumina read counts for various stages in 
>>> this parasites lifecycle. Of interest for us in this analysis are 
>>> genes that are differentially expressed during these lifecycles. To 
>>> determine this, I have focused on two types of differential
>>> expression: "peaks" and "cliffs." "Peaks" occur when a gene is 
>>> differentially expressed in one time sample (either higher or lower 
>>> than the remaining samples) and "cliffs" occur when a gene is 
>>> differentially expressed between two groups of sample (for instance 
>>> higher expression in the first three samples than the last three).
>>>
>>> To determine these peaks and cliffs, I have been creating groups in 
>>> which the desired peak/cliff is "case" and the remaining samples are 
>>> "control." I then run common dispersion and/or tagwise dispersion 
>>> and extract those reads with an FDR of less than 0.1. So, my questions:
>>>
>>> 1.) How much filtering of data should I do? Right now I have a fair 
>>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems 
>>> logical that I would filter out genes that have no expression, but 
>>> at what level should it stop? Also, should there be different 
>>> filtering depending on the analysis (peak or cliff)?
>>>
>>> 2.) When doing tagwise dispersion, what should I set my prior.n to 
>>> (I currently have 7 samples)? Does it depend on the type of analysis?
>>>
>>> 3.) Should I investigate using a more advanced glm based analysis? 
>>> Any advice on crafting a design for this?
>>>
>>> 4.) Any other ideas on analyses to perform on a set of timeseries 
>>> data with EdgeR?
>>>
>>> I greatly appreciate any help/advice and thank you in advance!
>>>
>>>
>>> Mark J. Lawson, Ph.D.
>>> Bioinformatics Research Scientist
>>> Center for Public Health Genomics, UVA mlawson at virginia.edu

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:9}}