[BioC] working with two factors in DESeq to analyze RNA-seq

Simon Anders anders at embl.de
Wed Feb 9 11:53:45 CET 2011


Dear Fernando

On 02/08/2011 06:02 PM, Biase, Fernando wrote:
> Dear Dr Anders,
>
> Based on my readings I am willing to use DESeq to analyze my RNA-seq
> data. I found it very easy to follow your instructions written on the
> package vignette, and to work with it for one factor comparison.
> However, I am not sure if I am using it correctly to account for a
> second factor in the model. So, here are my two questions:
>
> How do I compare two groups, accounting for a second factor in the
> model?  (it is not a factorial model that I am looking for)
>
> Can I simply write: The condition  names as "Ax", "Ax", "Ax", "Ax",
> "Ay", "Ay", "Ay", "Bx", "Bx", "Bx", "By", "By", "By"
 >
 > nbinomTest ( cds, “A”, “B”)
 >
 > being A/B may main factor and x/y my secondary factor?
[...]

Not quite. You need to use the new GLM functionality that I added to 
DESeq in the newest release (current version: 1.2.1).

First, when calling 'newCountDataSet', pass for 'condition' not a vector 
but a data frame with the design, i.e., something like this:

   design <- data.frame(
      treatment = c( "A", "A", "A", "A", "A", "A", "A",
         "B", "B", "B", "B", "B", "B" ),
      block = c( "x", "x", "x", "x", "y", "y", "y",
         "x", "x", "x", "y", "y", "y" ) )

   cds <- newCountDataSet( countTable, design )

Then, estimate the size factors and the variance functions, the later 
using the new "pooled" mode:

   cds <- estimateSizeFactors( cds)
   cds <- estimateVarianceFunctions( cds, method="pooled" )

Now, you can fit two models, a null model, which only account for your 
blocking factor, and the full model, which also takes into account your 
factor of interest:

   fit0 <- nbinomFitGLM( cds, count ~ block )
   fit1 <- nbinomFitGLM( cds, count ~ block + treatment )

The fitting might take a few minutes. Once it is finished, you get your 
p values with a chi-squared test:

   pvals <- nbinomGLMTest( fit1, fit0 )

These p values inform you whether the treatment has a significant effect 
on the counts, after the effect of the blocking factor has been taken 
into account.

Remember to adjust the p values for multiple testing before you look for 
hits:

   padj <- p.adjust( pvals, method="BH" )

If your count table had gene symbols in the row names, you can now list 
your hits:

    rownames(counts(cds))[ padj < .1 ]


This new functionality is not yet properly documented in the vignette 
because I have not yet found a public data set that would be suitable to 
demonstrate it.

Hence, as it has not been used that widely, I'd be interested to hear 
how well it works for you.

Best regards
   Simon



More information about the Bioconductor mailing list