[BioC] DESeq: residuals instead of read counts

Simon Anders anders at embl.de
Mon Jan 17 11:54:42 CET 2011


Dear Tuuli

On 01/16/2011 10:30 AM, Tuuli Lappalainen wrote:
> I've been using DESeq for RNASeq data analysis of differential
> expression of genes and exons. However, I have a dilemma: Instead of
> using read count data, I would like to use residuals of linear
> regression (after rescaling), but I'm unsure whether this is a valid
> thing to do.
>
> The reason to use residuals is the correction for variation in insert
> size between lanes (as e.g. in Montgomery et al. Nature 2010) - or
> potentially some other confounders as well. I have been planning to
> rescale the residuals to have all the values >0, and the median across
> the samples per gene the same as in the original count data, and use
> these data then similarly to raw counts. Would this be a proper way to
> analyze the data?

No, this would not work well because it invalidates the shot noise 
estimation. The proper place to introduce such corrections would be 
alongside the size factors. Unfortunately, DESeq does not offer, at the 
moment, the functionality to enter such extra factors, but it seems I 
really should add this, as it is actually trivial.

The alternative would be to use GLMs.

Maybe a few notes about both possibilities.

(a) DESeq models the counts K_ij of gene i in sample j as following a 
negative-binomial distribution with mean s_j q_ij. Here, q_ij is the 
expected concentration of transcripts from gene i under the condition 
(treatment) of sample j, i.e., q_ij is the same for all samples j within 
a replicate group. But even within a replicate group, the exepctation 
values of the counts K_ij will be different -- primarily because the 
sequencing depth varies from sample to sample. Hence, we multiply q_ij 
with the "size factor" s_j, which models the relation between 
(biological) transcript concentration and (technical) read-out in terms 
of read counts.

At the moment, this factor s_j only depends on the sample j but not on 
the gene i. If your confounder also only depends on j and not i, you can 
simply multiply it to the size factor (provided, you have estimated the 
confounder as a multiplicative rather than an additive correction).

If it depends on i, too, the math says all the same -- only, there is no 
slot in the CountDataSet object to store this information. Maybe I 
should add this, as you are not the first one to ask. Other users wanted 
to incorportate a GC correction (as used in the Pickrell et al. paper), 
and this would be done the same way.

(b) In the last release, DESeq got the functionality to fit GLMs -- only 
this is not yet mentioned in the vignette but it is documented in the 
help pages. With this, you can do a per-gene confounder estimation and 
the main effect estimation in a unified fashion.

In brief, it works as follows:

- When calling 'newCountDataSet', the second argument is no longer a 
factor of conditions but rather a data frame with the predictors, one 
row for each column in the count table, and one column for each factor. 
(in principle, some of the predictors could be numerical, rather than 
factors.)

- When calling 'estimateVarianceFunctions', use the option 
'method="pooled"'. This estimates one pooled variance for all samples.

- Instead of 'nbinomTest', use 'nbinomFitGLM' and 'nbinomTestGLM', as 
follows (assuming that the model frame contains two columns, 
'confounder' and 'treatment'):

# Fit the reduced model (takes a few minutes):
fit0 <- nbinomFitGLM( cds, count ~ confounder )

# Fit the full model (takes a few minutes):
fit1 <- nbinomFitGLM( cds, count ~ confounder + treatment )

# Calculate the pvalues by a simple chi2 test:
pvals <- nbinomGLMTest( fit1, fit0 )

This make sense only if you expect your confounder to change in value 
for each gene. If it doesn't it should go into the size factors (or, 
more generally, into the GLM's offset.)

The catch is that the new "pooled" variance estimation only works if the 
design has "cell replicates", i.e., if the design matrix has replicated 
rows, and this is often not the case. The authors of the "edgeR" package 
found an ingenious way around the problem (their newest version uses 
what they call a Cox-Reid dispersion estimation), and I guess, I should 
take over this idea.

Cheers
   Simon



More information about the Bioconductor mailing list