[BioC] edgeR: pseudocounts, logConc and logFC

Fri Apr 9 23:20:57 CEST 2010

Hi Ann.

Good questions.  See comments below.

> I am experimenting with edgeR for high throughput (next gen) sequence
> data and proteomics spectral count data and have a few questions.
>
> 1.  Is it correct to think of the pseudocounts (pseudo.alt produced by
> estimateCommonDisp) as normalized counts?  According to the edgeR
> vignette “The pseudocounts are calculated using a quantile-to-quantile
> method for the negative binomial so that the library sizes for the
> pseudocounts are equal to the geometric mean of the original library
> sizes.”  For the data that I am working with, the column sums for
> pseudo.alt are very close to the common.lib.size, but the boxplots do
> not “line-up”.  Is this because the pseudocounts are “generated under
> the alternative hypothesis”?

Yes, you could use the pseudodata as normalized counts.  If its RNA-seq
data, you might want to do something additional about gene length though
(e.g. RPKMs).  The reason for your boxplots not lining up (and another
consideration for normalization) may be what we call composition bias:

http://genomebiology.com/2010/11/3/R25

As you may know, the Berkeley folks have methods for normalization:

http://www.biomedcentral.com/1471-2105/11/94

In general for differential expression, we make the statistical models
operate directly on the raw counts (and incorporate 'normalization' into
the model); for us, the normalized data is just for looking at, not for
doing statistics on.

> 2.  I noticed that within the estimatePs function, the minimum value
> is set to 8.783496e-16.  I think the choice of this minimum will
> affect the estimated logConc and logFC values, but will it affect the
> test results (p-values)?

Yes, it definitely will affect logFC and logConc.  It shouldn't affect the
exact testing, since this is based on sums of group pseudocounts, which
are at roughly the original scale of measurement.

> 3.  The ranges for logConc and logFC seems different when comparing
> the graph produced by smearPlot and output produced by exactTest (for
> a single comparison).  Specifically, for each of the examples in the
> edgeR vignette (and in my own data examples), the minimum logConc in
> the smearPlot is ~ -16, while in the table from topTags the minimum is
> ~32.   For logFC, the max shown in smearPlot is ~10, while the max in
> topTags is ~40.   After changing xlim and ylim in plotSmear, this
> doesn’t seem to be an issue of setting the axes.

Actually, this is the whole reason for the 'smear' plots.  The smear
itself is composed of those genes/tags that have the minimum value in one
of the two groups.  The X values for the smear are chosen as random
uniform (hence, the smear), just to the left of the non-minimum
genes/tags.  The Y values are a 'compressed' logFC, so that they are not
so far out.  So, plotSmear() gives a different visual representation of
logFC/logConc than the exactTest() output table.

Hope that helps.

Cheers,
Mark

>
> I am using edgeR_1.4.7 with R version 2.10.1.
>
> Thanks!
>
> Ann
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}