[BioC] DEseq2 metagenomic analysis without replicates

Mon Jan 13 23:54:56 CET 2014

Hi Kristina

On 13/01/14 22:08, Kristina M Fontanez wrote:
> Thank you for your insights. Replicates aren’t an option often in marine
> metagenomics because many of us aren’t working at depths where dropping
> a bucket is not a feasible option. Notably, we are working with
> medium-term deployments of in situ incubations where the number of
> sampling spots was extremely limited. It’s a first pass at a difficult
> issue in marine microbiology but until we have the physical
> infrastructure (sampling mechanisms) to get a large number of samples,
> we often sacrifice replicates for better coverage across depths. Looking
> forward, replicates is definitely a goal.

First, you may (or may not) have a valid substitute for replication, see
below, and if so, there might be little to complain about your design.

Of course, doing all four depths twice might be overkill for a first
try. If I had money for four samples, I might have opted for three
depths and taken, say, the middle depth twice, to get some idea how much
variation there can be. Precisely because this is a pilot study and a
first try for a new field, it seems important to me to establish that
the measurements for a given depth are stable and reproducible, _before_
rolling out a bigger experiment, and this is why I'm always so puzzled
that people consider replicates as important for follow-ups but not for
pilot studies. I would say it is the other way round.

> The treatments are in situ incubations representing live microbes,
> poisoned microbes (dead) and surrounding seawater (not an incubation,
> can be thought of as a control).
> 
> Your point about guessing the dispersions is a difficult one for me to
> agree with. First, the differences between the treatments are likely to
> far exceed the differences within the depths. Preliminary analyses in
> bayseq where samples are grouped by treatment bear this point out. So, I
> think it’s reasonable to combine the dispersions across depths so that I
> can compare live and dead treatments. 

Actually, I fully agree.

If treatment causes much stronger differences than depth, then using the
samples from different depths as replicates to assess the effect of
treatment is entirely reasonable and valid. I somehow thought you wanted
to to it the other way round (pooling treatments to find differences
between depths -- though this is not what you wrote), and this won't
work, because nothing will be significant.

> Second, as you point out, guessing
> the appropriate dispersion makes the data difficult to publish given I
> won’t have a good reason to argue for any value. These particular marine
> metagenomes would be the first of their kind so there really isn’t a
> good reference point.

For the treatment effect, you are fine because you can argue that the
dispersion estimated across different depths is certainly a conservative
estimate for what you would have gotten from replicates -- so you are
fine there.

Problem is, you cannot reason about the effect of depth this way, but
you would not want to throw away your knowledge about depths.

However, I neglected in my first post that depth is an _ordered_ factor,
and this might make quite a difference.

I am not a marine biologist, but I imagine that most taxa will either
keep increasing or keep decreasing in abundance when you go deeper and
deeper, at least if your four depths are not that extreme. I assume that
only few of your taxa will be happiest or unhappiest at the medium
depths, i.e. have strongest or weakest abundance not at the most shallow
or deepest sample but in one of the middle ones.

If so (and only if so), you have a good substitute for replication.
Strong differences between depths would then nearly always be
consistent: the direction of change for all three steps (from depth 1 to
depth 2, from depth 2 to depth 3 and from depth 3 to depth 4) is the
same. Taxa without depth preference will fluctuate randomly: the
differences between adjacent depth are not only weak but also
inconsistent in their directions. Hence, the taxa with the strongest
differences between depths will nearly always be consistent while those
with the weakest differences will have random directions of change, and
between the two, the line for significance can be drawn. This is why an
ordered factor can substitute for replication.

To do this with DESeq2, specify depth in you sample table not as a
factor, but as numerical vector, e.g. by putting the depth in metres.
The estimated log2 fold change can then be interpreted as reciprocal
halving/doubling distances: A log2 fold change of +0.2, for example,
would mean that the taxon's abundance doubles every 5 metres of depth
(1/0.2=5), -0.4 would mean that it's abundance halves every 2.5 metres.
If this assumption of exponential change is not too far from reality
(and at least for light, it is true, and hence maybe also for
light-dependent microbes), you will get valid p values.

Your design for this should then be: ~ treatment + depth

I should caution that we haven't tested regression on continuous
variables much, so if it does not work, ask again. It should work but
may need some tweaking, especially you may need to switch off
coefficient shrinkage.

> One remaining question is whether I need the design ~ 1 option. I’m still
> not clear on that.

When you create a DESeqDataSet, you have to pass a design formula.
However, the 'estimateSizeFactor' function's output does not depend on
this. The function 'rlogTransformation' takes an option, 'blind'. Its
default, TRUE, means that the function should _ignore_ the design
formula and dispersion values stored in the object and recompute them
using the design formula '~ 1'. With '~ 1', the function does not know
anything about the samples' assignment to groups, which is why we called
it 'blind'. If you use 'rlogData', you have to specify '~ 1' manually.

  Simon