[BioC] Statistics question for multi-factor interaction test in edgeR

Mon May 13 15:17:43 CEST 2013

One further question. I'm glad to hear it's acceptable to run the
cpm>1)>=4 filtering, as well as the glmFit(Y, design), once on all 8
groups (32 libraries), and then just use the makeContrasts function. Yet
if I use makeContrasts to specify the 6 tests as noted in the prior
email/post, should I lower my "significant" FDR cutoff from 0.05 to
(0.05/6) = 0.0083? I know edgeR does a BH correction / FDR to account for
the multiple testing of all the different genes or tags. However, I am
unsure whether it's necessary to also have a correction for the 6
different contrasts I am testing, or if simply making the model >design =
model.matrix(~0+Group, data=Y$samples) where Y = DGEList(counts=D,
group=Group), and Group contains all 8 groups (2 species * 2 treatments *
2 times), intrinsically accounts for this in how it sets up the model. I
assume makeContrasts or the design function accounts for this, but I just
want to be sure.

Thank you again,
Hilary

--------------------------------------------------
Dr. Hilary April Smith
Postdoctoral Research Associate
University of Notre Dame
Department of Biological Sciences
321 Galvin Life Sciences

On 5/11/13 11:53 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:

>Dear Hilary,
>
>Makes sense to me.
>
>Personally, I would analyse all the libraries together (with 8 groups)
>instead of using separate glmFits for the two ages.  The same contrasts
>will still work.  Then there is no issue with different numbers of rows
>etc.
>
>Best wishes
>Gordon
>
>On Sat, 11 May 2013, Hilary Smith wrote:
>
>> Dear Gordon,
>> Thank you. We cannot really remove the 3-way interaction, because there
>> are some genes that respond to the 3-way interaction (even though a
>> classic parameterization with the intercept, leads to about an order of
>> magnitude more genes responding to a 2-way vs. 3-way interaction).
>>
>> Testing for species*treatment at each time, definitely seems the closest
>> way to address the questions we have. Roughly following section 3.3.1 of
>> the edgeR user guide ("Defining each treatment combination as a group")
>>on
>> the creation of specific contrasts (and many thanks for your updated
>>User
>> Guides to show this formulation in detail), I set up tests as below. If
>> this is valid, that's great, as it does make much more intuitive sense
>>to
>> me (and biological in terms of addressing our questions of interest).
>>
>> Thank you for your help; I REALLY appreciate it!
>> Best,
>> Hilary
>>
>>> design1 = model.matrix(~0+Group_time1, data=Ytime1$samples)
>>> design2 = model.matrix(~0+Group_time2, data=Ytime2$samples)
>>
>>> myC_TimeI = makeContrasts(
>> + Treatment1_vs_Treatment2_SpeciesA_TimeI = SpeciesA_TimeI_Treatment2 -
>> SpeciesA_TimeI_Treatment1,
>> + Treatment1_vs_Treatment2_SpeciesB_TimeI = SpeciesB_TimeI_Treatment2 -
>> SpeciesB_TimeI_Treatment1,
>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI =
>> (SpeciesA_TimeI_Treatment2 - SpeciesA_TimeI_Treatment1)-(
>> SpeciesB_TimeI_Treatment2 - SpeciesB_TimeI_Treatment1),
>> + levels=design1)
>>>
>>> myC_TimeII = makeContrasts(
>> + Treatment1_vs_Treatment2_SpeciesA_TimeII = SpeciesA_TimeII_Treatment2
>>-
>> SpeciesA_TimeII_Treatment1,
>> + Treatment1_vs_Treatment2_SpeciesB_TimeII = SpeciesB_TimeII_Treatment2
>>-
>> SpeciesB_TimeII_Treatment1,
>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeII =
>> (SpeciesA_TimeII_Treatment2 - SpeciesA_TimeII_Treatment1)-(
>> SpeciesB_TimeII_Treatment2 - SpeciesB_TimeII_Treatment1),
>> + levels=design2)
>>
>> Then, after using glmFit on each of the two makeContrasts above, I ran
>> glmLRT 6 times, once on each of the 6 contrasts (i.e., on
>> Treatment1_vs_Treatment2_SpeciesA_TimeI,
>> Treatment1_vs_Treatment2_SpeciesB_TimeI, etc. through
>> Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI.
>>
>> Ultimately I may then try to use the VennCounts/Diagram from limma to
>>show
>> the overlap between the 4 sets visually:
>> Treatment1vsTreatment2_SpeciesA_TimeI; Treatment1
>> vsTreatment2_SpeciesA_TimeII; Treatment1 vsTreatment2_SpeciesB_TimeI;
>>and
>> Treatment1 vsTreatment2_SpeciesB_TimeII. However, this may take a bit to
>> set up, because my two time points have slightly different gene lists
>> (i.e., to run the analyses on the Times differently, after filtering out
>> reads to only retain those with cpm>1 in „4 libraries, TimesI and II did
>> not retain exactly the same genes, though it¹s rather close).
>>
>>
>>
>>
>>
>> --------------------------------------------------
>>
>>
>>
>>
>>
>>
>> On 5/10/13 8:32 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>
>>> Dear Hilary,
>>>
>>> Your test for the 3-way interaction is correct, although 3-way
>>> interactions are pretty hard to interpret.
>>>
>>> However testing for the 2-way interaction in the presence of a 3-way
>>> interaction does not make statistical sense.  This is because the
>>> parametrization of the 2-way interaction as a subset of the 3-way is
>>> somewhat arbitrary.  Before you can test the 2-way interaction
>>> species*treatment in a meaningful way you would need to accept that the
>>> 3-way interaction is not necessary and remove it from the model.
>>>
>>> In general, I am of the opinion that classical statistical factorial
>>> interation models do not usually provide the most meaningful
>>> parametrizations for genomic experiments.  In most cases, I prefer to
>>>fit
>>> the saturated model (a different level for each treatment combination)
>>> and
>>> make specific contrasts.  There is some discussion of this in the limma
>>> User's Guide.
>>>
>>> In your case, I guess that you might want to test for species*treatment
>>> interaction separately at each time point.  It is almost impossible to
>>>do
>>> this within the classical 3-way factorial setup. However it is easy
>>>with
>>> the one-way approach I just mentioned, or else you could use:
>>>
>>>   ~Age + Age:Species + Age:Treatment + Age:Species:Treatment
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>>> Date: Thu, 9 May 2013 14:55:46 -0400
>>>> From: Hilary Smith <Hilary.A.Smith.964 at nd.edu>
>>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>>> Subject: [BioC] Statistics question for multi-factor interaction test
>>>> 	in	edgeR
>>>>
>>>> Hi. I need to generate two GLM tests of a factorial design with
>>>>RNA-Seq
>>>> count data. I have 3 factors with 2 levels apiece (2 species X 2
>>>> treatments X 2 times), and 4 separate replicates each (i.e., we made a
>>>> total of 2*2*2*4 = 32 separate libraries). Our main interest is in the
>>>> interaction of species*treatment, as we think species A will alter
>>>>gene
>>>> expression in the treatment stress vs. treatment benign, whereas
>>>> species
>>>> B is expected to show little change. However, we¹d like to also do
>>>> another test of species*treatment*time, because it is possible that
>>>>the
>>>> ability of species A to alter gene expression in response to the
>>>>stress
>>>> treatment may differ at the 1st versus 2nd time point.
>>>>
>>>> I think the way to set this up, is to create a design matrix as
>>>> follows,
>>>> with the lrt test with coef 5 giving the differentially expressed
>>>>genes
>>>> for the species*treatment test, and coef 8 giving the the
>>>> differentially
>>>> expressed gene for the species*treatment*time test (after calling
>>>> topTags that is). Yet to ensure I have the statistics correct, my
>>>> questions are: (1) is this thinking correct, as I don¹t see many 3x2
>>>> factorial models to follow, and (2) do I need to set up a reference
>>>> somehow (which I assume would be the set of four samples with
>>>> TreatmentBenign*SpeciesB*Time2, but I¹m not fully sure if that is
>>>> correct or needed).
>>>>
>>>> Many thanks in advance for your insight!
>>>> ~Hilary
>>>>
>>>>> designFF <- model.matrix(~Treatment*Species*Age)
>>>>> colnames(designFF)
>>>> [1] "(Intercept)"
>>>> [2] " TreatmentStress"
>>>> [3] "SpeciesA "
>>>> [4] "Time1"
>>>> [5] "TreatmentStress:SpeciesA"
>>>> [6] "TreatmentStress:Time1"
>>>> [7] "SpeciesA:Time1"
>>>> [8] "TreatmentStress:SpeciesA:Time1"
>>>>
>>>> And then to run tests with:
>>>>> fit <- glmFit(y, designFF)
>>>>
>>>>> lrtInteractionStressSpecies <- glmLRT(fitFF, coef=5)
>>>>> lrtInteractionStressSpeciesTime <- glmLRT(fitFF, coef=8)
>
>______________________________________________________________________
>The information in this email is confidential and intended solely for the
>addressee.
>You must not disclose, forward, print or use it without the permission of
>the sender.
>______________________________________________________________________