[BioC] Statistics question for multi-factor interaction test in edgeR

Tue May 14 14:12:43 CEST 2013

Excellent; that is very good to hear. Thank you, and I'll take a closer
look at the limma user guide. I haven't done microarrays, so I had not
gone through the limma guide closely enough.

--------------------------------------------------
Dr. Hilary April Smith
Postdoctoral Research Associate
University of Notre Dame
Department of Biological Sciences
321 Galvin Life Sciences

On 5/13/13 7:53 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:

>Dear Hilary,
>
>Unlike p-values (Type I error control), FDR is a scalable loss function,
>so it is not generally necessary to decrease the cutoff when adding more
>tests.  It is not correct or necessary to divide FDR by the number of
>tests done (a la Bonferroni p-values).
>
>The limma package (decideTests function) has a number of strategies for
>combining FDR across multiple contrasts as well as multiple genes.
>However the most commmon practice is the default, which is to simply do
>the FDR adjustment separately for each contrast.  For the limited number
>of contrasts you are considering, I would be happy enough with this
>strategy.
>
>Best wishes
>Gordon
>
>---------------------------------------------
>Professor Gordon K Smyth,
>Bioinformatics Division,
>Walter and Eliza Hall Institute of Medical Research,
>1G Royal Parade, Parkville, Vic 3052, Australia.
>http://www.statsci.org/smyth
>
>On Mon, 13 May 2013, Hilary Smith wrote:
>
>> One further question. I'm glad to hear it's acceptable to run the
>> cpm>1)>=4 filtering, as well as the glmFit(Y, design), once on all 8
>> groups (32 libraries), and then just use the makeContrasts function. Yet
>> if I use makeContrasts to specify the 6 tests as noted in the prior
>> email/post, should I lower my "significant" FDR cutoff from 0.05 to
>> (0.05/6) = 0.0083? I know edgeR does a BH correction / FDR to account
>>for
>> the multiple testing of all the different genes or tags. However, I am
>> unsure whether it's necessary to also have a correction for the 6
>> different contrasts I am testing, or if simply making the model >design
>>=
>> model.matrix(~0+Group, data=Y$samples) where Y = DGEList(counts=D,
>> group=Group), and Group contains all 8 groups (2 species * 2 treatments
>>*
>> 2 times), intrinsically accounts for this in how it sets up the model. I
>> assume makeContrasts or the design function accounts for this, but I
>>just
>> want to be sure.
>>
>> Thank you again,
>> Hilary
>>
>>
>>
>> --------------------------------------------------
>> Dr. Hilary April Smith
>> Postdoctoral Research Associate
>> University of Notre Dame
>> Department of Biological Sciences
>> 321 Galvin Life Sciences
>>
>>
>>
>>
>>
>> On 5/11/13 11:53 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>
>>> Dear Hilary,
>>>
>>> Makes sense to me.
>>>
>>> Personally, I would analyse all the libraries together (with 8 groups)
>>> instead of using separate glmFits for the two ages.  The same contrasts
>>> will still work.  Then there is no issue with different numbers of rows
>>> etc.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> On Sat, 11 May 2013, Hilary Smith wrote:
>>>
>>>> Dear Gordon,
>>>> Thank you. We cannot really remove the 3-way interaction, because
>>>>there
>>>> are some genes that respond to the 3-way interaction (even though a
>>>> classic parameterization with the intercept, leads to about an order
>>>>of
>>>> magnitude more genes responding to a 2-way vs. 3-way interaction).
>>>>
>>>> Testing for species*treatment at each time, definitely seems the
>>>>closest
>>>> way to address the questions we have. Roughly following section 3.3.1
>>>>of
>>>> the edgeR user guide ("Defining each treatment combination as a
>>>>group")
>>>> on
>>>> the creation of specific contrasts (and many thanks for your updated
>>>> User
>>>> Guides to show this formulation in detail), I set up tests as below.
>>>>If
>>>> this is valid, that's great, as it does make much more intuitive sense
>>>> to
>>>> me (and biological in terms of addressing our questions of interest).
>>>>
>>>> Thank you for your help; I REALLY appreciate it!
>>>> Best,
>>>> Hilary
>>>>
>>>>> design1 = model.matrix(~0+Group_time1, data=Ytime1$samples)
>>>>> design2 = model.matrix(~0+Group_time2, data=Ytime2$samples)
>>>>
>>>>> myC_TimeI = makeContrasts(
>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeI = SpeciesA_TimeI_Treatment2
>>>>-
>>>> SpeciesA_TimeI_Treatment1,
>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeI = SpeciesB_TimeI_Treatment2
>>>>-
>>>> SpeciesB_TimeI_Treatment1,
>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI =
>>>> (SpeciesA_TimeI_Treatment2 - SpeciesA_TimeI_Treatment1)-(
>>>> SpeciesB_TimeI_Treatment2 - SpeciesB_TimeI_Treatment1),
>>>> + levels=design1)
>>>>>
>>>>> myC_TimeII = makeContrasts(
>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeII =
>>>>SpeciesA_TimeII_Treatment2
>>>> -
>>>> SpeciesA_TimeII_Treatment1,
>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeII =
>>>>SpeciesB_TimeII_Treatment2
>>>> -
>>>> SpeciesB_TimeII_Treatment1,
>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeII =
>>>> (SpeciesA_TimeII_Treatment2 - SpeciesA_TimeII_Treatment1)-(
>>>> SpeciesB_TimeII_Treatment2 - SpeciesB_TimeII_Treatment1),
>>>> + levels=design2)
>>>>
>>>> Then, after using glmFit on each of the two makeContrasts above, I ran
>>>> glmLRT 6 times, once on each of the 6 contrasts (i.e., on
>>>> Treatment1_vs_Treatment2_SpeciesA_TimeI,
>>>> Treatment1_vs_Treatment2_SpeciesB_TimeI, etc. through
>>>> Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI.
>>>>
>>>> Ultimately I may then try to use the VennCounts/Diagram from limma to
>>>> show
>>>> the overlap between the 4 sets visually:
>>>> Treatment1vsTreatment2_SpeciesA_TimeI; Treatment1
>>>> vsTreatment2_SpeciesA_TimeII; Treatment1 vsTreatment2_SpeciesB_TimeI;
>>>> and
>>>> Treatment1 vsTreatment2_SpeciesB_TimeII. However, this may take a bit
>>>>to
>>>> set up, because my two time points have slightly different gene lists
>>>> (i.e., to run the analyses on the Times differently, after filtering
>>>>out
>>>> reads to only retain those with cpm>1 in „4 libraries, TimesI and II
>>>>did
>>>> not retain exactly the same genes, though it¹s rather close).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 5/10/13 8:32 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>>>
>>>>> Dear Hilary,
>>>>>
>>>>> Your test for the 3-way interaction is correct, although 3-way
>>>>> interactions are pretty hard to interpret.
>>>>>
>>>>> However testing for the 2-way interaction in the presence of a 3-way
>>>>> interaction does not make statistical sense.  This is because the
>>>>> parametrization of the 2-way interaction as a subset of the 3-way is
>>>>> somewhat arbitrary.  Before you can test the 2-way interaction
>>>>> species*treatment in a meaningful way you would need to accept that
>>>>>the
>>>>> 3-way interaction is not necessary and remove it from the model.
>>>>>
>>>>> In general, I am of the opinion that classical statistical factorial
>>>>> interation models do not usually provide the most meaningful
>>>>> parametrizations for genomic experiments.  In most cases, I prefer to
>>>>> fit
>>>>> the saturated model (a different level for each treatment
>>>>>combination)
>>>>> and
>>>>> make specific contrasts.  There is some discussion of this in the
>>>>>limma
>>>>> User's Guide.
>>>>>
>>>>> In your case, I guess that you might want to test for
>>>>>species*treatment
>>>>> interaction separately at each time point.  It is almost impossible
>>>>>to
>>>>> do
>>>>> this within the classical 3-way factorial setup. However it is easy
>>>>> with
>>>>> the one-way approach I just mentioned, or else you could use:
>>>>>
>>>>>   ~Age + Age:Species + Age:Treatment + Age:Species:Treatment
>>>>>
>>>>> Best wishes
>>>>> Gordon
>>>>>
>>>>>
>>>>>> Date: Thu, 9 May 2013 14:55:46 -0400
>>>>>> From: Hilary Smith <Hilary.A.Smith.964 at nd.edu>
>>>>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>>>>> Subject: [BioC] Statistics question for multi-factor interaction
>>>>>>test
>>>>>> 	in	edgeR
>>>>>>
>>>>>> Hi. I need to generate two GLM tests of a factorial design with
>>>>>> RNA-Seq
>>>>>> count data. I have 3 factors with 2 levels apiece (2 species X 2
>>>>>> treatments X 2 times), and 4 separate replicates each (i.e., we
>>>>>>made a
>>>>>> total of 2*2*2*4 = 32 separate libraries). Our main interest is in
>>>>>>the
>>>>>> interaction of species*treatment, as we think species A will alter
>>>>>> gene
>>>>>> expression in the treatment stress vs. treatment benign, whereas
>>>>>> species
>>>>>> B is expected to show little change. However, we¹d like to also do
>>>>>> another test of species*treatment*time, because it is possible that
>>>>>> the
>>>>>> ability of species A to alter gene expression in response to the
>>>>>> stress
>>>>>> treatment may differ at the 1st versus 2nd time point.
>>>>>>
>>>>>> I think the way to set this up, is to create a design matrix as
>>>>>> follows,
>>>>>> with the lrt test with coef 5 giving the differentially expressed
>>>>>> genes
>>>>>> for the species*treatment test, and coef 8 giving the the
>>>>>> differentially
>>>>>> expressed gene for the species*treatment*time test (after calling
>>>>>> topTags that is). Yet to ensure I have the statistics correct, my
>>>>>> questions are: (1) is this thinking correct, as I don¹t see many 3x2
>>>>>> factorial models to follow, and (2) do I need to set up a reference
>>>>>> somehow (which I assume would be the set of four samples with
>>>>>> TreatmentBenign*SpeciesB*Time2, but I¹m not fully sure if that is
>>>>>> correct or needed).
>>>>>>
>>>>>> Many thanks in advance for your insight!
>>>>>> ~Hilary
>>>>>>
>>>>>>> designFF <- model.matrix(~Treatment*Species*Age)
>>>>>>> colnames(designFF)
>>>>>> [1] "(Intercept)"
>>>>>> [2] " TreatmentStress"
>>>>>> [3] "SpeciesA "
>>>>>> [4] "Time1"
>>>>>> [5] "TreatmentStress:SpeciesA"
>>>>>> [6] "TreatmentStress:Time1"
>>>>>> [7] "SpeciesA:Time1"
>>>>>> [8] "TreatmentStress:SpeciesA:Time1"
>>>>>>
>>>>>> And then to run tests with:
>>>>>>> fit <- glmFit(y, designFF)
>>>>>>
>>>>>>> lrtInteractionStressSpecies <- glmLRT(fitFF, coef=5)
>>>>>>> lrtInteractionStressSpeciesTime <- glmLRT(fitFF, coef=8)
>
>______________________________________________________________________
>The information in this email is confidential and intended solely for the
>addressee.
>You must not disclose, forward, print or use it without the permission of
>the sender.
>______________________________________________________________________