[BioC] Simultaneous FDR control for multiple contrasts in edgeR

Tue May 14 02:02:35 CEST 2013

Hi Hilary,

Just to make it clear what I mean by scalability:  Suppose that you 
successfully control the FDR to be less than 0.05 for each contrast 
separately.  Then it is an automatic consequence that the FDR for all the 
tests pooled together is also less than 0.05.

Best wishes
Gordon

On Tue, 14 May 2013, Gordon K Smyth wrote:

> Dear Hilary,
>
> Unlike p-values (Type I error control), FDR is a scalable loss function, so 
> it is not generally necessary to decrease the cutoff when adding more tests. 
> It is not correct or necessary to divide FDR by the number of tests done (a 
> la Bonferroni p-values).
>
> The limma package (decideTests function) has a number of strategies for 
> combining FDR across multiple contrasts as well as multiple genes. However 
> the most commmon practice is the default, which is to simply do the FDR 
> adjustment separately for each contrast.  For the limited number of contrasts 
> you are considering, I would be happy enough with this strategy.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.statsci.org/smyth
>
> On Mon, 13 May 2013, Hilary Smith wrote:
>
>> One further question. I'm glad to hear it's acceptable to run the
>> cpm>1)>=4 filtering, as well as the glmFit(Y, design), once on all 8
>> groups (32 libraries), and then just use the makeContrasts function. Yet
>> if I use makeContrasts to specify the 6 tests as noted in the prior
>> email/post, should I lower my "significant" FDR cutoff from 0.05 to
>> (0.05/6) = 0.0083? I know edgeR does a BH correction / FDR to account for
>> the multiple testing of all the different genes or tags. However, I am
>> unsure whether it's necessary to also have a correction for the 6
>> different contrasts I am testing, or if simply making the model >design =
>> model.matrix(~0+Group, data=Y$samples) where Y = DGEList(counts=D,
>> group=Group), and Group contains all 8 groups (2 species * 2 treatments *
>> 2 times), intrinsically accounts for this in how it sets up the model. I
>> assume makeContrasts or the design function accounts for this, but I just
>> want to be sure.
>> 
>> Thank you again,
>> Hilary
>> 
>> 
>> 
>> --------------------------------------------------
>> Dr. Hilary April Smith
>> Postdoctoral Research Associate
>> University of Notre Dame
>> Department of Biological Sciences
>> 321 Galvin Life Sciences
>> 
>> 
>> 
>> 
>> 
>> On 5/11/13 11:53 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>> 
>>> Dear Hilary,
>>> 
>>> Makes sense to me.
>>> 
>>> Personally, I would analyse all the libraries together (with 8 groups)
>>> instead of using separate glmFits for the two ages.  The same contrasts
>>> will still work.  Then there is no issue with different numbers of rows
>>> etc.
>>> 
>>> Best wishes
>>> Gordon
>>> 
>>> On Sat, 11 May 2013, Hilary Smith wrote:
>>> 
>>>> Dear Gordon,
>>>> Thank you. We cannot really remove the 3-way interaction, because there
>>>> are some genes that respond to the 3-way interaction (even though a
>>>> classic parameterization with the intercept, leads to about an order of
>>>> magnitude more genes responding to a 2-way vs. 3-way interaction).
>>>> 
>>>> Testing for species*treatment at each time, definitely seems the 
>>>> closest way to address the questions we have. Roughly following 
>>>> section 3.3.1 of the edgeR user guide ("Defining each treatment 
>>>> combination as a group") on the creation of specific contrasts (and 
>>>> many thanks for your updated User Guides to show this formulation in 
>>>> detail), I set up tests as below. If this is valid, that's great, as 
>>>> it does make much more intuitive sense to me (and biological in terms 
>>>> of addressing our questions of interest).
>>>> 
>>>> Thank you for your help; I REALLY appreciate it!
>>>> Best,
>>>> Hilary
>>>> 
>>>>> design1 = model.matrix(~0+Group_time1, data=Ytime1$samples)
>>>>> design2 = model.matrix(~0+Group_time2, data=Ytime2$samples)
>>>> 
>>>>> myC_TimeI = makeContrasts(
>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeI = SpeciesA_TimeI_Treatment2 -
>>>> SpeciesA_TimeI_Treatment1,
>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeI = SpeciesB_TimeI_Treatment2 -
>>>> SpeciesB_TimeI_Treatment1,
>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI =
>>>> (SpeciesA_TimeI_Treatment2 - SpeciesA_TimeI_Treatment1)-(
>>>> SpeciesB_TimeI_Treatment2 - SpeciesB_TimeI_Treatment1),
>>>> + levels=design1)
>>>>> 
>>>>> myC_TimeII = makeContrasts(
>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeII = SpeciesA_TimeII_Treatment2
>>>> -
>>>> SpeciesA_TimeII_Treatment1,
>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeII = SpeciesB_TimeII_Treatment2
>>>> -
>>>> SpeciesB_TimeII_Treatment1,
>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeII =
>>>> (SpeciesA_TimeII_Treatment2 - SpeciesA_TimeII_Treatment1)-(
>>>> SpeciesB_TimeII_Treatment2 - SpeciesB_TimeII_Treatment1),
>>>> + levels=design2)
>>>> 
>>>> Then, after using glmFit on each of the two makeContrasts above, I ran
>>>> glmLRT 6 times, once on each of the 6 contrasts (i.e., on
>>>> Treatment1_vs_Treatment2_SpeciesA_TimeI,
>>>> Treatment1_vs_Treatment2_SpeciesB_TimeI, etc. through
>>>> Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI.
>>>> 
>>>> Ultimately I may then try to use the VennCounts/Diagram from limma to 
>>>> show the overlap between the 4 sets visually: 
>>>> Treatment1vsTreatment2_SpeciesA_TimeI; Treatment1 
>>>> vsTreatment2_SpeciesA_TimeII; Treatment1 vsTreatment2_SpeciesB_TimeI; 
>>>> and Treatment1 vsTreatment2_SpeciesB_TimeII. However, this may take a 
>>>> bit to set up, because my two time points have slightly different 
>>>> gene lists (i.e., to run the analyses on the Times differently, after 
>>>> filtering out reads to only retain those with cpm>1 in „4 
>>>> libraries, TimesI and II did not retain exactly the same genes, 
>>>> though it¹s rather close).
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 5/10/13 8:32 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>>> 
>>>>> Dear Hilary,
>>>>> 
>>>>> Your test for the 3-way interaction is correct, although 3-way 
>>>>> interactions are pretty hard to interpret.
>>>>> 
>>>>> However testing for the 2-way interaction in the presence of a 3-way 
>>>>> interaction does not make statistical sense.  This is because the 
>>>>> parametrization of the 2-way interaction as a subset of the 3-way is 
>>>>> somewhat arbitrary.  Before you can test the 2-way interaction 
>>>>> species*treatment in a meaningful way you would need to accept that 
>>>>> the 3-way interaction is not necessary and remove it from the model.
>>>>> 
>>>>> In general, I am of the opinion that classical statistical factorial 
>>>>> interation models do not usually provide the most meaningful 
>>>>> parametrizations for genomic experiments.  In most cases, I prefer 
>>>>> to fit the saturated model (a different level for each treatment 
>>>>> combination) and make specific contrasts.  There is some discussion 
>>>>> of this in the limma User's Guide.
>>>>> 
>>>>> In your case, I guess that you might want to test for 
>>>>> species*treatment interaction separately at each time point.  It is 
>>>>> almost impossible to do this within the classical 3-way factorial 
>>>>> setup. However it is easy with the one-way approach I just 
>>>>> mentioned, or else you could use:
>>>>>
>>>>>   ~Age + Age:Species + Age:Treatment + Age:Species:Treatment
>>>>> 
>>>>> Best wishes
>>>>> Gordon
>>>>> 
>>>>> 
>>>>>> Date: Thu, 9 May 2013 14:55:46 -0400
>>>>>> From: Hilary Smith <Hilary.A.Smith.964 at nd.edu>
>>>>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>>>>> Subject: [BioC] Statistics question for multi-factor interaction test
>>>>>> 	in	edgeR
>>>>>> 
>>>>>> Hi. I need to generate two GLM tests of a factorial design with 
>>>>>> RNA-Seq count data. I have 3 factors with 2 levels apiece (2 
>>>>>> species X 2 treatments X 2 times), and 4 separate replicates each 
>>>>>> (i.e., we made a total of 2*2*2*4 = 32 separate libraries). Our 
>>>>>> main interest is in the interaction of species*treatment, as we 
>>>>>> think species A will alter gene expression in the treatment stress 
>>>>>> vs. treatment benign, whereas species B is expected to show little 
>>>>>> change. However, we¹d like to also do another test of 
>>>>>> species*treatment*time, because it is possible that the ability of 
>>>>>> species A to alter gene expression in response to the stress 
>>>>>> treatment may differ at the 1st versus 2nd time point.
>>>>>> 
>>>>>> I think the way to set this up, is to create a design matrix as 
>>>>>> follows, with the lrt test with coef 5 giving the differentially 
>>>>>> expressed genes for the species*treatment test, and coef 8 giving 
>>>>>> the the differentially expressed gene for the 
>>>>>> species*treatment*time test (after calling topTags that is). Yet to 
>>>>>> ensure I have the statistics correct, my questions are: (1) is this 
>>>>>> thinking correct, as I don¹t see many 3x2 factorial models to 
>>>>>> follow, and (2) do I need to set up a reference somehow (which I 
>>>>>> assume would be the set of four samples with 
>>>>>> TreatmentBenign*SpeciesB*Time2, but I¹m not fully sure if that is 
>>>>>> correct or needed).
>>>>>> 
>>>>>> Many thanks in advance for your insight!
>>>>>> ~Hilary
>>>>>> 
>>>>>>> designFF <- model.matrix(~Treatment*Species*Age)
>>>>>>> colnames(designFF)
>>>>>> [1] "(Intercept)"
>>>>>> [2] " TreatmentStress"
>>>>>> [3] "SpeciesA "
>>>>>> [4] "Time1"
>>>>>> [5] "TreatmentStress:SpeciesA"
>>>>>> [6] "TreatmentStress:Time1"
>>>>>> [7] "SpeciesA:Time1"
>>>>>> [8] "TreatmentStress:SpeciesA:Time1"
>>>>>> 
>>>>>> And then to run tests with:
>>>>>>> fit <- glmFit(y, designFF)
>>>>>> 
>>>>>>> lrtInteractionStressSpecies <- glmLRT(fitFF, coef=5)
>>>>>>> lrtInteractionStressSpeciesTime <- glmLRT(fitFF, coef=8)

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}