[BioC] How to design matrix on edgeR to study genotype x environmental interaction

Gordon K Smyth smyth at wehi.EDU.AU
Fri Nov 23 04:27:22 CET 2012


Dear Daniela,

I think you would be very well advised to seek out a statistical 
bioinformatician with whom you can collaborate on an ongoing basis.  A GxE 
anova analysis would be statistically sophisticated even if you were 
analysing a simple univariate phenotypic trait.  Attempting to do that 
sort of analysis in the context of an RNA-Seq experiment on miRNAs is far 
more difficult again.  The design matrices you have created may be 
correct, but that's just the start of the analysis, and there are many 
layers of possible complexity.

The BCV in your experiment is so large that I feel there must be quality 
issues with your data that you have not successfully dealt with.  It seems 
very likely, for example, that there are batch effects that you have not 
yet described.

To answer some specific questions:

You might be better off with prior.df=10 instead the default, but this has 
little to do with the size of the BCV.

You ask why one variety and one stage are disappearing from your design 
matrix.  If you omit the "0+" in the first formula (and you should), you 
will find that one vineyard will disappear as well.  This is because the 
number of contrasts for any factor must be one less than the number of 
leveles.  This is a very fundamental feature of factors and model formula 
that you need to become familiar with before you can make sense of any 
model formula.

Your email makes no mention of library sizes or sequencing depths, but 
obviously that has a fundamental effect on what is significantly different 
from what.

I think you know now how to use edgeR in principle.  However, as you 
probably already appreciate, deciding what is the right analysis for your 
data is beyond the scope of the mailing list.

Best wishes
Gordon


On Thu, 22 Nov 2012, bioconductor-request at r-project.org wrote:

> Date: Thu, 22 Nov 2012 10:07:19 +0100
> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x
> 	environmental interaction
> Message-ID:
>
> Dear Gordon,
>
> Thank you so much for your valuable input. I took sometime to study a bit
> more and be able to consider all the aspects you pointed out. At this time
> I reconsider the analysis and started again, with the data exploration of
> all 48 samples.
>
> First I filtered out the low reads, considering just the ones with more
> than 1 cpm in at least 2 libraries (I have two replicates of each library);
> the MDS plot clearly separate one of the locations from the other two
> (dimension 1) and with less distinction the two varieties (dimension 2).
> The stages also seems to be separated in two groups (the first two ones
> together and separate of the two last ones) but as the varieties, not so
> distinct. The two replicates are also consistent.
>
> With the BCV plot I could observe that reads with lower logCPM have bigger
> BCV (the BCV value was equal to 0.5941), and then comes my first question:
>
> Should I choose *prior.df* different from the default, due to this
> behavior, when estimating genewise dispersion?
>
> To proceed with the DE analysis, I tried two approaches, this time with all
> the 48 samples, as suggested.
> For both approaches, I have the following data frame:
>
>> target
>   Sample Vineyard Variety Stage
> 1       1     mont      CS    ps
> 2       2     mont      CS    ps
> 3       4     mont      CS    bc
> 4       5     mont      CS    bc
> 5       7     mont      CS   19b
> 6       8     mont      CS   19b
> 7      10     mont      CS    hv
> 8      11     mont      CS    hv
> 9      13     mont      SG    ps
> 10     14     mont      SG    ps
> 11     16     mont      SG    bc
> 12     17     mont      SG    bc
> 13     19     mont      SG   19b
> 14     20     mont      SG   19b
> 15     22     mont      SG    hv
> 16     23     mont      SG    hv
> 17     25      Bol      CS    ps
> 18     26      Bol      CS    ps
> 19     28      Bol      CS    bc
> 20     29      Bol      CS    bc
> 21     31      Bol      CS   19b
> 22     32      Bol      CS   19b
> 23     34      Bol      CS    hv
> 24     35      Bol      CS    hv
> 25     37      Bol      SG    ps
> 26     38      Bol      SG    ps
> 27     40      Bol      SG    bc
> 28     41      Bol      SG    bc
> 29     43      Bol      SG   19b
> 30     44      Bol      SG   19b
> 31     46      Bol      SG    hv
> 32     47      Bol      SG    hv
> 33     49      Ric      CS    ps
> 34     50      Ric      CS    ps
> 35     52      Ric      CS    bc
> 36     53      Ric      CS    bc
> 37     55      Ric      CS   19b
> 38     56      Ric      CS   19b
> 39     58      Ric      CS    hv
> 40     59      Ric      CS    hv
> 41     61      Ric      SG    ps
> 42     62      Ric      SG    ps
> 43     64      Ric      SG    bc
> 44     65      Ric      SG    bc
> 45     67      Ric      SG   19b
> 46     68      Ric      SG   19b
> 47     70      Ric      SG    hv
> 48     71      Ric      SG    hv
>
> At the first instance, I used the full interaction formula as the following
> code:
>
>> d <- DGEList(counts=file)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
> *> design <- model.matrix(~0 + Vineyard + Variety + Stage +
> Vineyard:Variety + Vineyard:Stage + Variety:Stage + Vineyard:Variety:Stage,
> data=target)*
>
> [or even  (*> design <- model.matrix(~0 + Vineyard*Variety*Stage,
> data=target)*) which gives the same result]
>
>> rownames(design) <- colnames(DGEList_norm)
>
> However, when I call the *design* I see that one Variety (i.e., CS) and one
> Stage (i.e., 19b) are not present in the design matrix, as individual
> effect or even in the interactions.
>
> Then I passed to the second approach, in which, I create groups:
>
>> group <-
> factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_"))
>> cbind(target,Group=group)
>> DGElist <- DGEList(counts=file,group=group)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
>> design <- model.matrix(~0+group, data=DGElist_norm$samples)
>> colnames(design) <- levels(group)
>
> The design matrix in this case include all the groups, and then I proceed
> doing:
>
>> commondisp <- estimateGLMCommonDisp(DGElist_norm, design, verbose=TRUE)
> Disp = 0.35294 , BCV = 0.5941
>> trenddisp <- estimateGLMTrendedDisp(commondisp, design)
>> tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design)
>> fit <- glmFit(tagwisedisp, design)
>> my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps-mont_CS_ps,
> CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps-Bol_SG_ps,
> levels=design)  #Just as some examples of the contrasts I am interested on.
>> lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"])
>
> With this code, I got the results, but I am afraid that they are not very
> consistent with the data. To give one example, the DE results tell me that
> a given miRNA which has 0 and 1 reads respectively in the two replicates of
> one sample is significantly different when comparing with  other sample in
> which this miRNA has 5 and 10 reads in the two replicates  respectively,
> but in the same set of results another miRNA which has 4259 and 2198  reads
> respectively in the two replicates of one sample is not significantly
> different when comparing with  the other sample   in which this miRNA has
> 352 and 599 reads respectively in the two replicates. In other words, 0 and
> 1 are significantly different from 5 and 10  but 4259 and 2198 are
> not significantly different from  352 and 599. With this comparisons, I am
> just trying to interpret my data based on these results.
>
> I know that the test for differential expression is not made based on the
> raw reads, but I do not know exactly how it is made, anyway I expect that
> if I used the correct model to describe my data, the results will describe
> the differences consistently.
> Could you make any suggestions about my analysis? Creating the groups as I
> showed above, is it correct for testing all the interactions? Is there any
> explanation for the fact that the one variety and one stage "disappear"
> from the design matrix when using the full interaction formula?
>
> Sorry for the long email and thank you for all the advises,
>
> Best wishes
>
> Daniela Lopes Paim Pinto
> PhD student - Agrobiosciences
> Scuola Superiore Sant'Anna, Italy
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] edgeR_3.0.3  limma_3.14.1
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.2
>
>
>
>
>
>
>
>
>
>
> 2012/11/11 Gordon K Smyth <smyth at wehi.edu.au>
>
>> Dear Daniela,
>>
>> What version of the edgeR are you using?  The posting guide asks you to
>> give sessionInfo() output so we can see package versions.
>>
>> Your codes looks correct for testing an interaction, although you could
>> estimate the same interaction more directly using an interaction formula as
>> in Section 3.3.4 of the edgeR User's Guide.
>>
>> However the model you have used is correct only if all 12 samples
>> correspond to the same physiological stage.  I wonder why you are not
>> analysing all the 48 samples together.  I would start with data exploration
>> of all 48 samples, including exploration measures like transcript
>> filtering, library sizes, normalization factors, an MDS plot, a BCV plot,
>> and so on.  The first step is to check the data quality before going on to
>> test for differential expression.
>>
>> edgeR has very high statistical power, even giving p-values smaller than I
>> would like in some cases.  So if you're not getting any differential
>> expression, it is because there is none or because you have data quality
>> problems.
>>
>> Best wishes
>> Gordon
>>
>>  Date: Fri, 9 Nov 2012 14:44:28 +0100
>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
>>> To: bioconductor at r-project.org
>>> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x
>>>         environmental interaction
>>>
>>> Dear Gordon,
>>>
>>> Thank you so much for the reference. I read all the chapter regarding to
>>> the models and I tried to  set up the following code considering a data
>>> frame like this:
>>>
>>>  target
>>>>
>>>   Sample Variety Location
>>> 1       1      CS     Mont
>>> 2       2      CS     Mont
>>> 3      25      CS      Bol
>>> 4      26      CS      Bol
>>> 5      49      CS      Ric
>>> 6      50      CS      Ric
>>> 7      13      SG     Mont
>>> 8      14      SG     Mont
>>> 9      37      SG      Bol
>>> 10     38      SG      Bol
>>> 11     61      SG      Ric
>>> 12     62      SG      Ric
>>>
>>>  group <- factor(paste(target$Variety,**target$Location,sep="_"))
>>>> cbind(target,Group=group)
>>>> d <- DGEList(counts=file,group=**group)
>>>> DGEnorm <- calcNormFactors(d)
>>>> design <- model.matrix(~0+group, data=DGEnorm$samples)
>>>> colnames(design) <- levels(group)
>>>>
>>>
>>> Which gave me the design matrix:
>>>
>>>  design
>>>>
>>>          CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric
>>> CS_Mont        0       1      0      0       0      0
>>> CS_Mont.1      0       1      0      0       0      0
>>> CS_Bol         1       0      0      0       0      0
>>> CS_Bol.1       1       0      0      0       0      0
>>> CS_Ric         0       0      1      0       0      0
>>> CS_Ric.1       0       0      1      0       0      0
>>> SG_Mont        0       0      0      0       1      0
>>> SG_Mont.1      0       0      0      0       1      0
>>> SG_Bol         0       0      0      1       0      0
>>> SG_Bol.1       0       0      0      1       0      0
>>> SG_Ric         0       0      0      0       0      1
>>> SG_Ric.1       0       0      0      0       0      1
>>> attr(,"assign")
>>> [1] 1 1 1 1 1 1
>>> attr(,"contrasts")
>>> attr(,"contrasts")$group
>>> [1] "contr.treatment"
>>>
>>> And then I estimated the trended and tag wise dispersion and fit the model
>>> doing:
>>>
>>>  disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design)
>>>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design)
>>>> fit <- glmFit(disp.tag,design)
>>>>
>>>
>>> When I made some contrasts to find DE miRNAs, for example:
>>>
>>>  my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont,
>>>>
>>> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont), levels=design)
>>>
>>>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"])
>>>>
>>>
>>> I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and
>>> for example DE miRNAs due the interaction genotypeXenvironment  (
>>> CSvsSG_BolvsMont).
>>>
>>> However the results do not seems to reflect it, since I did not get even a
>>> single DE miRNA with   significant FDR (even less than 20%!!!!) and going
>>> back to the counts in the raw data I find reasonable differences in their
>>> expression, which was expected. I forgot to mention that I decided to
>>> consider stage by stage separately and not add one more factor on the
>>> model, since I am not interested, for the moment, on the time course (as I
>>> wrote in the previous email - see below).
>>>
>>> Could you (or any body else from the list) give me some advise regarding
>>> the code? Is this matrix appropriate for the kind of comparisons I am
>>> interested on?
>>>
>>> Thank you in advance for any input.
>>>
>>> Daniela
>>>
>>>
>>>
>>>
>>> 2012/10/30 Gordon K Smyth <smyth at wehi.edu.au>
>>>
>>>  Dear Daniela,
>>>>
>>>> edgeR can work with any design matrix.  Just setup your interaction
>>>> model using standard R model formula.  See for example Chapter 11 of:
>>>>
>>>>
>>>>  http://cran.r-project.org/doc/****manuals/R-intro.pdf<http://cran.r-project.org/doc/**manuals/R-intro.pdf>
>> <http://**cran.r-project.org/doc/**manuals/R-intro.pdf<http://cran.r-project.org/doc/manuals/R-intro.pdf>
>>>
>>
>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>>  Date: Mon, 29 Oct 2012 16:24:31 +0100
>>>>
>>>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
>>>>> To: bioconductor at r-project.org
>>>>> Subject: [BioC] How to design matrix on edgeR to study genotype x
>>>>>         environmental interaction
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I'm currently working with data coming from deep sequencing of 48 small
>>>>> RNAs libraries and using edgeR to identify DE miRNAs. I could not figure
>>>>> out how to design my matrix for the following experimental design:
>>>>>
>>>>> I have 2 varieties (genotypes), cultivated in 3 different locations
>>>>> (environments) and collected in 4 physiological stages. None of them
>>>>> represent a control treatment. I'm particulary interested on identifying
>>>>> those miRNAs which modulate their expression dependent on genotypes (G),
>>>>> environments (E) and G x E interaction. For instance the same variety in
>>>>> the 3 different locations, both varieties in the same location and both
>>>>> varieties in the 3 different locations.
>>>>>
>>>>> I was wondering if I could use the section 3.3 of edgeR user guide as
>>>>> reference or if someone could suggest me any other alternative method.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Daniela
>>>>>
>>>>>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list