[BioC] edgeR: generating a correct design matrix - multifactorial design

James W. MacDonald jmacdon at uw.edu
Fri Jul 27 16:26:54 CEST 2012


Hi Natasha,

On 7/27/12 6:58 AM, Natasha Sahgal wrote:
> Dear James,
>
> Thank you for the prompt response.
> When it comes to interactions, I do tend to get stumped easily.
>
> I had some additional queries:
>
> 1) I realized I made a typo in the array design model which would be: design = model.matrix(~0+dat$Group): to give the contrast of (Mut+ - Mut) - (WT+ - WT).
>
> As you mentioned, I did notice the glmLRT function has a contrast parameter, but was/am not sure if the above design model (i.e. ~0+dat$Group) would work with the aforementioned contrast? Since that was limma based for arrays and this is RNA-Seq.

Yes it will. You specify the design/contrast matrices exactly as you 
would for limma.
> 2) Thank you for the detailed alternative parameterization method, I realize the error in my dat object.
>
> On another note:
>
> 3) I do not completely follow the cpm (counts per million) concept for filtering the data. (Probably very naïve but) How does one decide on the cpm? How does one equate it to (read) counts - that's what the count table is right - number of reads? Why not just use those (i.e. say something like>= 10 reads in at least 2 samples (in my case here, as I gather from reading the user guide its about 50% of samples)?

You use cpm rather than raw counts because the different samples may 
have different numbers of total reads. For example, one sample may have 
twice as many reads as another, in which case you wouldn't want to use 
raw reads to do any comparisons or filtering (e.g., if you have twice as 
many total reads in one sample vs another, you would expect 
approximately twice as many reads per transcript as well, but that 
wouldn't indicate differential expression).

As for how many cpm to use to filter, this is sort of ad hoc, and is 
really intended to get rid of all those transcripts for which there is 
really no evidence for any expression.

> 4) Since tag-wise dispersion is recommended and one must calculate either common or trended dispersion prior to that.
> How does one decide to use common dispersion or trended dispersion or both before calculating tag-wise dispersion? (Or am I missing something obvious here).

The default is to use trended dispersions if both common and trended 
dispersion data exist, so I assume that means the better idea is to use 
trended. But I don't know empirically how one might decide between the two.

Best,

Jim


>
> Many Thanks in advance.
>
> Best Wishes,
> Natasha
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: 26 July 2012 16:03
> To: Natasha Sahgal; 'bioconductor at r-project.org'
> Subject: Re: [BioC] edgeR: generating a correct design matrix - multifactorial design
>
> Hi Natasha,
>
> On 7/26/12 8:29 AM, Natasha Sahgal wrote:
>> Dear Prof. Gordon and List,
>>
>> I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design.
>> Having gone through the user guide, I am a bit confused as to how to generate the model for my expt.
>>
>> The expt: 2 cell-lines (mut,wt), 2 conditions(stimulated, unstimulated), n=2 in each group.
>> My aim: to detect DE genes based on the effect of stimulus on mut cells.
>>
>> Thus,
>> dat
>>     Sample Group Stim
>> 1      1    WT   No
>> 2      2    WT   No
>> 3      3   WT+  Yes
>> 4      4   WT+  Yes
>> 5      5   Mut   No
>> 6      6   Mut   No
>> 7      7  Mut+  Yes
>> 8      8  Mut+  Yes
>>
>> Now, if this were array data the model would be:
>> design = model.matrix(~dat$Group)
>> and whilst fitting the model I could make a contrast such as (Mut+ -
>> Mut) - (WT+ - WT)
>>
>> I am not sure how to do this for the RNA-Seq data (i.e. what should
>> the model be? And what coefficients should I pull out?)
> You use the same exact design matrix you would have used for array data.
> Note that glmLRT() has a contrast argument, so if you parameterize the way you are doing here, you will need to use that (I am assuming you are just doing something like model.matrix(~Group), as including Stim doesn't make sense).
>
> Alternatively you could parameterize by separating out the group from the stim and fit the interaction as part of the design matrix:
>
>   >  Group<- factor(rep(c("WT", "Mut"), each=4))>  Group<- relevel(Group, "WT")>  Stim<- factor(rep(c("No","Yes"), each=2, times=2))>  model.matrix(~Group*Stim)
>     (Intercept) GroupMut StimYes GroupMut:StimYes
> 1           1        0       0                0
> 2           1        0       0                0
> 3           1        0       1                0
> 4           1        0       1                0
> 5           1        1       0                0
> 6           1        1       0                0
> 7           1        1       1                1
> 8           1        1       1                1
>
> And glmLRT() will test the interaction term by default.
>
> Best,
>
> Jim
>> Whether the model should be:
>> 1) model.matrix(~dat$Group) and somehow in the glmLRT function specify the above contrast in some manner?
>>
>> 2) model.matrix(~dat$Group+dat$Group*dat$Stim) (coefficient/contrast?)
>>
>> 3) model.matrix(~dat$Group*dat$Stim) (coefficient/contrast?)
>>
>>
>> I'd appreciate any help and advice.
>>
>>
>> Many Thanks,
>> Natasha
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list