[BioC] edgeR design matrix, one group vs average of other groups

Fri Mar 14 20:26:08 CET 2014

Thanks a lot, Yunshun and Ryan for your informative answers. I
understand that for my purposes it is preferable to use a design matrix
like that

> design
   		A	B	C
 sample.1	1	0	0
 sample.2	1 	0 	0
 sample.3 	0 	1 	0
 sample.4 	0	1	0
 sample.5 	0 	0 	1

and average for the contrast like this

> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))

But what would happen if there is a strong imbalance between samples A
and B, eg:

> design
   		A	B	C
 sample.1	1	0	0
 sample.2	1 	0 	0
 sample.3	1	0	0
 sample.4	1 	0 	0
 sample.5	1	0	0
 sample.6	1 	0 	0
 sample.7 	0 	1 	0
 sample.8 	0	1	0
 sample.9 	0 	0 	1

Should I still use the above approach or is it more advisable to put A
and B in one group and test AB vs C?

> design
   		A.B	C
 sample.1	1	0
 sample.2	1 	0
 sample.3	1	0
 sample.4	1 	0
 sample.5	1	0
 sample.6	1 	0
 sample.7 	1	0
 sample.8 	1	0
 sample.9 	0 	1

> lrt <- glmLRT(fit, contrast=c(-1,1))

Thanks a lot and best wishes,

Georg

Georg Otto <georg.otto at imm.ox.ac.uk> writes:

> Dear Bioconductors,
>
> I am working on RNA-seq data with multiple experimental factors and I am
> trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach.
>
>
>> design <- model.matrix(~0+group, data=y$samples)
>> colnames(design) <- levels(y$samples$group)
>> design
>   		A	B	C
> sample.1	1	0	0
> sample.2	1 	0 	0
> sample.3 	0 	1 	0
> sample.4 	0	1	0
> sample.5 	0 	0 	1
>
>> fit <- glmFit(y, design)
>
>
> I want to know which genes are differentially expressed in C compared to
> the other groups, so I chose to compare C to the average of A and B
>
>> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
>
>
> Alternatively I could put A and B in a single group
>
>> design
>   		A.B	C
> sample.1	1	0
> sample.2	1 	0
> sample.3 	1 	0
> sample.4 	1	0
> sample.5 	0 	1
>
>> fit <- glmFit(y, design)
>
> an compare C to A.B
>
>> lrt <- glmLRT(fit, contrast=c(-1,1))
>
>
> When I try this with my own data, the first approach gives me many more
> differentially expressed genes than the second one, but the second gene
> set is a subset of the first one. I would be very grateful if somebody
> could explain to me what is the difference between the approaches, and
> which one is the more appropriate for my purpose (find genes specific
> for condition C)
>
> Best wishes,
>
> Georg
>
>> sessionInfo()
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
>  [7] LC_PAPER=C                 LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
>
> other attached packages:
> [1] limma_3.18.13
>
> loaded via a namespace (and not attached):
> [1] compiler_3.0.1 tools_3.0.1
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor