[BioC] edgeR for RNA-seq: nested paired analysis with uneven groups

Mon May 26 17:29:30 CEST 2014

Hello,
I am new to the field and have a question regarding analysis of RNA-seq data with the edgeR package. I am looking at the difference in gene expression between two groups of patients (control and disease) after stimuli. I need to perform a paired analysis for each participant within each group, and then compare groups (different individuals between groups). My problem is that I have a different number of participants between in each group. 
Hereâ€™s a summary of my experimental design:

1) Paired analysis for control group (n=7)
control patient 1: measurement 1 -> disease stimuli -> measurement 2
control patient 2: measurement 1 -> disease stimuli -> measurement 2
â€¦. Control group expression (n=7)

2) Paired analysis for disease group (n=6)
diseased patient1: measurement 1 -> disease stimuli -> measurement 2
diseased patient2: measurement 1 -> disease stimuli -> measurement 2
â€¦ Disease group expression (n=6)

3) Differential expression Diseases vs healthy groups

So I am basically looking for genes that demonstrate the highest variation after stimuli between the two groups.

Based on the section 3.5 of the edgeR users guide, I first came up with the following nested paired analysis:
--------------------------------------------------------------------------------------------------------
#Individual patients
> patient <- factor(c('patient1', 'patient1', 'patient2', 'patient2', 'patient3', 'patient3', 'patient4', 'patient4', 'patient5', 'patient5','patient6', 'patient6', 'patient7', 'patient7', 'patient8', 'patient8',  'patient9', 'patient9', 'patient10', 'patient10', 'patient11', 'patient11', 'patient12', 'patient12', 'patient13', 'patient13' ))

#I re-numbered each patient
> n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,1, 1,2, 2, 3, 3, 4, 4, 5, 5, 6, 6), levels=c(1,2,3,4,5,6,7))

#I assigned them to their respective group
> group <- factor(c('control', 'control','control', 'control','control','control','control','control','control','control','control','control','control','control', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease','disease', 'disease', 'disease', 'disease'), levels=c("control","disease"))

# time 1 refers to measurement before stimuli and time 2 refers to measurement after stimuli
> time <- factor(c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2), levels=c(1,2))

> data.frame(sample=colnames(y),patient,n,time,group)

sample       patient            n         time        group

1                patient1          1            1         control
2                patient1          1            2         control
3                patient2          2            1         control
4                patient2          2            2         control
5                patient3          3            1         control
6                patient3          3            2         control
7                patient4          4            1         control
8                patient4          4            2         control
9                patient5          5            1         control
10              patient5          5            2         control
11              patient6          6            1         control
12              patient6          6            2         control
13              patient7          7            1         control
14              patient7          7            2         control
15              patient8          1            1         disease
16              patient8          1            2         disease
17              patient9          2            1         disease
18              patient9          2            2         disease
19              patient10        3            1         disease
20              patient10        3            2         disease
21              patient11        4            1         disease
22              patient11        4            2         disease
23              patient12        5            1         disease
24              patient12        5            2         disease
25              patient13        6            1         disease
26              patient13        6            2         disease

> design <- model.matrix(~group+group:n+group:time)
> colnames(design)

[1] "(Intercept)"        "groupdisease"       "groupcontrol:n2"    "groupdisease:n2"    "groupcontrol:n3"    "groupdisease:n3"    "groupcontrol:n4"    "groupdisease:n4"   
 [9] "groupcontrol:n5"    "groupdisease:n5"    "groupcontrol:n6"    "groupdisease:n6"    "groupcontrol:n7"    "groupdisease:n7"    "groupcontrol:time2" "groupdisease:time2"

#Estimate dispersion

> y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
> y$common.dispersion

> y <- estimateGLMTrendedDisp(y, design)
> names(y)
> summary(y$tagwise.dispersion)

> y <- estimateGLMTagwiseDisp(y, design)

#Differential expression

> fit <- glmFit(y, design)

> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))

---------------------------------------------------------------------------------------------
Now, it obviously didn't work since my groups are not of equal number and that re-numbering my sample this way and the contrast matrix make no sense. So I thought of re-numbering them this way:

>n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,8, 8,9, 9, 10, 10, 11, 11, 12, 12, 13, 13), levels=c(1,2,3,4,5,6,7,8,9,10,11,12,13))

And designing my matrix this way:
> design <- model.matrix(~group+group:n+n:time)
> colnames(design)

With the entire analysis: 
---------------------------------------------------------------------------------------------

> patient <- factor(c('patient1', 'patient1', 'patient2', 'patient2', 'patient3', 'patient3', 'patient4', 'patient4', 'patient5', 'patient5','patient6', 'patient6', 'patient7', 'patient7', 'patient8', 'patient8',  'patient9', 'patient9', 'patient10', 'patient10', 'patient11', 'patient11', 'patient12', 'patient12', 'patient13', 'patient13' ))

> n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,8, 8,9, 9, 10, 10, 11, 11, 12, 12, 13, 13), levels=c(1,2,3,4,5,6,7,8,9,10,11,12,13))

> group <- factor(c('control', 'control','control', 'control','control','control','control','control','control','control','control','control','control','control', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease','disease', 'disease', 'disease', 'disease'), levels=c("control","disease"))

> time <- factor(c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2), levels=c(1,2))

 sample        patient           n         time        group

1                 patient1          1            1         control
2                 patient1          1            2         control
3                 patient2          2            1         control
4                 patient2          2            2         control
5                 patient3          3            1         control
6                 patient3          3            2         control
7                 patient4          4            1         control
8                 patient4          4            2         control
9                 patient5          5            1         control
10               patient5          5            2         control
11               patient6          6            1         control
12               patient6          6            2         control
13               patient7          7            1         control
14               patient7          7            2         control
15               patient8          8            1         disease
16               patient8          8           2          disease
17               patient9          9            1         disease
18               patient9          9            2         disease
19               patient10        10          1         disease
20               patient10        10          2         disease
21               patient11        11          1         disease
22               patient11        11          2         disease
23               patient12        12          1         disease
24               patient12        12          2         disease
25               patient13        13          1         disease
26               patient13        13          2         disease

>design <- model.matrix(~group+group:n+n:time)
> colnames(design)

[1] "(Intercept)"      "groupdisease"     "groupcontrol:n2"  "groupdisease:n2"  "groupcontrol:n3"  "groupdisease:n3"  "groupcontrol:n4"  "groupdisease:n4"  "groupcontrol:n5" 
[10] "groupdisease:n5"  "groupcontrol:n6"  "groupdisease:n6"  "groupcontrol:n7"  "groupdisease:n7"  "groupcontrol:n8"  "groupdisease:n8"  "groupcontrol:n9"  "groupdisease:n9" 
[19] "groupcontrol:n10" "groupdisease:n10" "groupcontrol:n11" "groupdisease:n11" "groupcontrol:n12" "groupdisease:n12" "groupcontrol:n13" "groupdisease:n13" "n1:time2"        
[28] "n2:time2"         "n3:time2"         "n4:time2"         "n5:time2"         "n6:time2"         "n7:time2"         "n8:time2"         "n9:time2"         "n10:time2"       
[37] "n11:time2"        "n12:time2"        "n13:time2"
---------------------------------------------------------------------------------------------

However, now I am not sure if my matrix is right and I am confused as to what would be the best way to compare the two groups. Any ideas would be greatly appreciated.

Thank you very much,

L

 -- output of sessionInfo(): 

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.7.1  limma_3.21.3

--
Sent via the guest posting facility at bioconductor.org.