[BioC] edgeR: paired samples DGE and GLM

Alessandra [guest] guest at bioconductor.org
Tue Mar 12 12:20:08 CET 2013


Hi,

I am trying to use edgeR to compute differential gene expression.

I have a quite simple experimental design:
labExpId Patient sex    Treatment
sample.4  1   F            A
sample.3  2   M           A
sample.5  2   M           B
sample.7  3   F            A
sample.0  4   M           A
sample.8  3   F            B
sample.1  4   M           B
sample.2  1   F            B


I am comparing the effect of treatment across all patients, and it is fine when I apply exactTest. Then I wanted to include also the Patient in my model and I try using GLM, but I found several problems. So, I tried to apply GLM only including the Treatment just to troubleshoot. I follow the instructions on page 22 from the user's guide revised in December 2012.

My counts are stored in a matrix called m.

>colnames(m)
[1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" sample.8"
[7] "sample.0" "sample.1"

>design=model.matrix(~0+Treatment)
> design
          TreatmentA          TreatmentB
sample.4             1               0
sample.3             1               0
sample.5             0               1
sample.7             1               0
sample.0             1               0
sample.8             0               1
sample.1             0               1
sample.2             0               1

>M = DGEList(counts=na.omit(m))

>cpm.m <- cpm(M)
>M <- M[rowSums(cpm.m >1) >=1,]
>M <- calcNormFactors(M)

>M <- estimateGLMCommonDisp(M, design, verbose=T)
> names(M)
[1] "counts"            "samples"           "abundance"        
[4] "logCPM"            "common.dispersion"

First thing I notice, I don't see pseudo.counts and other attributes I saw when calling estimateCommonDisp().

> M <- estimateGLMTrendedDisp(M, design)
> M <- estimateGLMTagwiseDisp(M, design)

> fit <- glmFit(M, design)
> lrt <- glmLRT(fit, contrast=c(-1,1))
> res = topTags(lrt, n=nrow(lrt))$table

> head(res)
                     logFC    logCPM       LR       PValue          FDR
ENSG00000244115  15.512665  2.266789 35.30878 2.813602e-09 3.482958e-05
ENSG00000256329 -17.760869  4.514903 32.89653 9.719681e-09 6.015996e-05
ENSG00000223638  11.777617 -1.467638 18.46673 1.728967e-05 7.134295e-02
ENSG00000134321  -5.328058  5.959204 16.76318 4.234703e-05 1.310535e-01
ENSG00000196861   7.776972 -1.722111 15.83143 6.924259e-05 1.494412e-01
ENSG00000229292  11.368590 -1.876601 15.74621 7.243290e-05 1.494412e-01

> m[rownames(head(res)),]
                sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
ENSG00000244115         0         0         0         0         0      6412
ENSG00000256329         0         0         0     32070     0         0
ENSG00000223638         0         0         2         0         0        0
ENSG00000134321      1178       590   890     83130   754   1116
ENSG00000196861         0         3       363       0         0        53
ENSG00000229292         0         0         0         0         0         0

                sample.0 sample.1
ENSG00000244115         0      4415
ENSG00000256329         0         0
ENSG00000223638       434       550
ENSG00000134321       852      1092
ENSG00000196861       500        57
ENSG00000229292       344       406

I realized that it may be a problem of the ordering of the design matrix rows

> colnames(m)
[1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" "sample.8"
[7] "sample.0" "sample.1"
> rownames(design)
[1] "sample.4" "sample.3" "sample.5" "sample.7" "sample.0" "sample.8"
[7] "sample.1" "sample.2"

so I redo it forcing the row names of my design matrix to be the same order as the column names in the matrix count.
Now my design matrix looks like:

> design
          TreatmentA TreatmentB
sample.2             0               1
sample.3             1               0
sample.4             1               0
sample.5             0               1
sample.7             1               0
sample.8             0               1
sample.0             1               0
sample.1             0               1

I execute exactly the same commands as before.
Again I cannot see the pseudocounts.
> names(M)
[1] "counts"            "samples"           "abundance"        
[4] "logCPM"            "common.dispersion"

> head(res)
                    logFC     logCPM       LR PValue FDR
ENSG00000074855 -32.17749  0.2759965 2121.104      0   0
ENSG00000076351 -32.17749 -0.4306713 1549.345      0   0
ENSG00000105552 -32.17749 -0.4864164 1502.658      0   0
ENSG00000106991 -32.17749  0.4622641 2144.947      0   0
ENSG00000123009 -32.17749 -0.2422835 1719.625      0   0
ENSG00000133216 -32.17749  0.2206127 2127.575      0   0

I get very different results still from the exactTest with no GLM.
I attribute this to the missing pseudocounts in the DGEList object?

> m[rownames(head(res)),]
                sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
ENSG00000074855         0       948       902         0       676         0
ENSG00000076351         0       494       522         0       538         0
ENSG00000105552         0       490       454         0       486         0
ENSG00000106991         0      1094       798        0       698         0
ENSG00000123009         0       556       520         0       566         0
ENSG00000133216         0       808       762         0       730         0
                sample.0 sample.1
ENSG00000074855      1166        0
ENSG00000076351       698         0
ENSG00000105552       764         0
ENSG00000106991      1733        0
ENSG00000123009       980         0
ENSG00000133216      1304        0

I see that these genes have all zero counts for TreatmentB and this explains the low logFC, but I don't get this when I don't use GLM at all (exactTest).

So in summary, my questions are:

1) Does the ordering of rows in the design matrix have to be the same as the column order in the count matrix, 
or the sample name is taken into account so the order should not matter?

2) Do you have any idea why I don't get pseudocounts after estimating the dispersion, and is it really this that causes such low logFC?

I am looking forward to hearing from you.

Many thanks in advance,
Ale


 -- output of sessionInfo(): 

R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] plyr_1.7.1      optparse_1.0.1  ggplot2_0.9.3.1 reshape2_1.2.1 
[5] edgeR_3.0.8     limma_3.14.4   

loaded via a namespace (and not attached):
 [1] MASS_7.3-17        RColorBrewer_1.0-5 colorspace_1.1-1   dichromat_1.2-4   
 [5] digest_0.5.2       getopt_1.19.1      grid_2.15.0        gtable_0.1.2      
 [9] labeling_0.1       munsell_0.3        proto_0.3-9.2      scales_0.2.3      
[13] stringr_0.6        tools_2.15.0      

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list