[BioC] edgeR on differentially expressed genes with low read counts

Gordon K Smyth smyth at wehi.EDU.AU
Wed Nov 30 00:36:35 CET 2011

Dear Luca,

The edgeR function glmLRT() conducts a likelihood ratio test, a standard 
procedure for generalized linear models.  It is virtually impossible to 
get the p-value that you give from the counts that you give, regardless of 
the dispersion estimation, although the p-value does depend on the total 
library sizes, which I can't determine from your email.  So I would guess 
that there is a code error somewhere in the analysis that you've used, 
perhaps the different data tables have got out of sync during the analysis 
somehow.  I can see that the design matrix doesn't seem to agree with the 
columns of the data matrix.  For example, patient CT61 seems to be 
associated with columns one, two and seven of your data, but is associated 
with libraries one, four and six in your design matrix.  There may be 
other issues as well.

As Naomi has mentioned, I would recommend that any gene with very low 
counts be filtered from the data table before analysis, because such genes 
can never legitimately be significantly differentially expressed.  For 
your experiment, I would probably choose to keep genes with at least 1 
count-per-million in at least three libraries.  See the edgeR User's Guide 
for more information about this.  However this doesn't seem to your main 

If you are still having a problem and need further advice, please give 
complete uninterrupted code that takes you from the original data table to 
topTags results.

Best wishes

------------------- original message ----------------------

Luca [guest] guest at bioconductor.org
Tue Nov 29 14:54:51 CET 2011

I am using the package edgeR to determine differentially expressed genes 
in RNA-seq data. I have 8 samples, corresponding to 3 patients (2 
conditions per patient and 2 patients duplicated). I performed the 
contrast of these two conditions, following the user guide (section 11 
Case study: Oral carcinomas vs matched normal tissue). When analyzing some 
border line differentially expressed genes (FDR ~ 0.02) I found that in 
some cases, the read counts was really low, e.g. only one sample with 2 
reads, and the others (7) with 0 counts.
                 61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 
70_CT.tot 61m2_CT.tot 67m2_CT.tot
ENSG00000207696          0         0          0         0          2 
0           0           0

The design matrix used was:
>  	design
   patientsCT61 patientsCT67 patientsCT70 as.factor(1 - 
(as.numeric(groupsCT) - 1))1
1            1            0            0 
2            0            1            0 
3            0            0            1 
4            1            0            0 
5            0            1            0 
6            1            0            0 
7            0            1            0 
8            0            0            1 

The result for the gene in question was:
> tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",]
                 table.logConc table.logFC table.PValue table.FDR 
adjust.method                                 comparison
ENSG00000207696        -18.34       7.726     0.005222   0.03941 
BH as.factor(1 - (as.numeric(groupsCT) - 1))1

When investigating the result of the gene, I found that the glm-fitted 
values looked like this:

    61_CT.tot    67_CT.tot    70_CT.tot  61m2_CT.tot  67m2_CT.tot 
61_CT.poli   67_CT.poli   70_CT.poli
"0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" 
"0.03479503" "0.03522840" "1.84247667"

Since I was not able to find the details used in edgeR to calculate the 
model easily, I was wondering if from this glm-fitted values is it 
reasonable to obtain such a low FDR?
I used the common dispersion for the variance estimate:
glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = 
lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT)

Could someone help me with this?
Thank you very much in advanced!


  -- output of sessionInfo():

R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.6.0 edgeR_2.0.5

loaded via a namespace (and not attached):
[1] RCurl_1.4-3 XML_3.2-0   limma_3.6.9

Sent via the guest posting facility at bioconductor.org.

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list