[BioC] edgeR on differentially expressed genes with low read counts

Gordon K Smyth smyth at wehi.EDU.AU
Wed Nov 30 06:41:51 CET 2011


Dear Luca,

Also, I notice from the sessionInfo (thanks for including it) that you're 
using versions of R and edgeR from two Bioconductor releases ago, i.e., 
more than a year out of date.  You're using a very, very early version of 
the edgeR glm code.  It's important to install the current software, 
although it will not fix your main issue, which arises from another cause.

Best wishes
Gordon

On Wed, 30 Nov 2011, Gordon K Smyth wrote:

> Dear Luca,
>
> The edgeR function glmLRT() conducts a likelihood ratio test, a standard 
> procedure for generalized linear models.  It is virtually impossible to get 
> the p-value that you give from the counts that you give, regardless of the 
> dispersion estimation, although the p-value does depend on the total library 
> sizes, which I can't determine from your email.  So I would guess that there 
> is a code error somewhere in the analysis that you've used, perhaps the 
> different data tables have got out of sync during the analysis somehow.  I 
> can see that the design matrix doesn't seem to agree with the columns of the 
> data matrix.  For example, patient CT61 seems to be associated with columns 
> one, two and seven of your data, but is associated with libraries one, four 
> and six in your design matrix.  There may be other issues as well.
>
> As Naomi has mentioned, I would recommend that any gene with very low counts 
> be filtered from the data table before analysis, because such genes can never 
> legitimately be significantly differentially expressed.  For your experiment, 
> I would probably choose to keep genes with at least 1 count-per-million in at 
> least three libraries.  See the edgeR User's Guide for more information about 
> this.  However this doesn't seem to your main problem.
>
> If you are still having a problem and need further advice, please give 
> complete uninterrupted code that takes you from the original data table to 
> topTags results.
>
> Best wishes
> Gordon
>
> ------------------- original message ----------------------
>
> Luca [guest] guest at bioconductor.org
> Tue Nov 29 14:54:51 CET 2011
>
> Hi,
> I am using the package edgeR to determine differentially expressed genes in 
> RNA-seq data. I have 8 samples, corresponding to 3 patients (2 conditions per 
> patient and 2 patients duplicated). I performed the contrast of these two 
> conditions, following the user guide (section 11 Case study: Oral carcinomas 
> vs matched normal tissue). When analyzing some border line differentially 
> expressed genes (FDR ~ 0.02) I found that in some cases, the read counts was 
> really low, e.g. only one sample with 2 reads, and the others (7) with 0 
> counts.
>> 
> countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colnames(countsTable))]
>                61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 
> 70_CT.tot 61m2_CT.tot 67m2_CT.tot
> ENSG00000207696          0         0          0         0          2 0 
> 0           0
>
> The design matrix used was:
>>  	design
>  patientsCT61 patientsCT67 patientsCT70 as.factor(1 - (as.numeric(groupsCT) 
> - 1))1
> 1            1            0            0 0
> 2            0            1            0 0
> 3            0            0            1 0
> 4            1            0            0 0
> 5            0            1            0 0
> 6            1            0            0 1
> 7            0            1            0 1
> 8            0            0            1 1
>
> The result for the gene in question was:
>> tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",]
>                table.logConc table.logFC table.PValue table.FDR 
> adjust.method                                 comparison
> ENSG00000207696        -18.34       7.726     0.005222   0.03941 BH 
> as.factor(1 - (as.numeric(groupsCT) - 1))1
>
> When investigating the result of the gene, I found that the glm-fitted values 
> looked like this:
>
>> 
> format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitted.values)=="ENSG00000207696",],scientific=FALSE,digits=4)
>   61_CT.tot    67_CT.tot    70_CT.tot  61m2_CT.tot  67m2_CT.tot 61_CT.poli 
> 67_CT.poli   70_CT.poli
> "0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" "0.03479503" 
> "0.03522840" "1.84247667"
>
> Since I was not able to find the details used in edgeR to calculate the model 
> easily, I was wondering if from this glm-fitted values is it reasonable to 
> obtain such a low FDR?
> I used the common dispersion for the variance estimate:
> DGEensCT.CR<-estimateCRDisp(DGEensCT,design)
> glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = 
> DGEensCT.CR$CR.common.dispersion)
> lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT)
>
> Could someone help me with this?
> Thank you very much in advanced!
>
> Cheers,
> Luca
>
> -- output of sessionInfo():
>
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] biomaRt_2.6.0 edgeR_2.0.5
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.4-3 XML_3.2-0   limma_3.6.9
>
> --
> Sent via the guest posting facility at bioconductor.org.
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list