[BioC] questions on edgeR package

Gordon K Smyth smyth at wehi.EDU.AU
Thu Feb 6 06:39:50 CET 2014


Dear Ming,

I have said this in a separate thread, but thought I should summarize it 
for the original thread.

After some investigation, it is clear that edgeR's problem with the TCGA 
RSEM "raw counts" is that they are not integers.  edgeR is a likelihood- 
based package and it evaluates the negative binomial likelihood, which is 
always exactly zero at non-integer values.  Hence edgeR cannot estimate 
the dispersion sensibly if any of the input counts are fractional.

If you round the RSEM counts to integers, then edgeR will run ok.

I have made some changes to the edgeR GLM code on Bioc-devel so that it 
does this rounding automatically.  (The edgeR classic code was doing this 
anyway.)  Note that this work-around is only sensible for input values 
that are estimates of counts.

Best wishes
Gordon

On Wed, 29 Jan 2014, Gordon K Smyth wrote:

> Dear Ming,
>
> Something is seriously wrong -- you shouldn't get these warnings, you 
> shouldn't get such a large dispersion estimate and, if you do, you shouldn't 
> get such small p-values.
>
> I suspect that the culprit is RSEM.  edgeR is designed to accept raw read 
> counts rather than estimated abundance levels as might be output from RSEM or 
> Cufflinks.
>
> There are a number of tools that produce genewise read counts from RNA-seq 
> data. My lab routinely uses subread and featureCounts:
>
> http://www.ncbi.nlm.nih.gov/pubmed/24227677
>
> which are available from the Bioconductor package Rsubread.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.statsci.org/smyth
>
> On Tue, 28 Jan 2014, Ming Yi wrote:
>
>> 
>> Hi,
>> 
>> I have a few questions on edgeR package. I used it a while ago. Recently, I 
>> started to use it again for an ongoing project and realized that it has 
>> been updated a lot with many new feaures.
>> 
>> The data I am working on has 29 pairs of tumor and matched normal tissue 
>> samples from the same patients that have been run through RNAseq, the data 
>> has been processed by the popular RSEM method to be raw counts for each 
>> sample (tumor or macthed normal tissue of each patients). So I have a 
>> matrix of raw counts for 58 columns of data for all genes. we are looking 
>> for differential expressed genes between tumors and normals. Here are some 
>> main commands I used, eliminating some basic commands for simplicity:
>> 
>>> show(head(targets));
>>  SampleName      Subject   Type
>> 1  T4626_01A 38_4626  Tumor
>> 2  N4626_11A 38_4626 Normal
>> 3  T2668_01A 44_2668  Tumor
>> 4  N2668_11A 44_2668 Normal
>> ......
>>> Patient<-factor(targets$Subject);
>>> Type<-factor(targets$Type);
>> ......
>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>> show(dim(y));
>> [1] 20531    58
>> 
>>> #filter low expressed genes
>>> y<-y[rowSums(y$counts)>=50,]
>>> #Recompute the library sizes:
>>> y$samples$lib.size <- colSums(y$counts)
>>> #Use Entrez Gene IDs as row names:
>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>> ##Apply TMM normalization
>>> y <- calcNormFactors(y,method=c("TMM"));
>>> show(y$samples);
>>          group lib.size norm.factors
>> T4626_01A     1 97435132    0.9834587
>> N4626_11A     1 62583672    0.9154837
>> T2668_01A     1 77034746    0.9687860
>> N2668_11A     1 58631311    0.8644352
>> ......
>>> design<-model.matrix(~Patient+Type);
>>> rownames(design) <- colnames(y);
>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>> Disp = 3.99994 , BCV = 2
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> show(warnings());
>> Warning messages:
>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>  non-integer x = 4.520000
>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>  non-integer x = 5.380000
>> ....
>>> y <- estimateGLMTrendedDisp(y, design);
>> Loading required package: splines
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> y <- estimateGLMTagwiseDisp(y, design)
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> fit <- glmFit(y, design);
>>> lrt <- glmLRT(fit);
>>> show(topTags(lrt));
>> Coefficient:  TypeTumor
>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue 
>> FDR
>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0 
>> 0
>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0 
>> 0
>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0 
>> 0
>> .......
>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>> cpmRslt<-cpm(y)
>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>> show(cpmRslt[1:10,1:6]);
>>          T4626_01A    N4626_11A    T2668_01A    N2668_11A   T6145_01A
>> 6532      2.0558647 2.140002e+02   0.02679881 1.568574e+01   0.2147611
>> 1301      1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>> 5047     13.4413898 1.221761e-01   7.62426085 3.946099e-02  60.7774032
>> .....
>>> show(Rslt[1:10,]);
>> Coefficient:  TypeTumor
>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue 
>> FDR
>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0 
>> 0
>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0 
>> 0
>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0 
>> 0
>> .....
>>> show(summary(de <- decideTestsDGE(lrt)));
>>   [,1]
>> -1 6179
>> 0  4832
>> 1  7204
>> 
>> My questions are below:
>> 
>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower 
>> expressed using 50 as cutoff here,kind of arbitrary.  is there any better 
>> way to assess the cutoff value? can any one suggest a good cutoff based on 
>> his or her experience in such setting?
>> 
>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives 
>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why 
>> we got warning here (see warning message as above inside my commands), any 
>> issue on the warnings?
>> 
>> 3. The top gene lists shown in both cpmRslt and  Rslt are quite consistent, 
>> for example, gene 6532 always have tumor samples consistently much lower in 
>> cpm counts compared to that of matched normal samples, which is consistent 
>> with the logFC result in Rslt. So from the top genes, the result looks very 
>> promising. However, when I look into the full table of Rslt,  there are so 
>> many genes appear to be differential expressed (see result of summary(de <- 
>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in 
>> the dataset) in this table with FDR <0.05, which make me a bit nervous 
>> considering the issues as I asked in last two questions. Is this what was 
>> usually seen in tumor vs normal comparison. This is a paired test of 
>> course, which may increase the power and may see more differntial genes. 
>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs 
>> normal sample pairs? which one shall be better for RNA-seq data, paired 
>> test vs pooled test?
>> 
>> 4. For pooled test, I may use exactTest(), also based on the guide, there 
>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm 
>> model. anything like that worthy of trying? Any advice?
>> 
>> 5. I did MDS plot, which looks nice in that the tumor and normal are 
>> separated very well in one side vs other. In the plotSmear plot, I did see 
>> some red spots inside the black spots zones in the center zone (supposed to 
>> be lower logFC), why they were called as significant? May this explain why 
>> we get so many DEGs >13k of them? Or large variations across samples?
>> 
>> Thanks so much in advance for your help!
>> 
>> Mike
>> NIH
>> Maryland, USA.
>> 
>>> show(sessionInfo());
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C                 LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> attached base packages:
>> [1] splines   stats     graphics  grDevices utils     datasets  methods
>> [8] base
>> other attached packages:
>> [1] edgeR_3.4.2  limma_3.16.8
>> 
>> 
>> 
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list