[BioC] questions on edgeR package

Gordon K Smyth smyth at wehi.EDU.AU
Wed Jan 29 05:11:54 CET 2014


Dear Ming,

Thanks for the further information.

It is obvious that numbers like 4.67 are not raw integer counts.  I 
suspect that they are posterior expected counts from RSEM.  The column 
heading "raw_count" does seem rather misleading.

If you want to analyse these expected counts from RSEM, then limma-voom 
would be a far preferable pipeline than edgeR (or DESeq or any other 
negative binomial based package).

I understand that the RSEM authors did claim that their expected counts 
could be input to edgeR, but edgeR is designed to accept integers, and 
your results appear to confirm some of the problems that arise.  The 
problems cannot be solved by changing the filtering.

The raw FastQ files for the TCGA samples should be publicly available from 
GEO and SRA.  My lab has downloaded similar TCGA data as SRA files and 
converted them to FastQ.

Best wishes
Gordon


On Wed, 29 Jan 2014, Ming Yi wrote:

> Dear Dr Smyth:
>
> Thanks for the quick response. The data is in fact TCGA RNAseq data, one of the files looks like below:
> file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.genes.results"
>
> its content looks below:
>      gene_id raw_count scaled_estimate         transcript_id
> 1 ?|100130426      0.00    0.000000e+00            uc011lsn.1
> 2 ?|100133144      4.67    1.794813e-07 uc010unu.1,uc010uoa.1
> 3 ?|100134869     15.33    4.271899e-07 uc002bgz.2,uc002bic.2
> 4     ?|10357    218.79    1.933490e-05            uc010zzl.1
> 5     ?|10431   1255.00    5.033411e-05 uc001jiu.2,uc010qhg.1
> 6    ?|136542      0.00    0.000000e+00            uc011krn.1
> ......
> through contacting with TCGA staff, we know the column "raw_count" is the raw counts of reads after running through RSEM and the normalized data in a different file (I used the raw counts not normalized data for edgeR). The basic description of the data as below (at URL: https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.13.0/DESCRIPTION.txt)
> step 4 of: V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1
> ...The reads aligned to the reference genome sequences are translated to transcriptome coordinates
> prior to transcript level quantification.  This step is performed using the UNC Bioinformatics Utilities (UBU)
> software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances
> (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html).  ....
>
> Unfortunately, we do not have access to the raw RNA-seq data nor bam files etc. The authors of these data files also suggest to use the raw counts as mentioned above for analysis using edgeR or DESeq. Unless there are something wrong with their procedure, we have no control for that.
>
> Assuming the data is OK, any possibility other steps in edgeR may cause this? Such as filtering step as I asked:
> #filter low expressed genes
>> y<-y[rowSums(y$counts)>=50,]
> this step using arbitraay cutoff 50 has filtered out quite a few genes (rows) with sum of rows in counts of reads less than 50.
>
> or other steps?
>
> Any idea?
>
> Thanks again!
>
> Best
>
> Ming
>
>
>> Date: Wed, 29 Jan 2014 11:05:48 +1100
>> From: smyth at wehi.EDU.AU
>> To: yi02 at hotmail.com
>> CC: bioconductor at r-project.org
>> Subject: Re: questions on edgeR package
>>
>> Dear Ming,
>>
>> Something is seriously wrong -- you shouldn't get these warnings, you
>> shouldn't get such a large dispersion estimate and, if you do, you
>> shouldn't get such small p-values.
>>
>> I suspect that the culprit is RSEM.  edgeR is designed to accept raw read
>> counts rather than estimated abundance levels as might be output from RSEM
>> or Cufflinks.
>>
>> There are a number of tools that produce genewise read counts from RNA-seq
>> data. My lab routinely uses subread and featureCounts:
>>
>>   http://www.ncbi.nlm.nih.gov/pubmed/24227677
>>
>> which are available from the Bioconductor package Rsubread.
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Tue, 28 Jan 2014, Ming Yi wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a few questions on edgeR package. I used it a while ago.
>>> Recently, I started to use it again for an ongoing project and realized
>>> that it has been updated a lot with many new feaures.
>>>
>>> The data I am working on has 29 pairs of tumor and matched normal tissue
>>> samples from the same patients that have been run through RNAseq, the
>>> data has been processed by the popular RSEM method to be raw counts for
>>> each sample (tumor or macthed normal tissue of each patients). So I have
>>> a matrix of raw counts for 58 columns of data for all genes. we are
>>> looking for differential expressed genes between tumors and normals.
>>> Here are some main commands I used, eliminating some basic commands for
>>> simplicity:
>>>
>>>> show(head(targets));
>>>  SampleName      Subject   Type
>>> 1  T4626_01A 38_4626  Tumor
>>> 2  N4626_11A 38_4626 Normal
>>> 3  T2668_01A 44_2668  Tumor
>>> 4  N2668_11A 44_2668 Normal
>>> ......
>>>> Patient<-factor(targets$Subject);
>>>> Type<-factor(targets$Type);
>>> ......
>>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>>> show(dim(y));
>>> [1] 20531    58
>>>
>>>> #filter low expressed genes
>>>> y<-y[rowSums(y$counts)>=50,]
>>>> #Recompute the library sizes:
>>>> y$samples$lib.size <- colSums(y$counts)
>>>> #Use Entrez Gene IDs as row names:
>>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>>> ##Apply TMM normalization
>>>> y <- calcNormFactors(y,method=c("TMM"));
>>>> show(y$samples);
>>>          group lib.size norm.factors
>>> T4626_01A     1 97435132    0.9834587
>>> N4626_11A     1 62583672    0.9154837
>>> T2668_01A     1 77034746    0.9687860
>>> N2668_11A     1 58631311    0.8644352
>>> ......
>>>> design<-model.matrix(~Patient+Type);
>>>> rownames(design) <- colnames(y);
>>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>>> Disp = 3.99994 , BCV = 2
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> show(warnings());
>>> Warning messages:
>>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>  non-integer x = 4.520000
>>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>  non-integer x = 5.380000
>>> ....
>>>> y <- estimateGLMTrendedDisp(y, design);
>>> Loading required package: splines
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> y <- estimateGLMTagwiseDisp(y, design)
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> fit <- glmFit(y, design);
>>>> lrt <- glmLRT(fit);
>>>> show(topTags(lrt));
>>> Coefficient:  TypeTumor
>>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue FDR
>>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0   0
>>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0   0
>>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0   0
>>> .......
>>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>>> cpmRslt<-cpm(y)
>>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>>> show(cpmRslt[1:10,1:6]);
>>>          T4626_01A    N4626_11A    T2668_01A    N2668_11A   T6145_01A
>>> 6532      2.0558647 2.140002e+02   0.02679881 1.568574e+01   0.2147611
>>> 1301      1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>>> 5047     13.4413898 1.221761e-01   7.62426085 3.946099e-02  60.7774032
>>> .....
>>>> show(Rslt[1:10,]);
>>> Coefficient:  TypeTumor
>>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue FDR
>>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0   0
>>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0   0
>>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0   0
>>> .....
>>>> show(summary(de <- decideTestsDGE(lrt)));
>>>   [,1]
>>> -1 6179
>>> 0  4832
>>> 1  7204
>>>
>>> My questions are below:
>>>
>>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary.  is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting?
>>>
>>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings?
>>>
>>> 3. The top gene lists shown in both cpmRslt and  Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt,  there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test?
>>>
>>> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice?
>>>
>>> 5. I did MDS plot, which looks nice in that the tumor and normal are
>>> separated very well in one side vs other. In the plotSmear plot, I did
>>> see some red spots inside the black spots zones in the center zone
>>> (supposed to be lower logFC), why they were called as significant? May
>>> this explain why we get so many DEGs >13k of them? Or large variations
>>> across samples?
>>>
>>> Thanks so much in advance for your help!
>>>
>>> Mike
>>> NIH
>>> Maryland, USA.
>>>
>>>> show(sessionInfo());
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] splines   stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> other attached packages:
>>> [1] edgeR_3.4.2  limma_3.16.8

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list