[BioC] EdgeR - problems running estimateCRDisp

Gordon K Smyth smyth at wehi.EDU.AU
Sun Nov 28 06:34:19 CET 2010


Dear Josquin,

We have committed a major revision of edgeR, version 2.0.0, to 
Bioconductor that runs successfully on all your data without errors.

Previously we simply relied on the glm code in the stats package of R to 
fit the negative binomial distribution to one gene at a time.  Now we have 
written new code designed to fit thousands of genes simultaneously.  The 
new code implements a line search to prevent divergence, hence it now 
works on all your data.  An added bonus is that the new code is a couple 
of orders of magnitude faster than the old.

Best wishes
Gordon

On Fri, 5 Nov 2010, Gordon K Smyth wrote:

> Dear Josquin,
>
> Thanks for providing some data.  We've confirmed the problem is that the glm 
> functions provided with R fail to converge, when used to fit a negative 
> binomial glm, to some of the rows in your data, especially rows with very 
> large counts.  The problem affects glmFit and glmLRT, as well as 
> estimateCRDisp, because they all make calls to glm.fit.
>
> We're writing some new glm code, implementing convergence checking etc, that 
> should solve the problem.  This will be incorporated into the edgeR package 
> during the next week.
>
> Regards
> Gordon
>
> On Sun, 24 Oct 2010, Gordon K Smyth wrote:
>
>> Dear Josquin,
>> 
>> estimateCDDisp() has worked on all the datasets we have tested it out on so 
>> far (although that's admitedly a very limited number), so we are not able 
>> to reproduce your problems.  The function is liable to convergence issues 
>> however because it needs to fit a very large number of generalized linear 
>> models, one or more of which could conceivably fail to converge. That might 
>> be the problem with your data.  I don't see any problems with the code 
>> example that you give, you seem to be using the functions correctly, so the 
>> issue must be with the count data itself.  Would you be willing to share 
>> your data with us offline so that we diagnose and perhaps solve the 
>> problem?
>> 
>> To see estimateCDDisp() working as intended, you can run the example:
>>
>>  library(edgeR)
>>  example(glmFit)
>>  CR <- estimateCRDisp(d, design)
>>  fit <- glmFit(d, design, dispersion=CR$CR.common.dispersion)
>>  results <- glmLRT(d, fit, coef=2)
>>  topTags(results)
>> 
>> Best wishes
>> Gordon
>> 
>> 
>> ----- ORIGINAL MESSAGE -------------
>> [BioC] EdgeR - problems running estimateCRDisp
>> josquin.tibbits at dpi.vic.gov.au josquin.tibbits at dpi.vic.gov.au
>> Fri Oct 22 07:44:00 CEST 2010
>> 
>> I am running the new EdgeR package 1.8 and have updated R to 1.12.0 and
>> updated all the packages I have installed.
>> 
>> I have an RNAseq experiment with 24 samples and am wanting to run a glm
>> analysis using the Cox-Reid common dispersion (and tagwise) paramaters.  I
>> have created a DGEList object using a targets file and this successfully
>> ran in the calcNormFactors() and in the plotMDS.dge() functions. I have
>> also created what appears to be a sensible and ok design matrix using
>> model.matrix().  These then are being input to estimateCRDisp() to
>> estimate the common dispersion.  The code I have run is as follows (in
>> red) with output (blue) which is pretty much exactly the same as the
>> worked example in the manual on pp 51-- and does not seem abnormal at
>> all......
>> 
>> targets <- read.delim("Lp_NUE_DGEbymSEQ_TARGETS_EdgeRIN.txt",
>> stringsAsFactors = FALSE, row.names = "label")
>> targets$Nitrogen        <- factor(targets$Nitrogen)
>> targets$Endophyte       <- factor(targets$Endophyte)
>> targets$Tissue <- factor(targets$Tissue)
>> targets
>> 
>> ## CREATE THE DGEList Object
>> 
>> DGEList.object <- readDGE(targets, skip = 1, comment.char = '#')
>> colnames(DGEList.object) <- row.names(targets)
>> 
>> 
>> 
>> ## CALCULATE THE NORMALISATION (f) FACTORS
>> 
>> DGEList.object <- calcNormFactors(DGEList.object)
>> 
>> ## Exclude data where rowsums are < 30 YOU CAN CHANGE THIS (there being 24
>> samples in the experiment)
>> 
>> DGEList.object <- DGEList.object[rowSums(DGEList.object$counts) > 30, ]
>> DGEList.object
>> 
>> An object of class "DGEList"
>> $samples
>>                                       files Endophyte Tissue Nitrogen
>> lib.size norm.factors
>> Lp_NUE_LFFULL  Lp_NUE_LFFULL_ER_RAWREADS.txt      Free   Leaf     Full
>> 3786614    0.9244799
>> Lp_NUE_LFNO3    Lp_NUE_LFNO3_ER_RAWREADS.txt      Free   Leaf  Partial
>> 5311856    0.8120410
>> Lp_NUE_LFK        Lp_NUE_LFK_ER_RAWREADS.txt      Free   Leaf     Full
>> 3482772    0.7224960
>> Lp_NUE_LFPO4    Lp_NUE_LFPO4_ER_RAWREADS.txt      Free   Leaf     Full
>> 6208397    0.7367309
>> Lp_NUE_LFCa      Lp_NUE_LFCa_ER_RAWREADS.txt      Free   Leaf     Full
>> 3597004    0.8147619
>> 19 more rows ...
>> 
>> $counts
>>             Lp_NUE_LFFULL  Lp_NUE_LFNO3  Lp_NUE_LFK  Lp_NUE_LFPO4
>> Lp_NUE_LFCa Lp_NUE_LFNH4  Lp_NUE_RFFULL   Lp_NUE_RFNO3  Lp_NUE_RFK
>> Lp_NUE_RFPO4
>>> prg_cds_44             345           382         120           261  141
>>     215             888           717         698         1112
>>> prg_cds_141            290           451         126           422  235
>>     459             676           559         474          879
>>> prg_cds_2              233           277         131           333  239
>>     376             246           185         124          282
>>> prg_cds_16             433           608         184           614  346
>>     423             924           756         595          913
>>> prg_cds_84             287           353          90           241  248
>>     394             823           612         989         1061
>>             Lp_NUE_RFCa  Lp_NUE_RFNH4  Lp_NUE_LSFULL  Lp_NUE_LSNO3
>> Lp_NUE_LSK Lp_NUE_LSPO4 Lp_NUE_LSCa  Lp_NUE_LSNH4  Lp_NUE_RSFULL
>> Lp_NUE_RSNO3
>>> prg_cds_44          1107           439            954           257  205
>>       107          868           106             316         1147
>>> prg_cds_141          583          1052            773           325  230
>>       193          600           264             414         1046
>>> prg_cds_2            235           454            177           232  177
>>       203          138           239             334          228
>>> prg_cds_16           817          1040           2373           464  220
>>       276          895           291             714         2655
>>> prg_cds_84           890          1653            894           268  169
>>       123          714           209             311         1078
>>             Lp_NUE_RSK  Lp_NUE_RSPO4  Lp_NUE_RSCa  Lp_NUE_RSNH4
>>> prg_cds_44          622           903          114          474
>>> prg_cds_141         505           711          205         1166
>>> prg_cds_2            93           136          157          340
>>> prg_cds_16         1398          1660          311         1258
>>> prg_cds_84          893           771          154         1269
>> 63905 more rows ...
>> 
>> 
>> ## Produce an MDS plot of the data
>> 
>> #plotMDS.dge(DGEList.object, main = "MDS plot for Lp_NUE_DGEbymSEQ")
>> 
>> 
>> ## DEFINE THE DESIGN MATRIX
>> 
>> design <- model.matrix(~ Endophyte + Tissue + Nitrogen , data = targets)
>> print("This is the design matrix")
>> design
>>
>>                (Intercept) EndophyteSTWT TissueRoot NitrogenPartial
>> Lp_NUE_LFFULL             1             0          0               0
>> Lp_NUE_LFNO3              1             0          0               1
>> Lp_NUE_LFK                1             0          0               0
>> Lp_NUE_LFPO4              1             0          0               0
>> Lp_NUE_LFCa               1             0          0               0
>> Lp_NUE_LFNH4              1             0          0               1
>> Lp_NUE_RFFULL             1             0          1               0
>> Lp_NUE_RFNO3              1             0          1               1
>> Lp_NUE_RFK                1             0          1               0
>> Lp_NUE_RFPO4              1             0          1               0
>> Lp_NUE_RFCa               1             0          1               0
>> Lp_NUE_RFNH4              1             0          1               1
>> Lp_NUE_LSFULL             1             1          0               0
>> Lp_NUE_LSNO3              1             1          0               1
>> Lp_NUE_LSK                1             1          0               0
>> Lp_NUE_LSPO4              1             1          0               0
>> Lp_NUE_LSCa               1             1          0               0
>> Lp_NUE_LSNH4              1             1          0               1
>> Lp_NUE_RSFULL             1             1          1               0
>> Lp_NUE_RSNO3              1             1          1               1
>> Lp_NUE_RSK                1             1          1               0
>> Lp_NUE_RSPO4              1             1          1               0
>> Lp_NUE_RSCa               1             1          1               0
>> Lp_NUE_RSNH4              1             1          1               1
>> attr(,"assign")
>> [1] 0 1 2 3
>> attr(,"contrasts")
>> attr(,"contrasts")$Endophyte
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$Tissue
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$Nitrogen
>> [1] "contr.treatment"
>> 
>> 
>> ## ESTIMATE THE Cox-Reid common dispersion
>> 
>> DGEList.object <- estimateCRDisp(DGEList.object, design)
>> names(DGEList.object)
>> 
>> 
>> This step is giving  the following error:::
>> 
>> 
>> Loading required package: MASS
>> Error: NA/NaN/Inf in foreign function call (arg 1)
>> In addition: Warning messages:
>> 1: glm.fit: algorithm did not converge
>> 2: step size truncated due to divergence
>>> names(DGEList.object)
>> [1] "samples" "counts"
>> 
>> Clearly the function is not working. I have tried two different computers
>> and a run it in 64 bit but always with the same result. I have also tried
>> using a balanced design and a simple design all with the same results. I
>> have used the DGEList object to analyse the data using the original
>> estimateCommonDisp() and exactTest() which worked fine.
>> 
>> 
>> I was hoping someone might be able to help resolve this for me????
>> 
>> 
>> Thanks in advance....................... Josquin
>> 
>> The sessionInfo() and traceback() are below:
>> 
>> 
>>> sessionInfo()
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>> 
>> locale:
>> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
>> LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_Australia.1252
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] MASS_7.3-8  edgeR_1.8.1
>> 
>> loaded via a namespace (and not attached):
>> [1] limma_3.6.0
>> 
>> 
>>> traceback()
>> 4: .Fortran("dqrls", qr = x[good, ] * w, n = ngoodobs, p = nvars,
>>       y = w * z, ny = 1L, tol = min(1e-07, control$epsilon/1000),
>>       coefficients = double(nvars), residuals = double(ngoodobs),
>>       effects = double(ngoodobs), rank = integer(1L), pivot = 1L:nvars,
>>       qraux = double(nvars), work = double(2 * nvars), PACKAGE = "base")
>> 3: glm.fit(design, y[i, ], offset = offset[i, ], family = f)
>> 2: adjustedProfileLik(spline.disp[i], y.select, design = design,
>>       offset = offset.mat.select + lib.size.mat.select)
>> 1: estimateCRDisp(DGEList.object, design)
>> Notice:
>> This email and any attachments may contain information t...{{dropped:14}}

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list