[BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms

Fri May 24 17:08:35 CEST 2013

Great, thank you Alejandro - I will check out the code and let you know how it goes. 

In the mean time, I also found a small typo in the DEXSeqHTML function. The line 
  results[which(is.na(results$pvalue)), ][, c("pvalue", "padjust")] = 1
breaks with an error if there are no NA pvalues, and can be patched by changing to:
    results[which(is.na(results$pvalue)),c("pvalue", "padjust")] = 1 

While you are at it, it would also be useful if extraCols is indexed by geneIDs as rownames - that way we can send in extra info for as many 
genes in any order and it would be properly added to genetable for display in the DEXSeqHTML report. In detail, if you could change:
  genetable <- cbind(extraCols, genetable)
to 
  genetable <- cbind(extraCols[match(genetable$geneID, rownames(extraCols)),], genetable]) 
or something similar and change the documentation appropriatley, that would be quite helpful to users too!

Thanks, 
Mani

________________________________________
From: Alejandro Reyes [alejandro.reyes at embl.de]
Sent: Friday, May 24, 2013 8:19 AM
To: Narayanan, Manikandan (NIH/NIAID) [E]
Cc: bioconductor at r-project.org
Subject: Re: [BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms

Dear Manikandan Narayanan,

Sorry for the delayed reply!

Thanks for your e-mail! It has led us to spot two minor bugs and
therefore modified the DEXSeq code in two aspects:

Firstly, redundant terms were removed in DEXSeq in a very naive way:  a
lm.fit model was first fitted and the columns from the model matrix with
NA coefficients were removed from the model matrix, now a proper
function has been written for this purpose (with basically the code to
what you also proposed). Secondly, our "if" statement that you mentioned
was wrongly located within the function, which was causing your error
message. It has been relocated and now it seems to work for its purpose
( I tested with your model frame and formula).

You will find the changes in the last version of the svn (1.7.2). Let me
know if it works for you or if you have additional questions.

Best regards,
Alejandro Reyes

> Hi,
>    Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below?
>
>    Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks!
>
>> mf
>          sample exon sizeFactor condition   batch dispersion count
> 1 Untr_biorep1 E001  1.0517803      Untr biorep1 0.13523549    25
> 2 Untr_biorep1 E002  1.0517803      Untr biorep1 0.01656906   708
> 3  LPS_biorep1 E001  1.0010474       LPS biorep1 0.13523549    20
> 4  LPS_biorep1 E002  1.0010474       LPS biorep1 0.01656906   442
> 5 Untr_biorep2 E001  0.8920483      Untr biorep2 0.13523549    26
> 6 Untr_biorep2 E002  0.8920483      Untr biorep2 0.01656906   947
> 7  LPS_biorep2 E001  1.1090401       LPS biorep2 0.13523549    25
> 8  LPS_biorep2 E002  1.1090401       LPS biorep2 0.01656906   884
>
>> mm
>    (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2
> 1           1                 0                  1                  0
> 2           1                 0                  1                  0
> 3           1                 0                  0                  0
> 4           1                 0                  0                  0
> 5           1                 0                  0                  1
> 6           1                 0                  0                  1
> 7           1                 1                  0                  0
> 8           1                 1                  0                  0
>    conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002
> 1             1            0        0                      0
> 2             1            0        1                      1
> 3             0            0        0                      0
> 4             0            0        1                      0
> 5             1            1        0                      0
> 6             1            1        1                      1
> 7             0            1        0                      0
> 8             0            1        1                      0
>    batchbiorep2:exonE002
> 1                     0
> 2                     0
> 3                     0
> 4                     0
> 5                     0
> 6                     1
> 7                     0
> 8                     1
> attr(,"assign")
> [1] 0 1 1 1 2 3 4 5 6
> attr(,"contrasts")
> attr(,"contrasts")$sample
> [1] "contr.treatment"
>
> attr(,"contrasts")$condition
> [1] "contr.treatment"
>
> attr(,"contrasts")$batch
> [1] "contr.treatment"
>
> attr(,"contrasts")$exon
> [1] "contr.treatment"
>
>
>
>
>
>
>
> From: Narayanan, Manikandan (NIH/NIAID) [E]
> Sent: Thursday, May 16, 2013 11:14 AM
> To: bioconductor at r-project.org
> Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms
>
> Hi DEXSeq users/developers,
>    I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error.
>
>    I figure this is due to redundant terms in my formula as shown in PS below. So my questions are:
>
> 1) Is there a way to specify the formula  count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed?
>
> 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame:
>      if (nrow(mm) <= ncol(mm))
>          stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.")
>
> Would changing the code this way violate any assumptions of the DEXSeq model?
>
>
> Thank you,
> Mani
>
>
> PS: # condition + batch terms are redundant as sample term is already present!
>> formulaDispersion
> count ~ sample + (condition + batch) * exon
>
>> design(ecs)
>                   condition       batch
> Untr_biorep1      Untr     biorep1
> LPS_biorep1        LPS     biorep1
> Untr_biorep2      Untr     biorep2
> LPS_biorep2        LPS     biorep2
>
>> colnames(model.matrix(formulaDispersion, mf))
> [1] "(Intercept)"            "sampleLPS_biorep2"      "sampleUntr_biorep1"
> [4] "sampleUntr_biorep2"     "conditionUntr"          "batchbiorep2"
> [7] "exonE002"               "conditionUntr:exonE002" "batchbiorep2:exonE002"
>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor