[BioC] DEXSeq continous variable in model

Mon Aug 4 09:52:21 CEST 2014

Hi Alejandro,

Just to confirm that your testing was right. I'm not sure what I did 
wrong the first time I ran the continuous model, but when I reran it 
over the weekend it worked much better. Apologies for the erroneous bug 
report! Nice to see my instinct was right and the continuous model gives 
much more power (see dxr.age.padj vs dxr1 results below - E27/E28 show 
DEU), this is a really cool result for us, so thank you!

It would still be nice to get a log2fold column for the continuous model 
though, so I will keep an eye out for the next version.

> subset(dxr.age.padj, groupID == "ENSG00000169432" & padj < 0.1)

LRT p-value: full vs reduced

DataFrame with 3 rows and 10 columns
                              groupID   featureID exonBaseMean dispersion
                          <character> <character>    <numeric>  <numeric>
ENSG00000169432:E027 ENSG00000169432        E027    169.71429 0.03685301
ENSG00000169432:E028 ENSG00000169432        E028    161.35714 0.03184911
ENSG00000169432:E019 ENSG00000169432        E019     64.89286 0.03007973
                           stat       pvalue         padj
                      <numeric>    <numeric>    <numeric>
ENSG00000169432:E027  54.62674 1.457392e-13 2.834817e-10
ENSG00000169432:E028  36.52308 1.508693e-09 6.930432e-07
ENSG00000169432:E019   8.76212 3.075513e-03 4.810069e-02
                                     genomicData       countData 
transcripts
                                       <GRanges>        <matrix>      
<list>
ENSG00000169432:E027 2:-:[167160541, 167160632] 411 494 432 ...    
########
ENSG00000169432:E028 2:-:[167160748, 167160839] 144 222 126 ...    
########
ENSG00000169432:E019 2:-:[167140963, 167140995]   86 151 96 ...    
########
> subset(dxr1, groupID == "ENSG00000169432" & padj < 0.1)

LRT p-value: full vs reduced

DataFrame with 2 rows and 13 columns
                              groupID   featureID exonBaseMean dispersion
                          <character> <character>    <numeric>  <numeric>
ENSG00000169432:E027 ENSG00000169432        E027     169.7143 0.06699247
ENSG00000169432:E028 ENSG00000169432        E028     161.3571 0.04371863
                           stat       pvalue        padj X.1.99.8.   
X.8.14.
                      <numeric>    <numeric>   <numeric> <numeric> 
<numeric>
ENSG00000169432:E027  21.16153 4.221541e-06 0.001944887  18.08856  
22.48066
ENSG00000169432:E028  20.71878 5.319188e-06 0.002257030  22.56749  
18.23336
                      log2fold_.1.99.8._.8.14.                genomicData
                                     <numeric>                  <GRanges>
ENSG00000169432:E027               -0.3136070 2:-:[167160541, 167160632]
ENSG00000169432:E028                0.3076653 2:-:[167160748, 167160839]
                            countData transcripts
                             <matrix>      <list>
ENSG00000169432:E027 411 494 432 ...    ########
ENSG00000169432:E028 144 222 126 ...    ########

On 03-08-2014 18:39, Alejandro Reyes wrote:
> Hi Alex,
> 
> Yes, I think that the testing using DEXSeq should also work with the
> continuous model!
> 
> I had some discussions within the lab and there was a misunderstanding
> from my side with regards to the fold changes with continuous
> variables. Wolfgang (cc-ed) clarified to me that if f(x) is the
> expression level for condition x, the log fold change is then the
> derivate of f(x).  As you mention this is how DESeq2 deals with it,
> but the current implementation for DEXSeq does not deal with this at
> the moment.  I will work on this and keep you posted, in the meantime
> the breakdown that you mention is probably the best!
> 
> Alejandro
> 
>> Hi Alejandro,
>> 
>> Thanks for the reply! For your first question I will have to get back 
>> to you as I need to rerun DEXSeq with the continuous design and make 
>> that comparison - it'll take a few hours. I assume from your reply 
>> that the continuous variable *should* work in fact?
>> 
>> For your second question I guess what I was expecting was to see the 
>> fold change implied from the the model from a unit change in Age (in 
>> this case 1 week). This is for instance how DESeq2 reports it. But I 
>> agree when it comes to actually plotting then a factorial breakdown is 
>> probably simplest.
>> 
>> Alex Gutteridge
>> 
>> On 01-08-2014 10:29, Alejandro Reyes wrote:
>>> Dear Alex Gutteridge,
>>> 
>>> Thanks for your detailed e-mails and sorry for the delay in the 
>>> reply.
>>> 
>>> About the fact that you do get significant hits if you specify 
>>> factors
>>> instead of your continuous variable, Is it a thresholding effect or
>>> the exons that you detect when specifying the factorial design have
>>> large p-values with the continuous design?
>>> 
>>> About the exon fold changes error, this function does not support
>>> continuous variables.  I am unsure how would a fold change be with a
>>> continuous variable? I would maybe partition the "Age" into more than
>>> 2 factors based on e.g. quantiles or simply make plots like the one
>>> you showed!
>>> 
>>> Bests,
>>> Alejandro Reyes
>>> 
>>> 
>>> 
>>>> On 30-07-2014 10:13, Alex Gutteridge wrote:
>>>>> On 29-07-2014 10:00, Alex Gutteridge wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I have a time course RNASeq experiment and I'd like to detect DEU
>>>>>> between early and late stages. I am trying to use 'Age' as a
>>>>>> continuous variable in my design, but I'm getting an error which I
>>>>>> suspect is related to this e.g:
>>>>>> 
>>>>>>> summary(sample.data.ipsc[,1:6])
>>>>>>     Sample          CellType     Subtype Donor         Age
>>>>>>  Length:28          iPSC:28   iPSC   :28   4555   :28 Min.   : 
>>>>>> 2.000
>>>>>>  Class :character             HBR    : 0   D4721  : 0 1st Qu.: 
>>>>>> 4.000
>>>>>>  Mode  :character             L2     : 0   D4749  : 0 Median : 
>>>>>> 8.000
>>>>>>                               L3     : 0   D6002  : 0 Mean   : 
>>>>>> 7.821
>>>>>>                               L4     : 0   D6005  : 0 3rd 
>>>>>> Qu.:10.500
>>>>>>                               L5     : 0   D6008  : 0 Max.   
>>>>>> :14.000
>>>>>>                               (Other): 0   (Other): 0
>>>>>>  Passage
>>>>>>  42:3
>>>>>>  48:7
>>>>>>  49:5
>>>>>>  52:6
>>>>>>  56:7
>>>>>> 
>>>>>>> dxd = DEXSeqDataSetFromHTSeq(countFiles,
>>>>>> sampleData=sample.data.ipsc,
>>>>>>                                design= ~ sample + exon + Age:exon,
>>>>>> flattenedfile=flattenedFile)
>>>>>> 
>>>>>>> BPPARAM = MulticoreParam(workers=24)
>>>>>> 
>>>>>>> dxd = estimateSizeFactors(dxd)
>>>>>>> dxd = estimateDispersions(dxd, BPPARAM=BPPARAM)
>>>>>>> dxd = testForDEU(dxd, BPPARAM=BPPARAM)
>>>>>>> dxd = estimateExonFoldChanges(dxd, fitExpToVar="Age", 
>>>>>>> BPPARAM=BPPARAM)
>>>>>> Error: 8346 errors; first error:
>>>>>>   Error in FUN(1:3[[1L]], ...): Non-factor in model frame
>>>>>> 
>>>>>> For more information, use bplasterror(). To resume calculation, 
>>>>>> re-call
>>>>>>   the function and set the argument 'BPRESUME' to TRUE or wrap the
>>>>>>   previous call in bpresume().
>>>>>> 
>>>>>> First traceback:
>>>>>>   31: estimateExonFoldChanges(dxd, fitExpToVar = "Age", BPPARAM =
>>>>>> BPPARAM)
>>>>>>   30: bplapply(testablegenes, geteffects, BPPARAM = BPPARAM)
>>>>>>   29: bplapply(testablegenes, geteffects, BPPARAM = BPPARAM)
>>>>>>   28: mclapply(X = X, FUN = FUN, ..., mc.set.seed = 
>>>>>> BPPARAM$setSeed,
>>>>>>           mc.silent = !BPPARAM$verbose, mc.cores = 
>>>>>> bpworkers(BPPARAM),
>>>>>>           mc.cleanup = if (BPPARAM$cleanup) BPPARAM$cleanupSignal
>>>>>> else FALSE)
>>>>>>   27: lapply(seq_len(cores), inner.do)
>>>>>>   26: lapply(seq_len(cores), inner.do)
>>>>>>   25: FUN(1:24[[1L]], ...)
>>>>>>   24: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = 
>>>>>> TRUE))
>>>>>>   23: parallel:::sendMaster(...)
>>>>>>   22: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
>>>>>>   21: tryCatch(expr, error = function(e)
>>>>>> 
>>>>>> Can DEXSeq accept a model with a continuous variable? Does it make
>>>>>> sense to do so? (I do the same thing with DESeq2 to detected DE 
>>>>>> and it
>>>>>> works fine). Is this error due to that? Note that all the other 
>>>>>> steps
>>>>>> seem to run fine and I can get results (though I don't have many
>>>>>> significant hits - not sure if this is related or not). If not 
>>>>>> what is
>>>>>> best practice? Just split the data set into 'early' and 'late' 
>>>>>> samples
>>>>>> and run that as a factor?
>>>>>> 
>>>>>> Alex Gutteridge
>>>>> 
>>>>> To sort of answer my own question: cutting the age vector into two
>>>>> bins (early <8 weeks and late >8 weeks) and using that factor seems 
>>>>> to
>>>>> work and give many significant hits:
>>>>> 
>>>>>> sample.data.ipsc$AgeBin = cut(sample.data.ipsc$Age,2)
>>>>> 
>>>>>> dxd = DEXSeqDataSetFromHTSeq(countFiles,
>>>>>                                sampleData=sample.data.ipsc,
>>>>>                                design= ~ sample + exon + 
>>>>> AgeBin:exon,
>>>>>                                flattenedfile=flattenedFile)
>>>>> 
>>>>>> BPPARAM = MulticoreParam(workers=12)
>>>>> 
>>>>>> dxd = estimateSizeFactors(dxd)
>>>>>> dxd = estimateDispersions(dxd, BPPARAM=BPPARAM)
>>>>>> dxd = testForDEU(dxd, BPPARAM=BPPARAM)
>>>>> 
>>>>> Which makes me think that DEXSeq wasn't doing what I hoped with the
>>>>> previous analysis (quite apart from the error). Is there any way to
>>>>> rejig the design to cope with a continuous variable as opposed to a
>>>>> factor? I *think* the underlying biological question makes sense: 
>>>>> Are
>>>>> there exons showing DEU that correlates with time?
>>>>> 
>>>>> Alex Gutteridge
>>>> 
>>>> I'm just going to bump this question one last time as I'd love to 
>>>> know
>>>> if there's a better solution to what I have now. In short: I have a
>>>> reasonable detailed time course RNASeq experiment (8 timepoints; 
>>>> 2-14
>>>> weeks; 3-4 replicates at all time points except one) and would like 
>>>> to
>>>> detect time dependent DEU using DEXSeq. Slicing the timecourse into 
>>>> two
>>>> bins (at 8 weeks) and doing a straight early vs late comparison 
>>>> works
>>>> and detects plenty of interesting stuff. Trying to model age as a
>>>> continuous variable gives an error when determining fold changes and
>>>> doesn't report any significant DEU.
>>>> 
>>>> See first plot here for an example (exons 27/28 show strong DEU
>>>> comparing <8 weeks to >8 week samples) shown with a DEXSeq plot:
>>>> 
>>>> http://dx.doi.org/10.6084/m9.figshare.1124172
>>>> 
>>>> But the 8 week split is entirely arbitrary, really time (Age) is a
>>>> continuous variable and so I'd like to model it as such (maybe this 
>>>> is
>>>> just OCD on my behalf, but I'm sure I must also be loosing some 
>>>> power by
>>>> crudely splitting the timecourse like this; also DESeq2 lets me 
>>>> model
>>>> the genewise DE with a continuous variable).
>>>> 
>>>> The second plot from the figshare link shows the RPKMs of those two
>>>> exons (27/28). On top of a background of increased gene-wise 
>>>> expression
>>>> there is clearly a time dependent effect on the relative usage of 
>>>> these
>>>> two exons, which it must be possible to model better than a crude 
>>>> split
>>>> at 8w (black dashed line). Is there any trick to do this with 
>>>> DEXSeq?
>>>> 
>>>> Alex Gutteridge
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor