[BioC] (EdgeR) statistical justification of partitioning dataset for multiple analysis

Fri Jan 31 16:01:51 CET 2014

Dear
Thanks for your input. I did as you suggested.
For all treatment groups combined i got common BCV = 0.08

 When I look split up my dataset in 3 treatments groups and calculate the
BCV for each seperately I got common BCV:
control: 0.081   treatment1: 0.085   treantment2: 0.096

When I split the data for each analysis I got common BCV;
control + treat1: 0.078     control + treat2: 0.084     treat1 +treat2:
0.082

So it seems that treatment2 has some extra BCV compared to the others but
thes differences are not so big when you look at each analysis for
treatment comparison. I also don't think the BCVs for each analysis look
much different when you look at the BCV plots themself (in attachment)

I have to revise my statement  about finding more genes after splitting the
dataset compared to an analysis on the full dataset.
I find more genes (almost double) for treatment 1 vs control when I split
the dataset.
I find less genes (almost half) for treatment 2 vs control when I split the
dataset.
I find more or less (it depends at which timepoint you look) for treatment
2 vs treatment 1 when I split the dataset.

This puzzles me a bit.

But in general, when all BCVs are more or less the same. Would you gain
something by splitting the dataset or doesn't that make much sense
statistically?

Best regards
Adriaan

2014-01-30 Ryan <rct at thompsonclan.org>:

> Hi Adriaan,
>
> If I understand correctly, you have 3 different treatments, i.e. control,
> treatment 1, and treatment 2, and you have fit the same model formula to
> the full dataset as well as all 3 combinations of only 2 treatments, and
> you are getting significantly different results between the 3-treatment fit
> and the 2-treatment fits. I think the first thing you need to do is to look
> at the result of plotBCV for each analysis. It is possible that one of your
> treatments has significantly more biological variability across all genes
> than the others. edgeR assumes that each gene has the same BCV across all
> conditions, so that it can more robustly estimate a single dispersion value
> for each gene. So look at the plotBCV output from all your analyses, and
> see if the BCV estimates differ significantly. This would surely explain
> what you are seeing. You may also want to estimate dispersions from each
> treatment group individually (drop Treatment from the model formula in this
> case). The tagwise dispersions will not be very robust in this case, but
> the trend and common dispersions can help you figure out which treatment
> has the most biological variability.
>
> If the dispersion estimates don't explain your differing p-values, ask
> back here and maybe someone else will have another idea.
>
> Good luck,
>
> -Ryan
>
>
> On 1/30/14, 9:43 AM, Adriaan Sticker wrote:
>
>> Dear all,
>>
>> I'm doing analysis on allready mapped reads from sequencing data for
>> differential expression with EdgeR. My experimental setup is as follow:
>> I have samples from 4 different subjects. Material of each subject wast
>> treated with 2 different treatments (and a control) for 2 timepoints.
>>
>> I want to analyse the effect of the treatments (compared to control and
>> compared to eachother)
>>
>> In EdgeR I used following design
>> model.matrix(~ subject+ Treatment + Time +Treatment : Time)
>>
>> I considered 2 strategies to analye te data:
>>
>> Estimate parameters from above mentioned design with all data (all
>> samples)
>> and use different contrasts to get the differential expressed genes I
>> want.
>>
>> OR
>>
>> Use only the samples of the two treatments (eg. control vs treatment1,
>> treatment 1 vs treatment 2) I want to compare to fit the parameters.
>> Repeat
>> the previous 3 times till I have compared all 3 treatments with eachother.
>> So exctually 3 different analysis using only a subset (2/3 th) of the
>> data.
>>
>> I noticed that I could find considerably more significant differential
>> expressed genes between 2 treatments with the last approach. But I
>> wondered
>> how correct this approach is? Will I have for example problems with
>> multiple testing? (I control each analysis on fdr 5% with bejamin
>> Hochberg)
>>
>> thanks in advance
>> Kind regard
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_all.png
Type: image/png
Size: 32597 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_control_treat1.png
Type: image/png
Size: 30368 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_control_treat2.png
Type: image/png
Size: 31007 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_treat1_treat2.png
Type: image/png
Size: 30864 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0003.png>