[BioC] confusing P-value of one gene

Thu Aug 29 04:15:50 CEST 2013

Dear Xinwei,

This is a correct result.  The reason that the interaction is not 
statistically significant is inherent in the log-linear model, and hence 
in the definition of interaction for this sort of model.

You are probably thinking that the cpm values are much higher for the 
joint condition CX&RGF than for the other conditions, hence there should 
be a positive interaction, and this should be statistically significant.

Indeed, had you tested the joint condition vs the other three conditions 
it would certainly be significantly higher.

However the interaction is different.  The problem is that there are zero 
counts for the controls.  Hence the fold change from control to CX is 
infinity, and the fold change from control to RGF is infinity.  Hence the 
counts in the joint condition can be indefinitely large even the absence 
of any positive interaction.  Hence there is no evidence for any positive 
interaction.  In fact, you could make the counts for the CX&RGF libraries 
as large as you like, and the interaction would never become significant. 
To make this clear, the counts could have been:

   0 0 0 0 0 1 0 0 1 1e10 1e10 1e10

and this would not give a significant interaction. So long as there are 
zero counts for the controls, and least one count for the single 
treatments CX and RGF, the interaction will never become significant.

You should ignore the logFC in this case, because the interaction logFC is 
not defined in any meaningful way for this data.

On the other hand, if you had any positive counts for the controls, then 
the interaction would suddenly become significant, because the fold 
changes from control to CX and control to RGF would now be finite.

I suspect that you might find it more meaningful to test for

   CX&RGF - (control+CX+RGF)/3

This will certainly be significant.  Or else test for CX&RGF vs each of 
the other three individually.

As I've said before, I am not a fan of factorial interaction models for 
genomic data, and this is yet another example of why this is so.

Best wishes
Gordon

On Wed, 28 Aug 2013, Xinwei Han wrote:

> Hi,
>
> I manually checked p-values from edgeR and found the p-value of this 
> particular gene, AT1G04500, difficult to understand. The CPM of this 
> gene is like this:
>
> control replicate1: 0
> control replicate2: 0
> control replicate3: 0
> CX replicate1: 0
> CX replicate2: 0.24
> CX replicate3: 0
> RGF replicate1: 0
> RGF replicate2: 0.14
> RGF replicate3: 0.19
> CX&RGF replicate1: 25.14
> CX&RGF replicate2: 44.36
> CX&RGF replicate3: 34.62
>
> I fitted GLM with model.matrix(~RGF + CX + RGF:CX).  To find out genes 
> under significant interaction effect, lrt <- glmLRT(fit, coef=4) gives 
> the following results to this gene:
>
> logFC: 5.43
> logCPM: 3.19
> LR: 0.012
> PValue: 0.91
>
> I do not understand why such dramatic change and such large logFC have 
> p-value of 0.91. I attached the data and R script I used. Could you take 
> a look to see whether I did something wrong in the script? Or there are 
> some other reasons for that?
>
> I used the latest version of R and edgeR. "ms" in the data and script is
> the control.
>
> Thanks
> Xinwei
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}