[BioC] BH vs BY in p.adjust + p-value distribution

lemerle at embl.de lemerle at embl.de
Sat Jul 29 18:14:42 CEST 2006


>> by the way, in cases that it's not uniformly distributed, from the 
>> range values
>> of the over-represented bins on the histogram, can we not get an idea of the
>> effect size associated with the differential probesets responsible for this
>> non-uniformity ?
>> or the other way around, if i happened to know that there were differential
>> probesets but all of only moderate effect size, i might expect a bulge at
>> moderate p-values, while lower ones could well instead be uniformly
>> distributed, right?
>
> In principle yes, but that would mean that your test is underpowered. 
> Also, the p-value is (generally) the result of two things: effect 
> size and sample size.
>
>> but then if that were the case, could it also be that if all differential
>> probesets had similar p-values, say 0.2,  they could more easily be 
>> discovered
>> than the same number associated to a lower but wider ranger of 
>> p-values, only
>> because they would add significance to each other?
>
> This seems like a very artificial scenario, and unlikely due to 
> stochastic effects.
>
>> this doesn't quite sound right if it's true that the adjustment procedure
>> preserves the rank that the genes have from the p-value.
>
> a very small p-value is no guarantee for being differentially 
> expressed, it could still be by chance (from the uniform 
> distribution), it is just very unlikely,
>

hi wolfgang,
ok. what i was naively alluding to with hypothetical bulgy p-value 
distributions
was more related to your present comment that it'd be difficult to identify
which are the differential probesets in the over-represented range of 
p-values.
as i understand it, it's actually impossible: the only probeset-specific
information used for calculating the adjusted distribution is the p-values and
since, by definition, they don't differ in that respect, this forces ranks to
be preserved.
so a real differential probeset with the same p-value as a 
non-differential one
cannot, after adjustment, get a smaller value than the latter, ie cannot be
picked up without the other being picked up too.
this is easily seen by plotting raw p-values against adjusted values: 
the curve
is strictly monotonous (apologies that this may be very obvious to you 
and many
others)

but about the bulge, i think i am again making a mistake in thinking 
these might
arise in a distribution of p-values from selected datasets...
i tried to have a look at that empirically, by generating datasets with a
fraction of expression values of a subset of 'probesets' drawn from a normal
distribution with a slightly higher mean, to see for what magnitude of 
change i
started detecting off-sets from the uniform distribution and in what range of
p-values. and what i see is that the only trend emerging is always that the
smallest range is the most populated one with counts decreasing until the
distribution is again uniform.

does anybody confirm that this is in theory not expected?
what sorts of shapes can one expect from a p-value distribution when it isn't
uniform?

sorry that this is no longer directly related to bioc
but thanks in advance for comments,

caroline

Quoting Wolfgang Huber <huber at ebi.ac.uk>:

> Hi Caroline,
>
>> hi wolfgang,
>> well.... not at all uniform:
>
> That is good - the distribution that you see is expected to be a 
> mixture of uniform (for the non differentially expressed genes) and 
> something which is concentrated near p=0 (for the differentially 
> expressed genes). The power of your test (e.g. the sample size) 
> determines how well the differentially expressed genes indeed get 
> p-values close to 0.
>
>>> x <- hist(fit2$p.value[,1], breaks=30, 
>>> col="orange",main="distribution of raw p-values",labels=T)
>>
>>> cbind(x$mids,x$counts)
>>       [,1] [,2]
>> [1,] 0.025 1891
>> [2,] 0.075  270
>> [3,] 0.125  209
>> [4,] 0.175  123
>> [5,] 0.225  100
>> [6,] 0.275  101
>> [7,] 0.325   79
>> [8,] 0.375   79
>> [9,] 0.425   85
>> [10,] 0.475   57 .....
>>
>> but from here on, the distribution is uniform (around 50 in every bin until
>> p-val=1). so there are a lot of differential probesets in this contrast. but
>> between 519 and 1032 as estimated from BY and BH adjustments with 1% FDR,
>> there's quite a difference.... or can i estimate it directly from this
>> histogram .....substracting the baseline gives me 2439 probesets, 
>> almost 70% of
>> the whole set:
>>
>>> baseline <- mean(x$counts[11:20])
>>> sum(x$counts-baseline)
>> [1] 2439
>>
>> how safe is this ?
>
> This is a good estimate of the number of differentially expressed genes if
>
> - your p-values are indeed uniformly distributed for those genes that
>   fall under the null hypothesis
> - your test has an OK power to find the alternatives
>
> and of course it is more difficult to decide which ones they are.
>

>  Cheers
>  Wolfgang
>
>>



More information about the Bioconductor mailing list