[R] strange (to me) p-value distribution

Mark Kimpel mwkimpel at gmail.com
Sun Jun 8 08:24:36 CEST 2008


Wolfgang,

Thank you for both the explanation and the beautiful R code to
demonstrate your point. Even after seeing the empirical evidence,
however, I couldn't get the underlying mechanism into my head. I
tweaked your code a bit to make the batch effect even bigger, to the
point where, ah ha, the distribution no longer approximates normal but
is clearly bivariate (additional histograms).

I went back to my original data and looked at the histogram of logged
expression values. Although not as clear cut, the distribution is not
normal and indeed there is a hint of several "humps" corresponding to
different batches. I need to see what effect their inclusion into my
model is.

Lesson relearned about the importance of visualizing data before
starting an analysis.

My slightly tweaked code is below, in case anyone wants to look at it.

Mark


library("genefilter")

nr <- 31000
nc <- 20

x.1  <- matrix(rnorm(nr*nc), nrow = nr, ncol = nc)

fact <- factor(1:nc %% 2)

sapply(split(x.1, fact), mean)
sapply(split(x.1, fact), sd)

rt1 <- rowttests(x.1, fact)

## add a batch effect
x.2 <- x.1
x.2[, 1:10] <- x.2[, 1:10] + pi

sapply(split(x.2, fact), mean)
sapply(split(x.2, fact), sd)

rt2 <- rowttests(x.2, fact)

par(mfrow = c(2,2))
hist(x.1, breaks = 50, col = "mintcream")
hist(x.2, breaks = 50, col = "mistyrose")
hist(rt1$p.value, breaks = 100, col = "mintcream")
hist(rt2$p.value, breaks = 100, col = "mistyrose")


On Sat, Jun 7, 2008 at 6:40 PM, Wolfgang Huber <huber at ebi.ac.uk> wrote:
>
> Dear Mark,
>
> try out the example code below. Such a p-value distribution often occurs  if
> you have "batch" effects, i.e. if the between-group variability is in fact
> less than the within-group variability.
>
> In the example below, I do, for each row of x, a t-test between the values
> in the even and odd columns; for rt2, a "batch effect" has been added to
> columns 1:10.
>
>  hope this helps
>        Wolfgang
>
>
> library("genefilter")
>
> nr = 31000
> nc = 20
>
> x  = matrix(rnorm(nr*nc), nrow=nr, ncol=nc)
>
> rt1 = rowttests(x, factor(1:nc %% 2))
>
> ## add a batch effect
> x[, 1:10] = x[, 1:10] + pi/2
> rt2 = rowttests(x, factor(1:nc %% 2))
>
> par(mfrow=c(2,1))
> hist(rt1$p.value, breaks=100, col="mistyrose")
> hist(rt2$p.value, breaks=100, col="mistyrose")
>
>
> ------------------------------------------------------------------
> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>
>
> Mark Kimpel a écrit 07/06/2008 18:39:
>>
>> I'm working with a genomic data-set with ~31k end-points and have
>> performed an F-test across 5 groups for each end-point. The QA
>> measurments on the individual micro-arrays all look good. One of the
>> first things I do in my work-flow is take a look at the p-valued
>> distribution. it is my understanding that, if the findings are due to
>> chance alone, the p-value distribution should be uniform. In this case
>> the histogram, even with 1000 break points, starts low on the left and
>> climbs almost linearly to the right. In other words, very skewed
>> towards high p-values. I understand that this could be happening by
>> chance alone, but the same behavior is seen in the two contrasts of
>> interest I looked at and I have seen it in a couple of our other
>> genomic, high-dimensional experiments as well. I might also add that I
>> looked at the actual numbers of genes with p-val < X and indeed, for
>> each X < 0.05, there are far fewer sig. genes than one would expect by
>> chance.
>>
>> I can't figure out what is causing this and, if there is a cause, I'd
>> like to be able to tell the experimenter if it indicates a technical
>> factor. I've had other experiments where the p-value dist approximates
>> normal and of course those that have nice spikes at low p-values
>> indicating we have some significant genes.
>>
>> I'm addressing this hear rather than to BioC because I suspect there
>> is some basis statistical mechanism that could explain this. Is there?
>>
>> Mark
>>
>
>
>



-- 
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN 46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 663-0513 Home (no voice mail please)



More information about the R-help mailing list