[BioC] multtest MTP raw P values = 0

Tue Mar 4 06:00:59 CET 2008

Houston,

Thank you very much for your advice!

1) Yes, I could reproduce both the problem and your solution with a
much smaller data set (see code below).

I made a toy dataset with just 1 (one) gene and 2 groups of 5 samples
each, with all values in group 1 lower than values in group
2. Bootstrap with 100 and 1000 iterations gave P=0, and with 10,000
and 100,000 gave P>0, the latter values also consistent with both
the exact permutation test and t-test. It would take a lot of cpu time
to do B >= 10000 iterations with my real data set (and zeros would
still be possible), but I think there is no other way around it.

2) I used ssmaxT procedure with augmentation to get FDR, reported as
adjp. I used this method instead of Benjamini-Hochberg ("BH" in
mt.rawp2adjp) because it requires less assumptions, although it is
"conservative". When "BH" in mt.rawp2adjp is used, the adjp become
much lower, as expected (see below). BTW, with BH, the problem I
reported (same rawp values correspond to different adjp) is gone.

I am not sure which of these 2 methods give more realistic FDR
estimates. I could not find a consensus on multiple testing correction
procedures. There are reasons not to use Benjamini-Hochberg FDR
method, see for example:

http://view.ncbi.nlm.nih.gov/pubmed/16644791
Jung SH, Jang W. Bioinformatics. 2006 Jul 15;22(14):1730-6.

http://www.urmc.rochester.edu/smd/biostat/people/faculty/AOAS102.pdf
Gordon, A., Glazko, G., Qiu, X., and Yakovlev, A. (2007). The Annals
of Applied Statistics 1:179-190.

Thanks again.
Best regards,

Timur
--
Timur Shtatland, PhD
Center for Molecular Imaging Research
Massachusetts General Hospital
149 13th Street, Room 5408
Charlestown, MA 02129
tshtatland at mgh dot harvard dot edu

############################################################

> library(multtest)
> nSam <- 5
> X <- matrix(c(1:nSam, 1:nSam+10), nrow=1)
> print(X)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5   11   12   13   14    15
> Y <- c(rep(0,nSam), rep(1,nSam))
> print(Y)
 [1] 0 0 0 0 0 1 1 1 1 1
> 
> seed <- 99
> for (B in c(100, 1000, 10000, 100000)) {
+     TTBoot <- MTP(X=X, Y=Y, test = "t.twosamp.unequalvar", alternative =
"two.sided", typeone="fdr", method="ss.maxT", fdr.method="conservative", B=B,
seed=seed)
+     MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp)
+     cat(sprintf("%d iterations:\n", B))
+     print(format(MTPRes, digits=10, scientific=TRUE))
+ }
running bootstrap...
iteration = 100

100 iterations:
   PRaw  PAdj
1 0e+00 0e+00
running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000

1000 iterations:
   PRaw  PAdj
1 0e+00 0e+00
running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100
3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700
4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300
6400 6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900
8000 8100 8200 8300 8400 8500 8600 8700 8800 8900 9000 9100 9200 9300 9400 9500
9600 9700 9800 9900 10000

10000 iterations:
   PRaw  PAdj
1 3e-04 6e-04
running bootstrap...
iteration = 100 200 300... 
99900 100000

100000 iterations:
   PRaw  PAdj
1 1e-04 2e-04
> 

> library(exactRankTests)

> perm.test(X[Y==1],X[Y==0])

	2-sample Permutation Test

data:  X[Y == 1] and X[Y == 0] 
T = 65, p-value = 0.007937
alternative hypothesis: true mu is not equal to 0 

> t.test(X[Y==1],X[Y==0])

	Welch Two Sample t-test

data:  X[Y == 1] and X[Y == 0] 
t = 10, df = 8, p-value = 8.488e-06
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
  7.693996 12.306004 
sample estimates:
mean of x mean of y 
       13         3 

############################################################

## same rawp, added BH:
> TTMargFDR <- mt.rawp2adjp(rawp=TTBoot at rawp, proc="BH")
> MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp, PAdjBH=
TTMargFDR$adjp[order(TTMargFDR$index),-1])
> print(format(subset(MTPRes[order(MTPRes$PRaw, MTPRes$PAdj), ], PRaw < 2e-3),
digits=10, scientific=TRUE))
             PRaw            PAdj          PAdjBH
202508_s_at 0e+00 0.000000000e+00 0.000000000e+00
203130_s_at 0e+00 0.000000000e+00 0.000000000e+00
206780_at   0e+00 0.000000000e+00 0.000000000e+00
218820_at   0e+00 0.000000000e+00 0.000000000e+00
205825_at   0e+00 8.000000000e-03 0.000000000e+00
203000_at   0e+00 1.400000000e-02 0.000000000e+00
204465_s_at 0e+00 1.400000000e-02 0.000000000e+00
206282_at   0e+00 1.400000000e-02 0.000000000e+00
206502_s_at 0e+00 1.400000000e-02 0.000000000e+00
208399_s_at 0e+00 1.400000000e-02 0.000000000e+00
209598_at   0e+00 1.400000000e-02 0.000000000e+00
209755_at   0e+00 1.400000000e-02 0.000000000e+00
212805_at   0e+00 1.400000000e-02 0.000000000e+00
203001_s_at 0e+00 2.400000000e-02 0.000000000e+00
204811_s_at 0e+00 2.400000000e-02 0.000000000e+00
205174_s_at 0e+00 2.400000000e-02 0.000000000e+00
...

218816_at   1e-03 1.000000000e+00 1.051067961e-02
219939_s_at 1e-03 1.000000000e+00 1.051067961e-02
221020_s_at 1e-03 1.000000000e+00 1.051067961e-02

############################################################
############################################################

On Feb 29, 2008, at 7:26 PM, Houston Gilbert wrote:

Two things:

1.) The rawp digits are exactly zero because they are calculated as follows
(or similarly):

Here - Tn are the observed test statistics, Z is the matrix of bootstrap
test statistics.

rawp <- apply(Tn < Z, 1, mean)

This means that if no observed test statistics are greater than the sample
of resampled null test statistics (marginally, within rows), the raw
p-values will be zero, not 1/B.  The rawp here are very coarse estimates and
really reflect the probability of seeing a value of test statistic that
extreme given each marginal null distibution. This is a matter of taste I
guess, and might be something for to look into.  If you want non-zero
p-values, I imagine the best way to increase B, since you will get more
accurate resolution in the tails of your test statistics null distribution
estimate.  They may still just be zero, though.

2.)  The procedure you are actually doing is an augmentation procedure which
takes the _adjp_ calculated from the ssmaxT FWER-controlling procedure and
augments them to control the FDR in one of two ways (here,
"conservative"-ly").  Thus, your final FDR adjp are the FWER ssmaxT adjusted
p-values calculated at the observed values of the test statistics (based off
of an ecdf() call performed over column maxima of the null matrix) which
then get augmented to FDR-controlling p-values.  So I imagine the ssmaxT

adjp to be way more smooth. The equality of adjps in the output is also
often the byproduct of enforcing some kind of monotonicity constraint on the
p-values themselves when going over subsets of rows.

As an aside:  The augmentation procedures can be conservative in practice.
You may just want to try "BH" in mt.rawp2adjp and see what you get.

Houston

Houston Gilbert
Doctoral Student in Biostatistics
University of California, Berkeley
houston at stat.berkeley.edu
www.stat.berkeley.edu/~houston

On 2/29/08, James Bullard <bullard at berkeley.edu> wrote:

a post on the bioconductor mailing list - if you are not already on it

Begin forwarded message:

From: Timur Shtatland <tshtatland at mgh.harvard.edu>
Date: February 29, 2008 10:37:28 AM PST
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] multtest MTP raw P values = 0

Dear all,

I used MTP from the multtest package to compute bootstrap based raw
and adjusted t-test P values (group 1 = 7 samples, group 2 = 3
samples). Why are some raw P values in the output exactly zero?

I could not change this by using scientific notation, with varying
number of significant digits. In the example below, 10 digits are
used; increasing this does not make a difference. I expected all P
values > 0, with the minimum P value determined by the number of
bootstrap iterations, here 1000. Hence the raw P value minimum should
be 1e-03, as indeed the lowest *positive* P values are.

Another question is: why some raw P values that are *equal* correspond
to adjusted P values that are *different*?

For example, raw P = 1e-03 corresponds to adjusted P = 8.4e-02,
8.6e-02, etc.

I could not find the answer in the MTP help page, its vignettes, FAQ,
etc. A similar, but not identical, problem was reported on this
mailing list before, without a solution:

https://stat.ethz.ch/pipermail/bioconductor/2007-December/020492.html

Thank you for your help.

Best regards,

Timur
--
Timur Shtatland, PhD
Center for Molecular Imaging Research
Massachusetts General Hospital
149 13th Street, Room 5408
Charlestown, MA 02129
tshtatland at mgh dot harvard dot edu

############################################################

B <- 1000
seed <- 99
TTBoot <- MTP(X=esetGcrmaExprsFiltered, Y=TT, test =
"t.twosamp.unequalvar", alternative = "two.sided", typeone="fdr",
method="ss.maxT", fdr.method="conservative", B=B, seed=seed)

running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000

print(TTBoot)
      Multiple Testing Procedure

Object of class:  MTP
sample size = 10
number of hypotheses = 5413

test statistics = t.twosamp.unequalvar
type I error rate = fdr
nominal level alpha = 0.05
multiple testing procedure = ss.maxT

Call: MTP(X = esetGcrmaExprsFiltered, Y = TT, test =
"t.twosamp.unequalvar",
    alternative = "two.sided", typeone = "fdr", fdr.method =
"conservative",
    B = B, method = "ss.maxT", seed = seed)

Slots:
            Class    Mode  Length Dimension
statistic numeric numeric    5413
estimate  numeric numeric    5413
sampsize  integer numeric       1
rawp      numeric numeric    5413
adjp      numeric numeric    5413
conf.reg    array logical       0     0,0,0
cutoff     matrix logical       0       0,0
reject     matrix logical    5413    5413,1
nulldist   matrix numeric 5413000 5413,1000
call         call    call      10
seed      integer numeric       1
...

MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp)

print(format(subset(MTPRes[order(MTPRes$PRaw, MTPRes$PAdj), ],
PAdj < 0.1), digits=10, scientific=TRUE))
             PRaw            PAdj
202508_s_at 0e+00 0.000000000e+00
203130_s_at 0e+00 0.000000000e+00
206780_at   0e+00 0.000000000e+00
218820_at   0e+00 0.000000000e+00
205825_at   0e+00 8.000000000e-03
203000_at   0e+00 1.400000000e-02
204465_s_at 0e+00 1.400000000e-02
206282_at   0e+00 1.400000000e-02
206502_s_at 0e+00 1.400000000e-02
208399_s_at 0e+00 1.400000000e-02
209598_at   0e+00 1.400000000e-02
209755_at   0e+00 1.400000000e-02
212805_at   0e+00 1.400000000e-02
203001_s_at 0e+00 2.400000000e-02
204811_s_at 0e+00 2.400000000e-02
205174_s_at 0e+00 2.400000000e-02
205646_s_at 0e+00 2.400000000e-02
206051_at   0e+00 2.400000000e-02
212624_s_at 0e+00 2.800000000e-02
204035_at   0e+00 3.000000000e-02
221261_x_at 0e+00 3.200000000e-02
206915_at   0e+00 3.400000000e-02
206104_at   0e+00 4.000000000e-02
210036_s_at 0e+00 5.000000000e-02
218380_at   0e+00 5.000000000e-02
213135_at   0e+00 6.200000000e-02
204870_s_at 0e+00 6.400000000e-02
218952_at   0e+00 6.400000000e-02
205120_s_at 0e+00 6.600000000e-02
211597_s_at 0e+00 6.666666667e-02
205348_s_at 0e+00 6.800000000e-02
212190_at   0e+00 7.000000000e-02
213186_at   0e+00 7.000000000e-02
203485_at   0e+00 7.200000000e-02
203572_s_at 0e+00 8.400000000e-02
209583_s_at 0e+00 8.400000000e-02
212895_s_at 0e+00 8.400000000e-02
206001_at   0e+00 8.600000000e-02
206135_at   0e+00 8.600000000e-02
212843_at   0e+00 8.600000000e-02
204059_s_at 0e+00 8.800000000e-02
210826_x_at 0e+00 9.000000000e-02
213438_at   0e+00 9.000000000e-02
218675_at   0e+00 9.000000000e-02
201116_s_at 0e+00 9.200000000e-02
203272_s_at 0e+00 9.600000000e-02
204793_at   0e+00 9.800000000e-02
204720_s_at 1e-03 8.400000000e-02
221207_s_at 1e-03 8.400000000e-02
204540_at   1e-03 8.600000000e-02
209992_at   1e-03 8.888888889e-02
210246_s_at 1e-03 9.000000000e-02

print(dim(esetGcrmaFiltered))
Features  Samples
    5413       10

sessionInfo()
R version 2.6.0 (2007-10-03)
powerpc-apple-darwin8.10.1

locale:
en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   tools     stats     graphics  grDevices utils
datasets  methods   base

other attached packages:
 [1] multtest_1.18.0      affydata_1.11.3      affyQCReport_1.16.0
geneplotter_1.16.0   lattice_0.16-5
 [6] annotate_1.16.1      AnnotationDbi_1.0.6  RSQLite_0.6-4
DBI_0.2-4            RColorBrewer_1.0-2
[11] affyPLM_1.14.0       xtable_1.5-2         simpleaffy_2.14.05
gcrma_2.10.0         matchprobes_1.10.0
[16] genefilter_1.16.0    survival_2.32        affy_1.16.0
preprocessCore_1.0.0 affyio_1.6.1
[21] Biobase_1.16.1

loaded via a namespace (and not attached):
[1] KernSmooth_2.22-21 grid_2.6.0

version
               _
platform       powerpc-apple-darwin8.10.1
arch           powerpc
os             darwin8.10.1
system         powerpc, darwin8.10.1
status
major          2
minor          6.0
year           2007
month          10
day            03
svn rev        43063
language       R
version.string R version 2.6.0 (2007-10-03)

# platform: PowerPC G4, Mac OS 10.4.11.

The information transmitted in this electronic communica...{{dropped:16}}