[BioC] question about mt.maxT output (multtest)

Les Dethlefsen dethlefs at stanford.edu
Sat Jun 4 00:43:27 CEST 2011

Hi All,

I'm a biologist, not a statistician, and a bit of a newbie with R, so please forgive me if this is a stupid question, and thanks for your help.

My data are from shotgun metagenomic sequencing of multiple samples, but it's essentially similar to microarray data.  I have a matrix of 3312 functional gene categories by 24 samples, with real-valued entries indicating the proportion of interpretable gene sequence data for each sample that is assigned to each functional category.  There are a number of ways to bin the samples into categories, based on some other analysis I'm most interested in a breakdown into two subjects and 5 temporal categories within each subject, i.e. a total of 10 categories.  Two of the categories have 4 samples each, the remaining 8 categories have 2 samples each.

When I ask mt.maxT to carry out a 2-sided F test on the 10 categories, I see something I don't understand: the adjusted p values are not a monotonic transformation of the raw p values.  They are a monotonic with respect to the test statistic...as the test statistic falls, the adjusted p values rise, as expected.  But the raw p values are bouncing around quite a bit, with the lowest raw p values of 0.0001 having a range of adjusted p values, all much larger than the minimum adjusted p value.  There are 244 raw p values lower than that which corresponds to the minimum adjusted p value.

My (perhaps mistaken) understanding was that permutations were used to generate the raw p values (which I definitely want, since I don't think the distributions of values for the different functional gene categories will follow any regular distribution or be similar to each other), and then some sort of adjustment was made to the raw values to adjust for multiple testing considerations, but the adjustment itself was not based on permutations.  I would expect the permutations to be fairly noisy in my case with only 2 samples in most of my sample categories, but wouldn't that only affect the reliability of the raw p values?  How can the adjusted p values have a different rank order than the raw p values?

I did run the same command with the samples binned only to subject, i.e. only 2 categories of 12 subject each. While the range of p values is much lower (not surprisingly) there is still the same pattern with a monotonic relationship between the test statistic and the adjusted p values, but not between the raw p values and adjusted p values.  Hence, it doesn't seem that this behavior is due to the messiness of 10 unequally sized categories for only 24 samples.

I'm including some snips of my R session below; if anyone is interested in playing with the original data there's a compressed file at

ftp://asiago.stanford.edu/multtest_issue.tgz

that has the data matrix and the class labels.

I'm running R 2.13.0 (on Mac OSX with the R.app GUI) with newly-installed BioConductor, so everything should be up to date.

Thanks for any help or insight!
Les

Les Dethlefsen
Relman Lab
Microbiology & Immunology
Stanford University

> class.label.df
Sub CpNon Cp12Non CpPIP Cp12PIP Cp12PELP SubCp12PIP
V2    0     0       0     0       0        0          0
V3    0     0       0     0       0        0          0
V4    0     1       1     1       1        1          1
V5    0     1       1     1       1        1          1
V6    0     0       0     2       2        2          2
V7    0     0       0     2       2        2          2
V8    0     0       0     2       2        3          2
V9    0     0       0     2       2        3          2
V10   0     1       2     1       3        4          3
V11   0     1       2     1       3        4          3
V12   0     0       0     3       4        5          4
V13   0     0       0     3       4        5          4
V14   1     0       0     0       0        0          5
V15   1     0       0     0       0        0          5
V16   1     1       1     1       1        1          6
V17   1     1       1     1       1        1          6
V18   1     0       0     2       2        2          7
V19   1     0       0     2       2        2          7
V20   1     0       0     2       2        3          7
V21   1     0       0     2       2        3          7
V22   1     1       2     1       3        4          8
V23   1     1       2     1       3        4          8
V24   1     0       0     3       4        5          9
V25   1     0       0     3       4        5          9
> ko.SubCp12PIP.maxT.F=mt.maxT(ko.data.df,classlabel=class.label.df\$SubCp12PIP,test="f",side="abs",nonpara="n")
b=100        b=200        b=300        b=400        b=500        b=600        b=700        b=800        b=900        b=1000
b=1100        b=1200        b=1300        b=1400        b=1500        b=1600        b=1700        b=1800        b=1900        b=2000
b=2100        b=2200        b=2300        b=2400        b=2500        b=2600        b=2700        b=2800        b=2900        b=3000
b=3100        b=3200        b=3300        b=3400        b=3500        b=3600        b=3700        b=3800        b=3900        b=4000
b=4100        b=4200        b=4300        b=4400        b=4500        b=4600        b=4700        b=4800        b=4900        b=5000
b=5100        b=5200        b=5300        b=5400        b=5500        b=5600        b=5700        b=5800        b=5900        b=6000
b=6100        b=6200        b=6300        b=6400        b=6500        b=6600        b=6700        b=6800        b=6900        b=7000
b=7100        b=7200        b=7300        b=7400        b=7500        b=7600        b=7700        b=7800        b=7900        b=8000
b=8100        b=8200        b=8300        b=8400        b=8500        b=8600        b=8700        b=8800        b=8900        b=9000
b=9100        b=9200        b=9300        b=9400        b=9500        b=9600        b=9700        b=9800        b=9900        b=10000
> ko.SubCp12PIP.maxT.F
K05394  1975 4.536904e+03 0.0278000 0.0896
K01455   648 4.746965e+02 0.0238000 0.4072
K02394  1154 2.566388e+02 0.0243000 0.6114
K06990  2301 5.749225e+01 0.0297000 0.8393
K05795  2021 5.399053e+01 0.0157000 0.8511
K06641  2223 4.468233e+01 0.0265000 0.8778
K05916  2059 3.936887e+01 0.0296000 0.9033
K08321  2678 3.068437e+01 0.0001000 0.9339
K02074  1047 2.780696e+01 0.0044000 0.9476
K07783  2616 2.692854e+01 0.0023000 0.9512
K07318  2472 2.594598e+01 0.0297000 0.9572
K08137  2639 2.563818e+01 0.0297000 0.9596
K02479  1207 2.561986e+01 0.0297000 0.9596
K02082  1052 2.484221e+01 0.0252000 0.9614
K02771  1312 2.433006e+01 0.0265000 0.9667
K03446  1580 2.314993e+01 0.0028000 0.9694
K07214  2423 2.232653e+01 0.0001000 0.9757
K03325  1529 1.881246e+01 0.0001000 0.9850
K09124  2747 1.813063e+01 0.0002000 0.9858
K03762  1790 1.802233e+01 0.0303000 0.9866
K05791  2018 1.781049e+01 0.0309000 0.9876
K03739  1776 1.685187e+01 0.0265000 0.9894

####################  SNIP  ################################

> ko.Sub.maxT.F=mt.maxT(ko.data.df,classlabel=class.label.df\$Sub,test="f",side="abs",nonpara="n")
b=100        b=200        b=300        b=400        b=500        b=600        b=700        b=800        b=900        b=1000
b=1100        b=1200        b=1300        b=1400        b=1500        b=1600        b=1700        b=1800        b=1900        b=2000
b=2100        b=2200        b=2300        b=2400        b=2500        b=2600        b=2700        b=2800        b=2900        b=3000
b=3100        b=3200        b=3300        b=3400        b=3500        b=3600        b=3700        b=3800        b=3900        b=4000
b=4100        b=4200        b=4300        b=4400        b=4500        b=4600        b=4700        b=4800        b=4900        b=5000
b=5100        b=5200        b=5300        b=5400        b=5500        b=5600        b=5700        b=5800        b=5900        b=6000
b=6100        b=6200        b=6300        b=6400        b=6500        b=6600        b=6700        b=6800        b=6900        b=7000
b=7100        b=7200        b=7300        b=7400        b=7500        b=7600        b=7700        b=7800        b=7900        b=8000
b=8100        b=8200        b=8300        b=8400        b=8500        b=8600        b=8700        b=8800        b=8900        b=9000
b=9100        b=9200        b=9300        b=9400        b=9500        b=9600        b=9700        b=9800        b=9900        b=10000
> ko.Sub.maxT.F
K07214  2423 4.388760e+01 0.0001 0.0017
K01812   867 3.582776e+01 0.0001 0.0072
K01209   554 3.493056e+01 0.0001 0.0087
K03111  1446 3.368798e+01 0.0001 0.0110
K03313  1519 3.286514e+01 0.0001 0.0129
K05878  2049 3.030097e+01 0.0001 0.0212
K01188   539 2.861773e+01 0.0001 0.0285
K07053  2349 2.625740e+01 0.0001 0.0451
K09705  2793 2.596417e+01 0.0001 0.0476
K07277  2451 2.584051e+01 0.0001 0.0492
K03543  1633 2.551410e+01 0.0003 0.0526
K10852  2989 2.493226e+01 0.0003 0.0594
K01191   541 2.465159e+01 0.0001 0.0634
K02238  1108 2.458557e+01 0.0003 0.0651
K03701  1751 2.365698e+01 0.0001 0.0812
K06998  2305 2.344949e+01 0.0001 0.0856
K12340  3131 2.284225e+01 0.0001 0.0983
K01678   779 2.280762e+01 0.0004 0.0996
K03771  1796 2.255402e+01 0.0006 0.1042
K03325  1529 2.252884e+01 0.0001 0.1047
K00647   262 2.246549e+01 0.0001 0.1061
K00282   135 2.227595e+01 0.0002 0.1099
K03657  1726 2.220074e+01 0.0001 0.1122
K01551   698 2.161330e+01 0.0001 0.1278
K05970  2072 2.160496e+01 0.0005 0.1281
K03453  1583 2.131683e+01 0.0003 0.1387
K08321  2678 2.118411e+01 0.0001 0.1428
K01811   866 2.092958e+01 0.0003 0.1512
K09687  2783 2.025709e+01 0.0001 0.1788
K07407  2494 1.909975e+01 0.0002 0.2376
K11741  3080 1.908456e+01 0.0001 0.2381

####################  SNIP  ################################

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
 en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 multtest_2.8.0 Biobase_2.12.1

loaded via a namespace (and not attached):
 MASS_7.3-13     splines_2.13.0  survival_2.36-9 tools_2.13.0