[Rd] Are r2dtable and C_r2dtable behaving correctly?

Thu Aug 24 16:42:36 CEST 2017

Hello,

While doing some enrichment tests using chisq.test() with simulated
p-values, I noticed some strange behaviour. The computed p-value was
extremely small, so I decided to dig a little deeper and debug
chisq.test(). I noticed then that the simulated statistics returned by the
following call

tmp <- .Call(C_chisq_sim, sr, sc, B, E)

were all the same, very small numbers. This, at first, seemed strange to
me. So I decided to do some simulations myself, and started playing around
with the r2dtable() function. Problem is, using my row and column
marginals, r2dtable() always returns the same matrix. Let's provide a
minimal example:

rr <- c(209410, 276167)
cc <- c(25000, 460577)
ms <- r2dtable(3, rr, cc)

I have tested this code in two machines and it always returned the same
list of length three containing the same matrix three times. The repeated
matrix is the following:

[[1]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

[[2]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

[[3]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

I also coded a small function returning the value of the chi-squared
statistic using the previous fixed marginals and taking the value at [1, 1]
as input. This helped me to plot a curve and notice that the repeating
matrix was the one that yielded the minimum chi-squared statistic.

This behaviour persists if I use greater marginals (summing 100000 to every
element of the marginal for example),

> rr <- c(309410, 376167)
> cc <- c(125000, 560577)
> r2dtable(3, rr, cc)
[[1]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

[[2]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

[[3]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

 but not if we use smaller ones:

> rr <- c(9410, 76167)
> cc <- c(25000, 60577)
> r2dtable(3, rr, cc)
[[1]]
      [,1]  [,2]
[1,]  2721  6689
[2,] 22279 53888

[[2]]
      [,1]  [,2]
[1,]  2834  6576
[2,] 22166 54001

[[3]]
      [,1]  [,2]
[1,]  2778  6632
[2,] 22222 53945

I have looked inside the C code for the C_r2dtable() and rcont2()
functions, but I cannot do much more than guess where this behaviour could
originate, so I would like to ask for help from anybody more experienced in
the R implementation. Guess there is some kind of inflection point
depending on the total sample size of the table, or maybe the generation
algorithm tends to output matrices concentrated around the minimum.

This is the output from my sessionInfo()

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
LC_TIME=es_ES.UTF-8
 [4] LC_COLLATE=es_ES.UTF-8     LC_MONETARY=es_ES.UTF-8
 LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C

[10] LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
 methods   base

other attached packages:
 [1] profvis_0.3.3                           bindrcpp_0.2

 [3] FDb.InfiniumMethylation.hg19_2.2.0      org.Hs.eg.db_3.4.1

 [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.28.4

 [7] AnnotationDbi_1.38.2                    Biobase_2.36.2

 [9] GenomicRanges_1.28.4                    GenomeInfoDb_1.12.2

[11] IRanges_2.10.2                          S4Vectors_0.14.3

[13] BiocGenerics_0.22.0                     epian_0.1.0

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.3 purrr_0.2.3                reshape2_1.4.2

 [4] lattice_0.20-35            htmltools_0.3.6
 rtracklayer_1.36.4
 [7] blob_1.1.0                 XML_3.98-1.9               rlang_0.1.2

[10] foreign_0.8-67             glue_1.1.1                 DBI_0.7

[13] BiocParallel_1.10.1        bit64_0.9-7
 matrixStats_0.52.2
[16] GenomeInfoDbData_0.99.0    bindr_0.1                  plyr_1.8.4

[19] stringr_1.2.0              zlibbioc_1.22.0
 Biostrings_2.44.2
[22] htmlwidgets_0.9            psych_1.7.5                memoise_1.1.0

[25] biomaRt_2.32.1             broom_0.4.2                Rcpp_0.12.12

[28] DelayedArray_0.2.7         XVector_0.16.0             bit_1.1-12

[31] Rsamtools_1.28.0           mnormt_1.5-5               digest_0.6.12

[34] stringi_1.1.5              dplyr_0.7.2                grid_3.4.0

[37] tools_3.4.0                bitops_1.0-6               magrittr_1.5

[40] RCurl_1.95-4.8             tibble_1.3.3               RSQLite_2.0

[43] tidyr_0.7.0                pkgconfig_2.0.1            Matrix_1.2-9

[46] assertthat_0.2.0           R6_2.2.2
GenomicAlignments_1.12.1
[49] nlme_3.1-131               compiler_3.4.0

Any hint or help would be much appreciated. We do not use a lot the
simulated version of the chisq.test at the lab, but I would like to
understand better what is happening.

Kind regards,
Gustavo

	[[alternative HTML version deleted]]