[BioC] DMR identification: minfi speed, memory problems

Mon Feb 24 20:37:46 CET 2014

Hi Allegra,

On 2/24/2014 1:41 PM, Allegra Petti [guest] wrote:
> Dear All,
>
> I am trying to identify Differentially Methylated Regions (DMRs) using a tab-delimited file of beta values obtained with Illumina's 450k methylation arrays. I am using minfi and its implementation of bumphunter (development version 1.9.11).
>
> However, I run out of memory when I use a mere 100 permutations, and have tried different amounts of memory to no avail. Of further concern is the fact that the bumphunter reference (Jaffe, et al (2011) International J. of Epidemiology: 41:200-209) states that each permutation takes two hours (see p. 205). At that rate, a standard run with 1000 permutations would take over 83 days.
>
> My questions are as follows (code and output included below):
> 1. Is there a better way to run minfi/bumphunter, or specific resources I should allocate? (I am currently using parallel processing with four cores, and have tried various amounts of memory.)
> 2. Does minfi/bumphunter include a non-permutation-based method for estimating statistical significance? Jaffe et al state that they implemented a second, faster, p-value calculation following the method of Effron and Tibshirani (see p. 205). Does anyone know if this method is implemented in minfi?
> 3. Is there a better method for identifying DMRs? (I am also experimenting with methyAnalysis, which has nice visualization options, but (a) is poorly documented and (b) surprisingly found no DMRs in my data set using default settings.)
>
> Many thanks in advance for any insight and/or suggestions!
>
> Sincerely,
> Allegra
> _________________________________________________________________
> Code:
>
> # This R script reads in a tab-delimited matrix of Beta values (with a header line)
> # and finds differentially-methylated regions using minfi
>
> # sample command: bsub -oo beta2dmr.140224.out -e beta2dmr.140224.err -q long -M 16000000 -R 'select[type==LINUX64 && mem>16000] span[hosts=1] rusage[mem=16000]' -n 4 -J beta2dmr "~/gsc/R3.0.2/bin/Rscript beta2dmr_140220.r"
>
> library(minfi); # currently requires development version 1.9.11
> library(IlluminaHumanMethylation450kanno.ilmn12.hg19); # Note: had to be installed manually
> library(foreach); # used for parallel processing
> library(doRNG); # is this actually used??
> library(doParallel); # "backend" for foreach package
> registerDoParallel(cores = 4);
>
> #betas.frame <- read.table(file="testbetas.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");
> betas.frame <- read.table(file="betas_140220.txt",sep="\t",header=TRUE,na.strings=c(NA),row.names=1,quote="");
> betas <- as.matrix(betas.frame); # convert data frame to matrix
> data.rs <- RatioSet(Beta = betas,annotation=c(array= "IlluminaHumanMethylation450k",  annotation = "ilmn12.hg19")); # create RatioSet
> data.grs <- mapToGenome(data.rs); # create GenomicRatioSet
> sampleNames(data.rs); # display the sample names (ie column headings)
> targets <- read.450k.sheet('/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis',recursive=FALSE); # read Sample Sheet
> samples <- sampleNames(data.rs); # get sample names (ie column headers) from input file of beta values
> j <- match(samples,targets$Sample); # get row indices from sample sheet that correspond to sample names (column headers) in beta value input file
> pheno <- targets$Sample_Group[j]; # extract corresponding phenotype (e.g. "tumor" or "normal") from Sample_Group column of Sample Sheet
> designMatrix <- model.matrix(~ pheno); # specify design matrix for regression
> dmr <- bumphunter(data.grs, design = designMatrix, cutoff = 0.05, B=100); # run bumphunter with B permutations

That's your problem ^^^^^^^^^^^^^^^^^^^^

You can interpret the cutoff argument this way; 'Select the bumps that 
are larger than the Xth quantile of all bumps in my data set'. By using 
0.05, you are selecting almost all of the bumps in your data, which is 
not what you want to do.

Instead, you want to use a cutoff of 0.95 or 0.99, which is 'Select the 
bumps that are larger than 95% or 99% of all the bumps in my data'. Or 
to put it another way, you want to select the bumps that are big enough 
that they are likely to represent differential methylation, and ignore 
all the little bumps that likely arise by chance alone.

Best,

Jim

>                                                                                                              
> save(dmr, file="dmr.minfi.100_140222");
>
> __________________________________________________________________
> Standard Output:
>
>   [1] "TCGA.AB.2940.03A.01D.0742.05" "TCGA.AB.2944.03A.01D.0743.05"
>   [3] "TCGA.AB.2866.03A.01D.0741.05" "TCGA.AB.2871.03A.01D.0742.05"
>   [5] "TCGA.AB.2934.03A.01D.0743.05" "TCGA.AB.2807.03A.01D.0741.05"
>   [7] "TCGA.AB.2809.03A.01D.0741.05" "TCGA.AB.2984.03A.01D.0742.05"
>   [9] "TCGA.AB.2921.03A.01D.0743.05" "TCGA.AB.2952.03A.01D.0742.05"
> [11] "TCGA.AB.2829.03A.01D.0741.05" "TCGA.AB.2901.03A.01D.0742.05"
> [13] "TCGA.AB.2827.03A.01D.0741.05" "TCGA.AB.2884.03A.01D.0742.05"
> [15] "TCGA.AB.3011.03A.01D.0742.05" "TCGA.AB.2878.03A.01D.0742.05"
> [17] "TCGA.AB.2933.03A.01D.0742.05" "TCGA.AB.2843.03A.01D.0741.05"
> [19] "TCGA.AB.2814.03A.01D.0741.05" "TCGA.AB.2856.03A.01D.0741.05"
> [21] "TCGA.AB.2855.03A.01D.0741.05" "TCGA.AB.2919.03A.01D.0743.05"
> [23] "TCGA.AB.2967.03A.01D.0742.05" "TCGA.AB.2817.03A.01D.0742.05"
> [25] "TCGA.AB.2887.03A.01D.0742.05" "TCGA.AB.2894.03A.01D.0742.05"
> [27] "TCGA.AB.2895.03A.01D.0742.05" "TCGA.AB.2947.03A.01D.0743.05"
> [29] "TCGA.AB.2930.03A.01D.0743.05" "TCGA.AB.2896.03A.01D.0742.05"
> [31] "TCGA.AB.2983.03A.01D.0742.05" "TCGA.AB.2945.03A.01D.0741.05"
> [33] "TCGA.AB.2918.03A.01D.0743.05" "TCGA.AB.2928.03A.01D.0743.05"
> [35] "TCGA.AB.2979.03A.01D.0742.05" "TCGA.AB.2969.03A.01D.0741.05"
> [37] "TCGA.AB.2891.03A.01D.0742.05" "TCGA.AB.2971.03A.01D.0741.05"
> [39] "TCGA.AB.2964.03A.01D.0741.05" "TCGA.AB.2835.03A.01D.0741.05"
> [41] "TCGA.AB.3002.03A.01D.0742.05" "TCGA.AB.2853.03A.01D.0741.05"
> [43] "TCGA.AB.2836.03A.01D.0741.05" "TCGA.AB.2909.03A.01D.0743.05"
> [45] "TCGA.AB.2932.03A.01D.0743.05" "TCGA.AB.2926.03A.01D.0742.05"
> [47] "TCGA.AB.2826.03A.01D.0741.05" "TCGA.AB.2911.03A.01D.0741.05"
> [49] "TCGA.AB.2920.03A.01D.0742.05" "TCGA.AB.2917.03A.01D.0741.05"
> [51] "TCGA.AB.2869.03A.01D.0742.05" "TCGA.AB.2873.03A.01D.0742.05"
> [53] "TCGA.AB.2879.03A.01D.0742.05" "TCGA.AB.2831.03A.01D.0741.05"
> [55] "TCGA.AB.2816.03A.01D.0742.05" "TCGA.AB.2990.03A.01D.0742.05"
> [read.450k.sheet] Found the following CSV files:
> [1] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv"
> [2] "/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv"
>
> ------------------------------------------------------------
> Sender: LSF System <lsfadmin at blade14-4-11.gsc.wustl.edu>
> Subject: Job 5329542: <beta2dmr> Exited
>
> Job <beta2dmr> was submitted from host <linus2091.gsc.wustl.edu> by user <apetti> in cluster <lsfcluster1>.
> Job was executed on host(s) <4*blade14-4-11.gsc.wustl.edu>, in queue <long>, as user <apetti> in cluster <lsfcluster1>.
> </gscuser/apetti> was used as the home directory.
>
> __________________________________________________________________
> Standard Error:
>
> Loading required package: methods
> Loading required package: BiocGenerics
> Loading required package: parallel
>
> Attaching package: â€˜BiocGenericsâ€™
>
> The following objects are masked from â€˜package:parallelâ€™:
>
>      clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
>      clusterExport, clusterMap, parApply, parCapply, parLapply,
>      parLapplyLB, parRapply, parSapply, parSapplyLB
>
> The following object is masked from â€˜package:statsâ€™:
>
>      xtabs
>
> The following objects are masked from â€˜package:baseâ€™:
>
>      Filter, Find, Map, Position, Reduce, anyDuplicated, append,
>      as.data.frame, as.vector, cbind, colnames, duplicated, eval, evalq,
>      get, intersect, is.unsorted, lapply, mapply, match, mget, order,
>      paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rep.int,
>      rownames, sapply, setdiff, sort, table, tapply, union, unique,
>      unlist
>
> Loading required package: Biobase
> Welcome to Bioconductor
>
>      Vignettes contain introductory material; view with
>      'browseVignettes()'. To cite Bioconductor, see
>      'citation("Biobase")', and for packages 'citation("pkgname")'.
>
> Loading required package: lattice
> Loading required package: GenomicRanges
> Loading required package: IRanges
> Loading required package: XVector
> Loading required package: Biostrings
> Loading required package: bumphunter
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: locfit
> locfit 1.5-9.1   2013-03-22
>
> Attaching package: â€˜locfitâ€™
>
> The following objects are masked from â€˜package:GenomicRangesâ€™:
>
>      left, right
>
> Loading required package: rngtools
> Loading required package: pkgmaker
> Loading required package: registry
>
> Attaching package: â€˜pkgmakerâ€™
>
> The following object is masked from â€˜package:IRangesâ€™:
>
>      new2
>
> Warning messages:
> 1: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
>    Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
> 2: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
>    Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv
> 3: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
>    Could not infer array name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
> 4: In FUN(c("/gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/IdatSampleSheet.csv",  :
>    Could not infer slide name for file: /gscmnt/gc3016/info/medseq/apetti/AMLrefractory/dmr.analysis/SampleSheet140219.csv
> [bumphunterEngine] Parallelizing using 4 workers/cores (backend: doParallel, version: 1.0.6).
> [bumphunterEngine] Computing coefficients.
> [bumphunterEngine] Performing 100 permutations.
> [bumphunterEngine] Computing marginal permutation p-values.
> [bumphunterEngine] cutoff: 0.05
> [bumphunterEngine] Finding regions.
> [bumphunterEngine] Found 233469 bumps.
> [bumphunterEngine] Computing regions for each permutation.
>
> Warning message:
> In regionFinder(x = beta, chr = chr, pos = pos, cluster = cluster,  :
>    NAs found and removed. ind changed.
> Execution halted
>
>
>   -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.utf8        LC_COLLATE=C
>   [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>   [7] LC_PAPER=en_US.utf8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] doParallel_1.0.6
>   [2] doRNG_1.5.5
>   [3] rngtools_1.2.3
>   [4] pkgmaker_0.17.4
>   [5] registry_0.2
>   [6] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
>   [7] minfi_1.9.11
>   [8] bumphunter_1.2.0
>   [9] locfit_1.5-9.1
> [10] iterators_1.0.6
> [11] foreach_1.4.1
> [12] Biostrings_2.30.1
> [13] GenomicRanges_1.14.4
> [14] XVector_0.2.0
> [15] IRanges_1.20.6
> [16] lattice_0.20-24
> [17] Biobase_2.22.0
> [18] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.24.0  DBI_0.2-7             MASS_7.3-29
>   [4] R.methodsS3_1.6.1     RColorBrewer_1.0-5    RSQLite_0.11.4
>   [7] XML_3.98-1.1          annotate_1.40.0       beanplot_1.1
> [10] codetools_0.2-8       digest_0.6.4          genefilter_1.44.0
> [13] grid_3.0.2            illuminaio_0.2.0      itertools_0.1-1
> [16] limma_3.18.4          matrixStats_0.8.14    mclust_4.2
> [19] multtest_2.18.0       nlme_3.1-113          nor1mix_1.1-4
> [22] preprocessCore_1.24.0 reshape_0.8.4         siggenes_1.36.0
> [25] splines_3.0.2         stats4_3.0.2          stringr_0.6.2
> [28] survival_2.37-7       tools_3.0.2           xtable_1.7-1
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099