[BioC] Excessive memory requirements of PING or bug?

Martin Morgan mtmorgan at fhcrc.org
Tue May 22 14:48:39 CEST 2012


On 05/21/2012 10:09 AM, Xuekui Zhang wrote:
> Hi Lars,
>
>    Thanks for your feed back!
>    Yes, cutting chromosome before segmentation is not preferred, since the cutting points might be where the peaks/nucleosomes are.
>    To avoid this problem, you could run a sliding window (e.g. window size 300bp, step size 10 bp) on a chromosome, count reads count in each window to find valley of reads counts curve and good cutting there.
>    When we make the next version of PING, we could integrate cutting into segmentation step to avoid cutting chromosome on wrong place.

The original problem sounded like an integer overflow in the underlying 
C code in PING. Can that be fixed? If there is a simple reproducible 
example (e.g., using R's random number generators to simulate large 
enough data) I can perhaps help to identify this.

Martin

> Xuekui
>
> On May 20, 2012, at 3:25 AM, Lars Hennig wrote:
>
>> Yes, I tried. Restricting to single chromosomes of ~ 20MB did not help but going to much smaller subchromosomal domains did eventually solve the problem. Still, this is not a preferred option to slice the genome into many small sectons.
>>
>> Lars
>>
>> -----Original Message-----
>> From: Dan Tenenbaum [mailto:dtenenba at fhcrc.org]
>> Sent: Sunday, May 20, 2012 12:30 AM
>> To: Xuekui Zhang
>> Cc: Raphael Gottardo; Lars Hennig; Renan Sauteraud; bioconductor at r-project.org
>> Subject: Re: [BioC] Excessive memory requirements of PING or bug?
>>
>> [cc'ing Bioconductor list so others can benefit...]
>>
>> On Sat, May 19, 2012 at 3:28 PM, Xuekui Zhang<ubcxzhang at gmail.com>  wrote:
>>> Hi Lars,
>>>
>>>    Did you try to analyze each chromosome separately?
>>>    Please let me know if that still can not solve the problem.
>>>
>>> Xuekui
>>>
>>> On May 19, 2012, at 5:35 PM, Raphael Gottardo wrote:
>>>
>>> Hi Lars,
>>>
>>> Xuekui ccied here will look into it.
>>>
>>> Raphael
>>>
>>> --
>>> Raphael Gottardo, Associate Member
>>> http://www.rglab.org
>>> Fred Hutchinson Cancer Research Center Vaccine and Infectious Disease
>>> Division Public Health Sciences Division
>>>
>>>
>>>
>>> On May 18, 2012, at 11:56 AM, Dan Tenenbaum wrote:
>>>
>>> I'm cc'ing one of the PING maintainers who can perhaps shed more light
>>> on this.
>>> Dan
>>>
>>>
>>> On Thu, May 17, 2012 at 2:55 PM, Lars Hennig<Lars.Hennig at slu.se>  wrote:
>>>
>>> Dear PING maintainers,
>>>
>>>
>>> Running PING with the example from the vignette works fine, but
>>> segmentReads causes a "cannot allocate memory block of size
>>> 68719476735.9 Gb" error when using my own ChIP-seq sample data. (16Mio
>>> paired end reads mapped with bowtie). This is an Arabidopsis sample (genome size = 130MB).
>>>
>>> Using a sample of 100000 of our own reads runs smoothly again, 2.5 Mio
>>> crash with a similarly high memory request as mentioned above.
>>> Including snowfall or not has no effect.
>>>
>>>
>>> Is there a way to trick PING into processing more than some few 100000
>>> reads with "normal" memory (I have 48 Gb available). If PING really
>>> has a very high memory need, this could be mentioned in the documentation.
>>>
>>>
>>> Thank you very much,
>>>
>>>
>>> Lars
>>>
>>>
>>> Script:
>>>
>>>
>>> library(ShortRead)
>>>
>>>
>>> reads<- readAligned("reads_sorted.bam", type="BAM")
>>>
>>> reads<- reads[!is.na(position(reads))]
>>>
>>> reads<- reads[chromosome(reads) %in% c("Chr4")]
>>>
>>>
>>> #reads<- reads[1:100000]
>>>
>>>
>>> library(PING)
>>>
>>> library(snowfall)
>>>
>>> sfInit(parallel=TRUE,cpus=4)
>>>
>>> sfLibrary(PING)
>>>
>>>
>>>
>>> reads<- as(reads,"RangesList")
>>>
>>> reads<- as(reads,"RangedData")
>>>
>>> reads<- as(reads,"GenomeData")
>>>
>>>
>>> seg<-segmentReads(reads, minReads=5, maxLregion=1200,minLregion=80,
>>> jitter=T)
>>>
>>>
>>>
>>>
>>>
>>> traceback()
>>>
>>> 2: .Call("segReadsAll", data, dataC, start, end, as.integer(jitter),
>>>
>>>        paraSW, as.integer(maxStep), as.integer(minLregion), PACKAGE =
>>> "PING")
>>>
>>> 1: segmentReads(reads_gd, minReads = 5, maxLregion = 1200, minLregion
>>> = 80,
>>>
>>>        jitter = T)
>>>
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 2.15.0 (2012-03-30)
>>>
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>>
>>> locale:
>>>
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>
>>> [7] LC_PAPER=C                 LC_NAME=C
>>>
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>>
>>> attached base packages:
>>>
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>>
>>> other attached packages:
>>>
>>> [1] snowfall_1.84       snow_0.3-9          PING_1.0.0
>>>
>>> [4] chipseq_1.6.0       ShortRead_1.14.3    latticeExtra_0.6-19
>>>
>>> [7] RColorBrewer_1.0-5  Rsamtools_1.8.4     lattice_0.20-6
>>>
>>> [10] BSgenome_1.24.0     Biostrings_2.24.1   GenomicRanges_1.8.6
>>>
>>> [13] IRanges_1.14.3      BiocGenerics_0.2.0
>>>
>>>
>>> loaded via a namespace (and not attached):
>>>
>>> [1] Biobase_2.16.0      biomaRt_2.12.0      bitops_1.0-4.1
>>>
>>> [4] GenomeGraphs_1.16.0 grid_2.15.0         hwriter_1.3
>>>
>>> [7] RCurl_1.91-1        stats4_2.15.0       tools_2.15.0
>>>
>>> [10] XML_3.9-4           zlibbioc_1.2.0
>>>
>>>
>>>
>>> Dr. Lars Hennig
>>>
>>> Professor of Genetics
>>>
>>> Swedish University of Agricultural Sciences
>>>
>>> Uppsala BioCenter
>>>
>>> Department of Plant Biology and Forest Genetics
>>>
>>> PO-Box 7080
>>>
>>> SE-75007 Uppsala, Sweden
>>>
>>> Lars.Hennig at vbsg.slu.se
>>>
>>> Tel. +46 18 67 3326
>>>
>>> Fax  +46 18 67 3389
>>>
>>>
>>> Visiting address:
>>>
>>> Uppsala BioCenter
>>>
>>> Almas Allé 5
>>>
>>> SE-75651 Uppsala, Sweden
>>>
>>> Room A-489
>>>
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Bioconductor mailing list
>>>
>>> Bioconductor at r-project.org
>>>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list