[BioC] Reproducibility of DNAcopy segmentation

Wed Jan 13 20:00:17 CET 2010

On Wed, Jan 13, 2010 at 1:42 PM, Ross Patterson <rossjp at gmail.com> wrote:
> While performing some copy number analysis on data segmented with the
> DNAcopy package, I have noticed some variations in the output data, and was
> hoping someone here could help shed some light on that.  Specifically, while
> running the DNAcopy segmentation on the exact same input data multiple
> times, I have noticed that the resultant segment data output sometimes
> contains "extra" segments, caused by the discovery of "extra" breakpoints.
> In fact, the resultant output data is always different.  Digging into the
> source code a little bit, I saw what appeared to be calls to some random
> number generating functions, although not being very familiar with Fortran
> code I could not tell how or why these numbers were being used, or even if
> that is the source of segmentation discrepancies.  I know that in the last
> few years there have been some changes to the segmentation algorithm to
> allow it to run in near linear time.  Did that require introducing
> non-deterministic behavior?  Is there a way to force the segmentation
> algorithm to run deterministically, such that the output data can be
> identically reproduced every time the segmentation is run?

Hi, Ross.

DNAcopy uses an empirical distribution for determining significance.
The help for segment() gives some details.  The authors can perhaps
comment on whether or not there is a way to make things run
deterministically.

Sean