[BioC] clustering million points : transcript start and end sites

Thu Apr 5 17:02:13 CEST 2012

Hey Guys

I have about million points per chromosome that are basically putative
start and end regions of transcripts. My goal is to cluster them with
some unsupervised learning algorithm like DBSCAN (open to suggestions)
which can handle big datasets in O(nlogn) time if possible, treating
each start/end point as X,Y coordinates.

Any ideas ? I need to make sure I dont overshoot the memory during
distance calculation.

What I would like to get out is set of clusters with number of points
in each which will give me a putative set of transcript (may be
fragmented) from these points.

Thanks!
-Abhi