[BioC] Clustering question

Wed Jul 11 11:14:39 CEST 2012

Hi everybody. 

Imagine the following scenario: I have a Methylation data ExpressionSet with 40 samples and 450K probes (Illumina kind). Samples are divided in two classes, and I would like to characterize families of probes according to their behavior. That is, I would like to find a set of probes hypermethylating with respect to the covariate that divides between classes, another one showing that variability increases between classes, etc.

I have been trying some ideas around the following workflow:

1) Filtering of the data (non-specific, sexual chromosome genes, ..)
2) Transformation into a lower-dimensional, summary, subspace. For example, if I have 20 beta values for a class, and 20 for the other, above transformation takes the 40-dimensional beta values vector and summarizes it as a 2 dimensional vector, with the first component being the difference of the medians of the two classes, and the second one being the difference in their IQR. My idea was to summarize data and work with those transformed variables that really characterize what I am looking for. 
3) Clustering in the new subspace. For now, I am using k-means as a baseline clustering
method. My idea was to test a hierarchical method and maybe a Bayesian dp-means, among others. 

This is mainly a exploratory workflow. I want to know how these probes behave according to the above variables, and I am testing different ideas on my data. But I was wondering if I am doing right by summarizing the beta values into the new variables, or if there is some alternative (maybe model-based) for doing this kind of exploratory work. Apart from losing a lot of information on the way, am I getting into problems for doing that? 

Any hint or suggestion will be appreciated.

Regards,
Gus

---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)