[BioC] Clustering question

Thu Jul 12 10:35:15 CEST 2012

Hi again, Tim!   

---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)

El miércoles 11 de julio de 2012 a las 17:06, Tim Triche, Jr. escribió:

> try the 'lpc' or 'superpc' package if what you want is supervision.

As far as I can see, these are packages for supervised principal components, aren't they? I do not know if I am not understanding anything (might be, seems I am getting dumber due to aging), or if I have not explained clearly what I was trying to do.

Main problem is, I would like to "cluster" probes, not samples (Is that possible?), according to their behavior with respect to a given covariate. This specific covariate represents time, but in a two-class, discrete scale (e.g. Newborns vs. Nonagenarians).  

Goal is to find "families" of probes that behave similarly wrt to the above covariate. For example, a family could be that of probes whose methylation goes up with age, but with no difference in variability. That's the reason I started playing around with the means and variabilities of the beta values in both age groups. I thought: if I transform the beta values matrix into a px4 matrix with columns (mean of betas in newborns, mean of betas in nonag, deviation of betas in newborns, deviation of betas in nonag), and then cluster in this new subspace, I could find similar patterns of behavior in the way I was thinking.

Reason is, I am not looking for differential expressed probes. I am instead looking for probes that behave similar. (well, I have just figured while I was writing this that maybe a clustering of the beta values using some kind of correlation distance could fit in this scenario. What do you think?)

My doubt then is about using theses summary variables (means, medians, deviations, iqrs, differences of them, …) as components of vectors in a new space. I do not know if that is correct.  
> or if you have a crapload of covariates/mutations/whatever, do CCA
> (try 'PMA').

Same as above. Moreover, I have to admit that I had not heard about these methods before. I have put them in my to-read list. You are an incredible source of knowledge, Tim. :)  
> consider using a logit transform on the betas if you are doing linear
> modeling; note that big changes are more or less invariant to the
> transformation.

I am happy because I think this is the first paragraph I get to know more or less what you are saying. I think you are right. I have been for only three months in Bioinformatics, but I am slowly starting to use log transformations in my work. Presentations and materials from Mr W. Huber found in the Internet have helped a lot in understanding the benefits of this way of working.  
> and switch to using SummarizedExperiments if you want flexibility in
> slicing up the data genomically. I wrote some coercions and generics
> for this class and I'll be submitting a package since they've been
> incredibly useful to me, for e.g. subsetting by GRanges.  

I guess I have to give them a try. I promise.   
> Will present
> some examples at Bioc2012.

Good luck. Wish I could be there, but you know here we are running very low in budget. :(   
> variance testing is tricky, look at what Haim Bar and Jim Booth have
> done with empirical Bayes mixture modeling for p1 vs. p0.

Another one for the to-read list. Thanks again for the references.  

Well, I guess that most of the time I am not explaining my intentions in a clear way. It's been only three months in bioinformatics, and I am still -as I like to say- learning its vocabulary.
>  
>  
> On Wed, Jul 11, 2012 at 2:14 AM, Gustavo Fernández Bayón
> <gbayon at gmail.com (mailto:gbayon at gmail.com)> wrote:
> > Hi everybody.
> >  
> > Imagine the following scenario: I have a Methylation data ExpressionSet with 40 samples and 450K probes (Illumina kind). Samples are divided in two classes, and I would like to characterize families of probes according to their behavior. That is, I would like to find a set of probes hypermethylating with respect to the covariate that divides between classes, another one showing that variability increases between classes, etc.
> >  
> > I have been trying some ideas around the following workflow:
> >  
> > 1) Filtering of the data (non-specific, sexual chromosome genes, ..)
> > 2) Transformation into a lower-dimensional, summary, subspace. For example, if I have 20 beta values for a class, and 20 for the other, above transformation takes the 40-dimensional beta values vector and summarizes it as a 2 dimensional vector, with the first component being the difference of the medians of the two classes, and the second one being the difference in their IQR. My idea was to summarize data and work with those transformed variables that really characterize what I am looking for.
> > 3) Clustering in the new subspace. For now, I am using k-means as a baseline clustering
> > method. My idea was to test a hierarchical method and maybe a Bayesian dp-means, among others.
> >  
> > This is mainly a exploratory workflow. I want to know how these probes behave according to the above variables, and I am testing different ideas on my data. But I was wondering if I am doing right by summarizing the beta values into the new variables, or if there is some alternative (maybe model-based) for doing this kind of exploratory work. Apart from losing a lot of information on the way, am I getting into problems for doing that?
> >  
> > Any hint or suggestion will be appreciated.
> >  
> > Regards,
> > Gus
> >  
> >  
> > ---------------------------
> > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
> >  
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>  
>  
>  
>  
>  
> --  
> A model is a lie that helps you see the truth.
>  
> Howard Skipper