[BioC] clustering methods for gene expression

Donna Toleno toleno at usc.edu
Wed Oct 10 01:25:55 CEST 2007


Hello list,

I have a question about using the Bioconductor tools to reproduce a certain illustration I have seen in publications. I took some sample data and used hcluster on my microarray arrays as in the following example. I subset on the genes and then cluster the arrays based on the genes of interest. I use Euclidean distance and average linking. Is this the same as using PHYLIP to make a UPGMA tree with distances calculated based on pairwise "mean of the squared difference in expression intensities"?  If I am not mistaken, the Euclidean distance would be the square root of the sum of the squared differences. It seems it is a similar distance measure. But what if we took the mean probe signal intensities for each of two previously defined groups and then took a mean squared difference between the group means? Then within one of the groups the expression variation is defined in the literature as "mean probe signal intensity variation". These seem like unusual ways to measure the with
in and between group variation. Wouldn't it be more acceptable to use a linear model in limma to quantify the variance components within and between groups?Also, is there any reason to use PHYLIP to display the relationships of expression profiles instead of using hcluster or some other clustering algorithm? Any feedback at all would be appreciated. Feel free to criticize my R code appearing below. I am somewhat new to microarray data analysis and Bioconductor and I recently pieced together this code to carry out clustering.

mydata <- ReadAffy()
eset <- rma(mydata)
#transpose the expression set so that the rows are the genes
data_frame_expression <- t(as.data.frame(eset))
vars <- apply(data_frame_expression,1, var)
means <- apply(data_frame_expression,1, mean)
stdevs <- apply(data_frame_expression,1, sd)
CV <- stdevs/means
#subset the data on the rows (genes)
# an example:subset <- vars > quantile(vars,(nrow(data_frame_expression -200)/nrow(data_frame_expression))
subset <- CV > 0.15 
expression_subset <- data_frame_expression[subset, ]
#cluster on the columns (samples)
hc <- hcluster(t(expression_subset), link = "ave")
plot (hc)
write(hc2Newick(hc),file='hclust_subset.newick')
subset_distances <- as.matrix(dist(t(expression_subset), method = "euclidean", diag = FALSE, upper = FALSE))
write (subset_distances, file='subset_of_genes_dist_file') # the problem here is that the distance matrix didn't print to a file properly.


Thank you for reading!



More information about the Bioconductor mailing list