[BioC] Clustering in R....

Sean Davis sdavis2 at mail.nih.gov
Tue Nov 11 12:39:01 MET 2003


Marcus,

Here is a fairly general method for working with heatmap that I have used.
You can substitute any function that you want for distance (eg.,
1-correlation, etc.) and for clustering (don't have to use hclust).  Make
sure that you do the coercion (to distance or dendrogram objects as needed),
though.  Also, some distance functions that you can dream up will not work
with NA's, but dist does.

> m <- matrix(rnorm(100),nrow=10,ncol=10)
> m
            [,1]        [,2]       [,3]        [,4]       [,5]       [,6]
 [1,] -1.0326191  1.09744204  0.9923254 -0.05780237  1.6853566 -0.5938021
 [2,] -0.6493561 -0.58846041  0.8735639  0.34492342 -0.1398261  1.4288108
 [3,] -1.0020073  0.75130128 -2.6110435  1.27265445  0.1211387  0.7048981
 [4,] -0.1658810  0.45351434 -0.8973168 -0.17738084 -0.1056792 -1.7251339
 [5,]  0.1466563  0.11917823  0.9372353  0.29040600  0.8463049  0.9192848
 [6,]  0.6020565 -0.90338771 -0.7453363 -1.34284821 -0.7684490  0.2177409
 [7,]  0.5290555  0.58798246  0.4085396  0.63305003  0.2014624 -0.5613248
 [8,]  1.4456958  0.06372875  0.1829127  0.20681971  0.5745696 -0.3555856
 [9,]  0.5973093 -0.35483585  1.1074023  0.63930734 -1.2452399 -1.2721422
[10,]  1.2563169  0.92249574 -0.7103717 -0.41067056  0.2277188  0.3861969
             [,7]       [,8]       [,9]       [,10]
 [1,] -1.63852314 -1.0773165  0.5601368  1.05115476
 [2,] -0.14026278 -0.9013605  0.1581475  0.36730440
 [3,]  0.45517561 -1.5211124 -1.1641732  1.97321531
 [4,]  0.08338336  1.4846938  0.3096862  0.44513675
 [5,]  0.85917332  1.0337033 -0.1784938 -0.48848017
 [6,]  0.05054810  1.3712665 -0.6545246  0.10251154
 [7,]  2.30894410 -0.6089214  1.5761573  0.66912925
 [8,] -0.85946317  0.0855971 -0.7014037 -2.19050881
 [9,]  1.53911617  1.1185075  0.2428764 -0.09556405
[10,] -1.61446618  1.0605298  0.5160358  0.04152571
> m[10,1:8] <- NA
> m
            [,1]        [,2]       [,3]        [,4]       [,5]       [,6]
 [1,] -1.0326191  1.09744204  0.9923254 -0.05780237  1.6853566 -0.5938021
 [2,] -0.6493561 -0.58846041  0.8735639  0.34492342 -0.1398261  1.4288108
 [3,] -1.0020073  0.75130128 -2.6110435  1.27265445  0.1211387  0.7048981
 [4,] -0.1658810  0.45351434 -0.8973168 -0.17738084 -0.1056792 -1.7251339
 [5,]  0.1466563  0.11917823  0.9372353  0.29040600  0.8463049  0.9192848
 [6,]  0.6020565 -0.90338771 -0.7453363 -1.34284821 -0.7684490  0.2177409
 [7,]  0.5290555  0.58798246  0.4085396  0.63305003  0.2014624 -0.5613248
 [8,]  1.4456958  0.06372875  0.1829127  0.20681971  0.5745696 -0.3555856
 [9,]  0.5973093 -0.35483585  1.1074023  0.63930734 -1.2452399 -1.2721422
[10,]         NA          NA         NA          NA         NA         NA
             [,7]       [,8]       [,9]       [,10]
 [1,] -1.63852314 -1.0773165  0.5601368  1.05115476
 [2,] -0.14026278 -0.9013605  0.1581475  0.36730440
 [3,]  0.45517561 -1.5211124 -1.1641732  1.97321531
 [4,]  0.08338336  1.4846938  0.3096862  0.44513675
 [5,]  0.85917332  1.0337033 -0.1784938 -0.48848017
 [6,]  0.05054810  1.3712665 -0.6545246  0.10251154
 [7,]  2.30894410 -0.6089214  1.5761573  0.66912925
 [8,] -0.85946317  0.0855971 -0.7014037 -2.19050881
 [9,]  1.53911617  1.1185075  0.2428764 -0.09556405
[10,]          NA         NA  0.5160358  0.04152571
> sampdist=dist(t(m))
> sclus=hclust(sampdist) # sclus is a dendrogram that you can plot(sclus)
> genedist=dist(m)
> gclus=hclust(genedist) # gclus is also a dendrogram
> heatmap(m,Rowv=gclus,Colv=sclus) #this doesn't work!
Error in lV + rV : non-numeric argument to binary operator
> heatmap(m,Rowv=as.dendrogram(gclus),Colv=as.dendrogram(sclus)) # need proper
coercion for this to work

Although this works, note that using a gene that has 16 NA values out of 22
is probably not going to be useful, as the distance matrix for this example
for the genes is:

> genedist
          1        2        3        4        5        6        7        8
2  3.673241        
3  5.235695 4.536603
4  4.381494 4.522069 5.046200
5  4.367649 2.821795 5.437622 3.688942
6  5.408318 3.863713 5.380546 3.014530 3.345877
7  4.764409 3.915998 5.194822 3.911820 3.548220 4.830247
8  4.825510 4.216357 6.212646 4.149383 3.314914 3.844966 5.041345
9  5.536079 4.169987 6.179576 3.158424 3.249127 3.637840 3.149486 4.264858
10 2.259752 1.082164 5.724739 1.013612 1.953558 2.621002 2.754763 5.685128
           9
2           
3           
4           
5           
6           
7           
8           
9           
10 0.6834093

See how much different the distance involving row 10 is from the others--the
NA values were simply dropped.  You will probably have to either deal with
the missing values beforehand or use another distance measure that is not
sensitive to NA values.  I can't tell you what to do on that part, as that
is also somewhat dependent on your need to use that gene's data and the
practicality of doing more experiments.

Hope that helps.

Sean
--
Clinical Fellow
Pediatric Oncology
Johns Hopkins/
National Institutes of Health
NCI/NHGRI
-- 



On 11/10/03 7:41 AM, "Marcus" <marcusb at biotech.kth.se> wrote:
>> Hello again. Back from some weeks of laborative work I still have some
>> questions on clustering in R.
>> 
>> My problem is that I have some spots flagges as NA in a matrix of M-values
>> organised slidewise. I want to cluster those but I get error messages when
>> using heatmap due to the NA:s in the matrix. I mailed Andy Liaw (who wrote
>> the heatmap function) and he gave med the tip to look into the daisy
>> function. And the daisy function is supposed to handle NA:s.
>> 
>> But what do you get out of the function?
>> 
>> test <- daisy(mymatrix)
>> This creates an object of type dissimilarity right? And you can convert it
>> into a matrix with the help of
>> testII <- as.matrix(test)
>> Is this what I should use hclust on? or should I do
>> testIII <- as.dist(testII) before. Neither works so I do not know really
>> what is true.
>> 
>> And I tried to use daisy directly with heatmap but that didnt work but
>> produced the same error as with dist.
>> 
>> heatmap(mymatrix[1:22,], distfun = dist)
>> Error in hclustfun(distfun(x)) : NA/NaN/Inf in foreign function call (arg 11)
>> This is due to the fact that I only have 2 M-values in the twentisecond
>> row and 16 NA:s.
>> 
>> So basically my question is, how do you do to get heatmap to work with a
>> matrix of M-values that has got spots flagged NA in them ? What distance
>> function works and how do you use it?



More information about the Bioconductor mailing list