[R] agnes clustering and NAs

Sun Jan 30 04:00:05 CET 2011

Hello,

Thankyou for the clarification about the NAs. For your interest, thankfully my end goal was not to plot a dendrogram with 23371 elements, but just to use the output of the clustering to re-order the rows of a matrix before plotting it with image(). Since clara() and pam() are partitioning based approaches, I suppose I could instead stay with hclust() after removing the offending rows, so that I have the ordering position of each gene, not its cluster membership. I have 12 GB RAM on my 64-bit system, so the time it takes to run should be my only problem.

- Dario.

---- Original message ----
>Date: Fri, 28 Jan 2011 12:34:26 +0100
>From: Martin Maechler <maechler at stat.math.ethz.ch>  
>Subject: Re: [R] agnes clustering and NAs  
>To: gavin.simpson at ucl.ac.uk
>Cc: D.Strbenac at garvan.org.au, r-help at r-project.org, Uwe Ligges <ligges at statistik.tu-dortmund.de>
>
>>>>>> Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>>>>     on Fri, 28 Jan 2011 09:23:05 +0000 writes:
>
>    > On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote:
>    >> Hello,
>    >> 
>    >> Yes, that's right, it is a values matrix. Not a dissimilarity matrix.
>    >> 
>    >> i.e.
>    >> 
>    >> > str(iMatrix)
>    >> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ...
>    >> - attr(*, "dimnames")=List of 2
>    >> ..$ : NULL
>    >> ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ...
>
>Ok, so in the end you want to draw a dendrogram for  23'371
>observational units, really ?
>
>I think I would not use a hierarchical clustering method for so
>many units, but rather  clara() or maybe pam() or then model
>based or other methods, rather than fully hierarchical ones....
>...
>but yes, that's not the issue here, and see further down ...
>
>BTW:  The object 'iMatrix' you provided for download has only 50
>      columns, not 56...
>    >> 
>    >> For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column.
>
>    GS> Sorry, my bad. Try this:
>
>    GS> apply(iMatrix, 1, function(x) all(is.na(x)))
>
>    GS> will check that you have no fully `NA` rows.
>
>    GS> Also look at str(iMatrix) for potential problems.
>
>    GS> Finally, try:
>
>    GS> out <- dist(iMatrix) any(is.na(out))
>
>    GS> should repeat what agnes is doing to compute the
>    GS> dissimilarity matrix.  If that returns TRUE, go and find
>    GS> which samples are giving NA dissimilarity and why.
>
>    GS> The issue is not NA in the input data, but that your
>    GS> input data is leading to NA in the computed
>    GS> dissimilarities. This might be due to NA's in your input
>    GS> data, where a pair of samples has no common set of data
>    GS> for example.
>
>Yes, that's right on spot, thank you Gavin.
>
>This is indeed to true:  
>It *does* allow for NA's (in the data matrix), but if the
>pattern of NA's is such that the dissimilarity between two
>observations becomes undefined, namely e.g. if they have no
>common non-missings, then ``that's too much''.
>
>In general, I'd recommend to use 
>  dm <- daisy(....,...) 
>trying methods, that are better with NAs, e.g. Gower's metric,
>until dm() has {nearly} no NAs,
>and then figure out some imputation to replace all NA's in   dm
>by "reasonable values",
>then do clustering with the resulting dissimilarity "matrix" dm.
>
>HOWEVER, in your case, dm would correspond to 
> 23371 x 23371 dissimilarity matrix,
>stored as a double precision matrix (on a 64-bit platform)
>that's an object of size 4.4 GBytes, not very convenient to work
>with.
>as dissimilarity object it will only be about half of that size,
>but that's still ``a bit large''..
>As I said above, for such data, I would never do fully
>hierarchical clustering,
>but rather something else.
>
>Martin Maechler, ETH Zurich
>
>
>    GS> HTH
>    GS> G
>
>    >> The part of the agnes documentation I was referring to is :
>    >> 
>    >> "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric.  Missing values (NAs) are allowed."
>    >> 
>    >> So, I'm under the impression it handles NAs on its own ?
>    >> 
>    >> - Dario.
>    >> 
>    >> ---- Original message ----
>    >> >Date: Thu, 27 Jan 2011 12:53:27 +0000
>    >> >From: Gavin Simpson <gavin.simpson at ucl.ac.uk>  
>    >> >Subject: Re: [R] agnes clustering and NAs  
>    >> >To: Uwe Ligges <ligges at statistik.tu-dortmund.de>
>    >> >Cc: D.Strbenac at garvan.org.au, r-help at r-project.org
>    >> >
>    >> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote:
>    >> >> 
>    >> >> On 27.01.2011 05:00, Dario Strbenac wrote:
>    >> >> > Hello,
>    >> >> >
>    >> >> > In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like :
>    >> >> >
>    >> >> >> m<- matrix(c(
>    >> >> > 1, 1, 1, 2,
>    >> >> > 1, NA, 1, 1,
>    >> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE)
>    >> >> >> agnes(m)
>    >> >> > Call:    agnes(x = m)
>    >> >> > Agglomerative coefficient:  0.1614168
>    >> >> > Order of objects:
>    >> >> > [1] 1 2 3
>    >> >> > Height (summary):
>    >> >> >     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    >> >> >    1.155   1.247   1.339   1.339   1.431   1.524
>    >> >> >
>    >> >> > Available components:
>    >> >> > [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method" "data"
>    >> >> >
>    >> >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error :
>    >> >> >
>    >> >> >> agnes(iMatrix)
>    >> >> > Error in agnes(iMatrix) :
>    >> >> >    No clustering performed, NA-values in the dissimilarity matrix.
>    >> >> >
>    >> >> > I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation.
>    >> >> 
>    >> >> 
>    >> >> I haven't looked in the file, but you need to get rid of all NA, or in 
>    >> >> other words, all rows that contain *any* NA values.
>    >> >
>    >> >If one believes the documentation, then that only applies to the case
>    >> >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw
>    >> >data matrix or data frame.
>    >> >
>    >> >The only way the OP could have gotten that error with the call shown is
>    >> >if iMatrix were not a dissimilarity matrix inheriting from class "dist",
>    >> >so `NA`s should be allowed.
>    >> >
>    >> >My guess would be that the OP didn't get rid of all the `NA`s.
>    >> >
>    >> >Dario: what does:
>    >> >
>    >> >sapply(iMatrix, function(x) any(is.na(x)))
>    >> >
>    >> >or if iMatrix is a matrix:
>    >> >
>    >> >apply(iMatrix, 2, function(x) any(is.na(x)))
>    >> >
>    >> >say?
>    >> >
>    >> >G
>    >> >
>    >> >> Uwe Ligges
>    >> >> 
>    >> >> 
>    >> >> 
>    >> >> > The matrix I'm using can be obtained here :
>    >> >> > http://129.94.136.7/file_dump/dario/iMatrix.obj
>    >> >> >
>    >> >> > --------------------------------------
>    >> >> > Dario Strbenac
>    >> >> > Research Assistant
>    >> >> > Cancer Epigenetics
>    >> >> > Garvan Institute of Medical Research
>    >> >> > Darlinghurst NSW 2010
>    >> >> > Australia
>    >> >> >
>
>    >> >-- 
>    >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>    >> > Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
>    >> > ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>    >> > Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
>    >> > Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
>    >> > UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
>    >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia