[R] agnes clustering and NAs

Fri Jan 28 10:23:05 CET 2011

On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote:
> Hello,
> 
> Yes, that's right, it is a values matrix. Not a dissimilarity matrix.
> 
> i.e.
> 
> > str(iMatrix)
>  num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ...
>  - attr(*, "dimnames")=List of 2
>   ..$ : NULL
>   ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ...
> 
> For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column.

Sorry, my bad. Try this:

apply(iMatrix, 1, function(x) all(is.na(x)))

will check that you have no fully `NA` rows.

Also look at str(iMatrix) for potential problems.

Finally, try:

out <- dist(iMatrix)
any(is.na(out))

should repeat what agnes is doing to compute the dissimilarity matrix.
If that returns TRUE, go and find which samples are giving NA
dissimilarity and why.

The issue is not NA in the input data, but that your input data is
leading to NA in the computed dissimilarities. This might be due to NA's
in your input data, where a pair of samples has no common set of data
for example.

HTH

G

> The part of the agnes documentation I was referring to is :
> 
> "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric.  Missing values (NAs) are allowed."
> 
> So, I'm under the impression it handles NAs on its own ?
> 
> - Dario.
> 
> ---- Original message ----
> >Date: Thu, 27 Jan 2011 12:53:27 +0000
> >From: Gavin Simpson <gavin.simpson at ucl.ac.uk>  
> >Subject: Re: [R] agnes clustering and NAs  
> >To: Uwe Ligges <ligges at statistik.tu-dortmund.de>
> >Cc: D.Strbenac at garvan.org.au, r-help at r-project.org
> >
> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote:
> >> 
> >> On 27.01.2011 05:00, Dario Strbenac wrote:
> >> > Hello,
> >> >
> >> > In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like :
> >> >
> >> >> m<- matrix(c(
> >> > 1, 1, 1, 2,
> >> > 1, NA, 1, 1,
> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE)
> >> >> agnes(m)
> >> > Call:    agnes(x = m)
> >> > Agglomerative coefficient:  0.1614168
> >> > Order of objects:
> >> > [1] 1 2 3
> >> > Height (summary):
> >> >     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> >> >    1.155   1.247   1.339   1.339   1.431   1.524
> >> >
> >> > Available components:
> >> > [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method" "data"
> >> >
> >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error :
> >> >
> >> >> agnes(iMatrix)
> >> > Error in agnes(iMatrix) :
> >> >    No clustering performed, NA-values in the dissimilarity matrix.
> >> >
> >> > I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation.
> >> 
> >> 
> >> I haven't looked in the file, but you need to get rid of all NA, or in 
> >> other words, all rows that contain *any* NA values.
> >
> >If one believes the documentation, then that only applies to the case
> >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw
> >data matrix or data frame.
> >
> >The only way the OP could have gotten that error with the call shown is
> >if iMatrix were not a dissimilarity matrix inheriting from class "dist",
> >so `NA`s should be allowed.
> >
> >My guess would be that the OP didn't get rid of all the `NA`s.
> >
> >Dario: what does:
> >
> >sapply(iMatrix, function(x) any(is.na(x)))
> >
> >or if iMatrix is a matrix:
> >
> >apply(iMatrix, 2, function(x) any(is.na(x)))
> >
> >say?
> >
> >G
> >
> >> Uwe Ligges
> >> 
> >> 
> >> 
> >> > The matrix I'm using can be obtained here :
> >> > http://129.94.136.7/file_dump/dario/iMatrix.obj
> >> >
> >> > --------------------------------------
> >> > Dario Strbenac
> >> > Research Assistant
> >> > Cancer Epigenetics
> >> > Garvan Institute of Medical Research
> >> > Darlinghurst NSW 2010
> >> > Australia
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> 
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >-- 
> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> > ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> > Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> > Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> > UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >
> 
> 
> --------------------------------------
> Dario Strbenac
> Research Assistant
> Cancer Epigenetics
> Garvan Institute of Medical Research
> Darlinghurst NSW 2010
> Australia

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%