Christian Hennig
chrish at stats.ucl.ac.uk
Sat Jun 14 21:46:35 CEST 2008
Dear Laura,
I have R 2.6.0. I tried dist on a vector of length 200,000 and it told me
that it is too long. Theoretically, if you have 260,000 observations, the
length of the dist object should be 260,000*259,999/2, which is too large
for our computers, I guess. Which means that unfortunately cluster.stats
won't work for such a large data set, because it needs the full casewise
dissimilarity information.
I don't understand how you managed to produce a dist object of length
of only 130,000 out of your data, but it certainly doesn't give all
pairwise distance information for 260,000 points and therefore cannot be
used in cluster.stats with a clustering vector of length 260,000 or so.
Sorry,
Christian
On Sat, 14 Jun 2008, Laura Poggio wrote:
> Thank. See below.
>
> Laura
>
> 2008/6/14 Christian Hennig <chrish at stats.ucl.ac.uk>:
>
>> What does str(ddata) give?
>
>
> Class 'dist' atomic [1:130816] 69.2 117.1 145.6 179.9 195.6 ...
>
>
>>
>> dcent doesn't make sense as input for cluster.stats, because you need a
>> dissimilarity matrix between all objects.
>>
>
> Yes I know ... I simply try to see if something was changing with different
> structure of data
>
>
>
>>
>> Christian
>>
>> On Sat, 14 Jun 2008, Laura Poggio wrote:
>>
>> I am sorry I did not provide enough information.
>>> I am not using img later, but data that is data.frame.
>>> I wrote that img is a "image" just to explain what kind of data is coming
>>> from, but the object I am using is data and it is a data.frame (checked
>>> many
>>> times).
>>>
>>> I am not using as.dist, but dist in order to calculate the distance matrix
>>> among the data I have. Then the whole code I am using is:
>>>
>>> data <- <- as(img, "data.frame")[1:1] #(where img is an image 256x256
>>> px)
>>> kl <- kmeans(data, 5)
>>> library(fpc)
>>> ddata <- dist(data)
>>> dcent <- dist(kl$centers)
>>>
>>> cluster.stats(ddata, kl$cluster)
>>> cluster.stats(dcent, kl$cluster)
>>>
>>> In both cases I got the same error:
>>> Error in as.dist(dmat[clustering == i, clustering == i]) : (subscript)
>>> logical subscript too long
>>>
>>> Below the structure of the different objects is detailed below:
>>> data is "'data.frame': 262144 obs. of 1 variable"
>>> kl$centers is "num [1:5, 1]"
>>> kl$cluster is "Named int [1:262144]"
>>>
>>> I hope it is more informative. I am sorry but I did not find any
>>> explanation
>>> for the error message I am getting.
>>>
>>> Thank you very much in advance
>>>
>>> Laura
>>>
>>>
>>>
>>> 2008/6/14 Christian Hennig <chrish at stats.ucl.ac.uk>:
>>>
>>> The given information is not enough to tell you what's going on. as.dist
>>>> doesn't appear in the given code and it's not clear to me what kind of
>>>> object img is ("a small image" doesn't tell me what R makes of it).
>>>> Also, try to read the help pages first and find out whether img is of the
>>>> format that is required by the functions. And check (using str for
>>>> example)
>>>> whether "data" is what you expect it to be.
>>>>
>>>> Christian
>>>>
>>>>
>>>> On Sat, 14 Jun 2008, Laura Poggio wrote:
>>>>
>>>> Thank you very much for your answer.
>>>>
>>>>> I tried to run the function on my data and now I am getting this message
>>>>> of
>>>>> error
>>>>> Error in as.dist(dmat[clustering == i, clustering == i]) : (subscript)
>>>>> logical subscript too long
>>>>>
>>>>> Below the code I am using (version2.7.0 of R with all packages updated):
>>>>>
>>>>> data <- <- as(img, "data.frame")[1:1] #(where img is a small image
>>>>> 256
>>>>> px
>>>>> x 256 px)
>>>>> kl <- kmeans(data, 5)
>>>>> library(fpc)
>>>>> cluster.stats(data, kl$cluster)
>>>>>
>>>>> Thank you for any hints on the reasons and meaning of the error!
>>>>>
>>>>> Laura
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2008/6/13 Christian Hennig <chrish at stats.ucl.ac.uk>:
>>>>>
>>>>> Dear Laura,
>>>>>
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> I just tried to use the function cluster.stat in the package fpc.
>>>>>>> I just have a couple of questions about the syntax:
>>>>>>>
>>>>>>> cluster.stats(d,clustering,alt.clustering=NULL,
>>>>>>> silhouette=TRUE,G2=FALSE,G3=FALSE)
>>>>>>>
>>>>>>> 1) the distance object (d) is an object obtained by the function
>>>>>>> dist()
>>>>>>> on
>>>>>>> my own original matrix?
>>>>>>>
>>>>>>>
>>>>>>> d is allowed to be an object of class dist or a dissimilarity matrix.
>>>>>> The answer to your question depends on what your "original matrix" is.
>>>>>> If
>>>>>> it is something on which you can compute a distance by dist(), you're
>>>>>> right,
>>>>>> at least if dist() delivers the distance you are interested in.
>>>>>>
>>>>>> 2) clustering is the clusters vector as result of one of the many
>>>>>>
>>>>>> clustering
>>>>>>> methods?
>>>>>>>
>>>>>>>
>>>>>>> The help page tells you what clustering can be. So it could be the
>>>>>> clustering/partition vector of a clustering method or it could be
>>>>>> something
>>>>>> else. Note that cluster.stats doesn't depend on any particular
>>>>>> clustering
>>>>>> method. It computes the statistics regardless of where the clustering
>>>>>> vector
>>>>>> comes from.
>>>>>>
>>>>>> Best regards,
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>> Thank you very much in advance and sorry for such basic question, but
>>>>>> I
>>>>>>
>>>>>>> did
>>>>>>> not manage to clarify my mind.
>>>>>>>
>>>>>>> Laura
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>
>
