[R] R and DBSCAN

Christian Hennig chrish at stats.ucl.ac.uk
Wed Jun 8 13:11:41 CEST 2011


Dear Paco,

I tried dbscan on my computer with method="hybrid" and a 155000*3 
data matrix and it works. Needs some time though.
(You can track the progress using something like
countmode=c(1:10,100,1000,10000,100000).)
Note that for some reason I don't exactly understand, it takes *much* 
longer for 1-dimensional data (I need to look into this), so if you tried 
only 1-d data yet, it may be worth a try to do the whole thing with
the full 3-d dataset.

So I'm not sure what goes wrong on your side. Perhaps look at str(sst2) in 
order to make sure that it is what you think it is.

I can't advise you on how precisely to take longitude and latitude into 
account because this depends on your application and would probably 
require professional statistical advisory that is much more than just 
R-help. Note however that dbscan treats all variables equally.

Best wishes,
Christian

On Tue, 7 Jun 2011, Paco Pastor wrote:

> Hello Christian
>
> Thanks for answering. Yes, I have tried dbscan from fpc but I'm still stuck 
> on the memory problem. Regarding your answer, I'm not sure which memory 
> parameter should I look at. Following is the code I tried with dbscan 
> parameters, maybe you can see if there is any mistake.
>
> sstdat=read.csv("sst.dat",sep=";",header=F,col.names=c("lon","lat","sst"))
>
> library(fpc)
> sst1=subset(sstdat, sst<50)
> sst2=subset(sst1, lon>-6)
> sst2=subset(sst2, lon<40)
> sst2=subset(sst2, lat<46)
>
>> dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c("hybrid"), 
> seeds = FALSE, showplot = FALSE, countmode = NULL)
> Error: no se puede ubicar un vector de tamaño  858.2 Mb
>> head(sst2)
>             lon   lat   sst
> 1257 35.18 24.98 26.78
> 1258 35.22 24.98 26.78
> 1259 35.27 24.98 26.78
> 1260 35.31 24.98 26.78
> 1261 35.35 24.98 26.78
> 1262 35.40 24.98 26.85
>
>
> In this example I only apply dbscan to temperature values, not lon/lat, so 
> eps parameter is 0.1. As it is a gridded data set any point is surrounded by 
> eight data points, then I thought that at least 5 of the surrounding points 
> should be within the reachability distance. But I'm not sure I'm getting the 
> right approach by only considering temperature value, maybe then I'm missing 
> spatial information. How should I deal with longitude and latitude data?
>
> dimensions of sst2 are: 152243 rows x 3 columns
>
> Thanks again
>
> El 03/06/2011 18:24, Christian Hennig escribió:
>> Have you considered the dbscan function in library fpc, or was it another 
>> one?
>> dbscan in fpc doesn't have a "distance" parameter but several options, one
>> of which may resolve your memory problem (look up the documentation of the 
>> "memory" parameter).
>> 
>> Using a distance matrix for hundreds of thousands of points is a recipe for 
>> disaster (memory-wise). I'm not sure whether the function that you used did 
>> that, but dbscan in fpc can avoid it.
>> 
>> It is true that dbscan requires tuning constants that the user has to 
>> provide. There is unfortunately no general rule how to do this; it would be 
>> necessary to understand the method and the meaning of the constants, and 
>> how this translates into the requirements of your application.
>> 
>> You may try several different choices and do some cluster validation to see 
>> what works, but I can't explain this in general terms easily via email.
>> 
>> Hope this helps at least a bit.
>> 
>> Best regards,
>> Christian
>> 
>> 
>> On Fri, 3 Jun 2011, Paco Pastor wrote:
>> 
>>> Hello everyone,
>>> 
>>> When looking for information about clustering of spatial data in R I was 
>>> directed towards DBSCAN. I've read some docs about it and theb new 
>>> questions have arisen.
>>> 
>>> DBSCAN requires some parameters, one of them is "distance". As my data are 
>>> three dimensional, longitude, latitude and temperature, which "distance" 
>>> should I use? which dimension is related to that distance? I suposse it 
>>> should be temperature. How do I find such minimum distance with R?
>>> 
>>> Another parameter is the minimum number of points neded to form a cluster. 
>>> Is there any method to find that number? Unfortunately I haven't found.
>>> 
>>> Searching thorugh Google I could not find an R example for using dbscan in 
>>> a dataset similar to mine, do you know any website with such kind of 
>>> examples? So I can read and try to adapt to my case.
>>> 
>>> The last question is that my first R attempt with DBSCAN (without a proper 
>>> answer to the prior questions) resulted in a memory problem. R says it can 
>>> not allocate vector. I start with a 4 km spaced grid with 779191 points 
>>> that ends in approximately 300000 rows x 3 columns (latitude, longitude 
>>> and temperature) when removing not valid SST points. Any hint to address 
>>> this memory problem. Does it depend on my computer or in DBSCAN itself?
>>> 
>>> Thanks for the patience to read a long and probably boring message and for 
>>> your help.
>>> 
>>> -- 
>>> -----------
>>> Francisco Pastor
>>> Meteorology department, Instituto Universitario CEAM-UMH
>>> http://www.ceam.es
>>> -----------
>>> mail: paco at ceam.es
>>> skype: paco.pastor.guzman
>>> Researcher ID: http://www.researcherid.com/rid/B-8331-2008
>>> Cosis profile: http://www.cosis.net/profile/francisco.pastor
>>> -----------
>>> Parque Tecnologico, C/ Charles R. Darwin, 14
>>> 46980 PATERNA (Valencia), Spain
>>> Tlf. 96 131 82 27 - Fax. 96 131 81 90
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> Este mensaje y los ficheros anexos son confidenciales. Los mismos 
>>> contienen información reservada de la empresa que no puede ser difundida. 
>>> Si usted ha recibido este correo por error, tenga la amabilidad de 
>>> eliminarlo de su sistema y avisar al remitente mediante reenvío a su 
>>> dirección electrónica; no deberá copiar el mensaje ni divulgar su 
>>> contenido a ninguna persona.
>>> 
>>> Su dirección de correo electrónico junto a sus datos personales forman 
>>> parte de un fichero titularidad de la Fundación de la Comunidad Valenciana 
>>> Centro de Estudios Ambientales del Mediterráneo - CEAM, con CIF: 
>>> G-46957213, cuya finalidad es la de mantener el contacto con Ud. De 
>>> acuerdo con la Ley Orgánica 15/1999, usted puede ejercitar sus derechos de 
>>> acceso, rectificación, cancelación y, en su caso, oposición enviando una 
>>> solicitud por escrito, acompañada de una fotocopia de su DNI dirigida a: 
>>> Fundación de la Comunidad Valenciana Centro de Estudios Ambientales del 
>>> Mediterráneo - CEAM. C/ Charles R. Darwin, 14. Parque Tecnológico.46980 
>>> PATERNA (Valencia).
>>> 
>>> This message and the attached files are confidential. They contain 
>>> reserved information belonging to our centre and are not to be broadcast. 
>>> If you have received this email by mistake, please delete it from your 
>>> system and alert the sender by returning it to his/her email address. You 
>>> must not copy or divulge the contents of the message to anyone.
>>> 
>>> Your email address and personal data are included in a file belonging to 
>>> the Fundación de la Comunidad Valenciana Centro de Estudios Ambientales 
>>> del Mediterráneo - CEAM, con CIF: G-46957213. The purpose of this file is 
>>> to allow us to keep in contact with you. In accordance with Organic Law 
>>> 15/1999, you are permitted to access, rectify, cancel or oppose the 
>>> contents of this file by submitting a written request, accompanied by a 
>>> photocopy of your DNI, to: Fundación de la Comunidad Valenciana Centro de 
>>> Estudios Ambientales del Mediterráneo - CEAM. C/ Charles R. Darwin, 14. 
>>> Parque Tecnológico.46980 PATERNA (Valencia).
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> *** --- ***
>> Christian Hennig
>> University College London, Department of Statistical Science
>> Gower St., London WC1E 6BT, phone +44 207 679 1698
>> chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
>
> -- 
> -----------
> Francisco Pastor
> Meteorology department, Instituto Universitario CEAM-UMH
> http://www.ceam.es
> -----------
> mail: paco at ceam.es
> skype: paco.pastor.guzman
> Researcher ID: http://www.researcherid.com/rid/B-8331-2008
> Cosis profile: http://www.cosis.net/profile/francisco.pastor
> -----------
> Parque Tecnologico, C/ Charles R. Darwin, 14
> 46980 PATERNA (Valencia), Spain
> Tlf. 96 131 82 27 - Fax. 96 131 81 90
>
>
> ---------------------------------------------------------------------
> Este mensaje y los ficheros anexos son confidenciales. Los mismos contienen 
> información reservada de la empresa que no puede ser difundida. Si usted ha 
> recibido este correo por error, tenga la amabilidad de eliminarlo de su 
> sistema y avisar al remitente mediante reenvío a su dirección electrónica; no 
> deberá copiar el mensaje ni divulgar su contenido a ninguna persona.
>
> Su dirección de correo electrónico junto a sus datos personales forman parte 
> de un fichero titularidad de la Fundación de la Comunidad Valenciana Centro 
> de Estudios Ambientales del Mediterráneo - CEAM, con CIF: G-46957213, cuya 
> finalidad es la de mantener el contacto con Ud. De acuerdo con la Ley 
> Orgánica 15/1999, usted puede ejercitar sus derechos de acceso, 
> rectificación, cancelación y, en su caso, oposición enviando una solicitud 
> por escrito, acompañada de una fotocopia de su DNI dirigida a: Fundación de 
> la Comunidad Valenciana Centro de Estudios Ambientales del Mediterráneo - 
> CEAM. C/ Charles R. Darwin, 14. Parque Tecnológico.46980 PATERNA (Valencia).
>
> This message and the attached files are confidential. They contain reserved 
> information belonging to our centre and are not to be broadcast. If you have 
> received this email by mistake, please delete it from your system and alert 
> the sender by returning it to his/her email address. You must not copy or 
> divulge the contents of the message to anyone.
>
> Your email address and personal data are included in a file belonging to the 
> Fundación de la Comunidad Valenciana Centro de Estudios Ambientales del 
> Mediterráneo - CEAM, con CIF: G-46957213. The purpose of this file is to 
> allow us to keep in contact with you. In accordance with Organic Law 15/1999, 
> you are permitted to access, rectify, cancel or oppose the contents of this 
> file by submitting a written request, accompanied by a photocopy of your DNI, 
> to: Fundación de la Comunidad Valenciana Centro de Estudios Ambientales del 
> Mediterráneo - CEAM. C/ Charles R. Darwin, 14. Parque Tecnológico.46980 
> PATERNA (Valencia).
>
>
>
>
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche


More information about the R-help mailing list