[R] help formatting data for clustering

Tue Nov 13 23:38:27 CET 2012

This is easier if you read the data into a list instead of creating a
data frame since the number of values on each row is different. You may
be able to modify this to fit your needs. The steps are 1) Read the file
with readLines(); 2) split the lines into numeric vectors  (one for each
line); 3) repeat the first column (id) once for each brand in the line
and build a data.frame with col.names; 4) use table() to build a list of
all the brands and the number of times each appears; 5) cluster using
the table or if necessary convert to a data frame (this will add X to
the front of each brand number since numbers cannot be column names.

dta <- readLines(con=stdin(), n=3)
1 , 45 , 32, 45, 23
2 , 34
4, 11, 43, 45

lst <- strsplit(dta, ", ")
lst <- sapply(lst, as.numeric)
a <- sapply(1:length(lst), function(x) cbind(rep(lst[[x]][[1]], 
      length(lst[[x]])-1), lst[[x]][-1]))
a <- data.frame(do.call(rbind, a))
colnames(a) <- c("id", "brand")
newdat <- table(a$id, a$brand)
newdf <- data.frame(unclass(newdat))

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Raphael Bauduin
Sent: Tuesday, November 13, 2012 4:47 AM
To: r-help at r-project.org
Subject: [R] help formatting data for clustering

Hi,

I'm a R beginner. I have data of this form:

user_id, brand_id1, brand_id2, .....

for example:
1 , 45 , 32, 45, 23
2 , 34
4, 11, 43, 45

I'm looking for the right procedure to be able to cluster users. I am
especially interested to know which functions to use at each step.

I am currently able to load the data in a data frame, each row's name
being
the user id.

#extract user brands, ie all collumn except the first
user_brands <- userclustering[,-1]

# extract user ids, ie the first column
user_ids  <- userclustering[,1]

# set user ids as row name
row.names(user_brands) <- user_ids

But now I'm stuck replacing the brand ids by a count for each brand the
user ordered, all other brand counters being implicitely 0 for that
user.

Then I'll need to be sure I can use it for clustering (normalising,
correct
handling of brands absent from a user's list, etc).

thanks in advance for your help!

Raph

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.