[R] Help deciding on data format for sales data (newbie)

Daniel Malter daniel at umd.edu
Tue Jun 2 18:18:30 CEST 2009


Depending on what you want to do, the second format (wide) maybe useful; but
it can get very clumsy if there are many potential items. You may think
about the long format:

OrderNo Customer Item
1		1	a
1		1	b
1		1	c
2		1	d
3		2	a
3		2	d

The long (above) and the wide format (your second example) are the most
widely used and most software packages have functions to reshape the dataset
from one into the other.

You may also want want to think about reducing the dimensionality of your
data. If, say, "a" were a right shoe and "b" were a left shoe, then "a" is
typically always sold with "b." Then you would want to create one indicator
from the two (admittedly, a pretty stupid example) to get the number of
unique items reduced. Alternatively, you may also be interested in only a
subset of the items (e.g. shoes only).

Hope this helps,
Daniel

-------------------------
cuncta stricte discussurus
-------------------------

-----Ursprüngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von jonathanbriggs
Gesendet: Tuesday, June 02, 2009 11:47 AM
An: r-help at r-project.org
Betreff: [R] Help deciding on data format for sales data (newbie)


Dear All

Beginning data mining and need some help working out the best way to
represent data. I have searched here and online and not found any real help.
Imagines that I have a file of order(sales) data

OrderNo CustomerNo ItemsInOrder
1           1                a,b,c
2           1                d
3           2                a,d

I can represent this as a data.frame but then need to parse my ItemsInOrder?
This seems quite clumsy. Alternatively I can try this sort of representation

OrderNo  CustomerNo  a  b  c  d
1            1                1  1   1  NA
2            1                NA NA NA 1
3            2                1  NA  NA 1

Are these really the two choices and how well does the second representation
scale? (I have 50,000 SKUs)

Can anyone point me in the direction of some worked examples that take such
data and manipulate it; looking for association rules and clusters?

Thanks

Jonathan
--
View this message in context:
http://www.nabble.com/Help-deciding-on-data-format-for-sales-data-%28newbie%
29-tp23835331p23835331.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list