[R] Statistical analysis of a large database

Luis Torgo ltorgo at liacc.up.pt
Wed Oct 13 21:03:03 CEST 2004


On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote:
> > Luís Torgo, Data Mining with R. Learning by case
> > studies, Maggio 2003
> > http://www.liacc.up.pt/~ltorgo/DataMiningWithR/
> 
> Please note that that reference is not about large datasets, nor about 
> `data mining' in the generally used sense.  It has two studies, one 
> incomplete, on linear regression (with 200 samples) and on time series.

I would like to add a few information on these incomplete comments on
the book I'm writing. The book is unfinished as mentioned on its Web
page. It has currently two reasonably finished chapters: an introduction
to R and MySQL and a case study. As mentioned in the book, the first
case study is small by data mining standards (200 observations) and has
the goal of illustrating techniques that are shared by data mining and
other disciplines as well as smoothly introducing the reader to R and
its power. It addresses data pre-processing techniques, data
visualization, model construction (yes, linear regression but also
regression trees), and model evaluation, selection and combination, so I
think it is a bit incorrect to say that it is about linear regression
that corresponds to 5 of the 50 pages of that chapter.
 
The third (unfinished) chapter (2nd case study) is about financial
trading. It includes topics like connections to data bases as well as
many other components of a knowledge discovery process. Among those
components it includes model construction that involves obviously time
series models given the nature of the data. The chapter will include
other steps like issues concerning moving from predictions into actions,
creation of variables from the original time series, etc.. It is
currently being re-written and I expect to upload soon a new revised
version of this chapter.

The book will include at least two further cases studies that will be
larger. Still, I would note that the financial trading case study is
potentially very large, as it is a problem where data is constantly
growing. The final version of that chapter addresses this issue of
having a system that is online in the sense that it is receiving new
data in real time (also known as mining data streams in the data mining
field).

I'm sorry for being so long, but I think it is dangerous to try to
resume around 200 pages of an unfinished work in two lines of text.

Still, all comments on this on going project are very well welcome and I
would like to take this opportunity to thank all people that have been
sending me encouraging comments/emails.

Luis Torgo

-- 
Luis Torgo
  FEP/LIACC, University of Porto   Phone : (+351) 22 607 88 30
  Machine Learning Group           Fax   : (+351) 22 600 36 54
  R. Campo Alegre, 823             email : ltorgo at liacc.up.pt
  4150 PORTO   -  PORTUGAL         WWW   : http://www.liacc.up.pt/~ltorgo




More information about the R-help mailing list