[R] Re: Problems for 13 year old

Jim_Garrett@bd.com Jim_Garrett at bd.com
Mon Jan 27 17:11:03 CET 2003


How about spam filtering?

Granted, there's some infrastructure involved, which means gratification is
not instant.  But it involves something that most people who use computers
care about:  e-mail, and spam.

I mention this because the following web site sparked some interest in
statistics among some acquaintances who were otherwise very cool to it:

     http://www.paulgraham.com/spam.html

This outlines a "Bayesian" spam filter.  I'm not sure it's wholly Bayesian,
but it comes close, the author's are good, and I hear that it performs
well, in fact better than many commercial spam filters (or so I hear).
Moreover, the web site virtually gushes about the virtues of statistical
methods.  The interesting thing about the filter is that you get to see
what "features" it's discovering.

A quick search also indicated that Mozilla apparently offers a plug-in for
the same spam filter.  That would offer a quick way to get the filter up
and running with real e-mail.  But I don't know if Mozilla offers
interesting diagnostics about which features it's using, which is the
pedagogically interesting part.  Mozilla mentions it here:

     http://www.mozilla.org/mailnews/spam.html

Of course, you can use any number of classification techniques to
distinguish spam from other e-mail, you just need data.  Hastie and
Tibshirani's _The Elements of Statistical Learning_ demonstrates a couple
of types of models applied to the spam problem, and points to data at

     ftp.ics.uci.edu

Ideally, you would do some exploration to design a filter, implement it in
R, and then integrate it with your nephew's e-mail program.  This would be
a long-term project, maybe even a science-fair project, with long-term
benefits (educational and practical).  I know this can be done with Linux,
but I have no idea about Mac OS 9!  It's probably a stretch for typical
13-year-olds, but for the right 13-year-old, it would be a blast.

Good luck!

Jim Garrett
Baltimore, Maryland, USA


*********************************************************************************
This message is intended only for the designated recipient(s).   ... [[dropped]]




More information about the R-help mailing list