[R] Histogram over a Large Data Set (DB): How?

Sean Davis sdavis2 at mail.nih.gov
Fri Nov 18 01:03:57 CET 2005


[R] Histogram over a Large Data Set (DB): How?

Have you tried just grabbing the whole column using dbGetQuery?  Try doing 
this:

spams <- dbGetQuery(con,"select unixtime from email limit 1000000")

Then increase from 1,000,000 to 1.5 million, to 2 million, etc.  until you 
break something (run out of memory), if you do at all.

However, the BETTER way to do this, if you already have the data in the 
database is to allow the database to do the histogram for you.  For example, 
to get a count of spams by day, in MySQL do something like:

spams <- dbGetQuery(con,"select date_format(unixtime,'%m/%d/%y'),count(*) 
from email group by date_format(unixtime,'%m/%d/%y')")

Then, you will have a count of the number of spams by date and can plot as 
you like.

Sean

----- Original Message ----- 
From: Eric Eide
To: r-help at stat.math.ethz.ch
Sent: Thursday, November 17, 2005 6:25 PM
Subject: [R] Histogram over a Large Data Set (DB): How?


Hi!  I'm new to R, and I have a question about how R works with large data 
sets
--- in particular, data sets that come from databases.
I'm using R 2.2.0 with the DBI package (0.1-9) and the RMySQL package 
(0.5-5).
My get-my-feet-wet-with-R project is to make a histogram from a data set 
stored
in a MySQL database.  In particular, I have a table that describes some
observed spam emails.  The 'unixtime' column of the table contains the
timestamps of the messages.  My current goal is to plot the number of spams 
per
day during the recording period.
So far, this is my script (edited for brevity), and it works well:
----- 
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, group="spam")
# "where" clause to limit data set size, as described below.
spams <- dbGetQuery(con, "select unixtime from email where 
LENGTH(email_to)=4")
firstspam <- min(spams$unixtime)
lastspam <- max(spams$unixtime)
# spansize == one day
spansize <- 60 * 60 * 24
firstbreak = floor(firstspam / spansize)
lastbreak = celing(lastspam / spansize)
spambreaks = (firstbreak:lastbreak) * spansize
hist(spams$unixtime, br=spambreaks, plot=TRUE, col="red")
----- 
The "where" clause serves to limit the number of records, while I figure out
the surrounding parts.  And now I'd like to do that... but I'm not sure how!
The actual "email" table has 2.9 million rows (a lot of spam!), so I presume
that I cannot (or at least, shouldn't) read it all at once.  Reading the
documentation of the RMySQL package, I understand that the syntax I want is
something like this:
----- 
rs <- dbSendQuery(con, "select unixtime from email order by unixtime")
out <- dbApply(rs, INDEX = "unixtime",
         FUN = function(x, grp) hist(x$unixtime, ...))
----- 
But I can't quite seem to make this work.  When I try the above directly, I 
get
errors like this:
  Error in mysqlDBApply(res, ...) : unimplemented type 'NULL' in 'length<-'
More generally, I'm pretty sure that I'm misusing `hist' in the code above. 
I
don't want a series of histograms; I want just one histogram, made from the
concatenation of all the records fetched from the table.
I didn't find a recipe for doing this sort of thing in the R FAQ or in the
archives for this mailing list, so I'm hoping that someone can set me on the
right track.  How does one code in R with functions that expect "whole"
vectors, and somehow provide those vectors in a "piecemeal" fashion?  Is 
there
a general recipe for this situation in R (aren't large data sets common?), 
or
do I need to code the histogram generator myself?
Thanks for any help! --- 
Eric.
-- 
-------------------------------------------------------------------------------
Eric Eide <eeide at cs.utah.edu>  .         University of Utah School of 
Computing
http://www.cs.utah.edu/~eeide/ . +1 (801) 585-5512 voice, +1 (801) 581-5843 
FAX
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list