[R] ggplot2 / reshape / Question on manipulating data

Pete Kazmier pete-expires-20070910 at kazmier.com
Thu Jul 12 07:01:21 CEST 2007

I'm an R newbie but recently discovered the ggplot2 and reshape
packages which seem incredibly useful and much easier to use for a
beginner.  Using the data from the IMDB, I'm trying to see how the
average movie rating varies by year.  Here is what my data looks like:

> ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "")
> ratings <- subset(ratings, VoteCount > 100)
> head(ratings)
                             Title  Histogram VoteCount VoteMean Year
1                !Huff (2004) (TV) 0000000016       299      8.4 2004
8              'Allo 'Allo! (1982) 0000000125       829      8.6 1982
50              .hack//SIGN (2002) 0000001113       150      7.0 2002
56            1-800-Missing (2003) 0000000103       118      5.4 2003
66  Greatest Artists (2000) (mini) 00..000016       110      7.8 2000
77 00 Scariest Movie (2004) (mini) 00..000115       256      8.6 2004

The above data is not aggregated.  So after playing around with basic
R functionality, I stumbled across the 'aggregate' function and was
able to see the information in the manner I desired (average movie
rating by year).

> byYear <- aggregate(ratings$VoteMean, list(Year = ratings$Year), mean)
> plot(byYear)

Having just discovered gglot2, I wanted to create the same graph but
augment it with a color attribute based on the total number of votes
in a year.  So first I tried to see if I could reproduce the above:

> library(ggplot2)
> qplot(Year, x, byYear)

This did not work as expected because the x-axis contained labels for
each and every year making it impossible to read whereas the plot
created with basic R had nice x-axis labels.  How do I get 'qplot' to
treat the x-axis in a similar manner to 'plot'?

After playing around further, I was able to get 'qplot' to work in a
manner similar to 'plot' with regards to the x-axis labels by using
'melt' and 'cast'.  The 'qplot' now behaves correctly:

> mratings <- melt(ratings, id = c("Title", "Year"), measure = c("VoteCount", "VoteMean"))
> byYear2 <- cast(mratings, Year ~ variable, mean, subset = variable == "VoteMean")
> qplot(Year, VoteMean, data = byYear2)

How do 'byYear' and 'byYear2' differ?  I am trying to use 'typeof' but
both seem to be lists.  However, they are clearly different in some
way because 'qplot' graphs them differently.

Finally, I'd like to use a color attribute to 'qplot' to augment each
point with a color based on the total number of votes for the year.
Using attributes with 'qplot' seems simple, but I'm having a hard time
grooming my data appropriately.  I believe this requires aggregation
by summing the VoteCount column.  Is there a way to cast the data
using different aggregation functions for various columns?  In my
case, I want the mean of the VoteMean column, and the sum of the
VoteCount column.  Then I want to produce a graph showing the average
movie rating per year but with each point colored to reflect the total
number of votes for that year.  Any pointers?


More information about the R-help mailing list