[R] ggplot2 / reshape / Question on manipulating data

Thu Jul 12 20:15:30 CEST 2007

On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
> "hadley wickham" <h.wickham at gmail.com> writes:
>
> > On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
> >> I'm an R newbie but recently discovered the ggplot2 and reshape
> >> packages which seem incredibly useful and much easier to use for a
> >> beginner.  Using the data from the IMDB, I'm trying to see how the
> >> average movie rating varies by year.  Here is what my data looks like:
> >>
> >> > ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "")
> >> > ratings <- subset(ratings, VoteCount > 100)
> >> > head(ratings)
> >>                              Title  Histogram VoteCount VoteMean Year
> >> 1                !Huff (2004) (TV) 0000000016       299      8.4 2004
> >> 8              'Allo 'Allo! (1982) 0000000125       829      8.6 1982
> >> 50              .hack//SIGN (2002) 0000001113       150      7.0 2002
> >> 56            1-800-Missing (2003) 0000000103       118      5.4 2003
> >> 66  Greatest Artists (2000) (mini) 00..000016       110      7.8 2000
> >> 77 00 Scariest Movie (2004) (mini) 00..000115       256      8.6 2004
> >
> > Have you tried using the movies dataset included in ggplot?  Or is
> > there some data that you want that is not in that dataset.
>
> It's funny that you mention this because I had intended to write this
> email about a month ago but was delayed due to other reasons.  In any
> case, when I was typing this up last night, I wanted to recreate my
> steps but I could not find the IMDB movie data I had used originally.
> I searched everywhere to no avail so I downloaded the data myself and
> groomed it.  Only now do I remember that I had used the movies dataset
> included in ggplot.
>
> >> How do 'byYear' and 'byYear2' differ?  I am trying to use 'typeof' but
> >> both seem to be lists.  However, they are clearly different in some
> >> way because 'qplot' graphs them differently.
> >
> > Try using str - it's much more helpful, and you should see the
> > different quickly.
>
> Thanks!  This is the function I've been looking for in my quest to
> learn about internal data types of R.  Too bad it has such a terrible
> name!
>
> > Using the built in movies data:
> >
> > mm <- melt(movies, id=1:2, m=c("rating", "votes"))
> > msum <- cast(mm, year ~ variable, c(mean, sum))
> >
> > qplot(year, rating_mean, data=msum, colour=votes_sum)
> > qplot(year, rating_mean, data=msum, colour=votes_sum, geom="line")
>
> Great!  This is exactly what I was looking to do.  By the way, does
> any of your documentation use the movie dataset as an example?  I'm
> curious what else I can do with the dataset.  For example, how can I
> use ggplot's facets to see the same information by type of movie?  I'm
> unsure of how to manipulate the binary variables into a single
> variable so that it can be treated as levels.

A lot of the examples do use the movies data, but I don't think any of
it is particularly revealing.  You might want to look at the results
for the 2007 infovis visualisation challenge
(http://www.apl.jhu.edu/Misc/Visualization/) which uses similar data.
Submission isn't complete yet, but you can see my teams entry at
http://had.co.nz/infovis-2007/.  There are lots of interesting stories
to pursue.

I think I will update the movies data to include the first genre as
another column.  That will make it easier to facet by genre

Hadley