[R] polymorphic functions in ggplot? (WAS Re: Drawing rectangles in multiple panels)

Stephen Tucker brown_emu at yahoo.com
Sat Jul 14 23:38:34 CEST 2007


Regarding your earlier statement,

"I tend to think in very data centric approach, where you first generate the
data (in a data frame) and then you plot it. There is very little data
creation/modification during the plotting itself..."

Is the data generation and plotting truly separate and sequential? I'm
not entirely clear on this point - as statistical
transformations/operations return objects that require new variables
to be created - and this may be rooted in semantics (the verbal one,
not the computational) of the grammar of graphics - in the online book
draft of 'ggplot' it says (p. 37)

"The explicit transformation stage was dropped because variable
transformations are already so easy in R: they do not need to be part
of the grammar."

In my understanding of what transformations are defined to be, they
involve statistical ones - which perhaps I'm not truly getting because
tranformations are defined (by L. Wilkinson) as a mapping of elements
of one set to elements of the same set, and yet a function like
median() will accept a list (of values) and return a single
value... in any case maybe there is a distinction between a
statistical 'transformation' and a statistical 'operation' that I've
missed, but statistical 'transformations' are included in ggplot's
"stat" functions. L. Wilkinson also seems to include an explicit TRANS
specification at times (for example, in the case of the boxplot on
p.60) and at other times nest it into the ELEMENT specification (for
example, the histogram on p. 47).

In any case, I interpret that the following progression is achieved
through 'data operations' and 'application of algebra' in the language
of L. Wilkinson and through I/O, merge, reshape, and other functions
in R:

source object -> variables -> varset

A statistic might then computed on the varset, which will return
another source object (true in R as well: e.g., class 'histogram' or
'lm') from which variables can again be extracted, varsets
constructed, etc. to yield a list of tuples to be associated with
geometrical and aesthetic attributes. Indeed, in the bootstrap
example, L. Wilkinson begins by extracting variables from a bootstrap
function on another variable that has not explicitly been created from
source (dataset).

So it's not clear to me that the the data creation step is necessarily
distinct from the plotting, as it is more (but not completely) so in
the traditional graphics system:

## DATA specification
variable <- rnorm(100)
## TRANS specification
statsObj <- hist(variable,nclass=20,plot=FALSE)
## Transformed data is plotted (variables extracted implicity and
## associated with default geometry/aesthetic mappings)
plot(statsObj)

Below is an analogous plot in ggplot, where the creation of the
summary object occurs as part of the grammar:

ggplot(data=data.frame(variable),mapping=aes(x=variable)) +
stat_bin(breaks=statsObj$breaks)

Since all statistical transformations/operations aren't handled by
ggplot, it seems that working with non-data-frame objects (for
example, of class 'nls' or 'rlm') require data operations (p.7) (to
extract fitted values, etc.). Of course, R provides these facilities,
but the plotting functions in the traditional graphics system
accommodate a number of object classes through polymorphic
functions. I wonder if in a similar way for ggplot, stat_bin could
accept objects of 'histogram' class [hist() allows the user to specify
'nclass', which will then compute the breaks], or stat_smooth could
accept 'rlm' objects. Of course, in the case of an 'lm' object, plot()
additionally gives diagnostic (residual and Q-Q) plots but that type of
response does not seem to fit in with the expected behavior of ggplot
functions...


--- hadley wickham <h.wickham at gmail.com> wrote:

> On 7/12/07, Deepayan Sarkar <deepayan.sarkar at gmail.com> wrote:
> > On 7/11/07, hadley wickham <h.wickham at gmail.com> wrote:
> > > > A question/comment: I have usually found that the subscripts argument
> is
> > > > what I need when passing *external* information into the panel
> function, for
> > > > example, when I wish to add results from a fit done external to the
> trellis
> > > > call. Fits[subscripts] gives me the fits (or whatever) I want to plot
> for
> > > > each panel. It is not clear to me how the panel layout information
> from
> > > > panel.number(), etc. would be helpful here instead. Am I correct? --
> or is
> > > > there a smarter way to do this that I've missed?
> > >
> > > This is one of things that I think ggplot does better - it's much
> > > easier to plot multiple data sources.  I don't have many examples of
> > > this yet, but the final example on
> > > http://had.co.nz/ggplot2/geom_abline.html illustrates the basic idea.
> >
> > That's probably true. The Trellis approach is to define a plot by
> > "data source" + "type of plot", whereas the ggplot approach (if I
> > understand correctly) is to create a specification for the display
> > (incrementally?) and then render it. Since the specification can be
> > very general, the approach is very flexible. The downside is that you
> > need to learn the language.
> 
> Yes, that's right.  ggplot basically decomposes "type of plot" into
> statistical transformation (stat) + geometric object and allows you to
> control each component separately.  ggplot also explicitly includes
> the idea of layers (ie. one layer is a scatterplot and another layer
> is a loess smooth) and allows you to supply different datasets to
> different layers.
> 
> > On a philosophical note, I think the apparent limitations of Trellis
> > in some (not all) cases is just due to the artificial importance given
> > to data frames as the one true container for data. Now that we have
> > proper multiple dispatch in S4, we can write methods that behave like
> > traditional Trellis calls but work with more complex data structures.
> > We have tried this in one bioconductor package (flowViz) with
> > encouraging results.
> 
> That's one area which I haven't thought much about.  ggplot is very
> data.frame centric and it's not yet clear to me how plotting a linear
> model (say) would fit into the grammar.
> 
> Hadley
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list