[R] Data views (Re: (Another) Bates fortune?)

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Sun Feb 7 21:40:21 CET 2010


Note : this post has been motivated more by the "hierarchical data"
subject than the aside joke of Douglas Bates, but might be of interest
to its respondents.

Le vendredi 05 février 2010 à 21:56 +0100, Peter Dalgaard a écrit :
> Peter Ehlers wrote:
> > I vote to 'fortunize' Doug Bates on
> > 
> >  Hierarchical data sets: which software to use?
> > 
> > "The widespread use of spreadsheets or SPSS data sets or SAS data sets
> > which encourage the "single table with a gargantuan number of columns,
> > most of which are missing data in most cases" approach to organization
> > of longitudinal data is regrettable."
> > 
> > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 
> > 
> > 
> 
> Hmm, well, it's not like "long format" data frames (which I actually 
> think are more common in connection with SAS's PROC MIXED) are much 
> better. Those tend to replicate base data unnecessarily - "as if rats 
> change sex with millisecond resolution".

[ Note to Achim Zeilis : the "rats changing sex with millisecond
resolution" quote is well worth a nomination to "fortune" fame ; it
seems it is not one already... ]

>                                           The correct data structure 
> would be a relational database with multiple levels of tables, but, to 
> my knowledge, no statistical software, including R, is prepared to deal 
> with data in that form.

Well, I can think of two exceptions :

- BUGS, in its various incarnations (WinBUGS, OpenBUGS, JAGS), does not
require its data to come from the same source. For example, while
programming a hierarchical model (a. k. a. mixed-effect model),
individual level variables may come from one source and various group
level variables may come from other sources. Quite handy : no previous
merge() required. Now, writing (and debugging !) such models in BUGS
is another story...

- SAS has had this concept of "data view" for a long time, its most
useful incarnation being a "data view" of an SQL view. Again, this
avoids the need to actually merge the datasets (which, AFAICR, is a
serious piece of pain in the @$$ in SAS (maybe that's the *real*
etymology of the name ?)).

This problem has bugged me for a while. I think that the concept of a
"data view" is right (after all, that's one of the core concepts of SQL
for a reason...), but that implementing it *cleanly* in R is probably
hard work. Using a DBMS for maintaining tables and views and querying
them "just at the right time" does help, but the ability of using these
DBMS data without importing them in R is, AFAIK, currently lacking.

One upon a time, a very old version of RPgSQL (a Bioconductor package),
aimed to such a representation : it created objects inheriting from
data.frame to represent Postgres-based data, allowing to use these data
"transparently". This package dropped into oblivon when his creator and
sole maintainer became unable to maintain it further.

As far as I understand it, the DBI specification *might* allow the
creation of such objects, but I am not aware of any driver actually
implementing that.

In fact, there are two elements of solution to this problem :
a) creation of (abstract) objects representing data collections as data
frames, with the same properties, but not requesting the creation of an
actual data frame. As far as my (very poor) object-oriented knowledge
goes, these objects should be, in C++/Python parlance, inherit from
data.frame.
b) creation of objects implementing various realizations of the objects
created in a) : DBMS querying, actual data.frame querying (here I'm
thinking of sqldf, which does this on the reverse direction, allowing
querying R data frames to be queried in SQL. Quite handy...), etc ...

I tried my hand once at building such a representation (for
DBMS-deposited data), with partial success (read-only was OK, read-write
was seriously buggy). But my S3 object-oriented code stinks, my Python
is pytiful, and, as a public health measure,  I won't even try to
qualify my C++... So I leave implementation to better programmers as an
exercise (a term project, or even a master's thesis subject is probably
closer to truth...).

A third, much larger, (implementation) element, is lacking in this
picture : the algorithms used on these data. SAS is notoriously good (in
some simple cases, such as ordinary regression) at handling datasets
larger than available memory because the algorithms have been written
with punched cards (maybe even paper tape) in mind : *one* *sequential*
read of the data was the only *practical* way to go back in those days.
So all the matrices and vectors necessary to the computation
(notionally, X'X and X'Y) were built in memory in *one* step.

Such an organization is probably impossible with most "modern"
algorithms : see Douglas Bates' description of the lmer() algorithms for
a nice, big counter-example, or consider MCMC... But coming closer to
such an organization *seems* possible : see for example biglm.

So I think that data views are a a worthy but not-so-easy possible goal
aimed at various data structure problems (including hierarchical data),
but not *the* solution to data-representation problem in R.

Any thoughts ?

					Emmanuel Charpentier



More information about the R-help mailing list