[Rd] Suggestion: Dimension-sensitive attributes

Laurent Gautier lgautier at gmail.com
Fri Jul 10 08:41:34 CEST 2009


Bengoechea Bartolomé Enrique (SIES 73) wrote:
> Very good points. They closely match the current prototype I have
> written...
> 
>> Starting by working on an interface for such object(s) is probably
>> the first step toward a unified solution
> 
> Agree. Getting a good API is always the most important step.
> 
>> Dimension-level is what seems to the be most needed...
> 
> True, and that was Henrik's original suggestion.



> But I find all three
> are closely related to the same topic (metadata) and as such deserve
> to be worked out together, but if most people agree otherwise, the
> direction is clear.
> 
>> - Object-level, if not linked to any dimension-attribute is such
>> saying that one want to attach anything to any object. That's what
>> attr() is already doing.
> 
> Except that plain attributes are dropped when subsetting. I've found
> myself dozens of times creating classes must to create a `[` method
> for them that preserves some attributes. This looks like such a
> common situation that having a mechanism to avoid the user
> programming the same stuff again and again would be handy.

I see. I never faced the issue, but I agree that this can be somehow 
counter-intuitive.
Thinking about it, it seems natural nowadays to consider 
attributes-associated objects as a kind of prototype-based programming 
(and "[" to keep the attributes - although it does somehow consider 
special attributes such as "dim", "names", "dimnames").


>> - Cell-level, is may be out-of-scope for one first trial (but may
>> be I missed the use-cases for it)
> 
> Although I agree that cell-level is far less common, here are a
> couple of use cases I've hit recently:
> 
> 1) the array represents time series in columns. The original data
> comes in a different frequency for each column, with some data
> missing. When you align to a common frequency and interpolate missing
> values, I needed a factor array of the same dimension as the data
> array identifying whether each observation corresponded to the actual
> original series, or had been interpolated, and whether interpolation
> was due to missing data or to frequency alignment. Of course, I
> needed the factor array to be subsetted together with the array.

In that respect, and as you outline it, this is then like 
"stacking"/"putting side-by-side" arrays of identical dimensions. Your 
time serie data is in one array, the origin of the observation in an 
other...

I would see that as a separate data structure (that could implement the 
metadata interface we are discussing).

> 2) the array is a table representing data to be formatted by a
> reporting system (Sweave, R2HTML, etc), similar to the 'xtable'
> class. So I needed to associate formatting information to each
> individual "cell" (font, color, borders...), as well to each
> dimension and to the whole table.
> 
> Anyway, it's far easier to add "cell-level" metadata on top of the
> other features with a new class: for `[` subscripting just call
> NextMethod() and then apply the same indexes to the object storing
> the cell-level metadata. But I still think it's useful to work out
> data object's metadata at all possible levels with a unified
> interface.
> 

I understand the use cases, but I can't stop stop thinking that this 
should be separated from the dimension-associated metadata.

In the examples above, the data structures are two-dimensional and 
therefore dimension-associated metadata will be for "rows" and for 
"columns"; all the cells in a table/array as a sequence are not mapped 
to any *dimension*.



> About the subscripting `[` methods, I don't see the need to modify
> `[<-` for arrays, as out-of-bound indexes generate errors with arrays
> (unlike vectors or data frames), so `[<-` would only replace data and
> leave metadata untouched. Am I missing something?

That's what I am thinking.
I bundle "[" with "[<-" to specify that the way indexing is done would 
remain the same (for a second I considered that someone though of 
somehow indexing on the names of the dimensions, or on the metadata).

>> may be a function called "dimmeta()" (for consistency with
>> "dimnames()") ?
> 
> I'm using 'dimdata' in my current prototype, and Henrik suggested
> 'dimattr', but I really like your proposal more.

  the colour of the bikeshed

> Wrappers to the two first elements of 'dimmeta' for 2-dim arrays
> could be added in the same vein as 'rownames' and 'colnames':
> 'rowmeta' and 'colmeta'.

Yes. That the spirit.

>> The signature could be dimmeta(x, i), with x the object,
> 
> For consistency with 'dimnames', the 'i' argument could be dropped
> and use dimmeta(x)[[i]] instead...
> 

I thought about that, but also thought that it could have implications 
on the actual storage of those metadata. In the case the metadata are 
stored in a list, that interface enforces the building of a list.
(I said to ignore implementation for now, but paradoxically this made me 
consider possible implementations).

Let's ignore that and go for consistency first (there will always be 
time to come back on that and make backward compatible changes, such as 
dimmeta(x, i=NULL) # return the list if i is NULL ).


> Other standard generics to be affected would be:
> 
> * rbind & cbind for 2-dim arrays/matrices: they should combine the
> metadata, and for dimension-sensitive metadata can be modelled upon
> what is done with dimnames: use rowmeta (colmeta) of the first object
> with them in cbind (rbind), and combine colmeta (rowmeta) of all
> objects with them, filling with NAs/NULLs/.. for non
> metadata-sensitive objects being combined. An issue of coercing
> dimmeta of different classes may arise.

May be good to be trigger-happy for a first pass ( stop("mismatching 
meta data - sorry") )... and mix-and-match use cases might be fewer.

> * `dim<-`, but this may raise the same problem of coercing dimmeta of
> different classes.
> 

Disabling "dim<-" is, I think, choosing sanity for now.


> ...and I agree with the rest of your comments.


Same for me (about your comments).
This thread seems to be leading to something great.


L.


> Best,
> 
> Enrique
> 
> -----Original Message----- From: Laurent Gautier
> [mailto:lgautier at gmail.com] Sent: jueves, 09 de julio de 2009 14:15 
> Cc: Heinz Tuechler; Bengoechea Bartolomé Enrique (SIES 73); Tony
> Plate; Henrik Bengtsson; r-devel at r-project.org Subject: Re: [Rd]
> Suggestion: Dimension-sensitive attributes
> 
> Starting by working on an interface for such object(s) is probably
> the first step toward a unified solution, and this before about if
> and how R attributes are used.
> 
> It would also help to ensure a smooth transition from the existing
> classes implementing a similar solution (first the interface is added
> to those classes, then after a grace period the classes are
> eventually refactored).
> 
> Dimension-level is what seems to the be most needed... but I am not
> convinced of the practicality of the object-level, and cell-level
> scheme s proposed:
> 
> - Object-level, if not linked to any dimension-attribute is such
> saying that one want to attach anything to any object. That's what
> attr() is already doing.
> 
> - Cell-level, is may be out-of-scope for one first trial (but may be
> I missed the use-cases for it)
> 
> 
> 
> If starting with behaviour, it seems to boil to having "["/"[<-" and
>  "dimmeta()"/"dimmeta<-()", :
> 
> - extract "[" / replace "[<-" :
> 
> * keeps working the way it already does
> 
> * extracts a subset of the object as well as a subset of the 
> dimension-associated metadata.
> 
> * departing too much from the way "[" is working and add 
> behind-the-curtain name matching will only compromise the chances of
>  adoption.
> 
> * forget about the bit about which metadata is kept and which one 
> isn't when using "[". Make a function "unmeta()" (similar behavior to
>  "unname()") to drop them all, or work it out with something like
>> dimmeta(x, 1) <- NULL # drop the metadata associated with dimension
>> 1
> 
> - access the dimension-associated metadata:
> 
> * may be a function called "dimmeta()" (for consistency with 
> "dimnames()") ? The signature could be dimmeta(x, i), with x the
> object, and i the dimension requested. A replace function
> "dimmeta<-"(x, i, value) would be provided.
> 
> 
> In the abstract the "names" associated with a given dimension is just
>  one of possible metadata, but I'd keep away from meddling with it
> for a start.
> 
> 
> It would seem natural that metadata associated with one dimension: 
> would a table-like object (data.frame seems natural in R, and 
> unfortunately there is no data.frame-like structure in R).
> 
> 
> 
> L.
> 
>



More information about the R-devel mailing list