[R] Class that wraps Data Frame

Ramiro Barrantes ramiro at precisionbioassay.com
Mon Sep 3 15:33:07 CEST 2012


Thanks to everyone for all their help.  I will investigate more.  I am not attached to S4 at all, but it sounds like it might be a good option.  I will look into bioconductor for examples.

Thanks again,

Ramiro

________________________________________
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Friday, August 31, 2012 12:33 PM
To: Bert Gunter
Cc: David Winsemius; r-help at r-project.org; Ramiro Barrantes
Subject: Re: [R] Class that wraps Data Frame

I guess there are two issues with data.frame. It comes with more than
you probably want to support (e.g., list and matrix- like subsetter [,
the user expecting to be able to independently modify any column). And
it comes with less than you'd like (e.g., support for a 'column' of S4
objects). By making a class that contains ('is a') data.frame, you
commit to both limitations.

You're probably using data.frame as a way to implement some basic
restrictions -- equal-length columns, for instance. But there are
additional restrictions, too, columns x, y, z must be present and of
type integer, character, numeric respectively. For this scenario one is
better off implementing an S4 class (which provides type checking and
required columns), a validity method (for enforcing the equal-length
constraint), accessors, and sub-setting following the semantic that
you'd like to support, e.g., just along the length of the required slots.

The richest place for this in Bioconductor is the IRanges package,
though it can be a bit daunting from an architecture point of view. A
couple of things to point to. One is the DataFrame class, which is like
a data.frame but supporting a broader (in particular S4) set of columns
and allowing 'metadata' (actually, DataFrame, so recursive) on each
column. It is relevant if it is important to maintain S4 classes in a
data.frame-like structure.

Another is the IRanges class, which in some ways fits your overall use
case. It is basically a rectangular data structure, but with required
'columns' (the start and width of the range) and then arbitrary columns
the user can add. It's implemented with slots for start and width, and
then 'has a' slot containing a DataFrame as 'metadata columns' (the
actual implementation is more complicated than this). There are start
and width accessors. Sub-setting is always list-like
(single-dimensional, along the ranges). Users wanting to access one of
'their' columns use $ or extract the metadata columns (via mcols()) as a
DataFrame and then work on that. Maybe it's worth pointing out that the
basic definitions are column-oriented, an IRanges instance contains
start and width vectors; there is no 'IRange' class.

The GRanges class (in the GenomicRanges package) 'has a' IRanges, but
adds additional required slots ('seqnames' to reference the names of the
chromosome sequences to which the ranges refer, 'strand' to indicate the
strand to which the range belongs, etc.). So the pattern here avoids the
'is a' relationship that simple class extension would imply.

The IRanges package is at

   http://bioconductor.org/packages/devel/bioc/html/IRanges.html

I've described the 'devel' version of Bioconductor

   http://bioconductor.org/developers/useDevel/

Martin


On 08/31/2012 08:39 AM, Bert Gunter wrote:
> To add to what David said ...
>
> Of course, there are already S3 "getters" and "setters" methods for data
> frames ("[.data.frame" and "[<-.data.frame" )*. These could clearly be
> extended -- i.e. the data.frame class could be extended and appropriate S3
> methods written. Whether you use S3 or S4 depends on the degree of control,
> type checking, reuse etc. you want/need. David's suggestion to look at
> Bioconductor is a good one.
>
> Cheers,
> Bert
> *If you are unfamiliar with the S3 extract methods, consult the R Language
> Definition Manual.
>
> On Fri, Aug 31, 2012 at 8:14 AM, David Winsemius <dwinsemius at comcast.net>wrote:
>
>>
>> On Aug 31, 2012, at 5:57 AM, Ramiro Barrantes wrote:
>>
>>> Hello,
>>>
>>> I have again a "good practices"/programming theory question regarding
>> data.frames.
>>>
>>> One of the fundamental objects that I use is the data frame with a
>> particular set of columns that I would fill or get information from, and an
>> entire system would revolve around getting information from or putting
>> information to such data.frame.
>>>
>>> On a different OOP programming language I would be tempted to create a
>> class that would "wrap-around" that data.frame and create "getters" and
>> "setters" methods that would return whatever information I need. I started
>> doing that using S4.
>>>
>>> Does anyone have examples of packages that use that approach or any
>> suggestions?  It just seems to me that a class/object would be a better
>> idea because it would create a single, hopefully well validated way to
>> access information and edit the fundamental data.frame object, which would
>> be helpful if there are several programmers on the team and/or if some of
>> the data.frame manipulations are not straightforward and are best left
>> encapsulated in a method of a class, and then have people use that method.
>>   I would just like to know if there are reasons not do it that way and if
>> there are any examples of packages that use that approach and that I can
>> learn from.
>>
>> You could argue that the entire BioConductor project represents such an
>> effort. It makes extensive use of S4 methods. I'm not a user so cannot
>> readily point to examples of S4 functions that have set. and get. methods
>> for particular sorts of dataframes, but I suspect you can pose the same
>> question on the BioC mailing list and get a more informed answer.
>>
>> --
>> David Winsemius, MD
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-help mailing list