[R] any updates w.r.t. lapply, sapply, apply retaining classes

Joshua Wiley jwiley.psych at gmail.com
Fri Nov 4 02:20:19 CET 2011


Hi Mike,

I definitely understand your point.  I don't have any particularly
good ideas, though I think you might like S4, which is the newer
formal class/methods system.

As a note, I misspoke (or miswrote) that difftime inherits from
numeric---the mode is numeric, but does not inherit.

Cheers,

Josh

On Thu, Nov 3, 2011 at 4:49 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:
> Hi Joshua,
>     Thank you for the input!
>     I agree that it is non-trivial to solve the cases you & I have posed.
>  However, I would wholeheartedly support having an error spit back for any
> function that does not explicitly support a class.  In this case, if I
> attempt to do   sapply(x, class), and 'x' is of class "difftime", then I
> should receive an error "sapply cannot function upon class 'difftime' ".
>  Why do I take this stance?  There are at least 2 strong reasons:
>
> Most importantly, an incorrect answer is far more dangerous than no answer.
>  E.g., if I ask "what is 3 + 3?", I would far prefer to receive "I don't
> know" than "5".  The former lets me know I need to choose another path, the
> latter mistakenly makes me think I have an answer, when I do not, and I
> continue with analyses on the assumption that answer is correct.  In the
> case of dates, this happens often.  E.g., is the numeric that is returned
> from sapply, for instance, the # of seconds since 1970-01-01, or the number
> of days since 1970-01-01.  This depends upon how 'R' internally attempts to
> fix any incongruities.
> But also very significantly, an error will get me in the habit of avoiding
> any marginalized class types.  I keep thinking, for instance, that I can use
> the "Dates" class, since 'R' says that it supports them.  But if I got into
> the habit of converting all dates into numerics myself beforehand (maybe
> counting the number of seconds from 1970-01-01, since that seems a magic
> date), then I would be guaranteed that a function will either (a) cause an
> error (e.g., if I try a character function on it), or (b) function properly.
>  However, since I don't overtly receive errors when attempting to use dates
> (or difftimes, or factors, or whatever), I keep using them, instead of
> relying solely upon the true & trusted classes.
>
> the trickiest here is really factors.  Factors are, by most accounts,
> considered a core class.  In some cases, you can only use factors.  E.g.,
> when you want some sort of ordinal categorical variable.  Therefore, the
> fact that factors also barf similarly to other classes like difftime (albeit
> much more rarely), is especially dangerous.
>
>     There are, of course, habits that we can create to make ourselves better
> programmers, and I will recognize that I can improve.  However, this issue
> of functions generating "wrong" answers is such a huge problem with 'R', and
> other languages are catching up to 'R' so quickly, as far as their
> capability to handle higher level math, that this issue is making 'R' a less
> desirable language to use, as time progresses.  I don't mean to claim that
> my opinion is the end-all-be-all, but I would like to hear others chime in,
> whether this is a large concern, or whether there is a very small minority
> of folks impacted by it.
>                                                   Regards,
>                                                          Mike
>
> ---
> XKCD
>
>
>
> On Thu, Nov 3, 2011 at 2:51 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>
>> Hi Mike,
>>
>> This isn't really an answer to your question, but perhaps will serve
>> to continue discussion.  I think that there are some fundamental
>> issues when working special classes.  As a thought example, suppose I
>> wrote a class, "posreal", which inherits from the numeric class.  It
>> is only valid for positive, real numbers.  I use it in a package, but
>> do not develop methods for it.  A user comes along and creates a
>> vector, x that is a posreal.  Then tries: mean(x * -3).  Since I never
>> bothered to write a special method for mean for my class, R falls back
>> to the inherited numeric, but gives a value that is clearly not valid
>> for posreal.  What should happen?  S3 methods do not really have
>> validation, so in principle, one could write a function like:
>>
>> f <- function(x) {
>>  vclass <- class(x)
>>  res <- mean(x)
>>  class(res) <- vclass
>>  return(res)
>> }
>>
>> which "retains" the appropriate class, but in name only.  R core
>> cannot possibly know or imagine all classes that may be written that
>> inherit from more basic types but with possible special aspects and
>> requirements.  I think the inherited is considered to be more generic
>> and that is returned.  It is usually up to the user to ensure that the
>> function (whose methods were not specific to that special class but
>> the inherited) is valid for that class and can manually convert it
>> back:
>>
>> res <- as.posreal(res)
>>
>> What about lapply and sapply?  Neither are generic or have methods for
>> difftime, and so do some unexpected/desirable things.  Again, without
>> methods defined for a particular class, they cannot know what is
>> special or appropriate way to handle it, they use defaults which
>> sometimes work but may give unexpected or undesirable results, but
>> what else can be done?  (okay, they could just throw an error)  If a
>> function is naive about a class, it does not seem right to operate on
>> it using unknown methods and then pretend to be returning the same
>> type of data.  As it stands, they convert to a data type they know and
>> return that.
>>
>> Now, you mention that for loops are slow in R, and this is true to a
>> degree.  However, the *apply functions are basically just internal
>> loops, so they do not really save you (they are certainly not
>> vectorized!), though they are more elegant than explicit loops IMO.
>> One way to use them while retaining class would be like:
>>
>> sapply(seq_along(test), function(i) class(test[i]))
>>
>> this is less efficient then sapply(test, class), but the overhead
>> drops considerably as the function does nontrivial calculations.
>> Finally, I find the (relatively) new compiler package really shines at
>> making functions that are just wrappers for for loops more efficient.
>> Take a look at the examples from:
>>
>> require(compiler)
>> ?cmpfun
>>
>> I am not familiar with numPy so I do not know how it handles new
>> classes, but with some tweaks to my workflow, I do not find myself
>> running into problems with how R handles them.  I definitely
>> appreciate your position because I have been there...as I became more
>> familiar with R, classes, and methods, I find I work in a way that
>> avoids passing objects to functions that do not know how to handle
>> them properly.
>>
>> Cheers,
>>
>> Josh
>>
>>
>> On Thu, Nov 3, 2011 at 11:08 AM, Mike Williamson <this.is.mvw at gmail.com>
>> wrote:
>> > Hi All,
>> >
>> >    I don't have a "I need help" question, so much as a query into any
>> > update whether 'R' has made any progress with some of the core functions
>> > retaining classes.  As an example, because it's one of the cases that
>> > most
>> > egregiously impacts me & my work and keeps pushing me away from 'R' and
>> > into other numerical languages (such as NumPy in python), I will use
>> > sapply
>> > / lapply to demonstrate, but this behavior is ubiquitous throughout 'R'.
>> >
>> >    Let's say I have a class which is theoretically supported, but not
>> > one
>> > of the core "numeric" or "character" classes (and, to some degree,
>> > "factor"
>> > classes).  Many of the basic functions will convert my desired class
>> > into
>> > either numeric or character, so that my returned answer is gibberish.
>> >
>> > E.g.:
>> >
>> > test= as.difftime(c(1, 1, 8, 0.25, 8, 1.25), units= "days")  ## create a
>> > small array of time differences
>> > class(test)  ## this will return the proper class, "difftime"
>> > class(test[1] ) ## this will also return the proper class, "difftime"
>> > sapply(test, class)  ## this will return *numerics* for all of the
>> > classes.
>> >  Ack!!
>> >
>> >    In the example I give above, the impact might seem small, but the
>> > implications are *huge*.  This means that I am, in effect, not allowed
>> > to
>> > use *any* of the vectoring functions in 'R', which avoid performing
>> > loops
>> > thereby speeding up process time extraordinarily.  Many can sympathize
>> > that
>> > 'R' is ridiculously slow with "for" loops, compared to other languages.
>> >  But that's theoretically OK, a good statistician or data analyst should
>> > be
>> > able to work comfortably with matrices and vectors.  However, *'R'
>> > cannot
>> > work comfortably* with matrices or vectors, *unless* they are using the
>> > numeric or character classes.  Many of the classes suffer the problem I
>> > just described, although I only used "difftime" in the example.  Factors
>> > seem a bit more "comfortable", and can be handled most of the time, but
>> > not
>> > as well as numerics, and at times functions working on factors can
>> > return
>> > the numerical representation of the factor instead of the original
>> > factor.
>> >
>> >    Is there any progress in guaranteeing that all core functions either
>> > (a) ideally return exactly the classes, and hierarchy of classes, that
>> > they
>> > received (e.g., a list of data frames with difftimes & dates &
>> > characters
>> > would return a list of data frames with difftimes & dates & characters),
>> > or
>> > (b) barring that, the function should at least error out with a clear
>> > error
>> > explaining that sapply, for example, cannot vectorize on the class being
>> > used?  Returning incorrect answers is far worse than returning an error,
>> > from a perspective of stability.
>> >
>> >    This is, by far, the largest Achilles' heel to 'R'.  Personally, as
>> > my
>> > career advances and I work on more technical things, I am finding that I
>> > have to leave 'R' by the wayside and use other languages for robust
>> > numerical calculations and programming.  This saddens me, because there
>> > are
>> > so many wonderful packages developed by the community.  The example
>> > above
>> > came up because I am using the "forecast" library to great effect in
>> > predicting how long our product cycle time will be.  However, I spend
>> > much
>> > of my time fighting all these class & typing bugs in 'R' (and we have to
>> > start recognizing that they are bugs, otherwise they may never get
>> > resolved), such that many of the improvements in my productivity due to
>> > all
>> > the wonderful computational packages are entirely offset by the time
>> > I spend fighting this issue of poor classes.
>> >
>> >                                     Thanks & Regards!
>> >                                              Mike
>> >
>> > ---
>> > XKCD <http://www.xkcd.com>
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, ATS Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list