[Rd] Wrong length of POSIXt vectors (PR#10507)

Tony Plate tplate at acm.org
Mon Dec 17 07:53:40 CET 2007


Duncan Murdoch wrote:
> On 15/12/2007 5:17 PM, Martin Maechler wrote:
>>>>>>> "TP" == Tony Plate <tplate at acm.org>
>>>>>>>     on Fri, 14 Dec 2007 13:58:30 -0700 writes:
>>     TP> Duncan Murdoch wrote:
>>     >> On 12/13/2007 1:59 PM, Tony Plate wrote:
>>     >>> Duncan Murdoch wrote:
>>     >>>> On 12/11/2007 6:20 AM, simecek at gmail.com wrote:
>>     >>>>> Full_Name: Petr Simecek
>>     >>>>> Version: 2.5.1, 2.6.1
>>     >>>>> OS: Windows XP
>>     >>>>> Submission from: (NULL) (195.113.231.2)
>>     >>>>> 
>>     >>>>> 
>>     >>>>> Several times I have experienced that a length of a POSIXt vector 
>>     >>>>> has not been
>>     >>>>> computed right.
>>     >>>>> 
>>     >>>>> Example:
>>     >>>>> 
>>     >>>>> tv<-structure(list(sec = c(50, 0, 55, 12, 2, 0, 37, NA, 17, 3, 31
>>     >>>>> ), min = c(1L, 10L, 11L, 15L, 16L, 18L, 18L, NA, 20L, 22L, 22L
>>     >>>>> ), hour = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, NA, 12L, 12L, 12L), 
>>     >>>>> mday = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, NA, 13L, 13L, 13L), mon 
>>     >>>>> = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L), year = c(105L, 
>>     >>>>> 105L, 105L, 105L, 105L, 105L, 105L, NA, 105L, 105L, 105L), wday = 
>>     >>>>> c(1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L), yday = c(163L, 163L, 
>>     >>>>> 163L, 163L, 163L, 163L, 163L, NA, 163L, 163L, 163L), isdst = c(1L, 
>>     >>>>> 1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, 1L, 1L)), .Names = c("sec", "min", 
>>     >>>>> "hour", "mday", "mon", "year", "wday", "yday", "isdst"
>>     >>>>> ), class = c("POSIXt", "POSIXlt"))
>>     >>>>> 
>>     >>>>> print(tv)
>>     >>>>> # print 11 time points (right)
>>     >>>>> 
>>     >>>>> length(tv)
>>     >>>>> # returns 9 (wrong)
>>     >>>> 
>>     >>>> tv is a list of length 9.  The answer is right, your expectation is 
>>     >>>> wrong.
>>     >>>>> I have tried that on several computers with/without switching to 
>>     >>>>> English
>>     >>>>> locales, i.e. Sys.setlocale("LC_TIME", "en"). I have searched a 
>>     >>>>> help pages but I
>>     >>>>> cannot imagine how that could be OK.
>>     >>>> 
>>     >>>> See this in ?POSIXt:
>>     >>>> 
>>     >>>> Class '"POSIXlt"' is a named list of vectors...
>>     >>>> 
>>     >>>> You could define your own length measurement as
>>     >>>> 
>>     >>>> length.POSIXlt <- function(x) length(x$sec)
>>     >>>> 
>>     >>>> and you'll get the answer you expect, but be aware that length.XXX 
>>     >>>> methods are quite rare, and you may surprise some of your users.
>>     >>>> 
>>     >>> 
>>     >>> On the other hand, isn't the fact that length() currently always 
>>     >>> returns 9 for POSIXlt objects likely to be a surprise to many users 
>>     >>> of POSIXlt?
>>     >>> 
>>     >>> The back of "The New S Language" says "Easy-to-use facilities allow 
>>     >>> you to organize, store and retrieve all sorts of data. ... S 
>>     >>> functions and data organization make applications easy to write."
>>     >>> 
>>     >>> Now, POSIXlt has methods for c() and vector subsetting "[" (and many 
>>     >>> other vector-manipulation methods - see methods(class="POSIXlt")).  
>>     >>> Hence, from the point of view of intending to supply "easy-to-use 
>>     >>> facilities ... [for] all sorts of data", isn't it a little 
>>     >>> incongruous that length() is not also provided -- as 3 functions (any 
>>     >>> others?) comprise a core set of vector-manipulation functions?
>>     >>> 
>>     >>> Would it make sense to have an informal prescription (e.g., in 
>>     >>> R-exts) that a class that implements a vector-like object and 
>>     >>> provides at least of one of functions 'c', '[' and 'length' should 
>>     >>> provide all three?  It would also be easy to describe a test-suite 
>>     >>> that should be included in the 'test' directory of a package 
>>     >>> implementing such a class, that had some tests of the basic 
>>     >>> vector-manipulation functionality, such as:
>>     >>> 
>>     >>> > # at this point, x0, x1, x3, & x10 should exist, as vectors of the
>>     >>> > # class being tested, of length 0, 1, 3, and 10, and they should
>>     >>> > # contain no duplicate elements
>>     >>> > length(x0)
>>     >>> [1] 1
>>     >>> > length(c(x0, x1))
>>     >>> [1] 2
>>     >>> > length(c(x1,x10))
>>     >>> [1] 11
>>     >>> > all(x3 == x3[seq(len=length(x3))])
>>     >>> [1] TRUE
>>     >>> > all(x3 == c(x3[1], x3[2], x3[3]))
>>     >>> [1] TRUE
>>     >>> > length(c(x3[2], x10[5:7]))
>>     >>> [1] 4
>>     >>> >
>>     >>> 
>>     >>> It would also be possible to describe a larger set of vector 
>>     >>> manipulation functions that should be implemented together, including 
>>     >>> e.g., 'rep', 'unique', 'duplicated', '==', 'sort', '[<-', 'is.na', 
>>     >>> head, tail ... (many of which are provided for POSIXlt).
>>     >>> 
>>     >>> Or is there some good reason that length() cannot be provided (while 
>>     >>> 'c' and '[' can) for some vector-like classes such as "POSIXlt"?
>>     >> 
>>     >> What you say sounds good in general, but the devil is in the details. 
>>     >> Changing the meaning of length(x) for some objects has fairly 
>>     >> widespread effects.  Are they all positive?  I don't know.
>>     >> 
>>     >> Adding a prescription like the one you suggest would be good if it's 
>>     >> easy to implement, but bad if it's already widely violated.  How many 
>>     >> base or CRAN or Bioconductor packages violate it currently?   Do the 
>>     >> ones that provide all 3 methods do so in a consistent way, i.e. does 
>>     >> "length(x)" mean the same thing in all of them?
>>     TP> I'm not sure doing something like this would be so bad even if it is 
>>     TP> already widely violated.  R has evolved significantly over time, and 
>>     TP> many rough edges have been cleaned up, sometimes in ways that were not 
>>     TP> backward compatible.  This is a great thing & my thanks go to the people 
>>     TP> working on R.
>>
>>     TP> If some base or CRAN or Bioconductor packages currently don't implement 
>>     TP> vector operations consistently, wouldn't it be good to know that?  
>>     TP> Wouldn't it be useful to have an automatic way of determining whether a 
>>     TP> particular vector-like class is consistent with generally agreed set of 
>>     TP> principles for how basic vector operations should work -- things like 
>>     TP> length(x)+length(y)==length(c(x,y))?  This could help developers check, 
>>     TP> document & improve their code, and it could help users understand how to 
>>     TP> use a class, and to evaluate the software quality of a class 
>>     TP> implementation and whether or not it provides the functionality they need.
>>     >> I agree that the current state is less than perfect, but making it 
>>     >> better would really be a lot of work.  I suspect there are better ways 
>>     >> to spend my time, so I'm not going to volunteer to do it.  I'm not 
>>     >> even going to invite someone else to do it, or offer to review your 
>>     >> work if you volunteer.  I think this falls into the class of "next 
>>     >> time we write a language, let's handle this better" problems.
>>
>>     TP> Thanks very much for the thoughtful (and honest) feedback!  I suspect 
>>     TP> that the current state could be improved with just a little work, and 
>>     TP> without forcing anyone to do any work they don't want to do.  I'll think 
>>     TP> about this more and try to come back with a better & more concrete 
>>     TP> suggestion.
>>
>> Good. From "the outside" (i.e. superficial gut feeling :-)
>> I've sympathized with your suggestion, Tony, quite a bit.
>> Further, my own taste would probably also have lead me to define
>> length.POSIXlt differently ..
>> OTOH, I agree with Duncan that it may be too late to change it
>> and even more to enforce the consistency rules you propose.
>> If with a small bit of code (and some patience) we could check
>> all of CRAN and hopefully bioconductor packages and find only a
>> very few where it was violated, the whole endeavor may be worth it
>> ... for the sake of making  R more consistent, easier to teach, etc..
>>
>> Unfortunately I don't remember now what happened many months ago
>> when I indeed did experiment with having something like
>>
>>   length.POSIXlt <- function(x) length(x$sec)
>>
>> Martin Maechler
> 
> One reason I don't want to work on this is because the appropriate 
> action depends on what "length(x)" is intended to mean.  Currently for 
> POSIXlt objects, it gives the physical length of the underlying basic 
> type (the list).  This is the same behaviour as we have for matrices, 
> data frames and every other object without a specific length method, so 
> it's not outrageous.
> 
> The proposed change is to have it return the logical length of the 
> object, which also seems quite reasonable.  I don't think matrices and 
> data frames have a "logical length", so there would be no contradiction 
> in those examples.  The thing that worries me is that there are probably 
> objects in packages where both logical length and physical length make 
> sense but are different.  I don't have any expectation that length(x) on 
> those currently is consistent in which type of value it returns.
> 
> If we were to decide that "length(x)" *always* meant logical length, 
> then we would have a problem:  matrices and data frames don't have a 
> logical length, so we shouldn't be getting an answer there.  Changing 
> length(x) for those is not acceptable.
> 
> On the other hand, if we decide that "length(x)" *always* means physical 
> length, we don't need to do anything to the POSIXlt or matrices or data 
> frames, but there may well be other kinds of objects out there that 
> violate this rule.
> 
> We could leave the meaning of length(x) ambiguous.  If you want to know 
> what it does for a POSIXlt object, you need to read the documentation or 
> look at the source code.  As a policy, this isn't particularly 
> appealing, but I could probably live with it if someone else did the 
> research and showed that current usage is ambiguous.

Leaving the meaning of length(x) ambiguous seems reasonable to me (as 
are the meanings of 'c' and '[').

I was thinking more in terms of consistency of either supplying all or 
none of the tightly related group of functions 'c', '[', and 'length'. 
It seems diabolically confusing that 'c' and '[' exist for POSIXlt and 
do the expected things in terms of the vector-of-dates interpretation, 
but length does something completely different.  (And this is not 
mentioned in ?POSIXlt).

Coding & documentation guidelines & tools could help R to move towards 
more consistency with regard to this kind of behavior.

-- Tony Plate

> 
> Duncan Murdoch
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list