[Rd] Suggestions to speed up median() and has.na()

Duncan Murdoch murdoch at stats.uwo.ca
Tue Apr 11 02:54:39 CEST 2006


On 4/10/2006 8:08 PM, Thomas Lumley wrote:
> On Mon, 10 Apr 2006, Duncan Murdoch wrote:
> 
>> On 4/10/2006 7:22 PM, Thomas Lumley wrote:
>>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>>>
>>>> Suggestion 2:
>>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
>>>> as soon as a NA value is detected.  In the best case it returns after
>>>> the first index with TRUE, in the worst case it returns after the last
>>>> index N with FALSE.  The cost for is.na(x) is always O(N), and any()
>>>> in the best case O(1) and in the worst case O(N) (if any() is
>>>> implemented as I hope).  An has.na() function would be very useful
>>>> elsewhere too.
>>> This sounds useful (though it has missed the deadline for 2.3.0).
>>>
>>> It won't help if the typical case is no missing values, as you suggest, but 
>>> it will be faster when there are missing values.
>> I think it would help even in that case if the vector is large, because it 
>> avoids allocating and disposing of the logical vector of the same length as 
>> x.
> 
> That makes sense. I have just tried, and for vectors of length ten 
> million it does make a measurable difference.
> 
> 
>>>> BTW, without having checked the source code, it looks like is.na() is
>>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
>>>> a vector without NAs.  On the other hand, is.na(sum(x)) becomes
>>>> awfully slow if 'x' contains NAs.
>>>>
>>> I don't think  it is unnecessarily slow.  It has to dispatch methods and it 
>>> has to make sure that matrix structure is preserved.  After that the code 
>>> is just
>>>
>>>      case REALSXP:
>>>          for (i = 0; i < n; i++)
>>>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>>>          break;
>>>
>>> and it's hard to see how that can be improved. It does suggest that a 
>>> faster anyNA() function would have to not be generic.
>> If it's necessary to make it not generic to achieve the speedup, I don't 
>> think it's worth doing.  If anyNA is written not to be generic I'd guess a 
>> very common error will be to apply it to a dataframe and get a misleading 
>> "FALSE" answer.  If we do that, I predict that the total amount of r-help 
>> time wasted on it will exceed the CPU time saved by orders of magnitude.
>>
> 
> I wasn't proposing that it should be stupid, just not generic.  It could 
> support data frames (sum(), does, for example). If it didn't support data 
> frames it should certainly give an error rather than the wrong answer, but 
> if we are seriously trying to avoid delays around 0.1 seconds then going 
> through the generic function mechanism may be a problem.

If it's not dataframes, it will be something else.  I think it's highly 
desirable that any(is.na(x)) == anyNA(x) within base packages, and we 
should make it straightforward to maintain this identity in contributed 
packages.

By the way, I think Bill's suggestion of calling it anyMissing makes a 
lot of sense.

Duncan



More information about the R-devel mailing list