[Rd] Suggestions to speed up median() and has.na()

Duncan Murdoch murdoch at stats.uwo.ca
Tue Apr 11 01:48:12 CEST 2006


On 4/10/2006 7:22 PM, Thomas Lumley wrote:
> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
> 
>> Hi,
>>
>> I've got two suggestions how to speed up median() about 50%.  For all
>> iterative methods calling median() in the loops this has a major
>> impact.  The second suggestion will apply to other methods too.
> 
> I'm surprised this has a major impact -- in your example it takes much 
> longer to generate the ten million numbers than to find the median.
> 
>> Suggestion 1:
>> Replace the sort() calls with the .Internal(psort(x, partial)).   This
>> will avoid unnecessary overhead, especially an expensive second check
>> for NAs using any(is.na(x)).  Simple benchmarking with
>>
>> x <- rnorm(10e6)
>> system.time(median(x))/system.time(median2(x))
>>
>> where median2() is the function with the above replacements, gives
>> about 20-25% speed up.
> 
> There's something that seems a bit undesirable about having median() call 
> the .Internal function for sort().
> 
>> Suggestion 2:
>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
>> as soon as a NA value is detected.  In the best case it returns after
>> the first index with TRUE, in the worst case it returns after the last
>> index N with FALSE.  The cost for is.na(x) is always O(N), and any()
>> in the best case O(1) and in the worst case O(N) (if any() is
>> implemented as I hope).  An has.na() function would be very useful
>> elsewhere too.
> 
> This sounds useful (though it has missed the deadline for 2.3.0).
> 
> It won't help if the typical case is no missing values, as you suggest, 
> but it will be faster when there are missing values.

I think it would help even in that case if the vector is large, because 
it avoids allocating and disposing of the logical vector of the same 
length as x.

>> BTW, without having checked the source code, it looks like is.na() is
>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
>> a vector without NAs.  On the other hand, is.na(sum(x)) becomes
>> awfully slow if 'x' contains NAs.
>>
> 
> I don't think  it is unnecessarily slow.  It has to dispatch methods and 
> it has to make sure that matrix structure is preserved.  After that the 
> code is just
> 
>      case REALSXP:
>          for (i = 0; i < n; i++)
>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>          break;
> 
> and it's hard to see how that can be improved. It does suggest that a 
> faster anyNA() function would have to not be generic.

If it's necessary to make it not generic to achieve the speedup, I don't 
think it's worth doing.  If anyNA is written not to be generic I'd guess 
a very common error will be to apply it to a dataframe and get a 
misleading "FALSE" answer.  If we do that, I predict that the total 
amount of r-help time wasted on it will exceed the CPU time saved by 
orders of magnitude.

Duncan Murdoch



More information about the R-devel mailing list