[Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

Thu Sep 30 22:05:28 CEST 2021

Mildly related (?) to this discussion, if you happen to be in a situation
where you know something is a C NAN, but need to check if its a proper R
NA, the R_IsNA function is surprisingly (to me, at least) expensive to do
in a tight loop because it calls the (again, surprisingly expensive to me)
isnan function.  This can happen in known sorted  Altrep REALSXPs where you
can easily determine the C-NAN status of all elements in the vector with a
binary search for the edge of the NANs, so in O(logn) calls to R_isnan. You
could notably also determine finiteness of all elements this way with a
couple more O(logn) passes if you needed to in the sorted case.

This came up when I was developing the patch for the unique/duplicated
fastpass for known-sorted vectors (thanks to Michael for working with me on
that and putting it in); I ended up writing an NAN_IS_R_NA macro to avoid
that isnan call since it's known. This was necessary (well, helpful at
least) because unique/duplicated care about the difference between NA and
NaN, while sorting and REAL_NO_NA (because ALTREP metadata/behavior is
closely linked to sort behavior) do not. In the case where you have a lot
of NAN values of solely one type or the other (by far most often because
they are all NAs and none are NaNs) the difference in speedup was
noticeably significant as I recall. I don't have the numbers handy but I
could run them again if desired.

~G

On Thu, Sep 30, 2021 at 10:25 AM <luke-tierney using uiowa.edu> wrote:

> On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:
>
> >
> > André,
> >
> > I'm not an R core member, but happen to have looked a little bit at this
> > issue myself.  I've seen similar things on Skylake and Coffee Lake 2
> > (9700, one generation past your latest) too.  I think it would make sense
> > to have some handling of this, although I would want to show the
> trade-off
> > with performance impacts on CPUs that are not affected by this, and on
> > vectors that don't actually have NAs and similar.  I think the
> performance
> > impact is likely to be small so long as branch prediction is active, but
> > since branch prediction is involved you might need to check with
> different
> > ratios of NAs (not for your NA bailout branch, but for e.g. interaction
> > of what you add and the existing `na.rm=TRUE` logic).
>
> I would want to see realistic examples where this matters, not
> microbenchmarks, before thinking about complicating the code. Not all
> but most cases where sum(x) returns NaN/NA would eventually result in
> an error; getting to the error faster is not likely to be useful.
>
> My understanding is that arm64 does not support proper long doubles
> (they are the same as regular doubles). So code using long doubles
> isn't getting the hoped-for improved precision. Since that
> architecture is becoming more common we should probably be looking at
> replacing uses of long doubles with better algorithms that can work
> with regular doubles, e.g Kahan summation or variants for sum.
>
> > You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
> > which might complicate the logic a fair bit.
> >
> > Presumably the x87 FPU will remain common for a long time, but if there
> > was reason to think otherwise, then the value of this becomes
> > questionable.
> >
> > Either way, I would probably wait to see what R Core says.
> >
> > For reference this 2012 blog post[1] discusses some aspects of the issue,
> > including that at least "historically" AMD was not affected.
> >
> > Since we're on the topic I want to point out that the default NA in R
> > starts off as a signaling NA:
> >
> >     example(numToBits)   # for `bitC`
> >     bitC(NA_real_)
> >     ## [1] 0 11111111111 |
> 0000000000000000000000000000000000000000011110100010
> >     bitC(NA_real_ + 0)
> >     ## [1] 0 11111111111 |
> 1000000000000000000000000000000000000000011110100010
> >
> > Notice the leading bit of the significant starts off as zero, which marks
> > it as a signaling NA, but becomes 1, i.e. non-signaling, after any
> > operation[2].
> >
> > This is meaningful because the mere act of loading a signaling NA into
> the
> > x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
> > actually used in arithmetic operations.  This happens sometimes under
> some
> > optimization levels.  I don't now of any benefit of starting off with a
> > signaling NA, especially since the encoding is lost pretty much as soon
> as
> > it is used.  If folks are interested I can provide patch to turn the NA
> > quiet by default.
>
> In principle this might be a good idea, but the current bit pattern is
> unfortunately baked into a number of packages and documents on
> internals, as well as serialized objects. The work needed to sort that
> out is probably not worth the effort.
>
> It also doesn't seem to affect the performance issue here since
> setting b[1] <- NA_real_ + 0 produces the same slowdown (at least on
> my current Intel machine).
>
> Best,
>
> luke
>
> >
> > Best,
> >
> > B.
> >
> > [1]:
> https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
> > [2]: https://en.wikipedia.org/wiki/NaN#Encoding
> >
> >
> >
> >
> >
> >> On Thursday, September 30, 2021, 06:52:59 AM EDT, GILLIBERT, Andre <
> andre.gillibert using chu-rouen.fr> wrote:
> >>
> >> Dear R developers,
> >>
> >> By default, R uses the "long double" data type to get extra precision
> for intermediate computations, with a small performance tradeoff.
> >>
> >> Unfortunately, on all Intel x86 computers I have ever seen, long
> doubles (implemented in the x87 FPU) are extremely slow whenever a special
> representation (NA, NaN or infinities) is used; probably because it
> triggers poorly optimized microcode in the CPU firmware. A function such as
> sum() becomes more than hundred times slower!
> >> Test code:
> >> a=runif(1e7);system.time(for(i in 1:100)sum(a))
> >> b=a;b[1]=NA;system.time(sum(b))
> >>
> >> The slowdown factors are as follows on a few intel CPU:
> >>
> >> 1)      Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits
> : 140 times slower with NA
> >>
> >> 2)      Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150
> times slower with NA
> >>
> >> 3)      Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130
> times slower with NA
> >>
> >> 4)      Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times
> slower with NA
> >>
> >> I do not have access to more recent Intel CPUs, but I doubt that it has
> improved much.
> >>
> >> Recent AMD CPUs have no significant slowdown.
> >> There is no significant slowdown on Intel CPUs (more recent than Sandy
> Bridge) for 64 bits floating point calculations based on SSE2. Therefore,
> operators using doubles, such as '+' are unaffected.
> >>
> >> I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe
> somebody can test.
> >>
> >> Since NAs are not rare in real-life, I think that it would worth an
> extra check in functions based on long doubles, such as sum(). The check
> for special representations do not necessarily have to be done at each
> iteration for cumulative functions.
> >> If you are interested, I can write a bunch of patches to fix the main
> functions using long doubles: cumsum, cumprod, sum, prod, rowSums, colSums,
> matrix multiplication (matprod="internal").
> >>
> >> What do you think of that?
> >>
> >> --
> >> Sincerely
> >> Andr� GILLIBERT
> >>
> >>      [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-devel using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]