Floating point is a giant mess. There are known best practices for most issues that come up in everyday use (e.g., using Kahan summation or adding stochastic noise to reduce aggregate numerical error), but there are still corner cases. Different libraries and implementations give different results because there’s no generally accepted standard.
That’s even true for basic things like transitivity: if
a == b and
b == c, then
a == c. It’s so fundamental that we even assumed it as axiomatic in a number of math classes I’ve taken.
What happens when we check this on our computer using the our handy Haskell REPL?
$ ghci Prelude> -- Hi! I'm a comment! Prelude> -- 2^53 == 2.0^53 Prelude> 9007199254740992 == 9007199254740992.0 True Prelude> -- 2.0^53 == 2^53 + 1 Prelude> 9007199254740992.0 == 9007199254740993 True Prelude> -- 2^53 == 2^53 + 1 Prelude> 9007199254740992 == 9007199254740993 False
That’s weird. Looks like
== isn’t transitive in Haskell. What happens if we try another language?
$ julia julia> 9007199254740992 == 9007199254740992.0 true julia> 9007199254740992.0 == 9007199254740993 false julia> 9007199254740993 == 9007199254740992 false
Ok, so == is transitive over floats and ints in Julia but not in Haskell. How about Octave/Matlab?
$ octave octave:1> true ans = 1 octave:2> false ans = 0 octave:3> 9007199254740992 == 9007199254740993.0 ans = 1 octave:4> 9007199254740992.0 == 9007199254740993 ans = 1 octave:5> 9007199254740992 == 9007199254740993 ans = 1
That’s pretty weird, different ints seem to be equal. According to the docs, you can explicitly cast to an int type. Let’s try that!
octave:6> uint64(9007199254740992) == uint64(9007199254740993) ans = 1 octave:7> eq(uint64(9007199254740992),uint64(9007199254740993)) ans = 1 octave:8> 0 == 1 ans = 0
Let’s try a few more languages to see what we get. In the table below, Transitive represents when the particular operations we tried are transitive, Unique represents when these ints are unique and Allowed represents when ints and floats can be compared with
Hmm. After trying a handful of languages, it looks like there are four groups of languages.
- Languages where
==isn’t transitive over ints and floats, like C and Haskell.
- Languages where
==is transitive over ints and floats (at least on this one example), like Go and Julia.
- Languages where
- Languages where
==isn’t allowed between ints and floats.
What’s going on here? Well, there are two things that combine to cause this weird behavior.
First, (double precision) floating point numbers are represented by an exponent multiplied by a 52-bit fraction. As a more familiar example, let’s think about how that would work with decimal numbers in base 10 for a second. Say we had a floating point decimal with a 2 digit (signed) exponent and a 3 digit fraction. Our number would then be
10^(ee) * 1.(fff). So, if our exponent was
01 and our fraction was
000, we’d get
10^01 * 1.000 = 10.00. If we were to increment the fraction, to
001, we’d get
10^01 * 1.001 = 10.01. If, instead, the exponent were
02, we’d have
10^02 * 1.000 = 100.0 and
10^02 * 1.001 = 100.1. Notice that the smallest possible increment is the value of the exponent times the least significant digit in the fraction, i.e.,
(value of exponent) * 10^-(number of digits in fraction); in the last example, that works out to be
10^2 * 10^-3 = 10^-1 = 0.1.
Instead of base 10, floating point numbers operate in base 2. The limit on the smallest possible change we can make still holds, except that for a 64-bit floating point value, we have 11 bits (base-2 digits) of exponent, and 52 bits of fraction (plus 1 bit for the sign). This implies that the smallest increment is
(value of exponent) * 2^-52. Now, if we set the value of our exponent to be
53 and our fraction to
0, we get
9007199254740992. The next representable floating point value is
2^53 * 2^-52 = 2 more,
Second, you can’t compare an int and a float with
== without converting them to the same type. It’s natural to want to convert both to floats and compare, since floats are a more general type; 1.0 + 2 will give you 3.0 in many languages. But if we promote the int to a float, distinct ints can get promoted to the same float, resulting in a violation of transitivity, as we saw in Haskell2. If, instead, we convert floats to ints, the problem goes away, as we saw in Julia.
Well, that problem goes away, at least. But that still leaves a couple other problems. First, we end up with statements like
9007199254740993.0 == 9007199254740993 and
int(9007199254740993.0) > 9007199254740992 being false.
Second, what’s supposed to happen when you compare to a value that’s larger than an int? For a 64-bit float, our exponent can go up to 2^1023, which is way more than we can represent in a 64-bit int (2^63). Languages like Python and Ruby that convert values larger than a machine int into some kind of BigInt by default avoid the issue at the cost of some performance. Languages like Julia, which give you machine ints by default, have to deal with overflow.
On overflow, Julia throws an exception if you try to convert a float that’s bigger than 2^63. However, the
== comparison always returns false if you compare an int to a float that’s bigger than 2^63. It might seem surprising that
int(a) == b throws an exception while
a == b is false; that’s part of the case for forcing explicit conversions on comparisons, but given that mistyped comparisons are allowed in a language, throwing an exception on conversion and returning the correct result without an exception on comparison is the only way to avoid incorrect results.
And then there are the languages that make all numbers floats, which is fine if you don’t mind that
9007199254740992 == 9007199254740993. Personally, I mind that
9007199254740992 == 9007199254740993.
Since there’s no way to handle this well, how about just not allowing the comparison to happen? Of the languages I tried, only OCaml actually considers that to be an error. I like this approach a lot, but people often complain about how explicit you have to be with numeric types in OCaml. Technically, languages like Go and Haskell will also error out if you compare an int variable to a float variable, but if you just type an int in, as in the examples above, they’ll helpfully convert to a float for the comparison.
== example, it’s possible to maintain consistency by not allowing cross-type comparisons, but there are strange corners in floating point math even if in pure floating point land.
What happens when we try to compute
Infinity / Infinity? We get a special value reserved for values that are not a number,
NaN is supposed to fail all comparisons, leading to all kinds of fun stuff, like:
$ ghci Prelude> nan == nan False Prelude> nan < nan || nan == nan || nan > nan False
One of the justifications for having a special
NaN value is that the
NaN is supposed to propagate, indicating that an error occurred at some point in the calculation.
$ ghci Prelude> min nan 1 1.0 Prelude> min 1 nan NaN
But it doesn’t. The
NaN gets suppressed, sometimes, depending on the order of the arguments. This makes sense if you think of min as being implemented by an
function min(a,b) if a < b return a else return b end end
Since the comparison always fails, the else clause always wins.
Not all languages do that. Kahan once proposed that
NaN always gets suppressed, matching C’s behavior, treating
NaN as a missing value. Other folks think it’s better to propagate the
NaN to avoid suppressing errors. But, no matter what happens, reasonable people can disagree on whether or not the result makes sense. Some people want min to be a consistent drop in replacement for the naive way of coding it up, and some people want to avoid the possibility of missing an error. Some people even want to suppress the error. I don’t understand that one, but Kahan’s smarter than me and has thought about this more than me, so there’s probably a good reason that I just don’t understand.
We’ve covered Haskell and C. Let’s see what happens in other languages when give
NaN for one argument and
1.0 for the other. The languages we’ve tried do one of the following: always return the 1st argument, always return the 2nd argument, always return a
NaN, always return the non-
NaN, or throw an exception.
And then there are the folks who just want whatever’s fastest3, i.e., whatever hardware does. On x86, the SSE/AVX min and max instructions act like a normal
if/else comparison and return the second operand if there’s a
NaN. To get either of the alternatives above (
NaN propagation or
NaN suppression), you need to turn an instruction like
into an entire sequence of instructions like
MINPD PCMPNEQD BLENDPD
If that’s in your critical path on x86, getting
NaN to propagate is a noticeable performance hit. The equivalent ARM instruction,
NaN, so you’re forced to give up on cross-platform consistency if you want the best possible performance.
If you want to propagate
NaN, this corner case has its own corner case, which is what to do for
min(-Infinity, NaN). The argument for propagating
NaN is the same as before. The case for propagating
-Infinity is that that
min(-Infinity, foo) should be
-Infinity for all foo. This is analogous to a decades old unsettled debate in the hardware design world, X-optimism vs. X-pessimism. Unlike in hardware design, where X-ism is central to verifying the correctness of hardware,
-Infinity is just a corner case of a corner case, so we haven’t had decades of debate, but that doesn’t make the issue any simpler. I don’t want to replay the entire debate, but suffice to say that it’s possible to come up with arbitrarily many examples that make either decision look unreasonable.
To sum it up, different languages pick different, totally reasonable behaviors. And if you want good performance, you’ll get different results on different hardware. And that doesn’t even scratch the surface of floating point weirdness. In addition to the problems mentioned in Goldberg’s classic What Every Programmer Should Know About Floating Point Arithmetic, there are tens or hundreds of weird corner cases that can cause bugs4.
Thanks to Carter kicking off a discussion with Stefan about
NaN comparisons, to Stefan for his comments on
NaN comparisons as well as floating point transitivity, to Erik for comments on some floating point nuances that I missed, and to Alyssa for pointing out some omissions in my comments about Haskell.
This table just refers to this specific case.
==is famously not transitive in js for all sorts of things. ↩
Note that our in between value,
9007199254740993, will get converted to
9007199254740994.0under the default rounding mode on most platforms. This is because the normal default rounding mode is round to nearest, breaking ties by making the last digit even (i.e., 0 and not 1). That, in turn, is because the standard grade school algorithm of rounding ties up introduces a rounding bias. ↩
If you read through the IEEE standards meeting notes, you can see that they were pretty concerned about performance: “Roger’s experiments (on Itanium, x87, & SSE/SSE2) suggest that the cost of special handling of +/-0 & NaNs is high (3x, 2x, & 12x, respectively)” ↩
Here’s something to think about next time someone proposes we standardize on just high-enough precision floating point for all calculations, like in js. What happens when you calculate
log1p(x) = if (1.0 + x == 1.0) else log(1.0 + x) * x / ((1.0 + x) - 1.0)on a machine with extended precision floating point vs. one with single or double precision? ↩