> This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs
Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
adrian_b 1 days ago [-]
It is not correct because it does not implement the FP arithmetic standard and this can lead to much greater numerical errors than expected.
NVIDIA is not responsible alone, because the Microsoft DirectX specification includes the non-standard behavior.
Nevertheless, as shown in TFA, both the AMD and Intel GPUs allow the user to choose between correct behavior and incorrect behavior that might be faster, while NVIDIA ignores what the user requests and implements only the non-standard behavior.
The developers of graphics or ML/AI applications do not care about errors, but there are also people who want to use GPUs for normal computations, where the accuracy of the results matters, so they want to be able to choose between correct behavior and incorrect but faster behavior.
Actually "faster" is a misnomer, because denormals can be handled correctly without diminishing the speed, but that costs additional die area. Thus what NVIDIA gains by not implementing the right behavior is a reduced production cost.
SwuduSusuwu 1 days ago [-]
[dead]
codedokode 23 hours ago [-]
Maybe we don't need denormals in most cases. Denormals are extremely tiny numbers (on order of 2^-149). It would be nice if RISC-V which aims for simpliciy, got rid of them. If you need such small numbers, I don't know, just use doubles or write your own floats.
adrian_b 6 hours ago [-]
Denormals are not needed if you are willing to handle underflow exceptions.
Before Intel 8087 and the IEEE 754 standard, any decent floating-point unit generated overflow exceptions and underflow exceptions, which had to be handled by the programmer, unless the default behavior of crashing the program was acceptable.
Intel 8087 and the standard based on it have offered to the lazy programmers the option to not handle the exceptions, in which case overflow exceptions generate infinities and the underflow exceptions generate denormals.
When the exceptions are not handled, it is supposed that the programmer will check the final results of a long computation, and if infinities and denormals are not desired, but they exist nonetheless in the results, the programmer will investigate the reason and then the bug will be fixed.
So anyone is free to ensure that no denormals will ever appear in an application , by enabling the underflow exception. If it is desired that the program must not crash, then the program must be written carefully, so that underflows are impossible.
There is no correct way of eliminating denormals, except throwing exceptions on underflows.
The flush denormals to zero on output and interpret denormals as zero on input behaviors are not permissible in any program that must produce correct results. Anyone who uses -ffast-math or similar options for compiling a program that is not intended for graphics or ML/AI, where errors supposedly do not matter, makes an unforgivable mistake.
Unfortunately, "-ffast-math" enables a very large number of compilation options. A part of them are safe and they can cause a great increase in performance, like using fused multiply-add instructions. Others not only are dangerous, like flushing denormals to zero, but they also provide negligible performance increase on many processors.
Therefore, instead of aggregate options like "-ffast-math" one must enable only some of the component options, for maximum performance, without affecting result accuracy. For example, in gcc and clang one must use "-ffp-contract=fast", for enabling FMA instructions.
codedokode 4 hours ago [-]
Sorry I did not understand the point about underflows. Underflow happens with and without denormals - denormals just allow to go a little further before the number turns into a zero.
Also, there are cases when you do not want to crash on overflow - for example, live audio processing, you do not want sound to stop only because there ws an overflow in one audio sample.
adrian_b 4 hours ago [-]
When underflows happen there are 2 possible actions on any standard processor, which are selected by a bit in a configuration register.
One option is to throw an exception. In that case a program will never generate any denormals. Denormals can appear in such a program only when they are present in input data generated by another program, which has been executed with a different exception configuration.
The second option is to generate denormals. In such a configuration every underflow generates a denormal and such denormals are propagated through later arithmetic operations, unless they are added to or subtracted from a much bigger number, when they disappear.
What you say about audio is the very reason why denormals have been introduced in the standard. You are allowed to choose the second option, i.e. to mask the underflow exception, so there will be no underflow exceptions and the underflows will generate denormals, which will introduce minimal errors that are acceptable in most applications.
Unlike with the errors introduced by denormals, which are typically negligible, the non-standard third option, to flush denormals to zero, can easily produce huge errors that are unacceptable for most applications. Thus this non-standard option is acceptable only in applications where the errors cannot have serious consequences, e.g. they may cause bad pixels that are not noticed in an animation or they may change the probability of some token in LLM inference, which may not change the actual sampled output, and even if the output is changed, it might not matter among more important causes of hallucinations.
rwallace 5 hours ago [-]
The one way I know of for denormals to arise in real calculations is when you have a process that converges on zero. In which case, values will pass through the denormal range on the way from the normal range down to zero. And converting denormals to zero is obviously correct.
What other cases have you seen of denormals arising in real calculations?
adrian_b 5 hours ago [-]
If you compute a limit numerically, the decision that you have reached the limit must normally be taken long before there is any possibility of underflow, i.e. of generating denormals.
Typically you compute the difference between 2 adjacent terms of the convergent sequence and you decide that you have reached the limit when adding the difference to the current term to get the next does not change it (or when the relative error is smaller than some threshold). At this time the difference will still be many orders of magnitude greater than a denormal.
If the limit happens to be zero, then what you describe can happen. The programs where this can happen normally combine two different criteria to decide that the limit has been reached, i.e. that either the relative error is small enough, i.e. like I said that adding the difference to the previous term does not change it, or that the absolute error is small enough, i.e. that the difference is smaller than some threshold. The absolute error used as threshold is normally chosen based on what is physically meaningful in that problem, i.e. depending on what kind of physical quantity corresponds to the values of the terms of the convergent series.
An example of applications where the users must configure both relative errors and absolute errors, to be used as criteria of convergence, are the SPICE-like programs used to simulate electronic circuits, where the user must configure both a generally-applicable relative error, and absolute errors for different kinds of physical quantities, e.g. an absolute error for voltages and an absolute error for electric currents, which will be used, respectively, when a sequence of voltages converges towards zero or a sequence of currents converges towards zero, so that the absolute error criterion will be satisfied before the relative error criterion is satisfied.
In any case, in a correct program the attempt to find the limit should always stop before underflows become possible, so denormals should never be generated if the underflow exceptions are masked.
Denormals can appear in a lot of cases when almost equal values are subtracted, e.g. in the solution of many kinds of badly conditioned equations, but in most such cases there may exist alternative formulae that avoid underflows, i.e. the generation of denormals.
In general, when denormals appear, this signals bugs in the program, which must be investigated and fixed. The purpose of denormals is to allow the programmer to not fix the bugs, without causing catastrophic errors, like those that can happen when underflows generate null results, i.e. when "denormals are flushed to zero".
Careful programmers should nonetheless fix any bugs that generate denormals.
rwallace 4 hours ago [-]
Ah! So let me paraphrase to make sure I understand. In your experience, the real use case of denormals is not that you want to compute with them. It's that they are a useful error signal, because they typically indicate that the equations are badly conditioned, which means the results cannot be trusted. So in that scenario, silently flushing them to zero is bad, but so is silently computing with them. What you really want is for denormals to raise an exception?
adrian_b 4 hours ago [-]
They are indeed a useful error signal, like also infinities or NaNs.
Moreover, they are much more benign than infinities or NaNs, which signal bugs that you most likely have to fix.
Denormals cause only very small errors, which may be acceptable in most applications. Therefore you may choose to ignore such bugs and not fix them, because a modified algorithm that avoids underflows may be significantly more complex than the original algorithm.
Configuring the non-standard option to flush denormals to zero is bad for two reasons. Not only it removes evidence that something bad has happened, but unlike with denormals, this can cause huge errors, even infinite errors.
When the operations are done according to the IEEE standard, every operation has a limit for the relative error and it is possible to do a numeric analysis of some computational algorithm and estimate the maximum errors that can affect its results.
When underflows generate neither exceptions nor denormals, all guarantees are removed and you can no longer predict anything about the final errors of an algorithm.
crote 1 days ago [-]
Another thing to keep in mind is that CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.
mananaysiempre 1 days ago [-]
> CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago
Intel CPU processing, where slowdowns can be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)
adrian_b 1 days ago [-]
Denormal processing is slow only on certain CPUs, where the designers have been lazy, so when denormals are encountered that is handled by a microprogrammed sequence.
During the last half of century there have been plenty of CPUs where denormals have been handled in hardware, so that any slow down caused by them is negligible.
Except for generating graphic images seen by humans or in ML/AI applications, neither flushing results to zero nor treating denormal inputs as zero are acceptable, because they can lead to huge errors.
Whoever fears that denormals can slow down an application, must enable the underflow exception. In that case denormals are never generated, but the underflow exceptions must be handled, because when denormals are not desired but underflows happen, that means that there are bugs in the program, which must be fixed.
Denormals have been created so that people can mask the underflow exception and avoid to handle it, without dire consequences.
However this habit of no longer handling the floating-point exceptions, like before the IEEE 754 standard, has created younger developers who are no longer aware of how FP arithmetic must be handled to avoid errors, so now there are too many who believe that the use of "-ffast-math" is permitted in general-purpose programs, not only in special applications where result accuracy does not matter.
For correct results, you must use either denormals or underflow exception handling. There is no third choice. The third choice, like in GPUs, is only for when correctness is irrelevant.
adgjlsfhk1 1 days ago [-]
cpus that aren't Intel are plenty fast on denormals. Intel is the only one where denormals are 100x slower. (and Intel has fixed that on their new cpus, but only on their e cores)
andrepd 1 days ago [-]
More like 100x, but not sure how true that is nowadays.
yosefk 1 days ago [-]
Flush denormals to zero. Even their inventor had trouble writing correct code in their presence - see the Appendix to that "what every programmer should know..." paper
mananaysiempre 1 days ago [-]
On the other hand, they (unexpectedly to the inventor, who intended them to be a debugging tool) underpin a few foundational results in correctly rounded computation, such as https://en.wikipedia.org/wiki/Sterbenz_lemma.
loicd 1 days ago [-]
> Even their inventor had trouble writing correct code in their presence
I didn't know that. Could you provide a more specific reference?
Dwedit 1 days ago [-]
Denormals happen to be the way that Zero can even be represented at all?
aleph_minus_one 1 days ago [-]
For zeros (+0 and -0), there exist special representations in the IEEE 754 standard. Denormalized numbers are a different concept.
Dwedit 20 hours ago [-]
Zero is a denormal number where the mantissa is zero. If you disallow denormals and treat them all as zero, then actual zero follows as well.
SwuduSusuwu 1 days ago [-]
[dead]
andrepd 1 days ago [-]
It's one of several issues with the design of IEEE floats, unfortunately. I wish we could start thinking more seriously about a new design, to complement if not replace IEEE in the long term. Posits are an example https://github.com/andrepd/posit-rust
freeopinion 1 days ago [-]
Thank you for this contribution.
Your repo has a link to the standard[0], which might interest some people. It makes me unreasonably happy to know that this was funded out of Singapore.
The IEEE floats do not have any serious design issue. They are much better than any other floating-point formats that have ever been used.
There have been many examples of badly designed floating-point numbers, e.g. those used by the IBM mainframes and those used by the DEC minicomputers. In both cases, minimizing the costs for computer vendors was prioritized over the properties needed by the end users.
In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal.
Only logarithms would be better than IEEE floats, but the addition and subtraction of numbers stored as logarithms is too slow, so the IEEE floating-point numbers are a compromise between the speed of multiplication and the speed of addition.
Posits redistribute the relative error over the range of numbers, increasing the precision of numbers close to unity by reducing the precision of numbers far from unity. This is a property that may seem desirable for input data and output data, but it is extremely undesirable for the intermediate values that are generated in computations.
Even for input data and output data, in engineering there are many characteristic values of parts used in manufacturing, such as electronic components, where constant relative errors are needed over a range greater than 10^9 or even greater than 10^12 (e.g. the values of standard resistors, with a given tolerance, may vary from milliohms to gigaohms and those of standard capacitors from femtofarads to farads; it would not be acceptable to represent such nominal values as posits, with a tolerance depending on the nominal value).
There may exist some applications where this is useful, but all such applications are among those that need low precision numbers, of 32 bits or less. So 16-bit or even 32-bit posits might be useful in certain circumstances, but it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers.
The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications makes unlikely that the development of dedicated fast hardware for them can be worthwhile.
Even if some applications might indeed like a better precision for numbers close to unity, for those applications posits must contend not only with floating-point numbers but also with fixed-point numbers, which do not need any special hardware and they can be implemented in any standard CPU with a simple software library.
Fixed-point numbers have even better precision than posits for numbers close to unity, but posits have a greater range, where their precision drops quickly towards the extremities of the range.
So the application domain of posits is squeezed between those of floating-point numbers and those of fixed-point numbers, leaving a very small number of applications where posits are optimal.
I am not aware of any application that would justify the additional cost for posit-processing hardware. I think that the only chance for posits would be if someone would show that some posit format is better for certain ML/AI/LLM computations than narrow FP formats like BF16, FP8, FP4.
That will be the only use case that could find people willing to pay enough money for the design of posit-processing hardware, in some kind of NPU for training or for inference.
Like I have said, for technical/scientific computing posits are much inferior to FP64, graphic applications are happy with FP32 and FP16 and they do not have any reason for a change, while the applications that need high precision around unity are happy with fixed-point numbers and they also do not have any incentive to change.
andrepd 2 hours ago [-]
Ahahaha, I'm sorry but that's just a ridiculous thing to say. "The IEEE floats do not have any serious design issue. They are much better than any other floating-point formats that have ever been used". Really?? x)))
It's obvious that IEEE floats have several issues: low precision, excessive dynamic range, underflow to 0 causing infinite loss of precision (ditto for overflow), denormals, non-portability, lack of a total order, redundancy of NaN bit patterns, poor scaling to low precision... I can go on! Whether you regard those as relevant or as unimportant, or perhaps as unavoidable, is your opinion. But they do exist!
To re-iterate: claiming that IEEE floats are superior to any existing or proposed alternative is a claim you can attempt to make, although I disagree with it. But claiming "IEEE floats do not have any problems whatsoever" is simply not a good-faith conversation...
> In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal.
What do you possibly mean by "optimal"? For instance, inference on a neural network using a 16-bit (or even 8-bit!) posit type tailored to the distribution of the weights in that neural network can yield better results than with 32-bit floats! Obviously floats are not "optimal" in any possible conceivable situation (neither are posits "more optimal" than floats in any conceivable situation).
Even in "traditional" HPC applications, like weather modelling, experiments have shown 16-bit posits to be acceptable replacements for 32 or even 64-bit computations.
> Like I have said, for technical/scientific computing posits are much inferior to FP64
Repeating something does not make it true :) Like I have said, 32 and 16-bit posits can replace 64 and 32-bit floats in many important applications (obviously not all). HPC and ML workloads are largely memory-bound nowadays; halving the number of bits can yield a doubling of performance, roughly speaking.
> it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers.
> The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications
You assert this repeatedly as dogma, without proof x) I don't get it, do you have a dog in this race somehow? Bizarre.
In fact, the simplest way to see that this is wrong is to consider a "posit"-like format with no regime bits, but with otherwise the same structure: twos-complement representation, no underflow or overflow, deterministic rounding, a quire. This format is essentially an improved version of an IEEE float, without most of its warts, but still a constant relative error (actually constant, unlike IEEE with its subnormals!), similar hardware encode/decode implementations, etc.
adrian_b 21 minutes ago [-]
>It's obvious that IEEE floats have several issues: low precision, excessive dynamic range, underflow to 0 causing infinite loss of precision (ditto for overflow), denormals, non-portability, lack of a total order, redundancy of NaN bit patterns, poor scaling to low precision... I can go on!
From your list of supposed defects, only one can be considered a defect, i.e. that there are too many NaN values. However there is a justification for this, because using less values as NaNs would have required more expensive hardware for detecting which operands are NaNs.
> low precision, excessive dynamic range
This is a single item, because choosing the dynamic range determines the precision and vice versa.
Contrary to your claim, a greater dynamic range was one of the greatest improvements of the IEEE 754 format over the earlier FP formats, like that of IBM and that of DEC.
With a lower dynamic range, overflows in intermediate computation results become extremely frequent and unavoidable. This was a major problem before Intel 8087 and the IEEE 754 standard. The dynamic range of the IEEE FP64 format is great enough to make overflows very unlikely in typical technical/engineering computations, and this is a very desirable property.
> underflow to 0 causing infinite loss of precision
I do not know why you have written this false statement, but there exists no such thing in the IEEE standard for floating-point arithmetic.
Underflows have only 2 standard behaviors, they either generate exceptions that must be handled by the programmer or they generate denormal numbers, which minimize the loss of precision. It is impossible to have "infinite loss of precision" on underflows in a standard-conforming processor.
> ditto for overflow
Overflows have 2 possible standard behaviors, they can generate either exceptions or infinities. Both possible behaviors allow the programmer to detect that there is a bug in the program, which must be fixed. There exist no methods of handling overflow that can avoid the loss of precision (when using fixed-length numbers), so the only thing that can be done, and it is done by the standard, is to ensure that the user is made aware that an overflow happened.
> denormals
Denormals are an optional feature of the standard, like infinities and NaNs. They can be completely avoided by enabling the underflow exception.
Denormals offer a choice to the programmer, to avoid handling the underflow exceptions. If the programmer chooses to use denormals, they minimize the loss of precision when underflows happen.
> non-portability
Huh ?!
The IEEE FP numbers are the most portable FP format known in history. Before this standard, every computer-making company had their own FP formats that were incompatible. The conversion of numeric data between different computers was a very complex problem.
Now this is far in the past. Even if some processors, mainly GPUs, do not support all features of the standard, at least the numeric formats are everywhere the same so no conversions are required.
> poor scaling to low precision
This has nothing to do with the IEEE standard. Any kind of floating-point numbers must scale poorly towards very low precision, because when very few bits are available for the complete number, then even fewer bits can be used for the significand and for the exponent, which makes difficult the tradeoff between precision and dynamic range.
Even so, the IEEE standard FP16 format has adequate precision and dynamic range for its main application, which is storing color component values in pixels, in graphics and video applications.
For ML/AI applications, where even lower precision is desired, 8-bit to 4-bit floating-point numeric formats have become preferred to fixed-point numbers, despite the "poor scaling" of floating-point numbers to low precision.
So all your list contains no defect of the IEEE standard floating-point numbers, especially not in comparison with other FP number formats.
The only point that is specific to the IEEE format and about which one could argue regarding some specific application is the choice of the dynamic ranges, which for FP64 and FP128 are greater than those used in the older FP formats, which had been in use before 1980.
I have started using computers with IBM mainframes and DEC minicomputers, so I have practical experience with those older formats.
Switching to IBM PCs with 8087 and their successors was a great improvement by eliminating the problem of overflows. On older computers, in order to avoid overflows it was frequently necessary to introduce well chosen scale factors in various formulae and equations. The necessity of handling those scale factors removed much of the advantage that floating-point numbers have over fixed-point numbers. Floating-point numbers were invented precisely to free the programmers from the chores of having to deal with scale factors.
For solving practical engineering problems, like the design of electronic devices or integrated circuits, the dynamic range of IEEE FP64 numbers is good and any of the dynamic ranges of older FP formats (like also the dynamic range of IEEE FP32 numbers) is insufficient.
https://www.w3.org/TR/WGSL/#concrete-float-accuracy
This is all fully tested in the CTS.
https://gpuweb.github.io/cts/standalone/?q=webgpu:shader,*
Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
NVIDIA is not responsible alone, because the Microsoft DirectX specification includes the non-standard behavior.
Nevertheless, as shown in TFA, both the AMD and Intel GPUs allow the user to choose between correct behavior and incorrect behavior that might be faster, while NVIDIA ignores what the user requests and implements only the non-standard behavior.
The developers of graphics or ML/AI applications do not care about errors, but there are also people who want to use GPUs for normal computations, where the accuracy of the results matters, so they want to be able to choose between correct behavior and incorrect but faster behavior.
Actually "faster" is a misnomer, because denormals can be handled correctly without diminishing the speed, but that costs additional die area. Thus what NVIDIA gains by not implementing the right behavior is a reduced production cost.
Before Intel 8087 and the IEEE 754 standard, any decent floating-point unit generated overflow exceptions and underflow exceptions, which had to be handled by the programmer, unless the default behavior of crashing the program was acceptable.
Intel 8087 and the standard based on it have offered to the lazy programmers the option to not handle the exceptions, in which case overflow exceptions generate infinities and the underflow exceptions generate denormals.
When the exceptions are not handled, it is supposed that the programmer will check the final results of a long computation, and if infinities and denormals are not desired, but they exist nonetheless in the results, the programmer will investigate the reason and then the bug will be fixed.
So anyone is free to ensure that no denormals will ever appear in an application , by enabling the underflow exception. If it is desired that the program must not crash, then the program must be written carefully, so that underflows are impossible.
There is no correct way of eliminating denormals, except throwing exceptions on underflows.
The flush denormals to zero on output and interpret denormals as zero on input behaviors are not permissible in any program that must produce correct results. Anyone who uses -ffast-math or similar options for compiling a program that is not intended for graphics or ML/AI, where errors supposedly do not matter, makes an unforgivable mistake.
Unfortunately, "-ffast-math" enables a very large number of compilation options. A part of them are safe and they can cause a great increase in performance, like using fused multiply-add instructions. Others not only are dangerous, like flushing denormals to zero, but they also provide negligible performance increase on many processors.
Therefore, instead of aggregate options like "-ffast-math" one must enable only some of the component options, for maximum performance, without affecting result accuracy. For example, in gcc and clang one must use "-ffp-contract=fast", for enabling FMA instructions.
Also, there are cases when you do not want to crash on overflow - for example, live audio processing, you do not want sound to stop only because there ws an overflow in one audio sample.
One option is to throw an exception. In that case a program will never generate any denormals. Denormals can appear in such a program only when they are present in input data generated by another program, which has been executed with a different exception configuration.
The second option is to generate denormals. In such a configuration every underflow generates a denormal and such denormals are propagated through later arithmetic operations, unless they are added to or subtracted from a much bigger number, when they disappear.
What you say about audio is the very reason why denormals have been introduced in the standard. You are allowed to choose the second option, i.e. to mask the underflow exception, so there will be no underflow exceptions and the underflows will generate denormals, which will introduce minimal errors that are acceptable in most applications.
Unlike with the errors introduced by denormals, which are typically negligible, the non-standard third option, to flush denormals to zero, can easily produce huge errors that are unacceptable for most applications. Thus this non-standard option is acceptable only in applications where the errors cannot have serious consequences, e.g. they may cause bad pixels that are not noticed in an animation or they may change the probability of some token in LLM inference, which may not change the actual sampled output, and even if the output is changed, it might not matter among more important causes of hallucinations.
What other cases have you seen of denormals arising in real calculations?
Typically you compute the difference between 2 adjacent terms of the convergent sequence and you decide that you have reached the limit when adding the difference to the current term to get the next does not change it (or when the relative error is smaller than some threshold). At this time the difference will still be many orders of magnitude greater than a denormal.
If the limit happens to be zero, then what you describe can happen. The programs where this can happen normally combine two different criteria to decide that the limit has been reached, i.e. that either the relative error is small enough, i.e. like I said that adding the difference to the previous term does not change it, or that the absolute error is small enough, i.e. that the difference is smaller than some threshold. The absolute error used as threshold is normally chosen based on what is physically meaningful in that problem, i.e. depending on what kind of physical quantity corresponds to the values of the terms of the convergent series.
An example of applications where the users must configure both relative errors and absolute errors, to be used as criteria of convergence, are the SPICE-like programs used to simulate electronic circuits, where the user must configure both a generally-applicable relative error, and absolute errors for different kinds of physical quantities, e.g. an absolute error for voltages and an absolute error for electric currents, which will be used, respectively, when a sequence of voltages converges towards zero or a sequence of currents converges towards zero, so that the absolute error criterion will be satisfied before the relative error criterion is satisfied.
In any case, in a correct program the attempt to find the limit should always stop before underflows become possible, so denormals should never be generated if the underflow exceptions are masked.
Denormals can appear in a lot of cases when almost equal values are subtracted, e.g. in the solution of many kinds of badly conditioned equations, but in most such cases there may exist alternative formulae that avoid underflows, i.e. the generation of denormals.
In general, when denormals appear, this signals bugs in the program, which must be investigated and fixed. The purpose of denormals is to allow the programmer to not fix the bugs, without causing catastrophic errors, like those that can happen when underflows generate null results, i.e. when "denormals are flushed to zero".
Careful programmers should nonetheless fix any bugs that generate denormals.
Moreover, they are much more benign than infinities or NaNs, which signal bugs that you most likely have to fix.
Denormals cause only very small errors, which may be acceptable in most applications. Therefore you may choose to ignore such bugs and not fix them, because a modified algorithm that avoids underflows may be significantly more complex than the original algorithm.
Configuring the non-standard option to flush denormals to zero is bad for two reasons. Not only it removes evidence that something bad has happened, but unlike with denormals, this can cause huge errors, even infinite errors.
When the operations are done according to the IEEE standard, every operation has a limit for the relative error and it is possible to do a numeric analysis of some computational algorithm and estimate the maximum errors that can affect its results.
When underflows generate neither exceptions nor denormals, all guarantees are removed and you can no longer predict anything about the final errors of an algorithm.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.
Intel CPU processing, where slowdowns can be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)
During the last half of century there have been plenty of CPUs where denormals have been handled in hardware, so that any slow down caused by them is negligible.
Except for generating graphic images seen by humans or in ML/AI applications, neither flushing results to zero nor treating denormal inputs as zero are acceptable, because they can lead to huge errors.
Whoever fears that denormals can slow down an application, must enable the underflow exception. In that case denormals are never generated, but the underflow exceptions must be handled, because when denormals are not desired but underflows happen, that means that there are bugs in the program, which must be fixed.
Denormals have been created so that people can mask the underflow exception and avoid to handle it, without dire consequences.
However this habit of no longer handling the floating-point exceptions, like before the IEEE 754 standard, has created younger developers who are no longer aware of how FP arithmetic must be handled to avoid errors, so now there are too many who believe that the use of "-ffast-math" is permitted in general-purpose programs, not only in special applications where result accuracy does not matter.
For correct results, you must use either denormals or underflow exception handling. There is no third choice. The third choice, like in GPUs, is only for when correctness is irrelevant.
I didn't know that. Could you provide a more specific reference?
Your repo has a link to the standard[0], which might interest some people. It makes me unreasonably happy to know that this was funded out of Singapore.
[0] https://posithub.org/docs/posit_standard-2.pdf
There have been many examples of badly designed floating-point numbers, e.g. those used by the IBM mainframes and those used by the DEC minicomputers. In both cases, minimizing the costs for computer vendors was prioritized over the properties needed by the end users.
In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal.
Only logarithms would be better than IEEE floats, but the addition and subtraction of numbers stored as logarithms is too slow, so the IEEE floating-point numbers are a compromise between the speed of multiplication and the speed of addition.
Posits redistribute the relative error over the range of numbers, increasing the precision of numbers close to unity by reducing the precision of numbers far from unity. This is a property that may seem desirable for input data and output data, but it is extremely undesirable for the intermediate values that are generated in computations.
Even for input data and output data, in engineering there are many characteristic values of parts used in manufacturing, such as electronic components, where constant relative errors are needed over a range greater than 10^9 or even greater than 10^12 (e.g. the values of standard resistors, with a given tolerance, may vary from milliohms to gigaohms and those of standard capacitors from femtofarads to farads; it would not be acceptable to represent such nominal values as posits, with a tolerance depending on the nominal value).
There may exist some applications where this is useful, but all such applications are among those that need low precision numbers, of 32 bits or less. So 16-bit or even 32-bit posits might be useful in certain circumstances, but it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers.
The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications makes unlikely that the development of dedicated fast hardware for them can be worthwhile.
Even if some applications might indeed like a better precision for numbers close to unity, for those applications posits must contend not only with floating-point numbers but also with fixed-point numbers, which do not need any special hardware and they can be implemented in any standard CPU with a simple software library.
Fixed-point numbers have even better precision than posits for numbers close to unity, but posits have a greater range, where their precision drops quickly towards the extremities of the range.
So the application domain of posits is squeezed between those of floating-point numbers and those of fixed-point numbers, leaving a very small number of applications where posits are optimal.
I am not aware of any application that would justify the additional cost for posit-processing hardware. I think that the only chance for posits would be if someone would show that some posit format is better for certain ML/AI/LLM computations than narrow FP formats like BF16, FP8, FP4.
That will be the only use case that could find people willing to pay enough money for the design of posit-processing hardware, in some kind of NPU for training or for inference.
Like I have said, for technical/scientific computing posits are much inferior to FP64, graphic applications are happy with FP32 and FP16 and they do not have any reason for a change, while the applications that need high precision around unity are happy with fixed-point numbers and they also do not have any incentive to change.
It's obvious that IEEE floats have several issues: low precision, excessive dynamic range, underflow to 0 causing infinite loss of precision (ditto for overflow), denormals, non-portability, lack of a total order, redundancy of NaN bit patterns, poor scaling to low precision... I can go on! Whether you regard those as relevant or as unimportant, or perhaps as unavoidable, is your opinion. But they do exist!
To re-iterate: claiming that IEEE floats are superior to any existing or proposed alternative is a claim you can attempt to make, although I disagree with it. But claiming "IEEE floats do not have any problems whatsoever" is simply not a good-faith conversation...
> In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal.
What do you possibly mean by "optimal"? For instance, inference on a neural network using a 16-bit (or even 8-bit!) posit type tailored to the distribution of the weights in that neural network can yield better results than with 32-bit floats! Obviously floats are not "optimal" in any possible conceivable situation (neither are posits "more optimal" than floats in any conceivable situation).
Even in "traditional" HPC applications, like weather modelling, experiments have shown 16-bit posits to be acceptable replacements for 32 or even 64-bit computations.
> Like I have said, for technical/scientific computing posits are much inferior to FP64
Repeating something does not make it true :) Like I have said, 32 and 16-bit posits can replace 64 and 32-bit floats in many important applications (obviously not all). HPC and ML workloads are largely memory-bound nowadays; halving the number of bits can yield a doubling of performance, roughly speaking.
> it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers.
> The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications
You assert this repeatedly as dogma, without proof x) I don't get it, do you have a dog in this race somehow? Bizarre.
In fact, the simplest way to see that this is wrong is to consider a "posit"-like format with no regime bits, but with otherwise the same structure: twos-complement representation, no underflow or overflow, deterministic rounding, a quire. This format is essentially an improved version of an IEEE float, without most of its warts, but still a constant relative error (actually constant, unlike IEEE with its subnormals!), similar hardware encode/decode implementations, etc.
From your list of supposed defects, only one can be considered a defect, i.e. that there are too many NaN values. However there is a justification for this, because using less values as NaNs would have required more expensive hardware for detecting which operands are NaNs.
> low precision, excessive dynamic range
This is a single item, because choosing the dynamic range determines the precision and vice versa.
Contrary to your claim, a greater dynamic range was one of the greatest improvements of the IEEE 754 format over the earlier FP formats, like that of IBM and that of DEC.
With a lower dynamic range, overflows in intermediate computation results become extremely frequent and unavoidable. This was a major problem before Intel 8087 and the IEEE 754 standard. The dynamic range of the IEEE FP64 format is great enough to make overflows very unlikely in typical technical/engineering computations, and this is a very desirable property.
> underflow to 0 causing infinite loss of precision
I do not know why you have written this false statement, but there exists no such thing in the IEEE standard for floating-point arithmetic.
Underflows have only 2 standard behaviors, they either generate exceptions that must be handled by the programmer or they generate denormal numbers, which minimize the loss of precision. It is impossible to have "infinite loss of precision" on underflows in a standard-conforming processor.
> ditto for overflow
Overflows have 2 possible standard behaviors, they can generate either exceptions or infinities. Both possible behaviors allow the programmer to detect that there is a bug in the program, which must be fixed. There exist no methods of handling overflow that can avoid the loss of precision (when using fixed-length numbers), so the only thing that can be done, and it is done by the standard, is to ensure that the user is made aware that an overflow happened.
> denormals
Denormals are an optional feature of the standard, like infinities and NaNs. They can be completely avoided by enabling the underflow exception.
Denormals offer a choice to the programmer, to avoid handling the underflow exceptions. If the programmer chooses to use denormals, they minimize the loss of precision when underflows happen.
> non-portability
Huh ?!
The IEEE FP numbers are the most portable FP format known in history. Before this standard, every computer-making company had their own FP formats that were incompatible. The conversion of numeric data between different computers was a very complex problem.
Now this is far in the past. Even if some processors, mainly GPUs, do not support all features of the standard, at least the numeric formats are everywhere the same so no conversions are required.
> poor scaling to low precision
This has nothing to do with the IEEE standard. Any kind of floating-point numbers must scale poorly towards very low precision, because when very few bits are available for the complete number, then even fewer bits can be used for the significand and for the exponent, which makes difficult the tradeoff between precision and dynamic range.
Even so, the IEEE standard FP16 format has adequate precision and dynamic range for its main application, which is storing color component values in pixels, in graphics and video applications.
For ML/AI applications, where even lower precision is desired, 8-bit to 4-bit floating-point numeric formats have become preferred to fixed-point numbers, despite the "poor scaling" of floating-point numbers to low precision.
So all your list contains no defect of the IEEE standard floating-point numbers, especially not in comparison with other FP number formats.
The only point that is specific to the IEEE format and about which one could argue regarding some specific application is the choice of the dynamic ranges, which for FP64 and FP128 are greater than those used in the older FP formats, which had been in use before 1980.
I have started using computers with IBM mainframes and DEC minicomputers, so I have practical experience with those older formats.
Switching to IBM PCs with 8087 and their successors was a great improvement by eliminating the problem of overflows. On older computers, in order to avoid overflows it was frequently necessary to introduce well chosen scale factors in various formulae and equations. The necessity of handling those scale factors removed much of the advantage that floating-point numbers have over fixed-point numbers. Floating-point numbers were invented precisely to free the programmers from the chores of having to deal with scale factors.
For solving practical engineering problems, like the design of electronic devices or integrated circuits, the dynamic range of IEEE FP64 numbers is good and any of the dynamic ranges of older FP formats (like also the dynamic range of IEEE FP32 numbers) is insufficient.